New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15617: [Doc][C++] Document environment variables #12624
Conversation
|
||
The number of worker threads in the global (process-wide) CPU thread pool. | ||
If this environment variable is not defined, the available hardware | ||
concurrency is determined using a platform-specific routine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace I notice the IO thread pool size cannot be influenced for now, unless I'm mistaken. Is this something we'd like to make configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a function, though not an env var:
arrow/cpp/src/arrow/io/type_fwd.h
Lines 46 to 52 in 5cb5afc
/// \brief Set the capacity of the global I/O thread pool | |
/// | |
/// Set the number of worker threads in the thread pool to which | |
/// Arrow dispatches various I/O-bound tasks. | |
/// | |
/// The current number is returned by GetIOThreadPoolCapacity(). | |
ARROW_EXPORT Status SetIOThreadPoolCapacity(int threads); |
Also, hmm.
Lines 51 to 57 in 5cb5afc
// [[arrow::export]] | |
int GetIOThreadPoolCapacity() { return arrow::GetCpuThreadPoolCapacity(); } | |
// [[arrow::export]] | |
void SetIOThreadPoolCapacity(int threads) { | |
StopIfNotOk(arrow::SetCpuThreadPoolCapacity(threads)); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:-D. Do you want to open a JIRA for R?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I think it is very important this is configurable. So I am +1 on being able to expose this via an environment variable. We have had at least one customer that was using S3 and benefited from setting this larger than the initial default.
At some point though I think we want to move towards having I/O context / thread pools specific to the filesystem. A single global default doesn't make a lot of sense when you might have a mix of local and remote workloads. Even then I suppose we might still have a global default as a fallback in case the user doesn't specify anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I created https://issues.apache.org/jira/browse/ARROW-15941 for an IO thread pool environment variable.
The backend where to export `OpenTelemetry <https://opentelemetry.io/>`_-based | ||
execution traces. Possible values are: | ||
|
||
- ``ostream``: emit textual log messages to stdout; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm It seems once cannot choose between stdout/stderr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add ostream_stdout
and ostream_stderr
then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5dc79ad
to
cd94081
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very useful information. It's nice to see it all gathered in one place.
Do we want to specify the default behavior if the environment variable is not set? Most of the documented variables here do not have this info.
Also, do we want to add suggestions for when it might be appropriate to set an environment variable?
For example, under JAVA_HOME
we say "This may be required..." but under HADOOP_HOME
we have no such statement.
These environment variables apply when running python, R, etc. as well. Do we want to add a small snippet referencing this page in the documentation for those languages as well?
In addition to runtime dispatch, the compile-time SIMD level can | ||
be set using the ``ARROW_SIMD_LEVEL`` CMake configuration variable. | ||
Unlike runtime dispatch, compile-time SIMD optimizations cannot be | ||
changed at runtime (for example, if you compile Arrow C++ with AVX512 | ||
enabled, the resulting binary will only run on AVX512-enabled CPUs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this fully explains the interplay between runtime and compile time settings. Would the user ever specify ARROW_SIMD_LEVEL
at build time and still use ARROW_USER_SIMD_LEVEL
at runtime? Or does ARROW_USER_SIMD_LEVEL
only make sense if ARROW_SIMD_LEVEL
was not specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARROW_SIMD_LEVEL
determines the compiler flags used when building Arrow C++, so it functions as a baseline for the CPU requirements. Even if you set ARROW_USER_SIMD_LEVEL
to a lower value, the compile-time optimizations enabled by ARROW_SIMD_LEVEL
will still drive the CPU requirements (hence the example in parentheses).
The number of entries to keep in the Gandiva JIT compilation cache. | ||
The cache is in-memory and does not persist accross processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any other Gandiva documentation that we can link to for more information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there's no user-facing documentation for Gandiva.
List and describe the environment variables which influence the behaviour of Arrow C++ at runtime.
Ideally but I'm not sure where to put that. The Python docs don't have a natural place for it currently. As for the R docs, I'd rather leave this to the R developers. |
f491954
to
855899b
Compare
Ok, I added a dedicated PyArrow doc page about environment variables as well. |
.. envvar:: PYARROW_IGNORE_TIMEZONE | ||
|
||
By default, PyArrow propagates the timezone value when converting | ||
Arrow data to/from Python datetime objects. If this environment variable | ||
is set to a non-empty value, the timezone is not propagated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche Does this seem accurate or do you want to suggest a better wording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds correct, yes. Now, checking the code, I see this note:
arrow/cpp/src/arrow/python/python_to_arrow.h
Lines 56 to 59 in ecf8c75
/// Used to maintain backwards compatibility for | |
/// timezone bugs (see ARROW-9528). Should be removed | |
/// after Arrow 2.0 release. | |
bool ignore_timezone = false; |
Although I see that spark is still using the env variable (cc @BryanCutler). In the spark code (apache/spark#30111) it points to https://issues.apache.org/jira/browse/SPARK-32285, which is not yet resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this environment variable was mainly for pyspark compatibility, and if the plan still is too remove this once it is solved on the pyspark side, we should not maybe not document it? (because that would only encourage others to also start making use of it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. If this is only meant for use by Spark, then I agree we should probably not document it, or perhaps alter the documentation accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure it's only for Spark, but I would also say that if we wanted to expose this as an option for the conversion for general use, it should be an argument to conversion functions (as we already have others), and not controlled through an environment variable.
Benchmark runs are scheduled for baseline = 3eaa7dd and contender = 40d8e7e. 40d8e7e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
List and describe the environment variables which influence the behaviour of Arrow C++ at runtime.