Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15617: [Doc][C++] Document environment variables #12624

Closed
wants to merge 3 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Mar 14, 2022

List and describe the environment variables which influence the behaviour of Arrow C++ at runtime.


The number of worker threads in the global (process-wide) CPU thread pool.
If this environment variable is not defined, the available hardware
concurrency is determined using a platform-specific routine.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace I notice the IO thread pool size cannot be influenced for now, unless I'm mistaken. Is this something we'd like to make configurable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a function, though not an env var:

/// \brief Set the capacity of the global I/O thread pool
///
/// Set the number of worker threads in the thread pool to which
/// Arrow dispatches various I/O-bound tasks.
///
/// The current number is returned by GetIOThreadPoolCapacity().
ARROW_EXPORT Status SetIOThreadPoolCapacity(int threads);

Also, hmm.

// [[arrow::export]]
int GetIOThreadPoolCapacity() { return arrow::GetCpuThreadPoolCapacity(); }
// [[arrow::export]]
void SetIOThreadPoolCapacity(int threads) {
StopIfNotOk(arrow::SetCpuThreadPoolCapacity(threads));
}

Copy link
Member Author

@pitrou pitrou Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-D. Do you want to open a JIRA for R?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I think it is very important this is configurable. So I am +1 on being able to expose this via an environment variable. We have had at least one customer that was using S3 and benefited from setting this larger than the initial default.

At some point though I think we want to move towards having I/O context / thread pools specific to the filesystem. A single global default doesn't make a lot of sense when you might have a mix of local and remote workloads. Even then I suppose we might still have a global default as a fallback in case the user doesn't specify anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I created https://issues.apache.org/jira/browse/ARROW-15941 for an IO thread pool environment variable.

The backend where to export `OpenTelemetry <https://opentelemetry.io/>`_-based
execution traces. Possible values are:

- ``ostream``: emit textual log messages to stdout;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidavidm It seems once cannot choose between stdout/stderr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add ostream_stdout and ostream_stderr then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very useful information. It's nice to see it all gathered in one place.

Do we want to specify the default behavior if the environment variable is not set? Most of the documented variables here do not have this info.

Also, do we want to add suggestions for when it might be appropriate to set an environment variable?

For example, under JAVA_HOME we say "This may be required..." but under HADOOP_HOME we have no such statement.

These environment variables apply when running python, R, etc. as well. Do we want to add a small snippet referencing this page in the documentation for those languages as well?

docs/source/cpp/env_vars.rst Show resolved Hide resolved
docs/source/cpp/env_vars.rst Show resolved Hide resolved
Comment on lines +94 to +103
In addition to runtime dispatch, the compile-time SIMD level can
be set using the ``ARROW_SIMD_LEVEL`` CMake configuration variable.
Unlike runtime dispatch, compile-time SIMD optimizations cannot be
changed at runtime (for example, if you compile Arrow C++ with AVX512
enabled, the resulting binary will only run on AVX512-enabled CPUs).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this fully explains the interplay between runtime and compile time settings. Would the user ever specify ARROW_SIMD_LEVEL at build time and still use ARROW_USER_SIMD_LEVEL at runtime? Or does ARROW_USER_SIMD_LEVEL only make sense if ARROW_SIMD_LEVEL was not specified.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_SIMD_LEVEL determines the compiler flags used when building Arrow C++, so it functions as a baseline for the CPU requirements. Even if you set ARROW_USER_SIMD_LEVEL to a lower value, the compile-time optimizations enabled by ARROW_SIMD_LEVEL will still drive the CPU requirements (hence the example in parentheses).

Comment on lines +102 to +108
The number of entries to keep in the Gandiva JIT compilation cache.
The cache is in-memory and does not persist accross processes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any other Gandiva documentation that we can link to for more information?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there's no user-facing documentation for Gandiva.

List and describe the environment variables which influence the behaviour of Arrow C++ at runtime.
@pitrou
Copy link
Member Author

pitrou commented Mar 15, 2022

These environment variables apply when running python, R, etc. as well. Do we want to add a small snippet referencing this page in the documentation for those languages as well?

Ideally but I'm not sure where to put that. The Python docs don't have a natural place for it currently. As for the R docs, I'd rather leave this to the R developers.

@pitrou
Copy link
Member Author

pitrou commented Mar 15, 2022

Ok, I added a dedicated PyArrow doc page about environment variables as well.

Comment on lines +53 to +57
.. envvar:: PYARROW_IGNORE_TIMEZONE

By default, PyArrow propagates the timezone value when converting
Arrow data to/from Python datetime objects. If this environment variable
is set to a non-empty value, the timezone is not propagated.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Does this seem accurate or do you want to suggest a better wording?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds correct, yes. Now, checking the code, I see this note:

/// Used to maintain backwards compatibility for
/// timezone bugs (see ARROW-9528). Should be removed
/// after Arrow 2.0 release.
bool ignore_timezone = false;

Although I see that spark is still using the env variable (cc @BryanCutler). In the spark code (apache/spark#30111) it points to https://issues.apache.org/jira/browse/SPARK-32285, which is not yet resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this environment variable was mainly for pyspark compatibility, and if the plan still is too remove this once it is solved on the pyspark side, we should not maybe not document it? (because that would only encourage others to also start making use of it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see. If this is only meant for use by Spark, then I agree we should probably not document it, or perhaps alter the documentation accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure it's only for Spark, but I would also say that if we wanted to expose this as an option for the conversion for general use, it should be an argument to conversion functions (as we already have others), and not controlled through an environment variable.

@pitrou pitrou closed this in 40d8e7e Mar 16, 2022
@pitrou pitrou deleted the ARROW-15617-doc-env-vars- branch March 16, 2022 16:22
@ursabot
Copy link

ursabot commented Mar 16, 2022

Benchmark runs are scheduled for baseline = 3eaa7dd and contender = 40d8e7e. 40d8e7e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.17% ⬆️0.04%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.72% ⬆️0.77%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants