ARROW-15617: [Doc][C++] Document environment variables #12624

pitrou · 2022-03-14T15:56:09Z

List and describe the environment variables which influence the behaviour of Arrow C++ at runtime.

pitrou · 2022-03-14T15:57:58Z

docs/source/cpp/env_vars.rst

+
+   The number of worker threads in the global (process-wide) CPU thread pool.
+   If this environment variable is not defined, the available hardware
+   concurrency is determined using a platform-specific routine.


@westonpace I notice the IO thread pool size cannot be influenced for now, unless I'm mistaken. Is this something we'd like to make configurable?

There is a function, though not an env var:

arrow/cpp/src/arrow/io/type_fwd.h

Lines 46 to 52 in 5cb5afc

/// \brief Set the capacity of the global I/O thread pool

///

/// Set the number of worker threads in the thread pool to which

/// Arrow dispatches various I/O-bound tasks.

///

/// The current number is returned by GetIOThreadPoolCapacity().

ARROW_EXPORT Status SetIOThreadPoolCapacity(int threads);

Also, hmm.

arrow/r/src/threadpool.cpp

Lines 51 to 57 in 5cb5afc

// [[arrow::export]]

int GetIOThreadPoolCapacity() { return arrow::GetCpuThreadPoolCapacity(); }

// [[arrow::export]]

void SetIOThreadPoolCapacity(int threads) {

StopIfNotOk(arrow::SetCpuThreadPoolCapacity(threads));

}

:-D. Do you want to open a JIRA for R?

https://issues.apache.org/jira/browse/ARROW-15929

At the moment I think it is very important this is configurable. So I am +1 on being able to expose this via an environment variable. We have had at least one customer that was using S3 and benefited from setting this larger than the initial default.

At some point though I think we want to move towards having I/O context / thread pools specific to the filesystem. A single global default doesn't make a lot of sense when you might have a mix of local and remote workloads. Even then I suppose we might still have a global default as a fallback in case the user doesn't specify anything.

Ok, I created https://issues.apache.org/jira/browse/ARROW-15941 for an IO thread pool environment variable.

pitrou · 2022-03-14T15:58:26Z

docs/source/cpp/env_vars.rst

+   The backend where to export `OpenTelemetry <https://opentelemetry.io/>`_-based
+   execution traces.  Possible values are:
+
+   - ``ostream``: emit textual log messages to stdout;


@lidavidm It seems once cannot choose between stdout/stderr?

We can add ostream_stdout and ostream_stderr then

https://issues.apache.org/jira/browse/ARROW-15930

github-actions · 2022-03-14T17:28:12Z

https://issues.apache.org/jira/browse/ARROW-15617

westonpace

This is very useful information. It's nice to see it all gathered in one place.

Do we want to specify the default behavior if the environment variable is not set? Most of the documented variables here do not have this info.

Also, do we want to add suggestions for when it might be appropriate to set an environment variable?

For example, under JAVA_HOME we say "This may be required..." but under HADOOP_HOME we have no such statement.

These environment variables apply when running python, R, etc. as well. Do we want to add a small snippet referencing this page in the documentation for those languages as well?

docs/source/cpp/env_vars.rst

westonpace · 2022-03-15T15:17:01Z

docs/source/cpp/env_vars.rst

+      In addition to runtime dispatch, the compile-time SIMD level can
+      be set using the ``ARROW_SIMD_LEVEL`` CMake configuration variable.
+      Unlike runtime dispatch, compile-time SIMD optimizations cannot be
+      changed at runtime (for example, if you compile Arrow C++ with AVX512
+      enabled, the resulting binary will only run on AVX512-enabled CPUs).


I'm not sure this fully explains the interplay between runtime and compile time settings. Would the user ever specify ARROW_SIMD_LEVEL at build time and still use ARROW_USER_SIMD_LEVEL at runtime? Or does ARROW_USER_SIMD_LEVEL only make sense if ARROW_SIMD_LEVEL was not specified.

ARROW_SIMD_LEVEL determines the compiler flags used when building Arrow C++, so it functions as a baseline for the CPU requirements. Even if you set ARROW_USER_SIMD_LEVEL to a lower value, the compile-time optimizations enabled by ARROW_SIMD_LEVEL will still drive the CPU requirements (hence the example in parentheses).

westonpace · 2022-03-15T15:17:44Z

docs/source/cpp/env_vars.rst

+   The number of entries to keep in the Gandiva JIT compilation cache.
+   The cache is in-memory and does not persist accross processes.


Do we have any other Gandiva documentation that we can link to for more information?

Unfortunately there's no user-facing documentation for Gandiva.

List and describe the environment variables which influence the behaviour of Arrow C++ at runtime.

pitrou · 2022-03-15T16:21:15Z

These environment variables apply when running python, R, etc. as well. Do we want to add a small snippet referencing this page in the documentation for those languages as well?

Ideally but I'm not sure where to put that. The Python docs don't have a natural place for it currently. As for the R docs, I'd rather leave this to the R developers.

pitrou · 2022-03-15T16:51:05Z

Ok, I added a dedicated PyArrow doc page about environment variables as well.

pitrou · 2022-03-15T16:52:23Z

docs/source/python/env_vars.rst

+.. envvar:: PYARROW_IGNORE_TIMEZONE
+
+   By default, PyArrow propagates the timezone value when converting
+   Arrow data to/from Python datetime objects. If this environment variable
+   is set to a non-empty value, the timezone is not propagated.


@jorisvandenbossche Does this seem accurate or do you want to suggest a better wording?

That sounds correct, yes. Now, checking the code, I see this note:

arrow/cpp/src/arrow/python/python_to_arrow.h

Lines 56 to 59 in ecf8c75

/// Used to maintain backwards compatibility for

/// timezone bugs (see ARROW-9528). Should be removed

/// after Arrow 2.0 release.

bool ignore_timezone = false;

Although I see that spark is still using the env variable (cc @BryanCutler). In the spark code (apache/spark#30111) it points to https://issues.apache.org/jira/browse/SPARK-32285, which is not yet resolved.

If this environment variable was mainly for pyspark compatibility, and if the plan still is too remove this once it is solved on the pyspark side, we should not maybe not document it? (because that would only encourage others to also start making use of it)

Hmm, I see. If this is only meant for use by Spark, then I agree we should probably not document it, or perhaps alter the documentation accordingly.

I am not 100% sure it's only for Spark, but I would also say that if we wanted to expose this as an option for the conversion for general use, it should be an argument to conversion functions (as we already have others), and not controlled through an environment variable.

ursabot · 2022-03-16T16:31:30Z

Benchmark runs are scheduled for baseline = 3eaa7dd and contender = 40d8e7e. 40d8e7e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.17% ⬆️0.04%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.72% ⬆️0.77%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

pitrou requested review from lidavidm and westonpace March 14, 2022 15:56

pitrou commented Mar 14, 2022

View reviewed changes

lidavidm approved these changes Mar 14, 2022

View reviewed changes

github-actions bot added Component: C++ - Gandiva Component: C++ labels Mar 14, 2022

pitrou force-pushed the ARROW-15617-doc-env-vars- branch from 5dc79ad to cd94081 Compare March 14, 2022 17:51

westonpace reviewed Mar 15, 2022

View reviewed changes

pitrou added 2 commits March 15, 2022 17:06

ARROW-15617: [Doc][C++] Document environment variables

0f5be7b

List and describe the environment variables which influence the behaviour of Arrow C++ at runtime.

Fix string emptiness test

a64e610

Address comments; add PyArrow environment variables doc page

855899b

pitrou force-pushed the ARROW-15617-doc-env-vars- branch from f491954 to 855899b Compare March 15, 2022 16:50

pitrou commented Mar 15, 2022

View reviewed changes

pitrou closed this in 40d8e7e Mar 16, 2022

pitrou deleted the ARROW-15617-doc-env-vars- branch March 16, 2022 16:22

asfimport mentioned this pull request Mar 21, 2022

[C++] Allow setting IO thread pool size with an environment variable #31368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-15617: [Doc][C++] Document environment variables #12624

ARROW-15617: [Doc][C++] Document environment variables #12624

pitrou commented Mar 14, 2022

pitrou Mar 14, 2022

lidavidm Mar 14, 2022

pitrou Mar 14, 2022 •

edited

lidavidm Mar 14, 2022

westonpace Mar 15, 2022

pitrou Mar 15, 2022

pitrou Mar 14, 2022

lidavidm Mar 14, 2022

lidavidm Mar 14, 2022

github-actions bot commented Mar 14, 2022

westonpace left a comment

westonpace Mar 15, 2022

pitrou Mar 15, 2022

westonpace Mar 15, 2022

pitrou Mar 15, 2022

pitrou commented Mar 15, 2022

pitrou commented Mar 15, 2022

pitrou Mar 15, 2022

jorisvandenbossche Mar 17, 2022

jorisvandenbossche Mar 17, 2022

pitrou Mar 17, 2022

jorisvandenbossche Mar 17, 2022

ursabot commented Mar 16, 2022 •

edited

	/// \brief Set the capacity of the global I/O thread pool
	///
	/// Set the number of worker threads in the thread pool to which
	/// Arrow dispatches various I/O-bound tasks.
	///
	/// The current number is returned by GetIOThreadPoolCapacity().
	ARROW_EXPORT Status SetIOThreadPoolCapacity(int threads);

	// [[arrow::export]]
	int GetIOThreadPoolCapacity() { return arrow::GetCpuThreadPoolCapacity(); }

	// [[arrow::export]]
	void SetIOThreadPoolCapacity(int threads) {
	StopIfNotOk(arrow::SetCpuThreadPoolCapacity(threads));
	}

		The number of entries to keep in the Gandiva JIT compilation cache.
		The cache is in-memory and does not persist accross processes.

	/// Used to maintain backwards compatibility for
	/// timezone bugs (see ARROW-9528). Should be removed
	/// after Arrow 2.0 release.
	bool ignore_timezone = false;

ARROW-15617: [Doc][C++] Document environment variables #12624

ARROW-15617: [Doc][C++] Document environment variables #12624

Conversation

pitrou commented Mar 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou Mar 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 14, 2022

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Mar 15, 2022

pitrou commented Mar 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Mar 16, 2022 • edited

pitrou Mar 14, 2022 •

edited

ursabot commented Mar 16, 2022 •

edited