Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-28843][PYTHON] Set OMP_NUM_THREADS to executor cores for pytho…
…n if not set ### What changes were proposed in this pull request? When starting python processes, set `OMP_NUM_THREADS` to the number of cores allocated to an executor or driver if `OMP_NUM_THREADS` is not already set. Each python process will use the same `OMP_NUM_THREADS` setting, even if workers are not shared. This avoids creating an OpenMP thread pool for parallel processing with a number of threads equal to the number of cores on the executor and [significantly reduces memory consumption](numpy/numpy#10455). Instead, this threadpool should use the number of cores allocated to the executor, if available. If a setting for number of cores is not available, this doesn't change any behavior. OpenMP is used by numpy and pandas. ### Why are the changes needed? To reduce memory consumption for PySpark jobs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Validated this reduces python worker memory consumption by more than 1GB on our cluster. Closes #25545 from rdblue/SPARK-28843-set-omp-num-cores. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
- Loading branch information