-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread overloading with OpenBLAS #187
Comments
Would it not be possible to control this via NUMBA_NUM_THREADS ?
Strange, that you have all cores with |
11th Gen Intel© Core™ i7-1165G7 @ 2.80GHz × 4, numba 0.56.3 with
w/o
with
|
Without: numba.set_num_threads(1) With numba.set_num_threads(1) Dell XPS i5 |
compute3: Intel(R) Xeon(R) CPU E5-2630L v4 @ 1.80GHz, 20 cores
|
compute4: Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz @ 48 cores Docker Container (latest Acoular) : Default Threading Layer OMP
Standard Console : Default Threading Layer tbb
|
DELL XPS13 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 16GB Ram
|
Would it be a solution here to enforce tbb upon start of Acoular? |
I think we cannot really enforce the use of tbb. The problem in my case was that the libtbb-dev library was not installed and installing tbb with conda or pip is not reliable (Numba Issue). However, we should at least throw a warning if threading layer is omp and tbb can't be choosen. |
How would we find if tbb is used before running some code? The numba docs say that |
It would be nice to check this already when acoular is installed, however, I don't know how to do this cleanly atm. For now, we could at least check if
|
I would absolutely avoid to interfere with the install process. It is already complicated enough to have a solution where this works for pip + conda with 5 different python versions and 3 different OS. |
Do we know if this still happens with more recent Numba versions. This Numba issue hints that the problem with no accessible TBB is resolved? With Acoular 23.6 out, we can indeed install recent versions of all packages (except for sklearn). |
The problem seems to be the versions installed via pip. With these, neither tbb, omp or workqueue use all cores despite the setting to use just 1 thread. Moreover, numba from pypi does not depend on tbb. Therefore, omp is always used on a fresh install. If no limit to the number of threads is set, this means it takes forever. Moreover, the numba/numpy whatever ?? from pypi is much much slower than the conda versions. They do not use any speedups from MKL and the like. While there is intel-numpy on pypi, it is not possible to install this along acoular. |
Possible workarounds:
|
As of today I am no longer able to reproduce the problem. I suspect that some OMP component from the system is used that was updated in the meantime. This is exceptionally hard to debug. |
I was still not able to reproduce even with a fresh install of anaconda on compute2. However, by changing to |
I did a fresh install based on mamba:
Resulting in:
And a fresh pip env install with:
Both resulting in:
I can even install the tbb package with
If I install tbb before installing Acoular (and Numba), the environment specific libtbb.so is used. |
As I wrote above it seems that not OMP but OpenBLAS is the real problem. Maybe the problem vanished miraculously from the changes introduced in !227 ? The overloading problem resulted from OpenBLAS calls from Numba compiled parallel multithreaded functions. If the functions no longer calls OpenBLAS, the problem is gone. This is just a hypothesis and hard to analyze. Does the problem return if Acoular prior to !227 is used? |
My comment was not addressing OMP and OpenBLAS. It is still the case that Numba obviously ignores the tbb library installed in the corresponding environment. Yes, OpenBLAS=1 solves the issue for OMP. Hoewever, in my case, the problem has not vanished. Without setting OpenBLAS=1, thread overloading still occurs! |
And now I understood that just OpenBLAS matters and not the threading layer... |
I see. I was thinking of the following 'solution'.
|
This sounds reasonable.
From the docs: This is a dictionary that maps module names to modules which have already been loaded. This can be manipulated to force reloading of modules and other tricks. However, replacing the dictionary will not necessarily work as expected and deleting essential items from the dictionary may cause Python to fail. If you want to iterate over this global dictionary always use sys.modules.copy() or tuple(sys.modules) to avoid exceptions as its size may change during iteration as a side effect of code or activity in other threads.
|
Just for completeness: Numpy docs say (https://numpy.org/install/):
|
I could try to hack something tomorrow. |
Resolved by !236 |
In my case, performance of Beamforming algorithms is unnecessarily slow with the current numba setting.
Numba overloads the available threads if not set manually.
From the Numba documentation https://numba.pydata.org/numba-doc/dev/user/threading-layer.html#setting-the-number-of-threads:
In this example, suppose the machine we are running on has 8 cores (so numba.config.NUMBA_NUM_THREADS would be 8). Suppose we want to run some code with @njit(parallel=True), but we also want to run our code concurrently in 4 different processes. With the default number of threads, each Python process would run 8 threads, for a total in 48 = 32 threads, which is oversubscription for our 8 cores. We should rather limit each process to 2 threads, so that the total will be 42 = 8, which matches our number of physical cores.
If one consider the following example:
On my machine (DELL XPS 13, 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz, 16GB RAM, 8 cores, 16 threads):
Numba (1 thread): 7.5 seconds
Numba (default): 273.3 seconds
Threading layer chosen: omp
Even if I set the default number of threads to 1, numba utilizes all available cores!
The text was updated successfully, but these errors were encountered: