Thread overloading with OpenBLAS #187

adku1173 · 2024-04-09T13:22:52Z

In my case, performance of Beamforming algorithms is unnecessarily slow with the current numba setting.

Numba overloads the available threads if not set manually.

From the Numba documentation https://numba.pydata.org/numba-doc/dev/user/threading-layer.html#setting-the-number-of-threads:

In this example, suppose the machine we are running on has 8 cores (so numba.config.NUMBA_NUM_THREADS would be 8). Suppose we want to run some code with @njit(parallel=True), but we also want to run our code concurrently in 4 different processes. With the default number of threads, each Python process would run 8 threads, for a total in 48 = 32 threads, which is oversubscription for our 8 cores. We should rather limit each process to 2 threads, so that the total will be 42 = 8, which matches our number of physical cores.

If one consider the following example:


#import numba
#numba.set_num_threads(1)
from os import path
from acoular import __file__ as bpath, MicGeom, WNoiseGenerator, PointSource,\
 Mixer, WriteH5, TimeSamples, PowerSpectra, RectGrid, SteeringVector,\
 BeamformerCleansc, config
config.global_caching = 'none'
from time import time
# set up the parameters
sfreq = 51200 
duration = 1
nsamples = duration*sfreq
micgeofile = path.join(path.split(bpath)[0],'xml','array_64.xml')
h5savefile = 'three_sources.h5'

# generate test data, in real life this would come from an array measurement
mg = MicGeom( from_file=micgeofile )
n1 = WNoiseGenerator( sample_freq=sfreq, numsamples=nsamples, seed=1 )
n2 = WNoiseGenerator( sample_freq=sfreq, numsamples=nsamples, seed=2, rms=0.7 )
n3 = WNoiseGenerator( sample_freq=sfreq, numsamples=nsamples, seed=3, rms=0.5 )
p1 = PointSource( signal=n1, mics=mg,  loc=(-0.1,-0.1,0.3) )
p2 = PointSource( signal=n2, mics=mg,  loc=(0.15,0,0.3) )
p3 = PointSource( signal=n3, mics=mg,  loc=(0,0.1,0.3) )
pa = Mixer( source=p1, sources=[p2,p3] )
wh5 = WriteH5( source=pa, name=h5savefile )
wh5.save()

# analyze the data and generate map

ts = TimeSamples( name=h5savefile )
ps = PowerSpectra( time_data=ts, block_size=128, window='Hanning' )

rg = RectGrid( x_min=-0.2, x_max=0.2, y_min=-0.2, y_max=0.2, z=0.3, \
increment=0.02 )
st = SteeringVector(grid=rg, mics=mg)
bb = BeamformerCleansc( freq_data=ps, steer=st)

t = time()
pm = bb.synthetic( 8000, 0 )
print(time()-t)
print("Threading layer chosen: %s" % numba.threading_layer())

On my machine (DELL XPS 13, 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz, 16GB RAM, 8 cores, 16 threads):

Numba (1 thread): 7.5 seconds

Numba (default): 273.3 seconds

Threading layer chosen: omp

Even if I set the default number of threads to 1, numba utilizes all available cores!

The text was updated successfully, but these errors were encountered:

esarradj · 2024-04-09T13:22:56Z

Would it not be possible to control this via NUMBA_NUM_THREADS ?
Also, I get different timings (8 thread i7, XPS 13, 16 GB):

Kind	Timing (s)
1 proc., unlimited threads	3.6
2 proc., unlimited threads	2 x4.2
4 proc., unlimited threads	4 x 15.9
8 proc., unlimited threads	8 x 47.6
1 proc., 1 numba threads	6.2
2 proc., 1 numba threads	2 x6.4
4 proc., 1 numba threads	4 x 6.9
8 proc., 1 numba threads	8 x 11.1

Strange, that you have all cores with numba.set_num_threads(1), I got only one.
Because numba implements only a part of the algorithm, I am afraid that it is not very clear what really happens here.

adku1173 · 2024-04-09T13:22:59Z

Yes, in my case NUMBA_NUM_THREADS sets the number of threads per process! The same w

If I run NUMBA_NUM_THREADS=1 python timing.py , htop cmd shows:

gherold · 2024-04-09T13:23:02Z

11th Gen Intel© Core™ i7-1165G7 @ 2.80GHz × 4,
15.4 GiB

numba 0.56.3

with numba.set_num_threads(1):

3.0756850242614746
Threading layer chosen: tbb

w/o numba.set_num_threads(1) or with numba.set_num_threads(8), audible coil whining:

1.5986864566802979
Threading layer chosen: tbb

with numba.set_num_threads(4), fluttering coil whining

1.7498419284820557
Threading layer chosen: tbb

sjekosch · 2024-04-09T13:23:05Z

Without: numba.set_num_threads(1)
4.015026330947876
Threading layer chosen: tbb
numba.get_num_threads() : 4

With numba.set_num_threads(1)
5.576324939727783
Threading layer chosen: tbb
numba.get_num_threads() : 1

Dell XPS i5

esarradj · 2024-04-09T13:23:08Z

compute3: Intel(R) Xeon(R) CPU E5-2630L v4 @ 1.80GHz, 20 cores

Kind	Timing
1 proc, unlim threads	4.3
1 proc, 2 threads	4.7
1 proc, 1 threads	5.3
10 proc, 1 threads	10 x 6.0-6.6
10 proc, 2 threads	10 x 5.1-5.3
20 proc, 1 threads	20 x 7.3-7.7
20 proc, 2 threads	20 x 6.7-7.2
40 proc, 1 threads	40 x 11.5-12.2

adku1173 · 2024-04-09T13:23:11Z

compute4: Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz @ 48 cores

Docker Container (latest Acoular) : Default Threading Layer OMP

Kind	Timing
unlim threads	103.3
1 thread	6.4
2 threads	5.1

Standard Console : Default Threading Layer tbb

Kind	Timing
unlim threads	3.7
1 thread	5.8
2 threads	4.5

artpelling · 2024-04-09T13:23:14Z

DELL XPS13 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 16GB Ram

alle threads:

267.18502497673035
Threading layer chosen: omp

1 thread:

4.87727689743042
Threading layer chosen: omp

esarradj · 2024-04-09T13:23:17Z

Would it be a solution here to enforce tbb upon start of Acoular?

adku1173 · 2024-04-09T13:23:20Z

I think we cannot really enforce the use of tbb. The problem in my case was that the libtbb-dev library was not installed and installing tbb with conda or pip is not reliable (Numba Issue). However, we should at least throw a warning if threading layer is omp and tbb can't be choosen.

esarradj · 2024-04-09T13:23:23Z

How would we find if tbb is used before running some code? The numba docs say that threading_layer() will tell it only after a running parallel code.

artpelling · 2024-04-09T13:23:26Z

It would be nice to check this already when acoular is installed, however, I don't know how to do this cleanly atm. For now, we could at least check if libtbb-dev is installed by doing something like

import subprocess
dpkg = subprocess.Popen(('dpkg', '-l'), stdout=subprocess.PIPE)
result = str(subprocess.check_output(('grep', 'libtbb'), stdin=dpkg.stdout))
version = [s.split(':')[0] for s in result.split(' ') if 'libtbb' in s]
if version is []:
    print('warning')

esarradj · 2024-04-09T13:23:29Z

I would absolutely avoid to interfere with the install process. It is already complicated enough to have a solution where this works for pip + conda with 5 different python versions and 3 different OS.

esarradj · 2024-04-09T13:23:32Z

Do we know if this still happens with more recent Numba versions. This Numba issue hints that the problem with no accessible TBB is resolved?

With Acoular 23.6 out, we can indeed install recent versions of all packages (except for sklearn).

esarradj · 2024-04-09T13:23:35Z

The problem seems to be the versions installed via pip. With these, neither tbb, omp or workqueue use all cores despite the setting to use just 1 thread.

Moreover, numba from pypi does not depend on tbb. Therefore, omp is always used on a fresh install. If no limit to the number of threads is set, this means it takes forever.

Moreover, the numba/numpy whatever ?? from pypi is much much slower than the conda versions. They do not use any speedups from MKL and the like. While there is intel-numpy on pypi, it is not possible to install this along acoular.

esarradj · 2024-04-09T13:23:38Z

Possible workarounds:

tbb as extra dependency for Acoular, enforcing either tbb or workqueue at start
warning if no MKL speedups available

esarradj · 2024-04-09T13:23:41Z

As of today I am no longer able to reproduce the problem.
When the problem disappeared from my pip-created acoular environment. I did create a new one from scratch:
conda create -n ac311a python=3.11
pip install acoular
and the runtime with unlimited threads (8 in this case) was 3.1 s (1 thread: 12.1 s)

I suspect that some OMP component from the system is used that was updated in the meantime. This is exceptionally hard to debug.

esarradj · 2024-04-09T13:23:44Z

I was still not able to reproduce even with a fresh install of anaconda on compute2. However, by changing to BeamformerCleansc and more coarse grid, the problem arises again.
**Workaround: ** set environment OPENBLAS_NUM_THREADS=1 . This is recommended by OpenBLAS themselves. With pip its is always OpenBLAS and not MKL that gets installed and is responsible for the thread overloading. Unfortunately there is no way to set the OpenBLAS number of threads after importing numpy.
One possible solution could be to use the threadpool module, but I am not sure how this can be implemented easily 😦

adku1173 · 2024-04-09T13:23:47Z

I did a fresh install based on mamba:

mamba create -n test python=3.11
mamba install -c acoular acoular

Resulting in:

/home/kujawski/mambaforge/envs/test/lib/python3.11/site-packages/numba/np/ufunc/parallel.py:371: NumbaWarning: The TBB threading layer requires TBB version 2021 update 6 or later i.e., TBB_INTERFACE_VERSION >= 12060. Found TBB_INTERFACE_VERSION = 12050. The TBB threading layer is disabled.
  warnings.warn(problem)
5.321517467498779
Threading layer chosen: omp

And a fresh pip env install with:

python3 -m venv .venv
source .venv/bin/activate
pip install acoular

Both resulting in:

NumbaWarning: The TBB threading layer requires TBB version 2021 update 6 or later i.e., TBB_INTERFACE_VERSION >= 12060. Found TBB_INTERFACE_VERSION = 12050. The TBB threading layer is disabled.
  warnings.warn(problem)
5.542165279388428
Threading layer chosen: omp

I can even install the tbb package with pip install tbb
and find the corrsponding libtbb.so with find / -name libtbb.so 2>/dev/null in /home/kujawski/Desktop/.venv/lib/libtbb.so.
However, numba still ignores this and uses the system library at /usr/lib/x86_64-linux-gnu/libtbb.so which is version 12050 according to

echo /usr/lib/x86_64-linux-gnu/libtbb.so | python3 -c 'import ctypes;print(ctypes.CDLL(input()).TBB_runtime_interface_version())'

If I install tbb before installing Acoular (and Numba), the environment specific libtbb.so is used.

esarradj · 2024-04-09T13:23:50Z

As I wrote above it seems that not OMP but OpenBLAS is the real problem.

Maybe the problem vanished miraculously from the changes introduced in !227 ? The overloading problem resulted from OpenBLAS calls from Numba compiled parallel multithreaded functions. If the functions no longer calls OpenBLAS, the problem is gone. This is just a hypothesis and hard to analyze. Does the problem return if Acoular prior to !227 is used?

adku1173 · 2024-04-09T13:23:53Z

My comment was not addressing OMP and OpenBLAS.

It is still the case that Numba obviously ignores the tbb library installed in the corresponding environment. Yes, OpenBLAS=1 solves the issue for OMP. Hoewever, in my case, the problem has not vanished. Without setting OpenBLAS=1, thread overloading still occurs!

adku1173 · 2024-04-09T13:23:56Z

And now I understood that just OpenBLAS matters and not the threading layer...

esarradj · 2024-04-09T13:23:59Z

I see. I was thinking of the following 'solution'.

First thing when loading Acoular we check is Numpy is already loaded. I am not sure how, but there is probably a way to do this.
If Numpy is not loaded, set OPENBLAS_NUM_THREADS=1. End.
If Numpy is loaded, check if it is linked against OpenBLAS. This can be done with numpy.show_config(), but is ugly.
If OpenBLAS is linked and not OPENBLAS_NUM_THREADS==1, either set NUMBA_NUM_THREADS=1 and print informative warning or don't set NUMBA_NUM_THREADS and print a warning explaining to set OPENBLAS_NUM_THREADS=1 before the program start.

adku1173 · 2024-04-09T13:24:02Z

This sounds reasonable.

Can be achieved with the (sys.module) function:

From the docs: This is a dictionary that maps module names to modules which have already been loaded. This can be manipulated to force reloading of modules and other tricks. However, replacing the dictionary will not necessarily work as expected and deleting essential items from the dictionary may cause Python to fail. If you want to iterate over this global dictionary always use sys.modules.copy() or tuple(sys.modules) to avoid exceptions as its size may change during iteration as a side effect of code or activity in other threads.

numpy = sys.modules('numpy')
if numpy is None:
    os.environ['OPENBLAS_NUM_THREADS'] = 1
else:
    # mode argument is only available from numpy version 1.26 onwards :( 
    config = numpy.show_config(mode='dict')['Build Dependencies']
...

adku1173 · 2024-04-09T13:24:05Z

Just for completeness: Numpy docs say (https://numpy.org/install/):

The NumPy wheels on PyPI, which is what pip installs, are built with OpenBLAS. The OpenBLAS libraries are included in the wheel. This makes the wheel larger, and if a user installs (for example) SciPy as well, they will now have two copies of OpenBLAS on disk.
In the conda defaults channel, NumPy is built against Intel MKL. MKL is a separate package that will be installed in the users’ environment when they install NumPy.
In the conda-forge channel, NumPy is built against a dummy “BLAS” package. When a user installs NumPy from conda-forge, that BLAS package then gets installed together with the actual library - this defaults to OpenBLAS, but it can also be MKL (from the defaults channel), or even BLIS or reference BLAS.

esarradj · 2024-04-09T13:24:08Z

I could try to hack something tomorrow.

esarradj · 2024-04-09T13:24:11Z

Resolved by !236

adku1173 added bug gitlab-legacy Refers to issues and PRs migrated from GitLab (issue creation date = migration date). labels Apr 9, 2024

adku1173 closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread overloading with OpenBLAS #187

Thread overloading with OpenBLAS #187

adku1173 commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024 •

edited

gherold commented Apr 9, 2024

sjekosch commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024

artpelling commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024

esarradj commented Apr 9, 2024

artpelling commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024

adku1173 commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024

adku1173 commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

Thread overloading with OpenBLAS #187

Thread overloading with OpenBLAS #187

Comments

adku1173 commented Apr 9, 2024

On my machine (DELL XPS 13, 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz, 16GB RAM, 8 cores, 16 threads):

Numba (1 thread): 7.5 seconds

Numba (default): 273.3 seconds

Threading layer chosen: omp

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024 • edited

gherold commented Apr 9, 2024

sjekosch commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024

artpelling commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024

esarradj commented Apr 9, 2024

artpelling commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024

adku1173 commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024

adku1173 commented Apr 9, 2024

esarradj commented Apr 9, 2024

esarradj commented Apr 9, 2024

adku1173 commented Apr 9, 2024 •

edited