Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread overloading with OpenBLAS #187

Closed
adku1173 opened this issue Apr 9, 2024 · 26 comments
Closed

Thread overloading with OpenBLAS #187

adku1173 opened this issue Apr 9, 2024 · 26 comments
Labels
bug gitlab-legacy Refers to issues and PRs migrated from GitLab (issue creation date = migration date).

Comments

@adku1173
Copy link
Member

adku1173 commented Apr 9, 2024

In my case, performance of Beamforming algorithms is unnecessarily slow with the current numba setting.

Numba overloads the available threads if not set manually.

From the Numba documentation https://numba.pydata.org/numba-doc/dev/user/threading-layer.html#setting-the-number-of-threads:

In this example, suppose the machine we are running on has 8 cores (so numba.config.NUMBA_NUM_THREADS would be 8). Suppose we want to run some code with @njit(parallel=True), but we also want to run our code concurrently in 4 different processes. With the default number of threads, each Python process would run 8 threads, for a total in 48 = 32 threads, which is oversubscription for our 8 cores. We should rather limit each process to 2 threads, so that the total will be 42 = 8, which matches our number of physical cores.

If one consider the following example:


#import numba
#numba.set_num_threads(1)
from os import path
from acoular import __file__ as bpath, MicGeom, WNoiseGenerator, PointSource,\
 Mixer, WriteH5, TimeSamples, PowerSpectra, RectGrid, SteeringVector,\
 BeamformerCleansc, config
config.global_caching = 'none'
from time import time
# set up the parameters
sfreq = 51200 
duration = 1
nsamples = duration*sfreq
micgeofile = path.join(path.split(bpath)[0],'xml','array_64.xml')
h5savefile = 'three_sources.h5'

# generate test data, in real life this would come from an array measurement
mg = MicGeom( from_file=micgeofile )
n1 = WNoiseGenerator( sample_freq=sfreq, numsamples=nsamples, seed=1 )
n2 = WNoiseGenerator( sample_freq=sfreq, numsamples=nsamples, seed=2, rms=0.7 )
n3 = WNoiseGenerator( sample_freq=sfreq, numsamples=nsamples, seed=3, rms=0.5 )
p1 = PointSource( signal=n1, mics=mg,  loc=(-0.1,-0.1,0.3) )
p2 = PointSource( signal=n2, mics=mg,  loc=(0.15,0,0.3) )
p3 = PointSource( signal=n3, mics=mg,  loc=(0,0.1,0.3) )
pa = Mixer( source=p1, sources=[p2,p3] )
wh5 = WriteH5( source=pa, name=h5savefile )
wh5.save()

# analyze the data and generate map

ts = TimeSamples( name=h5savefile )
ps = PowerSpectra( time_data=ts, block_size=128, window='Hanning' )

rg = RectGrid( x_min=-0.2, x_max=0.2, y_min=-0.2, y_max=0.2, z=0.3, \
increment=0.02 )
st = SteeringVector(grid=rg, mics=mg)
bb = BeamformerCleansc( freq_data=ps, steer=st)

t = time()
pm = bb.synthetic( 8000, 0 )
print(time()-t)
print("Threading layer chosen: %s" % numba.threading_layer())

On my machine (DELL XPS 13, 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz, 16GB RAM, 8 cores, 16 threads):

Numba (1 thread): 7.5 seconds

Numba (default): 273.3 seconds

Threading layer chosen: omp

Even if I set the default number of threads to 1, numba utilizes all available cores!

@adku1173 adku1173 added bug gitlab-legacy Refers to issues and PRs migrated from GitLab (issue creation date = migration date). labels Apr 9, 2024
@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

Would it not be possible to control this via NUMBA_NUM_THREADS ?
Also, I get different timings (8 thread i7, XPS 13, 16 GB):

Kind Timing (s)
1 proc., unlimited threads 3.6
2 proc., unlimited threads 2 x4.2
4 proc., unlimited threads 4 x 15.9
8 proc., unlimited threads 8 x 47.6
1 proc., 1 numba threads 6.2
2 proc., 1 numba threads 2 x6.4
4 proc., 1 numba threads 4 x 6.9
8 proc., 1 numba threads 8 x 11.1

Strange, that you have all cores with numba.set_num_threads(1), I got only one.
Because numba implements only a part of the algorithm, I am afraid that it is not very clear what really happens here.

@adku1173
Copy link
Member Author

adku1173 commented Apr 9, 2024

Yes, in my case NUMBA_NUM_THREADS sets the number of threads per process! The same w

If I run NUMBA_NUM_THREADS=1 python timing.py , htop cmd shows:

Screenshot_from_2023-03-13_17-06-25

@gherold
Copy link
Contributor

gherold commented Apr 9, 2024

11th Gen Intel© Core™ i7-1165G7 @ 2.80GHz × 4,
15.4 GiB

numba 0.56.3

with numba.set_num_threads(1):

3.0756850242614746
Threading layer chosen: tbb

w/o numba.set_num_threads(1) or with numba.set_num_threads(8), audible coil whining:

1.5986864566802979
Threading layer chosen: tbb

with numba.set_num_threads(4), fluttering coil whining

1.7498419284820557
Threading layer chosen: tbb

@sjekosch
Copy link
Contributor

sjekosch commented Apr 9, 2024

Without: numba.set_num_threads(1)
4.015026330947876
Threading layer chosen: tbb
numba.get_num_threads() : 4

With numba.set_num_threads(1)
5.576324939727783
Threading layer chosen: tbb
numba.get_num_threads() : 1

Dell XPS i5

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

compute3: Intel(R) Xeon(R) CPU E5-2630L v4 @ 1.80GHz, 20 cores

Kind Timing
1 proc, unlim threads 4.3
1 proc, 2 threads 4.7
1 proc, 1 threads 5.3
10 proc, 1 threads 10 x 6.0-6.6
10 proc, 2 threads 10 x 5.1-5.3
20 proc, 1 threads 20 x 7.3-7.7
20 proc, 2 threads 20 x 6.7-7.2
40 proc, 1 threads 40 x 11.5-12.2

@adku1173
Copy link
Member Author

adku1173 commented Apr 9, 2024

compute4: Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz @ 48 cores

Docker Container (latest Acoular) : Default Threading Layer OMP

Kind Timing
unlim threads 103.3
1 thread 6.4
2 threads 5.1

Standard Console : Default Threading Layer tbb

Kind Timing
unlim threads 3.7
1 thread 5.8
2 threads 4.5

@artpelling
Copy link
Member

DELL XPS13 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 16GB Ram

  • alle threads:
267.18502497673035
Threading layer chosen: omp
  • 1 thread:
4.87727689743042
Threading layer chosen: omp

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

Would it be a solution here to enforce tbb upon start of Acoular?

@adku1173
Copy link
Member Author

adku1173 commented Apr 9, 2024

I think we cannot really enforce the use of tbb. The problem in my case was that the libtbb-dev library was not installed and installing tbb with conda or pip is not reliable (Numba Issue). However, we should at least throw a warning if threading layer is omp and tbb can't be choosen.

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

How would we find if tbb is used before running some code? The numba docs say that threading_layer() will tell it only after a running parallel code.

@artpelling
Copy link
Member

It would be nice to check this already when acoular is installed, however, I don't know how to do this cleanly atm. For now, we could at least check if libtbb-dev is installed by doing something like

import subprocess
dpkg = subprocess.Popen(('dpkg', '-l'), stdout=subprocess.PIPE)
result = str(subprocess.check_output(('grep', 'libtbb'), stdin=dpkg.stdout))
version = [s.split(':')[0] for s in result.split(' ') if 'libtbb' in s]
if version is []:
    print('warning')

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

I would absolutely avoid to interfere with the install process. It is already complicated enough to have a solution where this works for pip + conda with 5 different python versions and 3 different OS.

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

Do we know if this still happens with more recent Numba versions. This Numba issue hints that the problem with no accessible TBB is resolved?

With Acoular 23.6 out, we can indeed install recent versions of all packages (except for sklearn).

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

The problem seems to be the versions installed via pip. With these, neither tbb, omp or workqueue use all cores despite the setting to use just 1 thread.

Moreover, numba from pypi does not depend on tbb. Therefore, omp is always used on a fresh install. If no limit to the number of threads is set, this means it takes forever.

Moreover, the numba/numpy whatever ?? from pypi is much much slower than the conda versions. They do not use any speedups from MKL and the like. While there is intel-numpy on pypi, it is not possible to install this along acoular.

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

Possible workarounds:

  • tbb as extra dependency for Acoular, enforcing either tbb or workqueue at start
  • warning if no MKL speedups available

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

As of today I am no longer able to reproduce the problem.
When the problem disappeared from my pip-created acoular environment. I did create a new one from scratch:
conda create -n ac311a python=3.11
pip install acoular
and the runtime with unlimited threads (8 in this case) was 3.1 s (1 thread: 12.1 s)

I suspect that some OMP component from the system is used that was updated in the meantime. This is exceptionally hard to debug.

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

I was still not able to reproduce even with a fresh install of anaconda on compute2. However, by changing to BeamformerCleansc and more coarse grid, the problem arises again.
**Workaround: ** set environment OPENBLAS_NUM_THREADS=1 . This is recommended by OpenBLAS themselves. With pip its is always OpenBLAS and not MKL that gets installed and is responsible for the thread overloading. Unfortunately there is no way to set the OpenBLAS number of threads after importing numpy.
One possible solution could be to use the threadpool module, but I am not sure how this can be implemented easily 😦

@adku1173
Copy link
Member Author

adku1173 commented Apr 9, 2024

I did a fresh install based on mamba:

mamba create -n test python=3.11
mamba install -c acoular acoular

Resulting in:

/home/kujawski/mambaforge/envs/test/lib/python3.11/site-packages/numba/np/ufunc/parallel.py:371: NumbaWarning: The TBB threading layer requires TBB version 2021 update 6 or later i.e., TBB_INTERFACE_VERSION >= 12060. Found TBB_INTERFACE_VERSION = 12050. The TBB threading layer is disabled.
  warnings.warn(problem)
5.321517467498779
Threading layer chosen: omp

And a fresh pip env install with:

python3 -m venv .venv
source .venv/bin/activate
pip install acoular

Both resulting in:

NumbaWarning: The TBB threading layer requires TBB version 2021 update 6 or later i.e., TBB_INTERFACE_VERSION >= 12060. Found TBB_INTERFACE_VERSION = 12050. The TBB threading layer is disabled.
  warnings.warn(problem)
5.542165279388428
Threading layer chosen: omp

I can even install the tbb package with pip install tbb
and find the corrsponding libtbb.so with find / -name libtbb.so 2>/dev/null in /home/kujawski/Desktop/.venv/lib/libtbb.so.
However, numba still ignores this and uses the system library at /usr/lib/x86_64-linux-gnu/libtbb.so which is version 12050 according to

echo /usr/lib/x86_64-linux-gnu/libtbb.so | python3 -c 'import ctypes;print(ctypes.CDLL(input()).TBB_runtime_interface_version())'

If I install tbb before installing Acoular (and Numba), the environment specific libtbb.so is used.

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

As I wrote above it seems that not OMP but OpenBLAS is the real problem.

Maybe the problem vanished miraculously from the changes introduced in !227 ? The overloading problem resulted from OpenBLAS calls from Numba compiled parallel multithreaded functions. If the functions no longer calls OpenBLAS, the problem is gone. This is just a hypothesis and hard to analyze. Does the problem return if Acoular prior to !227 is used?

@adku1173
Copy link
Member Author

adku1173 commented Apr 9, 2024

My comment was not addressing OMP and OpenBLAS.

It is still the case that Numba obviously ignores the tbb library installed in the corresponding environment. Yes, OpenBLAS=1 solves the issue for OMP. Hoewever, in my case, the problem has not vanished. Without setting OpenBLAS=1, thread overloading still occurs!

@adku1173
Copy link
Member Author

adku1173 commented Apr 9, 2024

And now I understood that just OpenBLAS matters and not the threading layer...

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

I see. I was thinking of the following 'solution'.

  1. First thing when loading Acoular we check is Numpy is already loaded. I am not sure how, but there is probably a way to do this.
  2. If Numpy is not loaded, set OPENBLAS_NUM_THREADS=1. End.
  3. If Numpy is loaded, check if it is linked against OpenBLAS. This can be done with numpy.show_config(), but is ugly.
  4. If OpenBLAS is linked and not OPENBLAS_NUM_THREADS==1, either set NUMBA_NUM_THREADS=1 and print informative warning or don't set NUMBA_NUM_THREADS and print a warning explaining to set OPENBLAS_NUM_THREADS=1 before the program start.

@adku1173
Copy link
Member Author

adku1173 commented Apr 9, 2024

This sounds reasonable.

  1. Can be achieved with the (sys.module) function:

From the docs: This is a dictionary that maps module names to modules which have already been loaded. This can be manipulated to force reloading of modules and other tricks. However, replacing the dictionary will not necessarily work as expected and deleting essential items from the dictionary may cause Python to fail. If you want to iterate over this global dictionary always use sys.modules.copy() or tuple(sys.modules) to avoid exceptions as its size may change during iteration as a side effect of code or activity in other threads.

numpy = sys.modules('numpy')
if numpy is None:
    os.environ['OPENBLAS_NUM_THREADS'] = 1
else:
    # mode argument is only available from numpy version 1.26 onwards :( 
    config = numpy.show_config(mode='dict')['Build Dependencies']
...

@adku1173
Copy link
Member Author

adku1173 commented Apr 9, 2024

Just for completeness: Numpy docs say (https://numpy.org/install/):

  • The NumPy wheels on PyPI, which is what pip installs, are built with OpenBLAS. The OpenBLAS libraries are included in the wheel. This makes the wheel larger, and if a user installs (for example) SciPy as well, they will now have two copies of OpenBLAS on disk.

  • In the conda defaults channel, NumPy is built against Intel MKL. MKL is a separate package that will be installed in the users’ environment when they install NumPy.

  • In the conda-forge channel, NumPy is built against a dummy “BLAS” package. When a user installs NumPy from conda-forge, that BLAS package then gets installed together with the actual library - this defaults to OpenBLAS, but it can also be MKL (from the defaults channel), or even BLIS or reference BLAS.

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

I could try to hack something tomorrow.

@esarradj
Copy link
Member

esarradj commented Apr 9, 2024

Resolved by !236

@adku1173 adku1173 closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug gitlab-legacy Refers to issues and PRs migrated from GitLab (issue creation date = migration date).
Projects
None yet
Development

No branches or pull requests

5 participants