Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GIL release #14

Closed
robbmcleod opened this issue Jan 16, 2018 · 8 comments
Closed

GIL release #14

robbmcleod opened this issue Jan 16, 2018 · 8 comments

Comments

@robbmcleod
Copy link

Nominally in C code that takes some time one can release the GIL with:

Py_BEGIN_ALLOW_THREADS
<Non-Python calling code>
Py_END_ALLOW_THREADS

It looks like it would be pretty trivial to enclose your for loops with these macros, is this something you're interested in?

@astrofrog
Copy link
Owner

@robbmcleod - good idea, do you want to give it a try and see if it leads to any improvements in performance?

@robbmcleod
Copy link
Author

Initially it doesn't help when I tried multi-threading with concurrent.futures as too much time is spent in the Python wrapper. It won't hurt anything though, and if the Python overhead can be reduced it will help anyone doing asycnhronous work (like me with a live-view in pyqtgraph).

Those np.ascontiguousarray calls are really quite expensive. What I can try then is only conditionally applying those calls by checking x.flags.c_continguous first. The vast majority of NumPy arrays are continguous. Similarly we don't want to cast. NumPy assumes you want a copy in those calls and you're not modifying the input array, so if it's not needed, it should be skipped.

Also we could explicitly create npy_float32 and npy_float64 branches. The easiest way to do this would be to have two copies of each function. I suspect that's not too much maintenance overhead? And then map everything from npy_int32 to npy_float64 and all the smaller dtypes to npy_float32.

@robbmcleod
Copy link
Author

robbmcleod commented Jan 17, 2018

Releasing the GIL isn't useful as the histogram calculation is too fast for context switches to be worthwhile. My experience with NumExpr is that even with p-threads you're looking at about 2-5 us per-thread to pass the thread barrier.

Bypassing the Python checks to call the C-API directly is the best speed-up I found:

Python call histogram average: 7.928e-06 +/- 4.195e-08 s
C-API call histogram f32 average: 5.724e-06 +/- 2.548e-08 s
C-API call histogram f64 average: 5.795e-06 +/- 4.494e-08 s

The only other thing that strikes me as a possible optimization would be SIMD vectorizing the loops. Neither MSVC or GCC are vectorizing the loops as they are now (which is sort of obvious both from the uncompelling float32 benchmarks and the compiler messages). The calls to ix = (int)((x[i] - xmin) * normx * fnx); could be vectorized, whereas the accumulations would have to be unrolled manually for the SIMD vector length. I don't know that it's worth the bother to implement.

Test code:

import numpy as np
import fast_histogram
from time import perf_counter as pc

tries = 50
stackDepth = 10000
nBins = 256
imSize = 4096
data = np.random.normal(size=(imSize**2)).reshape(imSize, imSize)
data32 = data.astype('float32')
minVal = data.min();  maxVal = data.max()
print(data.flags)

times_direct = np.zeros(tries)
times_direct32 = np.zeros(tries)
times_single = np.zeros(tries)
times_overhead = np.zeros(tries)
# Call C-API directly:
for I in range(tries):
    t0 = pc()
    for chunk in range(stackDepth):
        _ = fast_histogram._histogram_core._histogram1d(data, nBins, minVal, maxVal)
    times_direct[I] = pc() - t0

# Call C-API 32-bit version directly:
for I in range(tries):
    t0 = pc()
    for chunk in range(stackDepth):
        _ = fast_histogram._histogram_core._histogram1d_f32(data32, nBins, minVal, maxVal)
    times_direct32[I] = pc() - t0    

# Call Python API:
for I in range(tries):
    t0 = pc()
    for chunk in range(stackDepth):
        _ = fast_histogram.histogram1d(data, nBins, (minVal, maxVal))
    times_single[I] = pc() - t0

times_single /= stackDepth
times_overhead /= stackDepth
times_direct /= stackDepth
times_direct32 /= stackDepth

print( f'Python call histogram average: {times_single.mean():.3e} +/- {times_single.std():.3e} s' )
print( f'C-API call histogram f32 average: {times_direct32.mean():.3e} +/- {times_direct32.std():.3e} s')
print( f'C-API call histogram f64 average: {times_direct.mean():.3e} +/- {times_direct.std():.3e} s')

Changes can be seen at my fork:

https://github.com/robbmcleod/fast-histogram

@robbmcleod
Copy link
Author

Shall we close this?

@astrofrog
Copy link
Owner

@robbmcleod - thanks for investigating this! I'd like to keep this open as I want to take some time to over the changes and make sure there's nothing we want to include.

@astrofrog
Copy link
Owner

@robbmcleod - one of the changes here that does make a big difference for large arrays is the 32-bit version you implemented for 32-bit arrays (to avoid the increased memory usage), so I was wondering if you would be happy to open a pull request to add those changes in? (if you don't have time I can pull in the commits you made).

@robbmcleod
Copy link
Author

I'll take a look when I have time.

@astrofrog
Copy link
Owner

I'm going to close this since 32-bit floating point arrays should be faster now (#23)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants