Numba #21

prisae · 2018-06-29T21:45:07Z

There are currently two branches which try to implement numba:

Roadmap:

Below was the original commit from 2018.

Currently 13 of the heaviest computations are implemented twice, once with NumPy, once as strings for numexpr (switch opt=None or opt='parallel'). These are all in empymod.kernel, and are

1x sqrt of complex matrix of size (nfreq, noff, nlay, nwvnr)
11x exp of complex matrix of size (nfreq, noff, nwvnr)
1x division of complex matrices of size (nfreq, noff, nwvnr)

Maybe these two implementation should be replaced by one single implementation using numba. See the notebooks

in https://github.com/prisae/tmp-share.

It looks like a speed-up of up to 70% could be achieved. At the cost of readability. But then, given the first list, 3 numba-functions might be enough; sqrt, exp, and the division.

Something to think about.

The text was updated successfully, but these errors were encountered:

prisae · 2018-06-30T16:35:44Z

An implementation happens here in the branch numba2.

There is an older branch, but kind of stale: numba.

prisae · 2018-06-30T21:25:15Z

It is not a straight forward issue, and I think a lot more profiling would be required, to even decide whether it is worth it or not. The numba2 branch has a very basic implementation of numba, there would be lots of room for improvement. But with the current state, a very rough comparison looks like this:

Except for single offsets spline is the fastest method, but because it interpolates, so it is out of the category of this comparison. For very small model, nothing beats plain NumPy. For larger models, numexpr is the fastest, however, numba is also faster than NumPy. (And remember, numba is not yet implemented in as many calculations as numexpr, so it could catch up there.)

However, I think it boils down to how well the different methods implement parallelization. In this method, numexpr always used the 4 available cores, while numba somehow at times, but not as consequent (just from looking at the system monitor). NumPy, on the other hand, just run on 1 core.

The heavy calculation or bottlenecks boil down to sqrt and exp functions. So it might be that there is actually not much room for improvement, just better parallelization. For instance, the whole empymod.wavenumber could be run in parallel, and then all results gathered before the transforms.

So for the moment I leave it as it is, NumPy and numexpr, where the latter can be used to run it on as many cores as desired. I keep the numba-branches, if I or someone else decides to pick that up again.

prisae · 2018-07-01T19:40:20Z

For future reference, if I ever pick that up again:

The issue is complex. The comparison shows various interesting things:

The implementation to calculate Gamma is pretty good in numba.
The implementation of exp(-Gam*d) is not so good in numba.
spline is so fast that other things (dlf, and generally just everything) becomes important. Not much to improve there without rewriting the entire empymod in C or Fortran.
For anything but the splined version, almost the entire time of empymod is spent in the kernel (everything in and below wavenumber in the figure). Not even the Hankel transform exceeds 1% of total calculation.

The last point shows that if there is any speed-up to be achieved, the only important functions are:

empymod.kernel.greenfct,
empymod.kernel.reflections, and
empymod.kernel.fields.
These three functions do the number crunching, and it boils down to complex square roots of massive matrices (Gamma) and exponentials and operations with this Gamma. Gamma can reach millions if not billions of elements (if the memory cannot hold it than one has to loop over offsets or frequencies).

Now, two things could be done:

Mixture of numba and numexpr (don't think it is a good idea).
Improve numba for the exp(-Gam*d) calculations.

However, given that spline is accurate enough for any real-life calculation, it is probably not worth it at all. For the exact solution time shouldn't matter that much, or numexpr can be used. So I think empymod is just fine as it is at the moment.

prisae · 2018-07-03T13:24:16Z

Running the above again but using only 1 thread for numba and for numexpr results that there is (almost) no gain at all over the pure numpy implementation. So the benefit is just from parallelization.

The best way to achieve this is, however, to parallelize the whole kernel, hence the calls to kernel.wavenumber, which would have to be done in every Hankel transform method within transform, or the call to kernel.greenfct within kernel.wavenumber, which might be the better approach.

I am closing this issue. I won't implement numba. I keep numexpr in for now, because it is an easy way to run it in parallel. But ideally the whole kernel, and not just a few of the calculations, could be run in parallel.

prisae · 2019-07-01T12:32:16Z

After all the experience with emg3d I am sure that numba could replace numexpr, leaving only one method (no duplication numpy/numexpr). Re-opening to keep it on the list...

prisae · 2019-10-14T06:48:51Z

Note that 'parallel', hence the use of numexpr, really is only faster if it can use several threads. Otherwise the plain numpy is faster. Maybe just get rid of it altogether to simplify things?

prisae · 2019-11-10T10:08:36Z

Replacing the whole kernel with numba should be the approach...

prisae · 2019-11-23T11:08:03Z

As a note

The difficulty comes from the many different scenarios, which yield different things:

E.g., if there are dozens of layers, the recursion becomes expensive, and just using numexpr or numba for a few functions does not help anything, as the whole reflection function would have to be rewritten.
If time-domain data is wanted, then the spline version is not used, which throws another light on the above comments.
If spline can be used, then so much stuff becomes important (DLF etc) that it is a whole other issue.

So this is a problem that depends a lot on the use case, which makes it difficult. There might be absolutely now gain for some cases, and there might be a lot of gain for other cases.

prisae · 2020-03-10T16:08:45Z

Will be closed by #77

prisae added enhancement New feature or request question User questions labels Jun 29, 2018

prisae closed this as completed Jul 3, 2018

prisae added a commit that referenced this issue Jul 3, 2018

Add insights from #21 to Roadmap

4d93e72

prisae added the wontfix This will not be worked on label Jul 4, 2018

prisae reopened this Jul 1, 2019

prisae removed the wontfix This will not be worked on label Oct 14, 2019

prisae added this to the v2.0.0 milestone Nov 10, 2019

prisae mentioned this issue Mar 11, 2020

v2.0.0 #52

Closed

12 tasks

prisae linked a pull request Mar 12, 2020 that will close this issue

Replace numexpr by numba #77

Merged

prisae closed this as completed Apr 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numba #21

Numba #21

prisae commented Jun 29, 2018 •

edited

Loading

prisae commented Jun 30, 2018

prisae commented Jun 30, 2018

prisae commented Jul 1, 2018

prisae commented Jul 3, 2018

prisae commented Jul 1, 2019

prisae commented Oct 14, 2019

prisae commented Nov 10, 2019

prisae commented Nov 23, 2019

prisae commented Mar 10, 2020

Numba #21

Numba #21

Comments

prisae commented Jun 29, 2018 • edited Loading

prisae commented Jun 30, 2018

prisae commented Jun 30, 2018

prisae commented Jul 1, 2018

prisae commented Jul 3, 2018

prisae commented Jul 1, 2019

prisae commented Oct 14, 2019

prisae commented Nov 10, 2019

prisae commented Nov 23, 2019

As a note

prisae commented Mar 10, 2020

prisae commented Jun 29, 2018 •

edited

Loading