Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numba #21

Closed
18 of 21 tasks
prisae opened this issue Jun 29, 2018 · 9 comments · Fixed by #77
Closed
18 of 21 tasks

Numba #21

prisae opened this issue Jun 29, 2018 · 9 comments · Fixed by #77
Labels
enhancement New feature or request question User questions
Milestone

Comments

@prisae
Copy link
Member

prisae commented Jun 29, 2018

There are currently two branches which try to implement numba:

Roadmap:


Below was the original commit from 2018.

Currently 13 of the heaviest computations are implemented twice, once with NumPy, once as strings for numexpr (switch opt=None or opt='parallel'). These are all in empymod.kernel, and are

  • 1x sqrt of complex matrix of size (nfreq, noff, nlay, nwvnr)
  • 11x exp of complex matrix of size (nfreq, noff, nwvnr)
  • 1x division of complex matrices of size (nfreq, noff, nwvnr)

Maybe these two implementation should be replaced by one single implementation using numba. See the notebooks

in https://github.com/prisae/tmp-share.

It looks like a speed-up of up to 70% could be achieved. At the cost of readability. But then, given the first list, 3 numba-functions might be enough; sqrt, exp, and the division.

Something to think about.

@prisae prisae added enhancement New feature or request question User questions labels Jun 29, 2018
@prisae
Copy link
Member Author

prisae commented Jun 30, 2018

An implementation happens here in the branch numba2.

There is an older branch, but kind of stale: numba.

@prisae
Copy link
Member Author

prisae commented Jun 30, 2018

It is not a straight forward issue, and I think a lot more profiling would be required, to even decide whether it is worth it or not. The numba2 branch has a very basic implementation of numba, there would be lots of room for improvement. But with the current state, a very rough comparison looks like this:

comparison_numpy-numba-spline-numexpr

Except for single offsets spline is the fastest method, but because it interpolates, so it is out of the category of this comparison. For very small model, nothing beats plain NumPy. For larger models, numexpr is the fastest, however, numba is also faster than NumPy. (And remember, numba is not yet implemented in as many calculations as numexpr, so it could catch up there.)

However, I think it boils down to how well the different methods implement parallelization. In this method, numexpr always used the 4 available cores, while numba somehow at times, but not as consequent (just from looking at the system monitor). NumPy, on the other hand, just run on 1 core.

The heavy calculation or bottlenecks boil down to sqrt and exp functions. So it might be that there is actually not much room for improvement, just better parallelization. For instance, the whole empymod.wavenumber could be run in parallel, and then all results gathered before the transforms.

So for the moment I leave it as it is, NumPy and numexpr, where the latter can be used to run it on as many cores as desired. I keep the numba-branches, if I or someone else decides to pick that up again.

@prisae
Copy link
Member Author

prisae commented Jul 1, 2018

For future reference, if I ever pick that up again:

comparison

The issue is complex. The comparison shows various interesting things:

  • The implementation to calculate Gamma is pretty good in numba.
  • The implementation of exp(-Gam*d) is not so good in numba.
  • spline is so fast that other things (dlf, and generally just everything) becomes important. Not much to improve there without rewriting the entire empymod in C or Fortran.
  • For anything but the splined version, almost the entire time of empymod is spent in the kernel (everything in and below wavenumber in the figure). Not even the Hankel transform exceeds 1% of total calculation.

The last point shows that if there is any speed-up to be achieved, the only important functions are:

  • empymod.kernel.greenfct,
  • empymod.kernel.reflections, and
  • empymod.kernel.fields.
    These three functions do the number crunching, and it boils down to complex square roots of massive matrices (Gamma) and exponentials and operations with this Gamma. Gamma can reach millions if not billions of elements (if the memory cannot hold it than one has to loop over offsets or frequencies).

Now, two things could be done:

  • Mixture of numba and numexpr (don't think it is a good idea).
  • Improve numba for the exp(-Gam*d) calculations.

However, given that spline is accurate enough for any real-life calculation, it is probably not worth it at all. For the exact solution time shouldn't matter that much, or numexpr can be used. So I think empymod is just fine as it is at the moment.

@prisae
Copy link
Member Author

prisae commented Jul 3, 2018

Running the above again but using only 1 thread for numba and for numexpr results that there is (almost) no gain at all over the pure numpy implementation. So the benefit is just from parallelization.

The best way to achieve this is, however, to parallelize the whole kernel, hence the calls to kernel.wavenumber, which would have to be done in every Hankel transform method within transform, or the call to kernel.greenfct within kernel.wavenumber, which might be the better approach.

I am closing this issue. I won't implement numba. I keep numexpr in for now, because it is an easy way to run it in parallel. But ideally the whole kernel, and not just a few of the calculations, could be run in parallel.

@prisae prisae closed this as completed Jul 3, 2018
prisae added a commit that referenced this issue Jul 3, 2018
@prisae prisae added the wontfix This will not be worked on label Jul 4, 2018
@prisae
Copy link
Member Author

prisae commented Jul 1, 2019

After all the experience with emg3d I am sure that numba could replace numexpr, leaving only one method (no duplication numpy/numexpr). Re-opening to keep it on the list...

@prisae prisae reopened this Jul 1, 2019
@prisae prisae removed the wontfix This will not be worked on label Oct 14, 2019
@prisae
Copy link
Member Author

prisae commented Oct 14, 2019

Note that 'parallel', hence the use of numexpr, really is only faster if it can use several threads. Otherwise the plain numpy is faster. Maybe just get rid of it altogether to simplify things?

@prisae
Copy link
Member Author

prisae commented Nov 10, 2019

Replacing the whole kernel with numba should be the approach...

@prisae prisae added this to the v2.0.0 milestone Nov 10, 2019
@prisae
Copy link
Member Author

prisae commented Nov 23, 2019

As a note

The difficulty comes from the many different scenarios, which yield different things:

  • E.g., if there are dozens of layers, the recursion becomes expensive, and just using numexpr or numba for a few functions does not help anything, as the whole reflection function would have to be rewritten.
  • If time-domain data is wanted, then the spline version is not used, which throws another light on the above comments.
  • If spline can be used, then so much stuff becomes important (DLF etc) that it is a whole other issue.

So this is a problem that depends a lot on the use case, which makes it difficult. There might be absolutely now gain for some cases, and there might be a lot of gain for other cases.

@prisae
Copy link
Member Author

prisae commented Mar 10, 2020

Will be closed by #77

@prisae prisae mentioned this issue Mar 11, 2020
12 tasks
@prisae prisae linked a pull request Mar 12, 2020 that will close this issue
@prisae prisae closed this as completed Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question User questions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant