Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sparse summed fit statistics in cython #2104

Merged
merged 4 commits into from Apr 3, 2019

Conversation

Projects
2 participants
@adonath
Copy link
Member

commented Apr 1, 2019

This PR is an "experimental" change I tried to implement the "sparse" evaluation for cash and cstat fit statistics. The idea is that the computational costly log term in the likelihood is omitted when the counts in the corresponding bins are zero. This is one of the tricks that @eacharles mentioned at PyGamma19. Here are some references he provided:

Here is the runtime of the affected notebooks with and without this change:

INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/detect_ts.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 15.3 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/hess.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 22.4 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/fermi_lat.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 13.4 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/analysis_3d_joint.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 49.5 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/cta_1dc_introduction.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 14.6 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/mcmc_sampling.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 101.5 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/analysis_3d.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 35.7 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED

And the current master:

INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/detect_ts.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 17.4 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/hess.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 24.5 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/fermi_lat.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 13.4 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/analysis_3d_joint.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 72.5 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/cta_1dc_introduction.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 14.7 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/mcmc_sampling.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 96.5 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED
INFO:gammapy.scripts.jupyter:   ... EXECUTING tutorials/analysis_3d.ipynb
INFO:gammapy.scripts.jupyter:   ... Executing duration: 42.5 seconds
INFO:gammapy.scripts.jupyter:   ... PASSED

It's clear that this is not the bottleneck in our likelihood evaluation, still it seems we spend a non-negligible amount of the computing time evaluating the likelihood for empty bins.

So the speed-ups I see on my machine are in the order of 10-35%. For some reason, that I don't understand, the MCMC tutorial runs slower. What you you think @registerrier and @AtreyeeS?

@adonath adonath self-assigned this Apr 1, 2019

@adonath adonath added the feature label Apr 1, 2019

@adonath adonath added this to In progress in Modeling via automation Apr 1, 2019

@adonath adonath added this to the 0.12 milestone Apr 1, 2019

@cdeil

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

@adonath - Such a PR on performance is really hard to review.

You show timings for a few high-level analyses - that's good. But a few got a bit faster, and at least one became slower, and in all cases we don't really know why. It's hard to reproduce or check as a reviewer, because you don't add the new & fast implementation as an extra option, but replace the old one.

Overall I would find it surprising if we spend a significant amount of time in cash, so I'm surprised by a ~ 20% performance improvement you mention for some notebooks. The way we handle masks is slow. And also we evaluate several trig functions as part of the model evaluation, which is just as slow as log. This is what I get:

In [21]: data = np.random.random(int(1e6)) 
    ...: mask = data > 0.5 
    ...: %timeit np.sin(data) 
    ...: %timeit np.log(data) 
    ...: %timeit data[mask] 
    ...:                                                                                                                                                                       

694 µs ± 8.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
630 µs ± 70.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.36 ms ± 67.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [22]: data = np.random.random(int(1e3)) 
    ...: mask = data > 0.5 
    ...: %timeit np.sin(data) 
    ...: %timeit np.log(data) 
    ...: %timeit data[mask] 
    ...:                                                                                                                                                                       

7.7 µs ± 32.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
7.38 µs ± 19.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.4 µs ± 15 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

For the change in cash you could check via a notebook (using numba or %%cython) whether the extra if for each pixel really outweighs the extra calls to log. Probably yes and eventually we want to have this performance optimisation in Gammapy. But the way it should be done is to avoid doing part of mask apply in Numpy and then a mask apply count > 0 via an if in C, no? There should be one mask that's fixed and cached and applied as fast as possible. Or even better would be to re-order the pixels to have continuous arrays when evaluating the different cases for the likelihood, but I guess that's not possible if there is PSF and EDISP, then pixels cannot be re-ordered or the cost of re-doing it for npred on each eval is prohibitive?


So @adonath - overall I'm not sure how to continue.

One idea could be to simplify as much as possible first:

  • do we really need the FLUX_FACTOR or was that added for no good reason and can be removed?
  • can we reduce to only one fixed mask per dataset during the likelihood fit and focus on one path to optimise?
  • remove cstat and focus on cash for the next months?
  • if there are multiple cash like we have two now - can you add a test that they are exactly the same that covers all the edge cases with zero counts or npred?

And then to get a well-defined test case and insight into performance:

  • add high-level test case that we want to optimise in https://github.com/gammapy/gammapy-benchmarks
  • add some performance info on MapDataset - a show_performance method that prints how much memory it uses, and also some accumulators how often the one relevant path in the likelihood was evaluated and how many pixels were in the mask - mostly performance would then be improved by doing fewer work overall by caching, and improving cash by evaluating fewer logs is only a micro-optimisation that overall will only matter at the ~ 1 % level?

Overall I wouldn't have suggested to work on this performance optimisation now - but if what comes out is some insight into where we spend our time during map fitting, or better handling / caching of masks on MapDataset, then that is a good thing to look at already now.

@adonath

This comment has been minimized.

Copy link
Member Author

commented Apr 2, 2019

Thanks @cdeil, I'd like to make a few comments and add a little bit more information.

Here is the %%prun output for a call to .likelihood() for the datasets_combined in the analysis _3d notebook:

Screenshot 2019-04-02 at 10 27 45

I also was surprised, that the calls to cash show up even before the calls to e.g. fftconvolve. But I remember that the computation of the cash likelihood also became the bottleneck in the TS map computation at some point. The Fermi people seemed to have made the same observation, so I think there is little reason to doubt that in general.

The main point you miss e.g. when comparing to the spatial model evaluation calls, is the number of pixels you make the computation for. There are ~O(1e6) pixels in a typical map-dataset, but the spatial models are evaluated on a cutout and using factorisation. So they are only evaluated for ~O(1e4) pixels, which is two orders of magnitude less. The cash likelihood is evaluated for all ~O(1e6) pixels in each iteration. Assuming 3 datasets and 300 function evaluations until the likelihood fit converges (as in the joint analysis notebook), you spend ~20s just to evaluate the log:

Screenshot 2019-04-02 at 10 58 19

Which is basically the difference in runtime we see in the analysis_3d_joint notebook. The main point here is that the log is evaluated even if 90% of the pixels are empty, so 90% of the time spent in the log is wasted. That's the motivation for this PR. I agree it's not a huge improvement, but it's clearly not micro-optimizing, especially because the required changes are little for a non-negligible improvement.

In the MCMC notebook, because it uses a simulated dataset with high statistics, the fraction of empty pixels is ~50%. This is probably where the additional if statement comes into account (just guessing...).

You show timings for a few high-level analyses - that's good. But a few got a bit faster, and at least one became slower, and in all cases we don't really know why. It's hard to reproduce or check as a reviewer, because you don't add the new & fast implementation as an extra option, but replace the old one.

Yes, but I think it's not really a good idea to make this an option and expose it in the API? How would the user decide whether to use it or not? I think in this case it's rather a non-option and just an example, where we waste computation time because of the limitations we have with numpy.

The main question for me is what's the cost this change comes for? So given that our data is typically sparse, is it acceptable to introduce a custom likelihood evaluation for this case (and basically reimplementing ~20 lines of fit statistics in cython for 20% performance gain) or not. I think the decision is not completely obvious. To implement this change took basically the same time as writing this comment, so I don't feel strongly about whether this PR is merged or not, but I think it's good to have a place for this kind of discussion.

I definitely don't want to make performance optimisation a priority for Gammapy now. Doing extensive performance studies and setting-up a benchmark system is definitely future work (even after 1.0...). But in general I'm open to accept these kind of little improvements (that probably every developer enjoys), if they come at few costs.

@cdeil

This comment has been minimized.

Copy link
Member

commented Apr 2, 2019

@adonath - OK to merge this in soon if you like. Would suggest to add the test with the edge cases for zero counts or npred that shows that the result of our two cash functions are the same (need four array entries I guess).

Discussed with @adonath offline just now.

  • Yes, I didn't think that spatial model evaluation is only on a subset of bins - not for each energy. That explains why spatial model evaluation isn't dominating our execution time.

  • Probably the mask apply doesn't show up in the profile for analysis_3d_joint because there no mask is set. If my example above for 1e6 pixels is correct, mask apply like we currently do it is very slow, will dominate runtime over cash evaluation or any log call optimisation there. -> still suggest to simplify mask handling (only one mask per dataset, never do any mask computation during the likelihood loop or have several mask) and probably also pass it to Cython to avoid the mask apply in Numpy completely.

A general note: results are very much dtype-dependent, for me like this:

In [32]: data = np.ones(int(1e6), dtype=int)                                                                                                                                   

In [33]: %timeit np.log(data)                                                                                                                                                  
6.72 ms ± 38.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [34]: %timeit data + data                                                                                                                                                   
730 µs ± 1.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [35]: %timeit data * data                                                                                                                                                   
762 µs ± 5.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [36]: data = np.ones(int(1e6), dtype=float)                                                                                                                                 

In [37]: %timeit np.log(data)                                                                                                                                                  
547 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [38]: %timeit data + data                                                                                                                                                   
358 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [39]: %timeit data + data                                                                                                                                                   
371 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [40]: data = np.ones(int(1e6), dtype=np.float32)                                                                                                                            

In [41]: %timeit np.log(data)                                                                                                                                                  
193 µs ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [42]: %timeit data * data                                                                                                                                                   
132 µs ± 393 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [43]: %timeit data + data                                                                                                                                                   
120 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So basically a log call is only as costly as two additions?!

So checking which dtypes are used in our datasets is also important and has the potential to speed-computations up by a factor of ~ 2.

@adonath adonath force-pushed the adonath:fit_statistics_cython branch from 9d4378d to bc8f1e6 Apr 3, 2019

@adonath adonath force-pushed the adonath:fit_statistics_cython branch from bc8f1e6 to 3eefe97 Apr 3, 2019

@adonath

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

Thanks @cdeil for the additional comments. I added a reference test and a TODO comment to move the mask handling to the Cython functions too. I'll address this, when refactoring the parameter and mask handling in our dataset classes. I'll merge this PR now.

@adonath adonath merged commit ef8b9e0 into gammapy:master Apr 3, 2019

8 of 9 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
Codacy/PR Quality Review Up to standards. A positive pull request.
Details
Scrutinizer Analysis: 4 updated code elements – Tests: passed
Details
gammapy.gammapy Build #20190403.6 succeeded
Details
gammapy.gammapy (DevDocs) DevDocs succeeded
Details
gammapy.gammapy (Lint) Lint succeeded
Details
gammapy.gammapy (Test Python35) Test Python35 succeeded
Details
gammapy.gammapy (Test Windows35) Test Windows35 succeeded
Details
gammapy.gammapy (Test Windows37) Test Windows37 succeeded
Details

Modeling automation moved this from In progress to Done Apr 3, 2019

@cdeil

This comment has been minimized.

Copy link
Member

commented Apr 12, 2019

@adonath and all - The C++ HPC lectures from Pierre this week were very interesting: There are many different effects and tricks one can use to make something like the Cash function faster.

See https://asterics2020-obelics.github.io/Lecture/CodeOptimisationAndPythonWrappers/index.html

This PR introduced several changes and overall got a speedup - but it's not quite clear what is going on. Specifically this suggests that introducing an if usually is very costly and should be avoided:
https://indico.in2p3.fr/event/18333/contributions/71087/attachments/52935/68664/presentation_branching_predicator.pdf

Now in our case, this has to be weighed against the cost of a log call. There I don't think we really know how costly it is, and it could be that np.log is super fast if it calls into MKL and they use special CPU instructions, but a log call in Cython might result in a different log call and might be much slower.

So overall I think if one really wants to optimise cash, one would need a notebook with the different implementations and time them - it could be that a speedup by 10 or more is still easily possible with a small change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.