Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIG 8 - Datasets #1986

Merged
merged 7 commits into from Dec 1, 2019
Merged

PIG 8 - Datasets #1986

merged 7 commits into from Dec 1, 2019

Conversation

@adonath
Copy link
Member

@adonath adonath commented Jan 14, 2019

This PR adds PIG 8: a proposal and work plan to enable joint likelihood fitting in Gammapy.

To-do list:

  • Clarify whether the current SpectrumObservation and SpectrumObservationList can be fully replaced by the SpectrumDataset and Datasets class
@adonath adonath self-assigned this Jan 14, 2019
@adonath adonath added the pig label Jan 14, 2019
@adonath adonath modified the milestones: 0.12, 0.11 Jan 14, 2019
@adonath adonath force-pushed the adonath:datasets_pig branch 5 times, most recently from c827cc8 to ad8f2b9 Jan 15, 2019
@cdeil
Copy link
Member

@cdeil cdeil commented Jan 28, 2019

@adonath - One point that I think is important is to aim for a dataset / model design in Gammapy that will allow efficient use of compute resources.

Most users have multi-core CPUs (with a few cores on their laptop, and a few 10s of cores on servers at their institute), many have access to CPU clusters where they can easily get 10 or 100 CPUs. It would be very nice if that compute power were available for likelihood fitting, not necessarily now with high priority, but hopefully in a year or two once Gammapy is more mature we can add this. Currently we are mostly restricted to single-core CPU processing. Accessing GPU and other modern hardware is possible in principle, but in practice is hard to support because the frameworks to use that from Python are adding this just now or not stable enough yet.

In the last days I was looking a bit at concurrent.futures.ProcessPoolExecutor (a nice API on top of multiprocessing) from the Python std lib, and I started looking at dask.distributed. Unfortunately I think with the design decision we're leaning towards, to have Parameter objects that are shared between models and datasets, it's impossible to get multi-CPU cluster. That requires message passing, can't be done with shared memory. Supporting multi-core could be possible, but will likely require special care in the design to work at all and to not be terribly inefficient (i.e. have a copy of all datasets in every process).

@adonath - I think either you and I focus on this and do some prototyping very soon (within the next ~ week) to try and get a design that works for multi-core or even multi-CPU (e.g. by having Datasets push parameter update values to each Dataset, instead of having shared Parameter objects). Or we postpone this, and just add a very short section / comment here saying that we're aware of this limitation, and plan to re-consider the Dataset / Datasets / Model design at some later point to achieve parallel processing. Probably somewhere between 90% and 99% of the API and tests and code could remain the same, so IMO postponing this critical part of the design isn't unreasonable, to quickly get the features first, already something in v0.11.

@cdeil
Copy link
Member

@cdeil cdeil commented Jan 30, 2019

Fermi ST added a tutorial showing how to do summed likelihood: https://fermi.gsfc.nasa.gov/ssc/data/analysis/scitools/summed_tutorial.html

@adonath - looking at this, and equally how Fermipy does it, could be interesting for the Gammapy datasets design, or even adding a few lines in the PIG with links.

@adonath adonath force-pushed the adonath:datasets_pig branch 2 times, most recently from 89dd0e8 to 623129f Feb 1, 2019
@adonath adonath force-pushed the adonath:datasets_pig branch from 623129f to 1c02405 Feb 18, 2019
@adonath adonath modified the milestones: 0.11, 0.12 Mar 8, 2019
Copy link
Member

@cdeil cdeil left a comment

@adonath - I did a final review of this PR. Some comments are quite substantial. @registerrier should also chime in and you should do a final update of this document, but overall I'd suggest to finalise this among the lead devs soon, and then post it on mailing list and CC list for final review, with a deadline of 1 week for final comments.

docs/development/pigs/pig-008.rst Outdated Show resolved Hide resolved
docs/development/pigs/pig-008.rst Outdated Show resolved Hide resolved
mask, when mutiple `MapDataset` are fitted jointly.


`MapDatasetOnOff`

This comment has been minimized.

@cdeil

cdeil Apr 26, 2019
Member

Are you sure a separate class for on/off is a good idea?

I think the options are:

  • separate class
  • subclass on/off from on
  • one class with optional data members for off (in HESS we have this as BgMaps and it works)

All of those have pros / cons in terms of simplicity vs code duplication.

Suggest to change in the PIG to put 2-3 sentences mentioning the options, without deciding on one.
But if @registerrier and you have discussed this and are sure you want separate classes for on/off, OK to leave as-is.

This comment has been minimized.

@adonath

adonath Jul 2, 2019
Author Member

Whether to introduce separate ...OnOff datasets is probably something we should discuss tomorrow...

The `SpectrumDatasets` with a parametric background model will be introduced first.
The existing `SpectrumObservation` can be refactored later in a `SpectrumOnOffDataset`
or `PHASpectrumDatasets`.

This comment has been minimized.

@cdeil

cdeil Apr 26, 2019
Member

Same suggestions: don't commit to separate class for on/off in the PIG, mention options. But as you prefer, OK to merge either way.
Suggest to remove the "or PHASpectrumDatasets" - PHA is a cryptic term, shouldn't be in the Gammapy name / design, it's just a legacy thing that we support as I/O format.

or `PHASpectrumDatasets`.


`FluxPointDataset`

This comment has been minimized.

@cdeil

cdeil Apr 26, 2019
Member

You have a mix of FluxPointDataset (singular) and FluxPointsDataset. Make it consistent.
I would put FluxPointDataset, but I think you've already mostly gone the FluxPointsDataset way and prefer it - OK, then edit here.

This comment has been minimized.

@adonath

adonath May 2, 2019
Author Member

I changed consistently to FluxPointsDataset.

This comment has been minimized.

@cdeil

cdeil May 3, 2019
Member

Then you could also consider changing FluxPointEstimator to FluxPointsEstimator for overall consistency.

`Datasets.read()`. Such that the individual datasets are only read from disk on
requests and are not loaded in memory when calling `.read()`. We propose to handle
this with a generic lazy-loading mechanism for `Map` objects, where the data is loaded
on access of the `.data` attribute.

This comment has been minimized.

@cdeil

cdeil Apr 26, 2019
Member

Really introduce lazy-loading in Map?
Not sure it's a good idea, it introduces a lot of complexity, the FITS I/O has to change.
At the DL3 obs level we introduce the lazy-loading at a higher level in Observations, not in the EventList, IRF classes.

Suggest to put this as an open question, saying lazy-loading is maybe something we want, e.g. to support parallel evaluation (see paragraph above), but how it will be done isn't decided yet.

Do you discuss the question whether Dataset is used in data reduction as a container?
Doing data reduction for 100s of runs would be another use case where data write and discard from memory is probably a good idea, at least to support optionally.

docs/development/pigs/pig-008.rst Outdated Show resolved Hide resolved
docs/development/pigs/pig-008.rst Outdated Show resolved Hide resolved
docs/development/pigs/pig-008.rst Outdated Show resolved Hide resolved

Alternatives
============
Joint-likelihood analyses for multiple instruments are already possible using

This comment has been minimized.

@cdeil

cdeil Apr 26, 2019
Member

I don't think 3ML is a full alternative.

We need Gammapy by itself to do good IACT data analysis. This requires datasets and joint fit across datasets, where the dataset can be spectra or maps from IACT observations.

Also per-obs background model with individual parameters is very common and needs to be supported. Suggest to mention this specifically as a use case also in the abstract.

I'm not sure there's really a full alternative to this PIG. You could change this section to "Future work" or "Open questions" or something like that, and as part of it share some thoughts that "Gammapy dataset" is similar to "3ML plugin" and that our focus is on the kinds of datasets used for IACTs (3D maps, 2D maps, 1D spectra, spectral points and likelihood profiles), and that this is pretty much already a full MWL solution, like Sherpa, 3ML, but our motivation was just to do IACT analysis well.

This comment has been minimized.

@adonath

adonath Jul 2, 2019
Author Member

I agree there is no real full alternative, so I just decided to remove the Alternatives section.

@adonath adonath force-pushed the adonath:datasets_pig branch from f3affad to 3269319 May 2, 2019
@adonath adonath force-pushed the adonath:datasets_pig branch 2 times, most recently from 1c05e73 to 779a117 May 24, 2019
@adonath adonath modified the milestones: 0.12, 0.13 May 28, 2019
@adonath adonath force-pushed the adonath:datasets_pig branch 2 times, most recently from 85f6fd8 to 99da234 Jun 13, 2019
@adonath adonath force-pushed the adonath:datasets_pig branch 2 times, most recently from 52518e5 to fc36796 Jul 2, 2019
@adonath adonath force-pushed the adonath:datasets_pig branch 3 times, most recently from 7463da3 to bd480fe Jul 3, 2019
@cdeil cdeil requested review from cdeil and registerrier Jul 5, 2019
Copy link
Member

@cdeil cdeil left a comment

@adonath - Thanks!

I re-read the whole thing. My comments are inline.

the abstraction layer of `Dataset` in Gammapy. A dataset bundles the reduced data
with a parameteric model and fit statistics function. It evaluates the model and
log-likelihood and passes it on to the fit object. Datasets can be combined by
adding their log-likelihood values and concatenating their model parameters.

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

The abstract is very much focused on the joint likelihood case. But already just the fact that different kinds of analyses should be structured in a similar way, and dataset-specific things should be separated from generic fitting things is enough motivation to introduce several Dataset classes and single Fit class. Suggest to reword a little bit, to make it clearer that this is something that was needed in Gammapy, already independent of the joint likelihood fit use case (which is where Datasets comes in).

Proposal
========
We propose to introduce the following classes to implement the dataset framework
in Gammapy:

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

Suggest to start with a section describing the general idea and the generic classes and how they interrelate: Dataset, Datasets, Fit.

Below for each example you show how different aspects how to fit a dataset, mostly in the MapDataset section, but also similar and other things in the other sections. The whole PIG becomes shorter and easier to understand if you start with the generic section.

case a spatially varying PSF is handled. The same holds for the `edisp`
argument.

A separate setup step creates cutouts of the exposure map for the individual

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

It's not clear to me really yet how setup works or if the evaluator objects are a good idea, i.e. whether they could be removed and e.g. caching is done on model or dataset objects. Is setup auto-called on __init__, or on optimise or ... ? Does the map geom of all maps have to be identical or does setup reproject to a common geom?

I don't think we have the answers to these questions yet, and it's not done consistently for the various datasets. Maybe just delete this paragraph, keep the scope of the PIG on the introduction of Dataset and Fit? I would prefer this, or to expand on / improve this aspects, especially if you consider something like the list of map evaluators the v1.0 design already.


The `stat` argument allows to choose between built-in fit statistics
such as `cash` or `cstat` an could be later extended to allow for arbitrary,
user defined fit statistics.

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

I think the idea is that string arguments for stat are only a convenience, what will happen is that on __init__ this creates Stat objects? And customisation would be by users passing in their own pre-instantiated stat objects?
This is not really clear, because you always only show stat strings, but there is no registry of stats by string, so it's not user-extensible like that. Suggest to move this paragraph to the generic section discussing the Dataset base class and to say something. (not knowing how this will work in detail is also OK, just say it).

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

The other thing that you could mention is how users can customise or put their own use cases: they can pass a user-defined Stat object, they can sub-class and modify an existing Dataset, or they can create their own Dataset from scratch which isn't very complicated (and they can use composition with any other features in Gammapy).

E.g. if someone wants to use Gammapy for HAWC or unbinned analysis or whatever they want really, then that's possible, the key thing achieved here is to have a framework or design that is user-extensible, not everything has to be built into Gammapy.

dataset = MapDataset()

# compute predicted number of counts
dataset.npred()

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

I think already npred isn't required? E.g. FluxPointsDataset doesn't have it.
So either remove it, or say that counts-based datasets will have this, but it's not required by the Dataset ABC.

This comment has been minimized.

@adonath

adonath Jul 9, 2019
Author Member

Removed.

background-model: "obs-123/background.fits"
model: "model.yaml" # optionally

Addtionally one could introduce a single FITS file serializiaton for quickly writing /

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

Remove "for quickly ...". It's not more or less quick than the YAML index file plus FITS file(s) solution.
You could write something like "Another serialisation format we are considering to implement is as a single FITS file."

dataset.write("dataset.fits")

The `SpectrumDatasetOnOff` in addition implements serialisation to the standard
OGIP format described on the `gamma-astro-data-formats PHA`_ page:

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

":" -> "."

This comment has been minimized.

@adonath

adonath Jul 9, 2019
Author Member

Done


10. Implement dataset serialization to yaml (**v0.14**).

11. Implement datasets serialization to a single fits file (`#2264 <https://github.com/gammapy/gammapy/issues/2264>`_).

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

"fits" -> "FITS"

This comment has been minimized.

@adonath

adonath Jul 9, 2019
Author Member

Done

8. Add tutorial for joint-likelihood fitting of IACT and Fermi-LAT data, based
on the joint-crab dataset (**v0.14**).

9. Add a `name` attribute to datasets and a `Datasets.__getitem__` method (**v0.14**).

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

Did you discuss introducing names for datasets?
I think as for models, that's something we might or might not want.
If you really propose to add it and mention it in the task list here, you should probably say something int he PIG how it would work. Do names have to be unique? Who generates them, in what way? Do we in the framework ever rely on and access by name? Or is this just something for user display, e.g. an obs_id?
One use case could be the lightcurve analysis, where you either have one run split into few-minute chunks, or you have time bins that span many full and partial observations. Is it the responsibility of the Gammapy framework to generate names there? Does reading back datasets for lightcurve analysis rely on those names?

My naive feeling is that we want to avoid names, or if you allow them, then they will be "user-land", never used within Gammapy. Probably a Dataset should have a meta dict as user-land? That might be something you could add in the PIG?


Lazy loading of Datasets
------------------------
For the use-case of inspecting or stacking individual datasets (e.g. MapDataset per observation)

This comment has been minimized.

@cdeil

cdeil Jul 7, 2019
Member

Probably reading of multiple datasets should be moved from the outlook to the main section.
We already had this for spectra - it was in the SpectrumObservationList.
This functionality should be brought back and since it's not obvious how to do it, the proposed plan to re-add it should IMO be mentioned in the PIG.

Personally I'm +1 to introduce a YAML-based format that points to FITS files, and to make this work for multiple spectra somehow (don't care if generated filenames or multiple folders for now) - i.e. propose something that allows us to store between spectrum data reduction and fitting, and then we can improve / refactor serialisation formats in the coming months. But the need for this should be mentioned in the PIG IMO.

Copy link
Contributor

@registerrier registerrier left a comment

Thanks @adonath. Minor comments inline

dataset = MapDataset(counts=None)

# to sample a counts map from npred
dataset.fake(random_seed=0)

This comment has been minimized.

@registerrier

registerrier Jul 9, 2019
Contributor

Should the .fake method overwrite the counts? Maybe return a new Dataset? Or simply the counts?

A use case where this might a problem is the situation where a user wants to test whether his fit is reasonable and performs and number of fake & fit to check the validity of the fitted parameters. There you'd want to keep the initial counts.

Proposal
========
We propose to introduce the following classes to implement the dataset framework
in Gammapy:

This comment has been minimized.

@registerrier

registerrier Jul 9, 2019
Contributor

Why not briefly describe what all Dataset have in common? This would also show that any user can add some type of Dataset.

Abstract
========
An essential feature of Gammapy for modeling gamma-ray data will be the possibility
of joint-likelihood analyses. This includes the joint-likelhood fitting of a model

This comment has been minimized.

@registerrier

registerrier Jul 9, 2019
Contributor

likelihood

with a parameteric model and fit statistics function. It evaluates the model and
log-likelihood and passes it on to the fit object. Datasets can be combined by
adding their log-likelihood values and concatenating their model parameters.

This comment has been minimized.

@registerrier

registerrier Jul 9, 2019
Contributor

I would add something to say that datasets are the end product of the data reduction step and once associated to a model, they are the basis of the fitting part.

Mask and data range handling
-----------------------------
For a typical gamma-ray analysis there are two kinds of data ranges: the safe
data range defined for each dataset e.g, by safe energy thresholds or offset cuts

This comment has been minimized.

@registerrier

registerrier Jul 9, 2019
Contributor

defined for each dataset during data reduction

ranges with separate masks `mask_safe` and `mask_fit` on the datasets. The main
purpose is that safe data ranges are set in advance during data reduction and
not modified later, while the fit mask can be modified by users to manually
define the fit range or by algorithms e.g. flux point computation.

This comment has been minimized.

@registerrier

registerrier Jul 9, 2019
Contributor

by some algorithm

fit = Fit(datasets)
fit.optimize()

This comment has been minimized.

@registerrier

registerrier Jul 9, 2019
Contributor

Maybe add something for dataset stacking ?

background-model: "obs-123/background.fits"
model: "model.yaml" # optionally

Addtionally one could introduce a single FITS file serializiaton for quickly writing /

This comment has been minimized.

@registerrier

registerrier Jul 9, 2019
Contributor

This only supports the reduced data part.

has to be passed on `MapDataset.__init__()` (**v0.14**).

8. Add tutorial for joint-likelihood fitting of IACT and Fermi-LAT data, based
on the joint-crab dataset (**v0.14**).

This comment has been minimized.

@registerrier

registerrier Jul 9, 2019
Contributor

Note that @luca-giunti has nice working examples for this

@adonath adonath force-pushed the adonath:datasets_pig branch from bd480fe to 91ff63a Jul 9, 2019
@cdeil cdeil modified the milestones: 0.13, 0.14 Jul 18, 2019
@adonath adonath modified the milestones: 0.14, 0.15 Sep 26, 2019
@cdeil
Copy link
Member

@cdeil cdeil commented Nov 12, 2019

@adonath - How should we finish up this PIG?

  1. Withdraw with a comment that it's mostly implemented?
  2. Invest half a day to polish, and circulate with Monday as deadline for comments?

If we do option 2, we could shorten to 1/3 or 1/2 the length if we replace each dataset description to a link to the v0.14 docs dataset class. Overall I'm +1 to do option 2, because most of the stuff from this PIG was done in Gammapy, and there's still a few details especially how Datasets and SkyModels and Fit and Parameters interact that would benefit from some more review and discussion, so that we can finalise the design next week and try to ship it in v0.15.

Let me know if you want help here.

@adonath adonath merged commit 264b325 into gammapy:master Dec 1, 2019
9 checks passed
9 checks passed
Codacy/PR Quality Review Up to standards. A positive pull request.
Details
Scrutinizer Analysis: 162 Issues, 0 Patches – Tests: passed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
gammapy.gammapy #20191130.5 succeeded
Details
gammapy.gammapy (DevDocs) DevDocs succeeded
Details
gammapy.gammapy (Lint) Lint succeeded
Details
gammapy.gammapy (Test Python36) Test Python36 succeeded
Details
gammapy.gammapy (Test Windows36) Test Windows36 succeeded
Details
gammapy.gammapy (Test Windows37) Test Windows37 succeeded
Details
@cdeil
Copy link
Member

@cdeil cdeil commented Dec 1, 2019

@adonath - Thanks for merging the PIG.

Please change the status to "withdrawn". It never went to official review via the mailing list and CC list, so according to our rules (see PIG 1), it can't be "accepted", since some relevant people didn't even see it, never mind approved it.

I suggest to say in decision that this PIG was mostly implemented in 2018, maybe that it should have gone to review in early 2018, or you could mention that important and tricky details are still unresolved, e.g. how Datasets and Models and Parameters interact, or how background models are handled (still not clear, currently a model object on map, but a count spectrum on spectrum), and personally I'm also not a fan of the two masks on the dataset solution. So I think this was the reason this PIG never went for official review -- important design aspects weren't clear (some still aren't), and the PIG wasn't proposed as a short-term plan (few months), but as the final solution, which no-one could design or envision in one go.

Maybe it's not worth to have a big paragraph in "decision", and just putting a short sentence like for the modeling PIG would be best? https://docs.gammapy.org/dev/development/pigs/pig-007.html#decision

You also need to add it to the toctree and fix RST formatting issues:
https://gist.github.com/cdeil/4eefb89e6ce0cd4539fbc6b6ec8860ad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants