Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datasets serialization #2314

Merged
merged 7 commits into from Aug 30, 2019

Conversation

@QRemy
Copy link
Contributor

commented Aug 1, 2019

add .from_yaml() and .to_yaml() to datasets
modify dataset.from_hdulist() and to_hdulist() to handle multiple backgrounds

@cdeil cdeil self-assigned this Aug 2, 2019
@cdeil cdeil added the feature label Aug 2, 2019
@cdeil cdeil added this to the 0.14 milestone Aug 2, 2019
Copy link
Member

left a comment

This is a continuation of #2296

@QRemy - Thanks!

I've left a bunch of inline comments. Many of them are superficial - I apologise, but I don't have time today to read and understand this code in-depth. I'm still OK to just polish this a bit and merge it in, and then in a few weeks with Axel we can do a larger review of the serialisation code.

My main comment here would be a concern of adding obs_id on datasets like you do here.
There are several use cases where a dataset will not have one unique obs_id.
E.g. when creating a lightcurve with 5-min time intervals, you can have for some bins datasets that come from the same observation with a given obs_id.
Or if you do a stacked analysis, or a lightcurve for monthly time bins, you can have many obs_id contribute to a stacked Dataset.

Can you avoid introducing obs_id on dataset here?

If you absolutely need it, my suggestion would be to add a "name" attribute for datasets instead, which then could be filled with obs_id or the strings "stacked" or "no name" or ... there's more options when we have a string compared to an int or list of ints for obs_id.
This needs careful thought, and very similar to compound model and parameter names it's a tricky problem.
If you introduce it for now, please somewhere add some docs how it works, especially whether the serialisation logic relies on unique names within a Datasets or if having multiple times the same name would still work; and if there are special strings like "local" and "global" that I see in this PR, what they mean.

For the next round of review I would then start with this high-level description to know your intent for the solution, and then to review the implementation.

@@ -98,3 +104,92 @@ def _dict_to_skymodel(model):
return SkyModel(
name=model["name"], spatial_model=spatial_model, spectral_model=spectral_model
)


def _get_backgrounds_names(dataset):

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

Suggest to remove this helper function. In the one place where you use it you can just use a 1-line list comprehension:

[model.name for model in dataset.background_model.models]
return BGnames


def models_to_datasets(datasets, components, get_lists=False):

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

I think the get_lists option isn't useful?

Suggest to remove it. Generally functions that always return the same thing are strongly preferred in coding to functions that sometimes return one thing and sometimes something else.

Currently it looks like you don't have a caller that looks at the return value, so you could just not return anything?

If it really is important, you could just always return that information, and then callers that don't care for it simply don't use it.

Parameters
----------
datasets : object

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

The syntax in Numpy style docstrings is always this:

Parameters
-----------
name : type
    description

I think in this case it might be this?

datasets : `~gammapy.utils.fitting.Datasets`
    Datasets

(sometimes the description isn't really useful, if the parameter name is already very descriptive, but one still needs to put it.

BGnames = _get_backgrounds_names(dataset)

backgrounds = []
for component in components["components"]:

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

This function is very long - I have to scroll up and down to read it.
Can you refactor one or two helper functions out?
E.g. if processing each dataset is mostly independent, that could be a helper function?

for dataset in datasets_list:

if not isinstance(dataset.background_model, BackgroundModels):
dataset.background_model = BackgroundModels[dataset.background_model]

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

Do you test this line? Is it really possible to index into the BackgroundModels class? Probably not, and this is a bug?

models_list = []
backgrounds_list = []
datasets_dictlist = []
for dataset in self.datasets:

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

Why do we have some serialisation logic here in datasets.py (the for loop and isinstance checks), and other logic in serialisation.py? Is there any reasoning to what goes where? Or would it be better to move all to serialisation.py and to only re-expose in small helper methods, like how you did it for sky models?

backgrounds_list = []
datasets_dictlist = []
for dataset in self.datasets:
filename = str(path) + "maps_" + dataset.obs_id + ".fits"

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

Be careful with paths, using str and + is error-prone and often fails on Windows.
To generate filename strings you can still use str.
Here I would recommend this:

path / "maps_{}.fits".format(dataset.obs_id)
dataset.write(filename, overwrite)
datasets_dictlist.append({"id": dataset.obs_id, "filename": filename})

if isinstance(dataset.background_model, BackgroundModels):

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

Why do you add logic here and in other places that checks for single versus container?
That's bad, see my comment in #2102

If you don't need it now, please remove this logic.
If it is needed, but easy to fix by creating datasets in a uniform way to always have the same data members (either single or container, possibly with one entry), please apply these fixes in this PR.
If it's difficult to clean up, then OK to keep this code for now, but please add a reminder:

# TODO: remove isinstance checks once #2102  is resolved

This comment has been minimized.

Copy link
@QRemy

QRemy Aug 2, 2019

Author Contributor

I agree serialization would be simpler if datasets were initialized with only lists.

Is there something equivalent to SkyModels and BackgroundModels for Spectral models ?

I guess removing all the instance checks and force to give only lists as you suggest will break several tests and notebooks, I would prefer to look at this in a separate pull-request.

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

Then leave the TODO for now, and we can continue the discussion in #2102 .
Would be great if you take that on in a separate PR.
The way forward is to just simplify the code, see what breaks, and adjust tests and examples if needed.

It can be OK to accept several things on input. But then before storing it as a data member, one should always go do a uniform representation. This way, when accessing data members like you do here, you no longer have to deal with the multiple cases, that only happens once in the setter.

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

Doing test-driven development really helps to write such code.
E.g. the case where you have BackgroundModels[...] that's broken should just be deleted, and only be added when you have a use and test case that requires it.

@@ -165,8 +165,8 @@ def test_map_dataset_fits_io(tmpdir, sky_model, geom, geom_etrue):
"COUNTS_BANDS",
"EXPOSURE",
"EXPOSURE_BANDS",
"BACKGROUND",
"BACKGROUND_BANDS",
"BG_BACKGROUND",

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

Why is this change BACKGROUND -> BG_BACKGROUND needed?
Can't you just keep BACKGROUND and make that work?

More generally we had a large mix and issues for IRFs because we had a mix of bg and bkg, and cleaned it up a year ago to only have bkg or background, no more bg. So please review this PR and get rid of bg everywhere, either using bkg or background as names.

@@ -160,7 +160,7 @@ components:
frozen: true
- name: cube_iem
id: global
filename: $GAMMAPY_DATA/fermi-3fhl/gll_iem_v06_cutout.fits
filename: $GAMMAPY_DATA/fermi_3fhl/gll_iem_v06_cutout.fits

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

So this was broken, the filename was incorrect? Is there a test now that uses this filename, or can you add a minimal test?
(alternatively, delete that part of the example YAML - we should aim to have minimal examples and tests to keep things maintainable)

@cdeil cdeil changed the title Add datasets seriliazation Add datasets serialization Aug 2, 2019
@QRemy

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2019

I agree that obs_id is confusing, would prefer just id or dataset_id.
But Axel told me that SpectrumDataset have already an attribute obs_id
https://docs.gammapy.org/0.13/api/gammapy.spectrum.SpectrumDataset.html#gammapy.spectrum.SpectrumDataset
So I use this for homogeneity until we decide how to rename it.

@cdeil

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

@QRemy - Handling spectrum and map and other datasets in a homogeneous way is important.

But I think adding obs_id is going in the wrong direction, meaning more clean-up to do in the coming weeks, and having to adapt the complex serialisation code.

As far as I know, it's agreed that to support the use cases I mentioned we would not use obs_id at the datasets level, because there can be datasets that were stacked from multiple observations, and there can be a single observation that results in multiple datasets.

The solution is to have an ID or equivalently name at the dataset level.
Sherpa calls it ID in some places (e.g. here, but on their dataset class it's called "name" (see here).
You already call it "id" in the YAML serialisation.

Note that it isn't just about calling it obs_id vs name, it's mostly about the data type and the logic of handling it during datasets creation and serialisatoin. Currently SpectrumDataset has obs_id which can be int or list of int (see here), so you'd have to implement logic to handle the list of int case. As far as I can see, you don't support SpectrumDataset at all at the moment? So you would be free to implement an as-good-as-we-can-envision solution for map datasets and map data reduction, and then we adjust SpectrumDataset later (which I think was always the assumption).

Overall deciding what to do here is difficult, because to judge one would need to see real-world use cases as in https://docs.gammapy.org/0.13/tutorials.html and to have data reduction emit datasets and to read / write them before fitting. That's a larger task and I guess you don't want to start coding on that, leaving it to Axel and Regis to do in ~ a month?
Then I would suggest to merge this in soon (with map dataset obs_id if you prefer), and to either wait, or to focus on model serialisation instead in the meantime.

# TODO: improve file-to-file comparison test
# fitsdata="$GAMMAPY_DATA/tests/models/maps_CTA-gc.fits"
# assert filecmp.cmp(path + "maps_CTA-gc.fits", fitsdata)
# seems to not work for fits file except is they are true copy

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

Usually the FITS header contains a timestamp and you will not get bit-by-bit reproducible FITS output files.
For YAML you might get them (or you might now, probably there are float repr or other formatting changes across systems or versions).
If you can get it to work (and pass in CI) for YAML, you can keep using these character-by-character asserts.

But for FITS at least my suggestion would be that you remove this TODO, and change to more specific asserts, e.g. on some FITS header key or on the FITS data sum if there's any question whether the data was handled properly.


path = str(tmpdir / "written_")
datasets.to_yaml(
"$GAMMAPY_DATA/tests/models/written_", selection="all", overwrite=True

This comment has been minimized.

Copy link
@cdeil

cdeil Aug 2, 2019

Member

You're writing to $GAMMAPY_DATA from the tests?
I think we should avoid this, leave that directory alone and only read from it, and only write / read from tmpdir, for I/O tests.

@cdeil cdeil assigned adonath and unassigned cdeil Aug 23, 2019
@cdeil cdeil requested review from cdeil and removed request for cdeil Aug 23, 2019
@cdeil

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

@QRemy @adonath - as discussed this morning, this PR strongly relates to one of the most difficult questions for Gammapy - how models and datasets are linked - and this needs an as-simple-as possible and clearly understood design.

Now simple could mean different things: it could mean linking only one way (creating a tree data structure) instead of both ways (creating a tree data structure), or it could mean avoiding a tree data structure and having a flat list of objects after serialisation and some name and reference scheme that's used to recreate the in-memory linked data structure on read.
There's also tradeoffs in simple structure and human-readable YAML files vs simple serialisation code, accepting a complex data structure in the YAML file.

One suggestion I have is to study what JWST is developing to solve a very similar problem:

They have complex compound WCS models that they represent as Astropy models, and they have developed ASDF to serialise them. We could use Astropy models and / or ASDF directly, or we could study how it works and then consider using a similar solution.

After some struggle (see spacetelescope/asdf#684 (comment)) I managed to get a working example:

from astropy.modeling import models
import asdf

poly_x = models.Polynomial2D(degree=2, c0_0=.2, c1_0=.11, c2_0=2.3, c0_1=.43, c0_2=.1, c1_1=.5)
model = poly_x | models.Scale(-2.3)
model(1, 2)

f = asdf.AsdfFile()
f.tree['model'] = model
f.write_to("model.asdf")

f2 = asdf.open("model.asdf")
model2 = f2.tree["model"]
model2(1, 2)

The next step (which I don't have time to pursue) would be to explore more complex models, or to see how to add extra objects like a ToyDataset to the ASDF serialisation machinery.

If you continue here with a hand-crafted YAML solution, my suggestions would be:

  • Pretty sure this is a good idea: Try to link from the dataset to the models, not the other way or both ways. I.e. create a tree data structure, not a graph data structure in the serialised version.
  • Not sure if this is a good idea: Try to use the YAML built-in feature to have references=aliases and map the in-memory linked data structure (reference to same source model from multiple datasets) to the same data structure in YAML, to avoid having to develop an ad-hoc different name-based solution for the linked data structure in YAML.

Here's an example showing how to serialise a toy linked data structure with PyYAML:

>>> import yaml
>>> data = yaml.load("""
... left hand: &A
...   name: The Bastard Sword of Eowyn
...   weight: 30
... right hand: *A
... """)
__main__:6: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
>>> id(data["left hand"])
4422085440
>>> id(data["right hand"])
4422085440
>>> yaml.dump(data)
'left hand: &id001\n  name: The Bastard Sword of Eowyn\n  weight: 30\nright hand: *id001\n'
>>> print(yaml.dump(data))
left hand: &id001
  name: The Bastard Sword of Eowyn
  weight: 30
right hand: *id001

I think it's clear that for compound models and datasets with linked models we don't aim to have a human-writable serialisation format any more? For config-file style input, users would either need to write Python code, or we would have to develop a different second format that's much simpler and limited to specifying models for the most common use cases?

@cdeil cdeil referenced this pull request Aug 23, 2019
@cdeil

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

@adonath - there's now merge conflicts because of #2328 and #2329 on the import lines.
Do you want to resolve this and merge this PR in? Or do you want @QRemy or me to rebase?

QRemy added 7 commits Jul 27, 2019
- add Datasets .to_yaml(),  .from_yaml() and set_models_from_yaml()
- modify MapDataset .from_hdulist() and to_hudlist() to handle multiple backgrounds (ie background_model isinstance BackgroundModels)
add datasets_to_dict()
redefine models_to_datasets()  as a class and decompose it into 3 helper functions
-add models and backgrounds list in datasets.yaml
-rename dataset_id into name and remove from backgrounds attributes
@QRemy QRemy force-pushed the QRemy:datasets_yaml branch from e19df95 to e6aa9f0 Aug 29, 2019
@adonath adonath force-pushed the QRemy:datasets_yaml branch from 663c755 to e6aa9f0 Aug 29, 2019
@adonath

This comment has been minimized.

Copy link
Member

commented Aug 29, 2019

I reverted my last commit in this PR and instead updated the datasets index files directly master c52bc2a. This should give us green light for the CI builds, once I restart those..

@adonath adonath merged commit 710bea7 into gammapy:master Aug 30, 2019
8 of 9 checks passed
8 of 9 checks passed
Codacy/PR Quality Review Hang in there, Codacy is reviewing your Pull request.
Details
Scrutinizer Analysis: 1 new issues, 14 updated code elements – Tests: passed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
gammapy.gammapy Build #20190829.5 had test failures
Details
gammapy.gammapy (DevDocs) DevDocs succeeded
Details
gammapy.gammapy (Lint) Lint succeeded
Details
gammapy.gammapy (Test Python36) Test Python36 succeeded
Details
gammapy.gammapy (Test Windows36) Test Windows36 succeeded
Details
gammapy.gammapy (Test Windows37) Test Windows37 succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.