Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Add datasets serialization #2314
cdeil left a comment
This is a continuation of #2296
@QRemy - Thanks!
I've left a bunch of inline comments. Many of them are superficial - I apologise, but I don't have time today to read and understand this code in-depth. I'm still OK to just polish this a bit and merge it in, and then in a few weeks with Axel we can do a larger review of the serialisation code.
My main comment here would be a concern of adding
Can you avoid introducing
If you absolutely need it, my suggestion would be to add a "name" attribute for datasets instead, which then could be filled with
For the next round of review I would then start with this high-level description to know your intent for the solution, and then to review the implementation.
I agree that obs_id is confusing, would prefer just id or dataset_id.
@QRemy - Handling spectrum and map and other datasets in a homogeneous way is important.
But I think adding
As far as I know, it's agreed that to support the use cases I mentioned we would not use
The solution is to have an ID or equivalently name at the dataset level.
Note that it isn't just about calling it
Overall deciding what to do here is difficult, because to judge one would need to see real-world use cases as in https://docs.gammapy.org/0.13/tutorials.html and to have data reduction emit datasets and to read / write them before fitting. That's a larger task and I guess you don't want to start coding on that, leaving it to Axel and Regis to do in ~ a month?
@QRemy @adonath - as discussed this morning, this PR strongly relates to one of the most difficult questions for Gammapy - how models and datasets are linked - and this needs an as-simple-as possible and clearly understood design.
Now simple could mean different things: it could mean linking only one way (creating a tree data structure) instead of both ways (creating a tree data structure), or it could mean avoiding a tree data structure and having a flat list of objects after serialisation and some name and reference scheme that's used to recreate the in-memory linked data structure on read.
One suggestion I have is to study what JWST is developing to solve a very similar problem:
They have complex compound WCS models that they represent as Astropy models, and they have developed ASDF to serialise them. We could use Astropy models and / or ASDF directly, or we could study how it works and then consider using a similar solution.
After some struggle (see spacetelescope/asdf#684 (comment)) I managed to get a working example:
The next step (which I don't have time to pursue) would be to explore more complex models, or to see how to add extra objects like a
If you continue here with a hand-crafted YAML solution, my suggestions would be:
Here's an example showing how to serialise a toy linked data structure with PyYAML:
I think it's clear that for compound models and datasets with linked models we don't aim to have a human-writable serialisation format any more? For config-file style input, users would either need to write Python code, or we would have to develop a different second format that's much simpler and limited to specifying models for the most common use cases?
- add Datasets .to_yaml(), .from_yaml() and set_models_from_yaml() - modify MapDataset .from_hdulist() and to_hudlist() to handle multiple backgrounds (ie background_model isinstance BackgroundModels)
add datasets_to_dict() redefine models_to_datasets() as a class and decompose it into 3 helper functions
-add models and backgrounds list in datasets.yaml -rename dataset_id into name and remove from backgrounds attributes