Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MapDatasetMetaData container #4853

Merged
merged 13 commits into from
Feb 5, 2024
Merged

Conversation

AtreyeeS
Copy link
Member

@AtreyeeS AtreyeeS commented Oct 24, 2023

Addresses #4792.

This PR adds a meta container on the MapDataset object.

  1. The stacking is not done in place currently.
  2. The optional dict keys are just stacked - and then users can decide what to do with it
  3. Stacking multiple instruments raises a warning - change to error and forbid?
  4. The GTI table is already present on the map dataset, not replicating it here
  5. Maybe another property maker can be added? To know how this dataset was created, eg: the outcome of some maker process or created by hand by the user. Or is that provenance information @bkhelifi
  6. In https://github.com/gammapy/gammapy/blob/main/gammapy/maps/utils.py#L77 we define INVALID_VALUE.int = np.nan
    However, isinstance(np.nan, int) returns False, which raises ValidationError. I have put -999 as default for the event types, any better solution?

opinions please @registerrier @adonath @QRemy @maxnoe

Still needs in future PRs:

  • Deprecate the meta table?
  • MapDatasetMaker.create_metadata()
  • FluxPointsDataset meta object

@AtreyeeS AtreyeeS added this to In progress in gammapy.datasets via automation Oct 24, 2023
@AtreyeeS AtreyeeS added this to the 1.2 milestone Oct 24, 2023
@AtreyeeS AtreyeeS linked an issue Oct 24, 2023 that may be closed by this pull request
@bkhelifi
Copy link
Member

Thanks a lot, @AtreyeeS .
About 2, it is not clear how the metadata reduction will be made. In order to avoid an exponential growing of metadata, I would not stack them per default now. I propose to think about them later when exchanging about metata reduction (with @kosack and @maxnoe ).
About 3, I would have say that this is forbidden, when instrument is not None. I do not see a reasonable use case for which it will be possible. However, one could change this behaviour in a future PR.
About 5, I think that this is indeed a provenance information.

For information, I am working on the FluxMetaData container. Once, we will have the main containers of the DL3 and DL4, one could work on the makers... the same for the estimators when we will have the main one for the DL4 and DL5.

Copy link
Contributor

@registerrier registerrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AtreyeeS .

Maybe a very general comment first. We already see that informations are replicated between many metadata objects. Also, it seems useless to validate the same information in many different places.

This means that we need a higher granularity in metadata structures. Our ObjectMetaData classes will then mostly contain lower level metadata objects + some informations specific to the object.

Then stacking would mostly be building a list of e.g. ObsInfoMetaData and EventMetaData.

It would be good to also identify entries that are specific to the Dataset, if there are.
Maybe the name, possibly the target_name

event_type: Optional[Union[int, list[int]]]
optional: Optional[dict]

@validator("creation")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure you need a validation here.
Technically, in general, I think the CreatorMetaData can be None until the object is serialized (unless the information is read from disk).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought MapDataset.empty() can add the CreatorMetaData and the rest of the information filled in by the MapDatasetMaker

f"Incorrect pointing. Expect CreatorMetaData got {type(v)} instead."
)

@validator("instrument")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here no validation is needed either as str is a standard type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this I will remove. I wanted to be sure we forbid a list of strings here

telescope: Optional[Union[str, list[str]]]
observation_mode: Optional[Union[str, list]]
pointing: Optional[Union[SkyCoord, list[SkyCoord]]]
obs_ids: Optional[Union[int, list[int]]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure obs_id can only be int?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same question about event types as well.

Comment on lines 91 to 94
@validator("obs_ids", "event_type")
def validate_obs_ids(cls, v):
if v is None:
return -999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you use a value like -999. It could stay None. No?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With None, I get a pydantic.error_wrappers.ValidationError none is not an allowed value (type=type_error.none.not_allowed)

@AtreyeeS
Copy link
Member Author

I updated to use pydantic 2.0.

I think the present stacking logic is too repetitive. Can it be simplified @adonath ?

@adonath
Copy link
Member

adonath commented Jan 23, 2024

Thanks a lot @AtreyeeS! Probably the following code structure is simpler?

from pydantic import BaseModel, RootModel
from typing import List, Optional


class MapDatasetMetaDataItem(BaseModel):
    name: str
    event_type: Optional[str] = None


class MapDatasetMetaData(RootModel):
    root: List[MapDatasetMetaDataItem]
    instrument: Optional[str] = None
    creation: ...

    def __iter__(self):
        return iter(self.root)

    def __getitem__(self, item):
        # could use the obs id here
        return self.root[item]
    
    def to_table(self):
        pass

    def from_table(self):
        pass
        
    def stack(self, item):
        # just append the list here
        pass


meta = MapDatasetMetaData(
    [
        {"name": "hess"},
        {"name": "magic"},
        {"name": "veritas"},
        {"name": "fact"},
        {"name": "cta"},
    ]
)

meta[3].name

Having parallel lists is usually not a good idea, as it leads to these repetitive code statements. Introducing two separate classes might be the better solution here.

@bkhelifi
Copy link
Member

If we follow the ideaof @adonath , maybe one should keep the same logics of naming scheme between the class and the metadata class, ie:
MapDatasetMetaData, MapDatasetsMetaData (then leading to a change of the name of the python file)

In the list of instruments, one could add hawc, fermilat, astri....

Otherwise, I am wondering whether this class with this content is more a DatasetMetaData class... It seems very generic to all dataset child classes. It not it?

Copy link

codecov bot commented Jan 29, 2024

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (b1b171c) 75.00% compared to head (1474bb9) 75.04%.
Report is 5 commits behind head on main.

Files Patch % Lines
gammapy/datasets/map.py 88.23% 2 Missing ⚠️
gammapy/datasets/metadata.py 97.61% 1 Missing ⚠️
gammapy/utils/fits.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4853      +/-   ##
==========================================
+ Coverage   75.00%   75.04%   +0.04%     
==========================================
  Files         232      233       +1     
  Lines       34554    34622      +68     
==========================================
+ Hits        25917    25982      +65     
- Misses       8637     8640       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@AtreyeeS
Copy link
Member Author

I added a from_meta_table function to create a meta object from a meta table, and added computation of the meta in makers.

So the current behaviour is that there is a mechanism

  1. to compute the meta container from a meta table, and the meta table keeps information on all dataset operations (stacking, to_image, slicing etc). The meta container just contains the creation info.
  2. Create the meta container during data reduction using the meta table

Copy link
Contributor

@registerrier registerrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AtreyeeS . I have left a few comments.

gammapy/datasets/metadata.py Outdated Show resolved Hide resolved
gammapy/datasets/metadata.py Outdated Show resolved Hide resolved
gammapy/datasets/metadata.py Outdated Show resolved Hide resolved
"instrument": "INSTRUM",
"telescope": "TELESCOP",
"observation_mode": "OBS_MODE",
"pointing": "POINTING",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this you don't need since "pointing" is already a MetaData object

@AtreyeeS
Copy link
Member Author

AtreyeeS commented Feb 2, 2024

This now uses the ObsInfoMetaData.
Ready for final review from side

"creation": "CREATION",
"event_types": "EVT_TYPE",
"optional": "OPTIONAL",
"pointing": "POINTING",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without adding pointing, to_header fails if altaz is not present

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see. This is then a bug. Initially with SkyCoord set to np.nan the serialization wouldn't fail because v.alt and v.az would exist. This is no longer the case...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #5075 and #5076.

registerrier
registerrier previously approved these changes Feb 2, 2024
Copy link
Contributor

@registerrier registerrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AtreyeeS . Looks good to me! No further comment.

gammapy/datasets/tests/test_metadata.py Show resolved Hide resolved
@AtreyeeS
Copy link
Member Author

AtreyeeS commented Feb 2, 2024

I checked that the meta information is not serialised with the dataset object.
I can add that in a following PR.

_tag: ClassVar[Literal["obs_info"]] = "obs_info"

obs_id: int
obs_id: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it changed to str?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed previously that obs_ids don't need to be necessarily integers, and can be strings, and thus the MapDatasetMetaData was taking strings for obs_id

Having a str accepts both int and str on input

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the gadf requirement is that this is an int. That's why I kept this constrain when introducing the ObsInfoMetaData.
Unless there is a strong constraint for the MapDatasetMetaData I'd rather keep the same solution and update later if necessary. No?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no opinions on this.. I can then change the behaviour for the MapDataset as well...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One should take care of this type, as obs_id is used everywhere... I think that most of the time an int is used.
If we plan to change this for our internal data (here, we do not care of gadf), many many codes should be checked and changed.
But it is true that having internally a str is more generic. (but it requires many changes in our code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkhelifi The current implementation allows to read int as seen in the tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but pydantic is changing the type to str.
For now, I think it is simpler to stick to int.

@bkhelifi
Copy link
Member

bkhelifi commented Feb 2, 2024

A test is failing:
FAILED ../../.tox/py311-test-alldeps_noray/lib/python3.11/site-packages/gammapy/data/tests/test_observations.py::test_observation_read - AssertionError: assert '20136' == 20136

  • where '20136' = <gammapy.data.observations.Observation object at 0x7febafbb75d0>.obs_id

Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
Signed-off-by: Atreyee Sinha <asinha@ucm.es>
@AtreyeeS
Copy link
Member Author

AtreyeeS commented Feb 2, 2024

  1. reverted obs_id to int
  2. a minimal serialisation of the meta implemented by extending the PRIMARY header
  3. the serialisation does not work correctly when obs_info is a list of ObsInfoMetaData - I can try to fix that for 1.3

registerrier
registerrier previously approved these changes Feb 2, 2024
Copy link
Contributor

@registerrier registerrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AtreyeeS . This looks good! No further comment from my side.

Signed-off-by: Régis Terrier <rterrier@apc.in2p3.fr>
@registerrier
Copy link
Contributor

I have corrected an issue with PointingInfoMetaData.from_header() directly on your branch @AtreyeeS . The issue was that the default value was set to SkyCoord(np.nan, no.nan) which was breaking fits serialization.
This is now solved with default values set to None if the header keywords are not defined.

Copy link
Contributor

@registerrier registerrier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AtreyeeS . Let's wait for the CI to run and merge.

@registerrier registerrier merged commit 8577bb9 into gammapy:main Feb 5, 2024
16 checks passed
gammapy.datasets automation moved this from In progress to Done Feb 5, 2024
@bkhelifi
Copy link
Member

This PR is associated to the issue #4791

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

Define and add metadata on MapDataset
4 participants