# Metadata in ATLAS Open Data

All datasets have associated metadata -- properties shared by all the events in the dataset. In the ATLAS Open Data, metadata includes things like a unique numerical dataset identifier, the production cross section, and the description of the physics processes included in the sample. These metadata must be used to properly normalize the MC simulation datasets when comparing to detector data, and they can also be used to search for related samples.

The metadata are available in a package called `atlasopenmagic` that is [available on pypi](http://pypi.org/project/atlasopenmagic/). Let's set it up and use it to explore the metadata of the most recent ATLAS Open Data release!

Note that the metadata we will be looking at are also [available on the web](https://opendata.atlas.cern/docs/data/for_education/13TeV25_metadata) in a big table, but using a python module will make it much easier to grab numbers that you need for an analysis in an automatic way.

In [1]:
# First we install atlasopenmagic into our SWAN environment
# Notice that we need --user to avoid trying to install the package in a read-only file system
# This is a problem unique to SWAN; on binder or colab you won't need --user
%pip install --user atlasopenmagic

Collecting atlasopenmagic
  Using cached atlasopenmagic-1.0.1-py3-none-any.whl.metadata (7.2 kB)
Using cached atlasopenmagic-1.0.1-py3-none-any.whl (15 kB)
Installing collected packages: atlasopenmagic
Successfully installed atlasopenmagic-1.0.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Now we have to do a little bit of work to make sure that atlasopenmagic is available in our python path
# This is because SWAN by default does not include the local package installation area in the PYTHONPATH
# Again, this is not necessary on binder or colab
import sys
import os
sys.path += [ f'{os.environ["HOME"]}/.local/lib/python{sys.version_info.major}.{sys.version_info.minor}/site-packages' ]

# Now we can safely import atlasopenmagic
import atlasopenmagic as atom

In [3]:
# Now let's see what releases are available to us
atom.available_releases()

Available releases:
2016e-8tev        2016 Open Data for education release of 8 TeV proton-proton collisions (https://opendata.cern.ch/record/3860).
2020e-13tev       2020 Open Data for education release of 13 TeV proton-proton collisions (https://cern.ch/2r7xt).
2024r-pp          2024 Open Data for research release for proton-proton collisions (https://opendata.cern.record/80020).
2024r-hi          2024 Open Data for research release for heavy-ion collisions (https://opendata.cern.ch/record/80035).
2025e-13tev-beta  2025 Open Data for education and outreach beta release for 13 TeV proton-proton collisions (https://opendata.cern.ch/record/93910).
2025r-evgen       2025 Open Data for research release for event generation (https://opendata.cern.ch/record/160000).


In [5]:
# And let's plan to use the latest release of Open Data for Outreach and Education
atom.set_release('2025e-13tev-beta')

Active release set to: 2025e-13tev-beta. Metadata cache cleared.


In [6]:
# Now we can look at the metadata for a specific sample
atom.get_metadata(345060)
# Notice that the function here will accept either the dataset identifier or the "physics short", a short unique descriptor for the sample

Fetching and caching all metadata for release: 2025e-13tev-beta...
Successfully cached 374 datasets.


{'dataset_number': '345060',
 'physics_short': 'PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_ZZ4l',
 'e_tag': 'e7735',
 'cross_section_pb': 28.3,
 'genFiltEff': 0.000124,
 'kFactor': 1.717,
 'nEvents': 1598000,
 'sumOfWeights': 45231011.19517517,
 'sumOfWeightsSquared': 1296676130.5944173,
 'process': 'ggH H->ZZ->llll',
 'generator': 'Powheg+Pythia8(v.230)+EvtGen(v.1.6.0)',
 'keywords': ['Higgs', 'SM', 'SMHiggs', 'ZZ', 'mH125'],
 'file_list': ['root://eospublic.cern.ch:1094//eos/opendata/atlas/rucio/user/egramsta/mc_345060.PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_ZZ4l.noskim.root'],
 'description': '125 GeV',
 'job_path': 'https://gitlab.cern.ch/atlas-physics/pmg/infrastructure/mc15joboptions/-/tree/master/share/DSID345xxx/MC15.345060.PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_ZZ4l.py',
 'CoMEnergy': None,
 'GenEvents': None,
 'GenTune': None,
 'PDF': None,
 'Release': None,
 'Filters': None,
 'release': {'name': '2025e-13tev-beta'},
 'skims': [{'skim_type': '2J2LMET30',
   'file_list': [

That's a lot of metadata! Let's go through the fields a bit:

* `dataset_number`: Unique identifier assigned to each dataset.
* `physics_short`: Short name with information regarding the content of the dataset.
* `cross_section_pb`: Represents the probability of a particular interaction occurring, measured in picobarns (pb). It is a fundamental parameter that helps understanding the likelihood of specific particle interactions under given conditions.
* `genFiltEff`: Measure of the effectiveness of the selection criteria applied to the data. It indicates the fraction of events that pass the filters applied during the data processing stages.
* `kFactor`: Multiplicative correction factor used to account for higher-order effects in theoretical calculations. It adjusts the leading-order theoretical predictions to better match the observed data by incorporating next-to-leading order (NLO) or next-to-next-to-leading order (NNLO) corrections.
* `nEvents`: Total count of the events in the (unskimmed) dataset.
* `sumOfWeights` (`sumOfWeightsSquared`): Sum of the event weights (or event weights squared) in the released dataset. Use this for normalization of the samples (for understanding the statistical power of the dataset).
* `generator`: Specifies the simulation software used to generate the data. Information about the generators can be found in the Simulation Tools section.
* `keywords`: Terms or phrases associated with the dataset that help to find specific datasets.
* `process`: Brief description of the physics process being studied. For instance, "H->γγ" denotes the Higgs boson decaying into two photons.
* `job_path`: Link to the specific code or configuration files used to generate the sample. Sometimes these will be fairly easy to understand; in some cases they are quite complex and difficult for non-experts to understand.
* `description`: Longer description of the process that was generated to create the sample.
* `e_tag`: ATLAS software configuration (including software version) that was used to create the sample.
* `file_list`: The list of paths for the unskimmed files in the dataset. 
* `skims`: The list of metadata for the skimmed files in the dataset. Only one skimmed or unskimmed version should be used at a time. Within the skims we have:
   * A file list for the skim.
   * The name of the skim. The exact selections are [described here](https://opendata.atlas.cern/docs/data/for_education/13TeV25_details)

In [9]:
# Get an individual field of metadata
xsec = atom.get_metadata('345060', 'cross_section_pb')
print(xsec)

28.3


In [None]:
# Keywords are a great way to find datasets that you're interested in. Let's see what keywords are available.
atom.available_keywords()

In [None]:
# Now let's find datasets that match one of those keywords
atom.match_metadata(field='keywords',value='Higgs')

The [atlasopenmagic documentation](https://pypi.org/project/atlasopenmagic/) has many more examples of searches; please feel free to play with the package yoourself, and [open an issue](https://github.com/atlas-outreach-data-tools/atlasopenmagic/issues) or a [pull request](https://github.com/atlas-outreach-data-tools/atlasopenmagic/pulls) if there is a feature that you would like to have available!