# Metadata in ATLAS Open Data

All datasets have associated metadata -- properties shared by all the events in the dataset. In the ATLAS Open Data, metadata includes things like a unique numerical dataset identifier, the production cross section, and the description of the physics processes included in the sample. These metadata must be used to properly normalize the MC simulation datasets when comparing to detector data, and they can also be used to search for relevant or related samples.

The metadata are available in a package called `atlasopenmagic` that is [available on pypi](http://pypi.org/project/atlasopenmagic/). Let's set it up and use it to explore the metadata of the most recent ATLAS Open Data release!

Note that the metadata we will be looking at are also [available on the web](https://opendata.atlas.cern/docs/data/for_education/13TeV25_metadata) in a big table, but using a python module will make it much easier to grab numbers that you need for an analysis in an automatic way.

In [1]:
# First we install atlasopenmagic into our SWAN environment
# Notice that we need --user to avoid trying to install the package in a
# read-only file system This is a problem unique to SWAN; on binder or colab you
# won't need --user, but it doesn't hurt
%pip install --user atlasopenmagic

Collecting atlasopenmagic
  Downloading atlasopenmagic-1.8.0-py3-none-any.whl.metadata (5.8 kB)
Downloading atlasopenmagic-1.8.0-py3-none-any.whl (21 kB)
Installing collected packages: atlasopenmagic
Successfully installed atlasopenmagic-1.8.0


In [2]:
# Now we have to do a little bit of work to make sure that atlasopenmagic is
# available in our python path This is because SWAN by default does not include
# the local package installation area in the PYTHONPATH Again, this is not
# necessary on binder or colab - there you can remove these lines if you like,
# though they don't do any harm
import sys
import os
sys.path += [ f'{os.environ["HOME"]}/.local/lib/python{sys.version_info.major}.{sys.version_info.minor}/site-packages' ]

# Now we can safely import atlasopenmagic
import atlasopenmagic as atom

In [3]:
# Now let's see what releases are available to us
atom.available_releases()

Available releases:
2016e-8tev           2016 Open Data for education release of 8 TeV proton-proton collisions (https://opendata.cern.ch/record/3860).
2020e-13tev          2020 Open Data for education release of 13 TeV proton-proton collisions (https://cern.ch/2r7xt).
2024r-pp             2024 Open Data for research release for proton-proton collisions (https://opendata.cern.record/80020).
2024r-hi             2024 Open Data for research release for heavy-ion collisions (https://opendata.cern.ch/record/80035).
2025e-13tev-beta     2025 Open Data for education and outreach beta release for 13 TeV proton-proton collisions (https://opendata.cern.ch/record/93910).
2025r-evgen-13tev    2025 Open Data for research release for event generation at 13 TeV (https://opendata.cern.ch/record/160000).
2025r-evgen-13p6tev  2025 Open Data for research release for event generation at 13.6 TeV (https://opendata.cern.ch/record/160000).


{'2016e-8tev': '2016 Open Data for education release of 8 TeV proton-proton collisions (https://opendata.cern.ch/record/3860).',
 '2020e-13tev': '2020 Open Data for education release of 13 TeV proton-proton collisions (https://cern.ch/2r7xt).',
 '2024r-pp': '2024 Open Data for research release for proton-proton collisions (https://opendata.cern.record/80020).',
 '2024r-hi': '2024 Open Data for research release for heavy-ion collisions (https://opendata.cern.ch/record/80035).',
 '2025e-13tev-beta': '2025 Open Data for education and outreach beta release for 13 TeV proton-proton collisions (https://opendata.cern.ch/record/93910).',
 '2025r-evgen-13tev': '2025 Open Data for research release for event generation at 13 TeV (https://opendata.cern.ch/record/160000).',
 '2025r-evgen-13p6tev': '2025 Open Data for research release for event generation at 13.6 TeV (https://opendata.cern.ch/record/160000).'}

In [4]:
# And let's use the latest release of Open Data for Outreach and Education
atom.set_release('2025e-13tev-beta')

Fetching metadata for release: 2025e-13tev-beta...
Fetching datasets: 100%|██████████| 374/374 [00:00<00:00, 683.86datasets/s]
✓ Successfully cached 374 datasets.
Active release: 2025e-13tev-beta. (Datasets path: REMOTE)


If you want to immediately find out what samples are available, you can use:
```python
atom.available_datasets()
```

That will return a long list of integer "dataset identifiers". There aren't good rules about what dataset gets assigned what integer, so they're most useful for software that needs to keep lists of samples. In this notebook, we're going to use a sample that we already know exists; later on we'll show you how to identify the right samples for your use-case.

In [5]:
# Now we can look at the metadata for a specific sample
atom.get_metadata(345060)
# Notice that the function here will accept either the dataset identifier or the
# "physics short", a short unique descriptor for the sample %% [markdown] That's
# a lot of metadata! Let's go through the fields a bit:

{'dataset_number': '345060',
 'physics_short': 'PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_ZZ4l',
 'e_tag': 'e7735',
 'cross_section_pb': 0.006024,
 'genFiltEff': 1.0,
 'kFactor': 1.0,
 'nEvents': 1598000,
 'sumOfWeights': 45231011.19517517,
 'sumOfWeightsSquared': 1296676130.5944173,
 'process': 'ggH H->ZZ->llll',
 'generator': 'Powheg+Pythia8(v.230)+EvtGen(v.1.6.0)',
 'keywords': ['Higgs', 'SM', 'SMHiggs', 'ZZ', 'mH125'],
 'description': '125 GeV',
 'job_path': 'https://gitlab.cern.ch/atlas-physics/pmg/infrastructure/mc15joboptions/-/tree/master/share/DSID345xxx/MC15.345060.PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_ZZ4l.py',
 'CoMEnergy': None,
 'GenEvents': None,
 'GenTune': None,
 'PDF': None,
 'Release': None,
 'Filters': None,
 'cross_section_uncertainty': None,
 'hepmc_version': None,
 'release': {'name': '2025e-13tev-beta'}}

That's a lot of metadata! Let's go through the fields a bit:

* `dataset_number`: Unique identifier assigned to each dataset.
* `physics_short`: Short name with information regarding the content of the dataset.
* `cross_section_pb`: A [cross section](https://atlas.cern/glossary/cross-section) represents the probability of a particular interaction occurring, measured in picobarns (pb). It is a fundamental parameter that helps understanding the likelihood of specific particle interactions under given conditions. The cross section is usually what is returned by the generator - you need to multiply by the filter efficiency and k-factor (see below) to get a complete sample weight.
* `genFiltEff`: Measure of the effectiveness of the selection criteria applied to the data. It indicates the fraction of events that pass the filters applied during the data processing stages.
* `kFactor`: Multiplicative correction factor used to account for higher-order effects in theoretical calculations. It adjusts the leading-order theoretical predictions to better match the observed data by incorporating next-to-leading order (NLO) or next-to-next-to-leading order (NNLO) corrections.
* `nEvents`: Total count of the events in the (unskimmed) dataset.
* `sumOfWeights` (`sumOfWeightsSquared`): Sum of the event weights (or event weights squared) in the released dataset. Use this for normalization of the samples (for understanding the statistical power of the dataset).
* `generator`: Specifies the simulation software used to generate the data. Generators are described in more detail [here](https://opendata.atlas.cern/docs/documentation/monte_carlo/simulation_tools).
* `keywords`: Terms or phrases associated with the dataset that help to find specific datasets.
* `process`: Brief description of the physics process being studied. For instance, "H->γγ" denotes the Higgs boson decaying into two photons. Often this describes the [Feynman Diagram](https://cds.cern.ch/record/2759490/) of the highest energy interaction in the event.
* `job_path`: Link to the specific code or configuration files used to generate the sample. Sometimes these will be fairly easy to understand; in some cases they are quite complex and difficult for non-experts to understand.
* `description`: Longer description of the process that was generated to create the sample.
* `e_tag`: ATLAS software configuration (including software version) that was used to create the sample.
* `file_list`: The list of paths for the unskimmed files in the dataset.
* `skims`: The list of metadata for the skimmed files in the dataset. Only one skimmed or unskimmed version should be used at a time. Within the skims we have:
   * A file list for the skim.
   * The name of the skim. The exact selections are [described here](https://opendata.atlas.cern/docs/data/for_education/13TeV25_details)

In [6]:
# Get an individual field of metadata
xsec = atom.get_metadata('345060', 'cross_section_pb')
print(xsec)

0.006024


In [7]:
# Keywords are a great way to find datasets that you're interested in. Let's see
# what keywords are available.
atom.available_keywords()

['1jet',
 '1lepton',
 '2electron',
 '2lepton',
 '2muon',
 '2neutrino',
 '2photon',
 '2tau',
 '3lepton',
 '3photon',
 '4lepton',
 '4top',
 'BSM',
 'BSMHiggs',
 'BSMtop',
 'Higgs',
 'NLO',
 'QCD',
 'SM',
 'SMHiggs',
 'SSM',
 'VBF',
 'VBS',
 'W',
 'WHiggs',
 'WIMP',
 'WW',
 'WZ',
 'Wprime',
 'Wt',
 'Z',
 'ZH',
 'ZHiggs',
 'ZZ',
 'Zgamma',
 'Zprime',
 'allHadronic',
 'allhadronic',
 'bbbar',
 'bsm',
 'bsmtop',
 'chargino',
 'diboson',
 'dijet',
 'diphoton',
 'egamma',
 'electron',
 'electroweak',
 'exotic',
 'gaugino',
 'gluino',
 'gluonFusionHiggs',
 'gravitino',
 'heavyBoson',
 'higgs',
 'higgsino',
 'inclusive',
 'invisible',
 'jets',
 'lepton',
 'leptoquark',
 'mH125',
 'multilepton',
 'muon',
 'neutralino',
 'neutrino',
 'nlo',
 'performance',
 'photon',
 'quark',
 'resonance',
 'rpv',
 'sChannel',
 'scalar',
 'schannel',
 'simplifiedModel',
 'simplifiedmodel',
 'singleTop',
 'singletop',
 'sm',
 'smhiggs',
 'squark',
 'stau',
 'stop',
 'susy',
 'tHiggs',
 'tZ',
 'tau',
 'thiggs',
 't

In [8]:
# Now let's find datasets that match all those keywords
atom.match_metadata(field='keywords',value='Higgs')

[('341456', 'PowhegPythia8EvtGen_CT10_AZNLO_ZH125J_MINLO_veveWWlvqq_VpT'),
 ('341458', 'PowhegPythia8EvtGen_CT10_AZNLO_ZH125J_MINLO_vmuvmuWWlvqq_VpT'),
 ('341460', 'PowhegPy8EG_CT10_AZNLO_ZH125J_MINLO_vtauvtauWWlvqq_VpT'),
 ('343981', 'PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_gamgam'),
 ('344158', 'aMcAtNloPythia8EvtGen_A14NNPDF23LO_ppx0_FxFx_Np012_SM'),
 ('345056', 'PowhegPythia8EvtGen_NNPDF3_AZNLO_ZH125J_MINLO_vvbb_VpT'),
 ('345058', 'PowhegPythia8EvtGen_NNPDF3_AZNLO_ggZH125_vvbb'),
 ('345060', 'PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_ZZ4l'),
 ('345061', 'PowhegPythia8EvtGen_NNPDF3_AZNLO_ggZH125_HgamgamZinc'),
 ('345066', 'PowhegPythia8EvtGen_NNPDF3_AZNLO_ggZH125_ZZ4lepZinc'),
 ('345097', 'PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_mumu'),
 ('345098', 'PowhegPythia8EvtGen_NNPDF3_AZNLO_ggZH125_Hmumu_Zinc'),
 ('345103', 'PowhegPythia8EvtGen_NNPDF30_AZNLO_ZH125J_Hmumu_Zincl_MINLO'),
 ('345104', 'PowhegPythia8EvtGen_NNPDF30_AZNLO_WpH125J_Hmumu_Wincl_MINLO'),
 ('345105', 'PowhegPythia8E

In [12]:
# Now let's find datasets that match all those keywords
atom.match_metadata(field='keywords',value=['Higgs','ZZ'])

[('344158', 'aMcAtNloPythia8EvtGen_A14NNPDF23LO_ppx0_FxFx_Np012_SM'),
 ('345060', 'PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_ZZ4l'),
 ('345066', 'PowhegPythia8EvtGen_NNPDF3_AZNLO_ggZH125_ZZ4lepZinc'),
 ('346340', 'PowhegPy8EG_A14NNPDF23_NNPDF30ME_ttH125_ZZ4l_allhad'),
 ('346341', 'PowhegPy8EG_A14NNPDF23_NNPDF30ME_ttH125_ZZ4l_semilep'),
 ('346342', 'PowhegPy8EG_A14NNPDF23_NNPDF30ME_ttH125_ZZ4l_dilep'),
 ('346447', 'PhH7EG_H7UE_NNPDF30_VBFH125_ZZ4lep_noTau'),
 ('346451', 'PhH7EG_H7UE_NNPDF3_ggZH125_ZZ4lepZinc'),
 ('346452', 'PhH7EG_H7UE_NNPDF30ME_ttH125_ZZ4l_allhad_noTau'),
 ('346453', 'PhH7EG_H7UE_NNPDF30ME_ttH125_ZZ4l_semilep_noTau'),
 ('346454', 'PhH7EG_H7UE_NNPDF30ME_ttH125_ZZ4l_dilep_noTau'),
 ('346588', 'PowhegPythia8EvtGen_NNLOPS_nnlo_30_ggH125_ZZ4nu_MET75'),
 ('346633', 'PowhegPy8EG_A14NNPDF23_NNPDF30ME_ttH125_ZZ4nu_allhad'),
 ('346634', 'PowhegPy8EG_A14NNPDF23_NNPDF30ME_ttH125_ZZ4nu_dilep'),
 ('450576', 'PowhegPythia8EvtGen_NNLOPS_nnlo_30_VBFH125_ZZllbb')]

In [15]:
# Now let's find datasets that match all those keywords
atom.match_metadata(field='keywords',value=['Higgs','WW'])

[('345324', 'PowhegPythia8EvtGen_NNLOPS_NN30_ggH125_WWlvlv_EF_15_5'),
 ('345948', 'PowhegPy8EG_NNPDF30_AZNLOCTEQ6L1_VBFH125_WWlvlv'),
 ('346877', 'PhH7EG_NNPDF30_VBFH125_WWlvlv'),
 ('500335', 'MGH7EG_VBFHWWlvlv')]

In [16]:
# Now lets look for samples that have `W` somewhere in the process name
# Notice that this also catches things like `Wprime`!
atom.match_metadata(field='process',value='W')

[('301243', 'Pythia8EvtGen_A14NNPDF23LO_Wprime_enu_SSM3000'),
 ('301247', 'Pythia8EvtGen_A14NNPDF23LO_Wprime_munu_SSM3000'),
 ('301826', 'Pythia8EvtGen_A14NNPDF23LO_Wprime_qq_3000'),
 ('302733', 'MadGraphPythia8EvtGen_A14NNPDF23LO_WpL_tblep_M3000'),
 ('306149', 'MadGraphPythia8EvtGen_A14NNPDF23LO_WpL_tbhad_M3000'),
 ('341456', 'PowhegPythia8EvtGen_CT10_AZNLO_ZH125J_MINLO_veveWWlvqq_VpT'),
 ('341458', 'PowhegPythia8EvtGen_CT10_AZNLO_ZH125J_MINLO_vmuvmuWWlvqq_VpT'),
 ('341460', 'PowhegPy8EG_CT10_AZNLO_ZH125J_MINLO_vtauvtauWWlvqq_VpT'),
 ('345104', 'PowhegPythia8EvtGen_NNPDF30_AZNLO_WpH125J_Hmumu_Wincl_MINLO'),
 ('345105', 'PowhegPythia8EvtGen_NNPDF30_AZNLO_WmH125J_Hmumu_Wincl_MINLO'),
 ('345211', 'PowhegPy8EG_NNPDF30_AZNLO_WmH125J_Winc_MINLO_tautau'),
 ('345212', 'PowhegPy8EG_NNPDF30_AZNLO_WpH125J_Winc_MINLO_tautau'),
 ('345213', 'PowhegPy8EG_NNPDF30_AZNLO_WmH125J_Winc_MINLO_etau'),
 ('345214', 'PowhegPy8EG_NNPDF30_AZNLO_WpH125J_Winc_MINLO_etau'),
 ('345215', 'PowhegPy8EG_NNPDF30_AZNLO_W

In [10]:
# We can also find samples that have a cross section near one we're interested in.
# By default, the matching is pretty tight (1%):
atom.match_metadata('cross_section_pb',0.001762)

[('301204', 'Pythia8EvtGen_A14MSTW2008LO_Zprime_NoInt_ee_SSM3000'),
 ('301209', 'Pythia8EvtGen_A14MSTW2008LO_Zprime_NoInt_mumu_SSM3000')]

In [11]:
# But we can loosen the tolerance to 50% to see what other samples have similar
# cross sections.
atom.match_metadata('cross_section_pb',0.001762,float_tolerance=0.50)

[('301204', 'Pythia8EvtGen_A14MSTW2008LO_Zprime_NoInt_ee_SSM3000'),
 ('301209', 'Pythia8EvtGen_A14MSTW2008LO_Zprime_NoInt_mumu_SSM3000'),
 ('302527', 'Pythia8EvtGen_A14NNPDF23LO_2DP20_Mass_1500_2000'),
 ('304014', 'MadGraphPythia8EvtGen_A14NNPDF23_3top_SM'),
 ('345317', 'PowhegPythia8EvtGen_NNPDF30_AZNLO_WmH125J_Hyy_Wincl_MINLO'),
 ('345318', 'PowhegPythia8EvtGen_NNPDF30_AZNLO_WpH125J_Hyy_Wincl_MINLO'),
 ('345319', 'PowhegPythia8EvtGen_NNPDF30_AZNLO_ZH125J_Hyy_Zincl_MINLO'),
 ('346525', 'PowhegPythia8EvtGen_A14NNPDF23_NNPDF30ME_ttH125_gamgam'),
 ('364243', 'Sherpa_222_NNPDF30NNLO_WWZ_4l2v_EW6'),
 ('364246', 'Sherpa_222_NNPDF30NNLO_WZZ_3l3v_EW6'),
 ('364710', 'Pythia8EvtGen_A14NNPDF23LO_jetjet_JZ10WithSW'),
 ('375884', 'MGPy8EG_A14N_GG_ttn1_2000_5000_200'),
 ('392202', 'MGPy8EG_A14N23LO_C1N2_WZ_500p0_100p0_3L_2L7'),
 ('507366', 'MGPy8EG_A14NNPDF23LO_GGM_N1N2C1_950'),
 ('523692', 'MGPy8EG_A14N23LO_TT_tN1_1200_200_MS'),
 ('700591', 'Sh_2212_lllljj_Int')]

The [atlasopenmagic documentation](https://pypi.org/project/atlasopenmagic/) has many more examples of searches; please feel free to play with the package yourself, and [open an issue](https://github.com/atlas-outreach-data-tools/atlasopenmagic/issues) or a [pull request](https://github.com/atlas-outreach-data-tools/atlasopenmagic/pulls) if there is a feature that you would like to have available!

In case you would like to explore the metadata some more on your own, here are some things you might try:

* Go back to the list of samples that we found that match the `Higgs` keyword. Can you understand how each is different based on the names and metadata?
* Let's pretend we're going to do an analysis using the QCD jet spectrum. Try using the `QCD` keyword to identify the samples that you would want to use.
    * You probably found quite a few samples! Can you tell what the differences are between them based on the metadata?
* Let's pretend we're going to do a physics analysis in a dilepton final state. How can you identify all the samples that you are going to want? Can you think through all the physics processes that might be relevant and identify the corresponding samples?
* Try loading up the 13 TeV event generation open data metadata (the release is `2025r-evgen-13tev`). There are a LOT more samples available, so this will take a moment to load (typically around 20-30s).
    * What keywords are available? Are they different from the ones you found before?
    * Now try looking for the QCD dijet samples again. There are even more! What are the differences between them?
* Pick your favorite [ATLAS paper](https://twiki.cern.ch/twiki/bin/view/AtlasPublic/Publications) and see if you can identify the actual samples that were used in that paper in the Open Data. They aren't all available, but most of them are!


<div class="alert alert-block alert-info">
We welcome your feedback on this notebook or any of our other materials! Please <a href="https://forms.gle/zKBqS1opAHHemv9U7">fill out this survey</a> to let us know how we're doing, and you can enter a raffle to win some <a href="https://atlas-secretariat.web.cern.ch/merchandise">ATLAS merchandise</a>!
</div>