Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMS - create a script for derived data records #212

Open
katilp opened this issue Nov 6, 2023 · 5 comments
Open

CMS - create a script for derived data records #212

katilp opened this issue Nov 6, 2023 · 5 comments
Assignees

Comments

@katilp
Copy link
Member

katilp commented Nov 6, 2023

CMS 2016 release will include several "derived data" records structurally similar to e.g. https://opendata.cern.ch/record/12341
They will be:

  • PF-enriched samples produced from the 2016 OD MiniAOD samples
  • NanoAOD-type derivation from Run1 OD AOD samples

We should have a script template to create such records, that can be run in similar way as those for collision or MC records.

For the provenance, they will link to the parent dataset and the SW that was used to produce them (e.g. Run1 Nano: cernopendata/opendata.cern.ch#3281). Both will be available as CODP records. So need for extended provenance listing as it is already available in the parent dataset record.

For the variable description, these records can link to listings of this type.
This html files (one per type of production) should be hosted on the OD portal.

In the scripts, all metadata variables should be collected to the start of the script, for the ease of reuse.

@nancyhamdan
Copy link
Member

Can refer to this script from cms-2012-event-display-files as a starting point for new script for derived datasets, taking into account the following notes:

  • Dataset listings for each type that will be used as input for the script can be found in eos eos ls /eos/opendata/cms/derived-data/ e.g. eos ls /eos/opendata/cms/derived-data/POET/23-Jul-22/. "merged" files should be excluded
  • authors field will be set as "CMS Open Data Group" for all datasets
  • number_files can be extracted from the derived-data directory in eos
  • methodology corresponding to the "How were these data selected" section will differ from one type to another, can be changed in the Jinja html template with some conditional logic. File linked in this section will be a link to a corresponding software record
  • Refer to cernopendata/opendata.cern.ch/issues/3349 for title of each type of dataset
  • Values for description under abstract can be extracted from a csv file that has datasets' descriptions. We could have a description template for each type of derived datasets
  • dataset_semantics corresponding to the "Dataset characteristics" section can also be extracted as input from a csv file listing them for each dataset
  • Instead of having to extract values for description and dataset_semantics from csv files, and to easily be able to configure the script with any other necessary hard coded values, we could have a yaml file that has these values and the script could take this file as input. The yaml file could have similar structure to this:
common values:
    collision_energy: "0.2TeV" 
    keywords:
        - education
        - outreach 
    description: >
        This is a very long description saying this and that. It can even have many lines. So it could be quite comfortable to 
        enter desired dataset descriptions even if they are really long.

dataset:
    name: "/BTag/Run2012C-22Jan2013-V1/AOD"
    number_of_events: 123456789
dataset:
    name: "/CTag/Run2012C-22Jan2013-V1/AOD"
    number_of_events: 456
dataset:
    name:" /DTag/Run2012C-22Jan2013-V1/AOD"
    number_of_events: 456324234

@katilp
Copy link
Member Author

katilp commented Nov 27, 2023

For the number of events, you can use the following (if running where ROOT is available)

NanoAODRun1 and PFNano:

import ROOT
myfile = ROOT.TFile.Open("http://opendata.cern.ch/eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/Run2012C_SingleMu/FF6C9C2D-3B37-43D7-A9B0-043CB2AC8202.root")
myfile.Events.GetEntries()

In the older versions, the event number might appear as a "long" integer, e.g. 563709L, in that case, int(myfile.Events.GetEntries())

POET:

The POET output has a different structure, and there are two versions of it:

  • for the "flat" files, everything is under the events tree : myfile.events.GetEntries() (NB: lower case e)
  • for the direct POET output (no "flat"), the values are under object-specific trees, and an intermediate tree needs to be defined: myfile.myelectrons.Events.GetEntries() (NB: upper case E)

The number of events is the same in both cases.

@katilp
Copy link
Member Author

katilp commented Dec 11, 2023

Further details for the three types of derived data :

POET

Files under /eos/opendata/cms/derived-data/POET/23-Jul-22/

These are the files used in the 2022 workshop lesson
https://cms-opendata-workshop.github.io/workshop2022-lesson-ttbarljetsanalysis/

For each dataset, we have:

  • <dataset> directory with root files as a direct output of POET (separate trees for each object)
  • <dataset>_flat directory with root files "flattened" to a single tree, as required when used as input to coffea with nanoevents schema
  • <dataset>_flat.root with the separate files in the <dataset>_flat directory merged into one file

e.g.

RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8_flat
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8_flat.root

Finally, no reason to leave out the merged file, we can as well have it in the record.
So all files go in a single derived <dataset> record:

  • title: <dataset> dataset in reduced NanoAOD-like format
  • description: This dataset contains information extracted from different physics objects from the 2015 MiniAOD parent <dataset> dataset, readable with bare ROOT or other ROOT-compatible software. It was produced for the CMS open data workshop tutorials. It is provided in three different structures (then the list from above)
  • author: CMS Open data group
  • produced with: https://opendata.cern.ch/record/12502
  • example usage:
  • file type: nanoaod-poet

NanoAODRun1

FIles under /eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/

These are the files used in the 2022 workshop
https://cms-opendata-workshop.github.io/workshop2022-lesson-run1example/

For each dataset we have

So all files go in a single derived <dataset> record:

  • title: <dataset> dataset in Run1 NanoAOD-like format
  • description: <dataset> dataset in a NanoAOD-like research-level Ntuple format for CMS Run1 data, readable with bare ROOT or other ROOT-compatible software, and containing the per-event information that is needed in most generic analyses. In contrast to the CMS NanoAOD format which is derived from MiniAOD, it is generated directly from the AOD format with completely independent code provided by the CMS open data group. Nevertheless, there is a large overlap in functionality and content between NanoAODRun1 and NanoAOD such that common analyses are possible. It is provided as a collection of root files under <dataset> directory, and in <dataset>_merged.root with the separate files in the <dataset> directory merged into one file.
  • author: CMS Open data group
  • produced with: new recid for https://github.com/cms-opendata-analyses/NanoAODRun1ProducerTool (see CMS: prepare SW records for NanoAODRun1 opendata.cern.ch#3281)
  • example usage: see CMS: prepare the SW record for NanoAODRun1 usage examples opendata.cern.ch#3495
  • file type: nanoaod-run1

For titles and format, see cernopendata/opendata.cern.ch#3349 (comment)

PFNano

Files to be moved

For each dataset, files are under /<dataset>/Run2016G-UL2016_MiniAODv2-v2_PFNanoAODv1/

The derived <dataset> records:

  • title: <dataset> dataset in NanoAOD format enhanced with Particle Flow candidates
  • description: ` dataset in NanoAOD format enhanced with Particle Flow candidates, readable with bare ROOT or other ROOT-compatible software. In addition to the default NanoAOD content, it contains the information (@jmhogan, maybe we should point to some documentation on the PF here)
  • author: CMS Open data group
  • produced with: new recid for https://github.com/cms-opendata-analyses/PFNanoProducerTool (
  • example usage: can point to standard NanoAOD docs
  • file type: nanoaod-pf

@katilp
Copy link
Member Author

katilp commented Dec 14, 2023

The file types of the "normal" collision and simulated data will be nanoaod and nanoaodsim, respectively.
We should reflect that in the types of derived datasets so that collision data (all derived datasets starting with RunYYYYN_) will be nanoaod-<type> and simulated data nanoaodsim-<type>

@katilp
Copy link
Member Author

katilp commented Jan 23, 2024

The recids for the production code are

  • NanoAODRun1: 12505
  • PFNano: 12504

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

3 participants