CMS - create a script for derived data records #212

katilp · 2023-11-06T13:39:38Z

CMS 2016 release will include several "derived data" records structurally similar to e.g. https://opendata.cern.ch/record/12341
They will be:

PF-enriched samples produced from the 2016 OD MiniAOD samples
NanoAOD-type derivation from Run1 OD AOD samples

We should have a script template to create such records, that can be run in similar way as those for collision or MC records.

For the provenance, they will link to the parent dataset and the SW that was used to produce them (e.g. Run1 Nano: cernopendata/opendata.cern.ch#3281). Both will be available as CODP records. So need for extended provenance listing as it is already available in the parent dataset record.

For the variable description, these records can link to listings of this type.
This html files (one per type of production) should be hosted on the OD portal.

In the scripts, all metadata variables should be collected to the start of the script, for the ease of reuse.

nancyhamdan · 2023-11-21T21:14:23Z

Can refer to this script from cms-2012-event-display-files as a starting point for new script for derived datasets, taking into account the following notes:

Dataset listings for each type that will be used as input for the script can be found in eos eos ls /eos/opendata/cms/derived-data/ e.g. eos ls /eos/opendata/cms/derived-data/POET/23-Jul-22/. "merged" files should be excluded
authors field will be set as "CMS Open Data Group" for all datasets
number_files can be extracted from the derived-data directory in eos
methodology corresponding to the "How were these data selected" section will differ from one type to another, can be changed in the Jinja html template with some conditional logic. File linked in this section will be a link to a corresponding software record
Refer to cernopendata/opendata.cern.ch/issues/3349 for title of each type of dataset
Values for description under abstract can be extracted from a csv file that has datasets' descriptions. We could have a description template for each type of derived datasets
dataset_semantics corresponding to the "Dataset characteristics" section can also be extracted as input from a csv file listing them for each dataset
Instead of having to extract values for description and dataset_semantics from csv files, and to easily be able to configure the script with any other necessary hard coded values, we could have a yaml file that has these values and the script could take this file as input. The yaml file could have similar structure to this:

common values:
    collision_energy: "0.2TeV" 
    keywords:
        - education
        - outreach 
    description: >
        This is a very long description saying this and that. It can even have many lines. So it could be quite comfortable to 
        enter desired dataset descriptions even if they are really long.

dataset:
    name: "/BTag/Run2012C-22Jan2013-V1/AOD"
    number_of_events: 123456789
dataset:
    name: "/CTag/Run2012C-22Jan2013-V1/AOD"
    number_of_events: 456
dataset:
    name:" /DTag/Run2012C-22Jan2013-V1/AOD"
    number_of_events: 456324234

katilp · 2023-11-27T09:11:29Z

For the number of events, you can use the following (if running where ROOT is available)

NanoAODRun1 and PFNano:

import ROOT
myfile = ROOT.TFile.Open("http://opendata.cern.ch/eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/Run2012C_SingleMu/FF6C9C2D-3B37-43D7-A9B0-043CB2AC8202.root")
myfile.Events.GetEntries()

In the older versions, the event number might appear as a "long" integer, e.g. 563709L, in that case, int(myfile.Events.GetEntries())

POET:

The POET output has a different structure, and there are two versions of it:

for the "flat" files, everything is under the events tree : myfile.events.GetEntries() (NB: lower case e)
for the direct POET output (no "flat"), the values are under object-specific trees, and an intermediate tree needs to be defined: myfile.myelectrons.Events.GetEntries() (NB: upper case E)

The number of events is the same in both cases.

katilp · 2023-12-11T12:01:02Z

Further details for the three types of derived data :

POET

Files under /eos/opendata/cms/derived-data/POET/23-Jul-22/

These are the files used in the 2022 workshop lesson
https://cms-opendata-workshop.github.io/workshop2022-lesson-ttbarljetsanalysis/

For each dataset, we have:

<dataset> directory with root files as a direct output of POET (separate trees for each object)
<dataset>_flat directory with root files "flattened" to a single tree, as required when used as input to coffea with nanoevents schema
<dataset>_flat.root with the separate files in the <dataset>_flat directory merged into one file

e.g.

RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8_flat
RunIIFall15MiniAODv2_ZZ_TuneCUETP8M1_13TeV-pythia8_flat.root

Finally, no reason to leave out the merged file, we can as well have it in the record.
So all files go in a single derived <dataset> record:

title: <dataset> dataset in reduced NanoAOD-like format
description: This dataset contains information extracted from different physics objects from the 2015 MiniAOD parent <dataset> dataset, readable with bare ROOT or other ROOT-compatible software. It was produced for the CMS open data workshop tutorials. It is provided in three different structures (then the list from above)
author: CMS Open data group
produced with: https://opendata.cern.ch/record/12502
example usage:
- text: The use of this dataset does not require any software specific to the CMS experiment. It can be read with the ROOT package
- example link: https://cms-opendata-workshop.github.io/workshop2022-lesson-ttbarljetsanalysis/ (this is to be checked, it might refer to old file locations)
file type: nanoaod-poet

NanoAODRun1

FIles under /eos/opendata/cms/derived-data/NanoAODRun1/01-Jul-22/

These are the files used in the 2022 workshop
https://cms-opendata-workshop.github.io/workshop2022-lesson-run1example/

For each dataset we have

<dataset> directory with root files as produced by https://github.com/cms-opendata-analyses/NanoAODRun1ProducerTool
<dataset>_merged.root with the separate files in the <dataset> directory merged into one file

So all files go in a single derived <dataset> record:

title: <dataset> dataset in Run1 NanoAOD-like format
description: <dataset> dataset in a NanoAOD-like research-level Ntuple format for CMS Run1 data, readable with bare ROOT or other ROOT-compatible software, and containing the per-event information that is needed in most generic analyses. In contrast to the CMS NanoAOD format which is derived from MiniAOD, it is generated directly from the AOD format with completely independent code provided by the CMS open data group. Nevertheless, there is a large overlap in functionality and content between NanoAODRun1 and NanoAOD such that common analyses are possible. It is provided as a collection of root files under <dataset> directory, and in <dataset>_merged.root with the separate files in the <dataset> directory merged into one file.
author: CMS Open data group
produced with: new recid for https://github.com/cms-opendata-analyses/NanoAODRun1ProducerTool (see CMS: prepare SW records for NanoAODRun1 opendata.cern.ch#3281)
example usage: see CMS: prepare the SW record for NanoAODRun1 usage examples opendata.cern.ch#3495
file type: nanoaod-run1

For titles and format, see cernopendata/opendata.cern.ch#3349 (comment)

PFNano

Files to be moved

For each dataset, files are under /<dataset>/Run2016G-UL2016_MiniAODv2-v2_PFNanoAODv1/

The derived <dataset> records:

title: <dataset> dataset in NanoAOD format enhanced with Particle Flow candidates
description: ` dataset in NanoAOD format enhanced with Particle Flow candidates, readable with bare ROOT or other ROOT-compatible software. In addition to the default NanoAOD content, it contains the information (@jmhogan, maybe we should point to some documentation on the PF here)
author: CMS Open data group
produced with: new recid for https://github.com/cms-opendata-analyses/PFNanoProducerTool (
example usage: can point to standard NanoAOD docs
file type: nanoaod-pf

katilp · 2023-12-14T08:37:01Z

The file types of the "normal" collision and simulated data will be nanoaod and nanoaodsim, respectively.
We should reflect that in the types of derived datasets so that collision data (all derived datasets starting with RunYYYYN_) will be nanoaod-<type> and simulated data nanoaodsim-<type>

katilp · 2024-01-23T17:51:48Z

The recids for the production code are

NanoAODRun1: 12505
PFNano: 12504

This was referenced Nov 6, 2023

CMS: 2016 data release checklist #124

Open

CMS - add creation of NanoAOD content documentation to NanoAOD scripts #213

Open

katilp assigned katilp, nancyhamdan and joudmas and unassigned katilp Nov 11, 2023

nancyhamdan mentioned this issue Dec 7, 2023

cms-derived-datasets: add the script #214

Merged

katilp mentioned this issue Dec 11, 2023

CMS: prepare the SW record for PFNanoProducerTool cernopendata/opendata.cern.ch#3496

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMS - create a script for derived data records #212

CMS - create a script for derived data records #212

katilp commented Nov 6, 2023 •

edited

Loading

nancyhamdan commented Nov 21, 2023

katilp commented Nov 27, 2023 •

edited

Loading

katilp commented Dec 11, 2023 •

edited

Loading

katilp commented Dec 14, 2023 •

edited

Loading

katilp commented Jan 23, 2024

CMS - create a script for derived data records #212

CMS - create a script for derived data records #212

Comments

katilp commented Nov 6, 2023 • edited Loading

nancyhamdan commented Nov 21, 2023

katilp commented Nov 27, 2023 • edited Loading

NanoAODRun1 and PFNano:

POET:

katilp commented Dec 11, 2023 • edited Loading

POET

NanoAODRun1

PFNano

katilp commented Dec 14, 2023 • edited Loading

katilp commented Jan 23, 2024

katilp commented Nov 6, 2023 •

edited

Loading

katilp commented Nov 27, 2023 •

edited

Loading

katilp commented Dec 11, 2023 •

edited

Loading

katilp commented Dec 14, 2023 •

edited

Loading