CMS: Run2 QCD MC for data science jettuples #2447

katilp · 2018-10-27T09:02:23Z

In connection with #2440, this issue follows the jettuples to be produced from run2 AOD samples, to be made available on the portal.

The datasets:

Data science jettuples (contact Kimmo Kallonen HIP):

/QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM
- Release: CMSSW_8_0_21 Global Tag: 80X_mcRun2_asymptotic_2016_TrancheIV_v6

To do:

transfer the MINIAODSIM dataset to T3_CH_CERN_OpenData
- transfer request done ok
prepare the data record for MiniAODSIM see CMS: prepare data records for Run2 Hbb and QCD MC for ML studies #2525
test that the production workflow works on CMS Open Data VM
- the VM with encapsulated slc6 CMS: new VM for 2011-2012 with slc6 container #2426 can be used (it is not yet on the portal record)
- the correspondig GT is needed (CMS: GTs for the data science samples #2443), now available (note the changes needed to read the GT on the config file + symbolic links: cfr http://opendata.cern.ch/docs/cms-guide-for-condition-database)
test that the production workflow works on ReANA (@alintulu)
prepare the CODP record for the production workflow
- have it in a github repo (see with @caredg ): https://github.com/cms-legacydata-analyses/JetNtupleProducerTool
- provide the corrsponding metadata for the CODP record (see CMS: SW records for ML sample production (Run2 QCD for jettuples) #2584 )
produce the jettuple files and transfer to eospublic upload (see with @tiborsimko)
- they will be in Kimmo's public area to be copied over
provide the metadata for the CODP records for thejettuple files (see CMS: Run2 QCD MC for data science jettuples #2447 (comment) below)
- NB the variable description is in https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/blob/2016/README.md
if appropriate, provide an example of usage as a software record

For contributions, see also https://github.com/cernopendata/opendata.cern.ch/wiki/Contributing-content-to-CERN-Open-Data

katilp · 2018-11-19T09:19:22Z

The path to files can be found through
https://eospublichttp01.cern.ch/eos/opendata/cms/MonteCarlo2016/RunIISummer16MiniAODv2
(later, they will be available through the portal record)

NB: (from Kimmo): the VM default architecture is slc6_amd64_gcc472, but CMSSW_8_0_26 needs slc6_amd64_gcc530. arch must be changed when doing cmsrel

kimmokal · 2018-11-19T10:42:26Z

After trying out the cmsrel CMSSW_8_0_26 command a few times, I got two different results. There was either just a warning:
WARNING: Release CMSSW_8_0_26 is not available for architecture slc6_amd64_gcc472. Developer's area is created for available architecture slc6_amd64_gcc530.

Or an error:
ERROR: Unable to find release area for "CMSSW" version "CMMSW_8_0_26" for arch slc6_amd64_gcc472. Please make sure you have used the correct name/version.

I couldn't figure out when the result was just a warning, or an actual error. This can be avoided by manually changing the architecture export SCRAM_ARCH=slc6_amd64_gcc530. After that cmsrel CMSSW_8_0_26 works fine, and setting cmsenv will automatically change the arch to gcc530 in later shell instances.

It is worth noting that with the SCRAM arch set to gcc530, the cmsrel command for earlier releases (such as 5_3_32) used for Run I datasets doesn't work without changing the arch again. So I don't know which arch is better as the default.

UPDATE:
I now realize that the error above was actually my mistake and was caused by a careless typo... Here is a concise dissection of the situation anyway:

The default SCRAM_ARCH of the VM is slc6_amd64_gcc472
cmsrel CMSSW_8_0_26 prints a warning, but creates the CMSSW area anyway
Setting cmsenv at CMSSW_8_0_26/src/ changes SCRAM_ARCH to slc6_amd64_gcc530
After the SCRAM_ARCH has changed, cmsrel 5_3_32 doesn't work and prints an error
A new shell instance will again have SCRAM_ARCH=slc6_amd64_gcc472

I guess that having slc6_amd64_gcc472 as the default SCRAM architecture is then fine, but there might be some confusion if someone first works with a Run II-friendly CMSSW version and then tries to go back to creating a CMSSW area for Run I datasets using the same shell instance. This is probably a very rare issue, but it's something to be aware of.

katilp · 2019-03-27T09:30:57Z

@kimmokal Are the tuples ready to be copied over?

kimmokal · 2019-03-27T10:05:13Z

@katilp The .root files are placed in my EOS space and can be found in the path /eos/user/k/kkallone/JetNTuple_QCD_RunII_13TeV_MC/

The HDF5 conversion has truly been a headache due to data columns with variable length, but I think/hope I have now conquered the major obstacles. I will spend this afternoon processing all the files and validating that they work as they should. If all goes well, they can be copied over also later today.

katilp · 2019-03-27T13:11:29Z

@kimmokal when ready check the permissions of the directory (for the moment I can't access it)

kimmokal · 2019-03-27T15:47:46Z

@katilp Can you now access the directory?

I verified that the HDF5 conversion is working as it should. However, turns out that lxplus is so ridiculously slow right now that I will do the converting locally, which admittedly will also take a long time. Hence, the .h5 files will be ready to be copied over tomorrow.

kimmokal · 2019-03-28T14:06:45Z

@katilp The conversion is now ready. I ended up doing it in a parallel fashion on lxplus. I was perhaps a bit unwise in putting the converted .h5 files in the same folder as the .root files. So if you already started copying the files, you might have ended up copying unfinished .h5 files in the process.

katilp · 2019-03-28T14:20:28Z

@kimmokal OK, thanks we'll have a look. We did not start copying yet. For the variable description in https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016, would you be able to provide a public page with the relevant information which now resides in CMS internal pages

https://twiki.cern.ch/twiki/bin/view/CMS/JetID13TeVRun2016
https://twiki.cern.ch/twiki/bin/viewauth/CMS/QuarkGluonLikelihood ?
We are planning to have the variable description on the record in a similar way as for the existing
http://opendata-dev.cern.ch/record/328 (for the underlying structure: http://opendata-dev.cern.ch/record/328/export/json)

kimmokal · 2019-03-28T16:28:54Z

@katilp I actually updated the README for the Master branch earlier, removing all references to the internal twiki pages, but I forgot to update the 2016 branch as well. I'll fix that.

katilp · 2019-03-28T16:31:29Z

@kimmokal Could you also provide a description text for the purpose of these files (cfr at the beginning of http://opendata-dev.cern.ch/record/328, but does not need to be that long)?

katilp · 2019-03-28T17:41:02Z

@ArtemisLav Could you kindly build a record (similar to http://opendata-dev.cern.ch/record/328) for these Data science jettuples:

Dataset name from @kimmokal
Author is Kimmo Kallonen
@kimmokal will provide a description (for "abstract")
Dataset characteristics (@kimmokal can provide N events, n files, volume)
Dataset semantics can be taken from the table in https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/master
Instead of "How were these data generated?" it would be better to have "How were these data produced?" with a link to https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016 (it will have a corresponding SW record but it has not yet been created)
How were these data validated? can be left out (tbc @kimmokal )
How can you use these data? @kimmokal can give a text and link (maybe not ready yet)

We discussed with @tiborsimko that it could go to a new cms-derived-Run2-datascience.json

It would be good to have this as an example record, so that other similar records (3 more to come) can be based on this. Thanks!

kimmokal · 2019-03-29T15:16:25Z

@katilp I updated the readme of the github repo and merged the master into the 2016 branch, so it's up-to-date now. Note that there is the line 'git clone https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/', where the 'cms-legacydata-analyses' part needs to be changed in the actual release.

I am now in the process of writing the description for the dataset. I'll send it to you (or to @ArtemisLav ?) during the weekend. I don't think there's a need for the 'How were these data validated?' section.

* Closes cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>

ArtemisLav · 2019-04-02T12:59:46Z

@katilp semantics are fixed 44907a0

kimmokal · 2019-04-03T08:11:36Z

@katilp @ArtemisLav I wrote up this description:

"The dataset consists of particle jets extracted from simulated proton-proton collision events at a center-of-mass energy of 13 TeV generated with Pythia 8. The particles emerging from the collisions traverse through a simulation of the CMS detector. The particles were reconstructed from the simulated detector signals using the particle-flow (PF) algorithm. The reconstructed particles are also called PF candidates. The jets in this dataset were clustered from the PF candidates of each collision event using the anti-$k_t$ algorithm with distance parameter $R = 0.4$.

From each collision event, only those jets with transverse momentum exceeding 30 GeV were saved to file. The jets were also required to have pseudorapidity of less than 2.5 (this indicates the jet's position in the detector). For each jet, there are variables describing the jet on a high-level, particle-level and generator-level. There are also some variables describing the collision event and the conditions of its simulation. All of the variables are saved on a jet-by-jet basis, which means that one row of data corresponds to one jet.

The origin of a jet is particularly interesting. This so-called flavor of the jet is obtained from the generator-level particles by a jet flavor algorithm, which attempts to match a reconstructed jet to a single initiating particle. As a consequence, the jet flavor definition depends on the chosen algorithm. Here three different flavor definitions are available. The ‘hadron’ definition identifies b- and c-hadrons from the jet’s constituents, so it is only useful for b-tagging studies. The ‘parton’ definition extends this to include the light jet flavors (u, d, s and gluon). Finally there is the ‘physics’ definition, which looks at the quarks and gluons of the initial collision. The ‘parton’ and ‘physics’ definitions both identify all jet flavors, but the former is more biased towards b- and c-quarks. If in doubt, it is recommended to use the ‘physics’ definition."

I can extend it if it's too short or is missing necessary details.

Also, my Orcid-id is 0000-0001-9769-7163. Is there something else required from me for the metadata?

katilp · 2019-04-03T08:38:48Z

@ArtemisLav Could you also add:

"relations": [
      {
        "doi": "FIXME", 
        "recid": "12021", 
        "title": "/QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM", 
        "type": "isChildOf"
      }
    ],

This file is on dev but does not have DOI yet

katilp · 2019-04-03T08:48:35Z

@kimmokal If you have an example notebook or similar it could possibly be entered under "usage". The regular data samples have something like

so in contrast it would maybe be useful to mention that this does not require any CMS experiment specific environment and can be used as in some example (if you have a link)

ArtemisLav · 2019-04-03T09:18:52Z

Thanks @kimmokal

Is there something else required from me for the metadata?

I just need a title for the record and if possible the distribution information (dataset characteristics):
"distribution": { "formats": [ "e.g. root" ], "number_events": 11111, "number_files": 2222, "size": 3333 },

tiborsimko · 2019-04-03T09:25:46Z

@ArtemisLav I'm copying the files to the final destination, I'll supply all the file information (except number of events).

BTW we'll have 122 files in ROOT and H5 formats, and the data contained in them should be equivalent, so I wonder whether we shall say that this dataset contains 122 or 244 files? I guess the latter, but that cold also confuse some people, e.g. those that only want ROOT and they might wonder why there is only 122 of them... Any DCAT etc standards out there for this "alternative formats" cases?

ArtemisLav · 2019-04-03T09:49:32Z

@tiborsimko hmm would it be easier if we just add a note in usage perhaps?

tiborsimko · 2019-04-03T17:02:10Z

Yes, I would list all files the record holds, and usage note could explain ROOT vs H5 formats indeed.

Note that there is is a transfer trouble with three H5 files, but otherwise we are good to create this test record.

ArtemisLav · 2019-04-04T08:37:10Z

OK, could someone please provide that description?

katilp · 2019-04-04T08:55:19Z

Could be something like this, I leave @kimmokal to complete:
"The use of these files does not require any software specific to the CMS experiment. There are two sets of equivalent files in two different formats: ROOT and H5"

@kimmokal you could add the mention of h5 specific stuff maybe.

@tiborsimko Should we use H5, h5, HD5, hd5, HDF5, hdf5... in the text?

kimmokal · 2019-04-04T10:57:23Z

@katilp I have been struggling with the notebook and making it practical enough :/
I could provide in the usage part just two short examples of scripts for loading the .root and .h5 files in Python?

@ArtemisLav @tiborsimko How do we deal with the "number_events" here, because the files don't contain full events, only jets? Should it just be the total number of jets then?

Do you have any suggestion for the title of the record?

katilp · 2019-04-04T12:16:55Z

@kimmokal that would be very good as well.
Good point about event numbers, it may well be the same for some other ML samples.
Should we have an alternative "number_entries" or similar?

* Addresses cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>

* Addresses cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch> Signed-off-by: Tibor Simko <tibor.simko@cern.ch>

katilp · 2019-04-07T15:00:51Z

Some suggestions to the current draft:

can we have Data science as a tab in the header part in a similar way as the categories for MC in:

i.e. there:
- datascience is a keyword whereas the processes for the MC record are Categories. Now we have two possible keywords for derived datasets, either masterclass or datascience, it would be good to display them after Dataset Derived (they would display better starting with a capital letter though)
it would be better to have a tabular display for data semantics
How to use (usage) should come before dataset semantics
Can Related datasets have a description? If so then:

"relations": [
     {
       "description": "<p>This dataset was derived from: </p> ",
       "recid": "12021", 
       "title": "/QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM", 
       "type": "isChildOf"
     }

for h5 files, the file listing with root prefix is not useful, if I'm not mistaking, would a listing directly usable for wget be better?
in this context if we do not expect people to download the files with xrootd (not the h5), the "Download" button is misleading. Can it be "Download index"?
will the file generation/production part be in methodology or else? If so then

"methodology": {
     "description": " <p>This dataset was produced with the software available in: </p>"
 }

with the link to the record to be built for the SW (fifth item in the initial list at the start of this issue), @tiborsimko would that work?

As the dataset does not contain events but jets, would it be possible to have an alternative for number_events in:

"distribution": {
      "formats": [
        "root", 
        "h5"
      ], 
      "number_events": 11111, 
      "number_files": 244, 
      "size": 204611954128
    },

it could be "Entries" (number_entries). This may be needed for some other ML samples as well.

tiborsimko · 2019-04-08T08:22:12Z

@katilp It would be nice to open indepedent issue for these things, so that the work can be parallelised. E.g. @okraskaj can take care of the template amendment while @ArtemisLav could take care of metadata editing.... I'll be busy today until late in the afternoon.

Some quick comments:

Yes we can display keywords after categories. (@okraskaj )
Yes for nicer semantics display templates: nicer semantic display #2577
Usage before semantics is not obvious, since we also have selection, validatiion sections, and we'd have to move characteristics as well to stay close to semantics... The last two had better stay close together
Yes for methodology and software record, @ArtemisLav will you have time to add it?
As for the number_events', the number 1111 was just a placeholder I think(?) we can simply remove it. Not sure about the change to number_entrieseverywhere, I think we should rather introduce a new propertynumber_jets` if needed? (@okraskaj)

katilp · 2019-04-08T12:54:48Z

@tiborsimko yes, indeed number_entries or number_jets just as alternative, like here, not change it everywhere

ArtemisLav · 2019-04-08T12:57:07Z

Yes for methodology and software record, @ArtemisLav will you have time to add it?

Sure, is there metadata somewhere?

katilp · 2019-04-08T13:03:33Z

@ArtemisLav the description is above i.e.

"methodology": {
     "description": " <p>This dataset was produced with the software available in: </p>"
 }

and the link to the software record will need to be a place holder for now

ArtemisLav · 2019-04-08T13:04:18Z

@katilp I meant for the software record. Are we doing that now?

katilp · 2019-04-08T13:06:17Z

@ArtemisLav not yet done, but I can add it. Should I open an issue for all sw records needed for these ML samples, or one by one?

* addresses cernopendata#2447. Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>

katilp · 2019-07-15T14:08:05Z

Closing as remaining issues of Run2 MiniAODSIM provenance followed up in #2525

katilp added Topic: records Experiment: CMS CMS: MC CMS: derived datasets labels Oct 27, 2018

katilp added this to the CMS-ML-Sample-Release milestone Oct 27, 2018

katilp self-assigned this Oct 27, 2018

This was referenced Oct 27, 2018

CMS: MC for ML studies #2440

Closed

CMS: add basic documentation about MINIAOD #2449

Closed

CMS: list of datasest to be moved from eospublic/upload to eospublic/cms #2454

Closed

This was referenced Nov 19, 2018

CMS: Run2 Hbb and QCD MC for ML studies #2448

Closed

CMS: ML files for Tracking GPU studies from CMS Upgrade MC #2459

Closed

katilp mentioned this issue Dec 17, 2018

CMS: new VM for 2011-2012 with slc6 container #2426

Closed

tiborsimko modified the milestones: CMS-ML-Sample-Release, CMS-Q4-Updates Mar 19, 2019

katilp mentioned this issue Mar 21, 2019

CMS: prepare data records for Run2 Hbb and QCD MC for ML studies #2525

Open

2 tasks

katilp assigned ArtemisLav Mar 28, 2019

katilp added the Status: ready for work label Mar 28, 2019

ArtemisLav added a commit to ArtemisLav/opendata.cern.ch that referenced this issue Mar 31, 2019

records: add data science jettuples

78ce209

* Closes cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>

ArtemisLav mentioned this issue Apr 2, 2019

records: add data science jettuples #2550

Merged

ArtemisLav added a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 5, 2019

records: add data science jettuples

3cda509

* Addresses cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>

ArtemisLav added a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 5, 2019

records: add data science jettuples

143007f

* Addresses cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>

tiborsimko pushed a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 5, 2019

records: add data science jettuples

72a7f46

* Addresses cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch> Signed-off-by: Tibor Simko <tibor.simko@cern.ch>

okraskaj mentioned this issue Apr 8, 2019

templates: display keywords after categories #2581

Closed

ArtemisLav added a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 8, 2019

records: fixes for record 12100

89af7e2

* addresses cernopendata#2447. Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>

ArtemisLav mentioned this issue Apr 8, 2019

records: fixes for record 12100 #2582

Merged

katilp mentioned this issue Apr 8, 2019

CMS: SW records for ML sample production (Run2 QCD for jettuples) #2584

Closed

tiborsimko pushed a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 8, 2019

records: fixes for record 12100

61bf15b

* addresses cernopendata#2447. Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>

katilp closed this as completed Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMS: Run2 QCD MC for data science jettuples #2447

CMS: Run2 QCD MC for data science jettuples #2447

katilp commented Oct 27, 2018 •

edited

Loading

katilp commented Nov 19, 2018

kimmokal commented Nov 19, 2018 •

edited

Loading

katilp commented Mar 27, 2019

kimmokal commented Mar 27, 2019

katilp commented Mar 27, 2019

kimmokal commented Mar 27, 2019

kimmokal commented Mar 28, 2019

katilp commented Mar 28, 2019

kimmokal commented Mar 28, 2019

katilp commented Mar 28, 2019

katilp commented Mar 28, 2019

kimmokal commented Mar 29, 2019

ArtemisLav commented Apr 2, 2019

kimmokal commented Apr 3, 2019 •

edited

Loading

katilp commented Apr 3, 2019

katilp commented Apr 3, 2019

ArtemisLav commented Apr 3, 2019

tiborsimko commented Apr 3, 2019

ArtemisLav commented Apr 3, 2019

tiborsimko commented Apr 3, 2019

ArtemisLav commented Apr 4, 2019

katilp commented Apr 4, 2019 •

edited

Loading

kimmokal commented Apr 4, 2019

katilp commented Apr 4, 2019

katilp commented Apr 7, 2019 •

edited

Loading

tiborsimko commented Apr 8, 2019

katilp commented Apr 8, 2019

ArtemisLav commented Apr 8, 2019

katilp commented Apr 8, 2019

ArtemisLav commented Apr 8, 2019

katilp commented Apr 8, 2019

katilp commented Jul 15, 2019

CMS: Run2 QCD MC for data science jettuples #2447

CMS: Run2 QCD MC for data science jettuples #2447

Comments

katilp commented Oct 27, 2018 • edited Loading

katilp commented Nov 19, 2018

kimmokal commented Nov 19, 2018 • edited Loading

katilp commented Mar 27, 2019

kimmokal commented Mar 27, 2019

katilp commented Mar 27, 2019

kimmokal commented Mar 27, 2019

kimmokal commented Mar 28, 2019

katilp commented Mar 28, 2019

kimmokal commented Mar 28, 2019

katilp commented Mar 28, 2019

katilp commented Mar 28, 2019

kimmokal commented Mar 29, 2019

ArtemisLav commented Apr 2, 2019

kimmokal commented Apr 3, 2019 • edited Loading

katilp commented Apr 3, 2019

katilp commented Apr 3, 2019

ArtemisLav commented Apr 3, 2019

tiborsimko commented Apr 3, 2019

ArtemisLav commented Apr 3, 2019

tiborsimko commented Apr 3, 2019

ArtemisLav commented Apr 4, 2019

katilp commented Apr 4, 2019 • edited Loading

kimmokal commented Apr 4, 2019

katilp commented Apr 4, 2019

katilp commented Apr 7, 2019 • edited Loading

tiborsimko commented Apr 8, 2019

katilp commented Apr 8, 2019

ArtemisLav commented Apr 8, 2019

katilp commented Apr 8, 2019

ArtemisLav commented Apr 8, 2019

katilp commented Apr 8, 2019

katilp commented Jul 15, 2019

katilp commented Oct 27, 2018 •

edited

Loading

kimmokal commented Nov 19, 2018 •

edited

Loading

kimmokal commented Apr 3, 2019 •

edited

Loading

katilp commented Apr 4, 2019 •

edited

Loading

katilp commented Apr 7, 2019 •

edited

Loading