Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMS: Run2 QCD MC for data science jettuples #2447

Closed
7 of 8 tasks
katilp opened this issue Oct 27, 2018 · 36 comments
Closed
7 of 8 tasks

CMS: Run2 QCD MC for data science jettuples #2447

katilp opened this issue Oct 27, 2018 · 36 comments

Comments

@katilp
Copy link
Member

katilp commented Oct 27, 2018

In connection with #2440, this issue follows the jettuples to be produced from run2 AOD samples, to be made available on the portal.

The datasets:

Data science jettuples (contact Kimmo Kallonen HIP):

  • /QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM
    • Release: CMSSW_8_0_21 Global Tag: 80X_mcRun2_asymptotic_2016_TrancheIV_v6

To do:

For contributions, see also https://github.com/cernopendata/opendata.cern.ch/wiki/Contributing-content-to-CERN-Open-Data

@katilp
Copy link
Member Author

katilp commented Nov 19, 2018

The path to files can be found through
https://eospublichttp01.cern.ch/eos/opendata/cms/MonteCarlo2016/RunIISummer16MiniAODv2
(later, they will be available through the portal record)

NB: (from Kimmo): the VM default architecture is slc6_amd64_gcc472, but CMSSW_8_0_26 needs slc6_amd64_gcc530. arch must be changed when doing cmsrel

@kimmokal
Copy link

kimmokal commented Nov 19, 2018

After trying out the cmsrel CMSSW_8_0_26 command a few times, I got two different results. There was either just a warning:
WARNING: Release CMSSW_8_0_26 is not available for architecture slc6_amd64_gcc472. Developer's area is created for available architecture slc6_amd64_gcc530.

Or an error:
ERROR: Unable to find release area for "CMSSW" version "CMMSW_8_0_26" for arch slc6_amd64_gcc472. Please make sure you have used the correct name/version.

I couldn't figure out when the result was just a warning, or an actual error. This can be avoided by manually changing the architecture export SCRAM_ARCH=slc6_amd64_gcc530. After that cmsrel CMSSW_8_0_26 works fine, and setting cmsenv will automatically change the arch to gcc530 in later shell instances.

It is worth noting that with the SCRAM arch set to gcc530, the cmsrel command for earlier releases (such as 5_3_32) used for Run I datasets doesn't work without changing the arch again. So I don't know which arch is better as the default.

UPDATE:
I now realize that the error above was actually my mistake and was caused by a careless typo... Here is a concise dissection of the situation anyway:

  • The default SCRAM_ARCH of the VM is slc6_amd64_gcc472
  • cmsrel CMSSW_8_0_26 prints a warning, but creates the CMSSW area anyway
  • Setting cmsenv at CMSSW_8_0_26/src/ changes SCRAM_ARCH to slc6_amd64_gcc530
  • After the SCRAM_ARCH has changed, cmsrel 5_3_32 doesn't work and prints an error
  • A new shell instance will again have SCRAM_ARCH=slc6_amd64_gcc472

I guess that having slc6_amd64_gcc472 as the default SCRAM architecture is then fine, but there might be some confusion if someone first works with a Run II-friendly CMSSW version and then tries to go back to creating a CMSSW area for Run I datasets using the same shell instance. This is probably a very rare issue, but it's something to be aware of.

@katilp
Copy link
Member Author

katilp commented Mar 27, 2019

@kimmokal Are the tuples ready to be copied over?

@kimmokal
Copy link

@katilp The .root files are placed in my EOS space and can be found in the path /eos/user/k/kkallone/JetNTuple_QCD_RunII_13TeV_MC/

The HDF5 conversion has truly been a headache due to data columns with variable length, but I think/hope I have now conquered the major obstacles. I will spend this afternoon processing all the files and validating that they work as they should. If all goes well, they can be copied over also later today.

@katilp
Copy link
Member Author

katilp commented Mar 27, 2019

@kimmokal when ready check the permissions of the directory (for the moment I can't access it)

@kimmokal
Copy link

@katilp Can you now access the directory?

I verified that the HDF5 conversion is working as it should. However, turns out that lxplus is so ridiculously slow right now that I will do the converting locally, which admittedly will also take a long time. Hence, the .h5 files will be ready to be copied over tomorrow.

@kimmokal
Copy link

@katilp The conversion is now ready. I ended up doing it in a parallel fashion on lxplus. I was perhaps a bit unwise in putting the converted .h5 files in the same folder as the .root files. So if you already started copying the files, you might have ended up copying unfinished .h5 files in the process.

@katilp
Copy link
Member Author

katilp commented Mar 28, 2019

@kimmokal OK, thanks we'll have a look. We did not start copying yet. For the variable description in https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016, would you be able to provide a public page with the relevant information which now resides in CMS internal pages

@kimmokal
Copy link

@katilp I actually updated the README for the Master branch earlier, removing all references to the internal twiki pages, but I forgot to update the 2016 branch as well. I'll fix that.

@katilp
Copy link
Member Author

katilp commented Mar 28, 2019

@kimmokal Could you also provide a description text for the purpose of these files (cfr at the beginning of http://opendata-dev.cern.ch/record/328, but does not need to be that long)?

@katilp
Copy link
Member Author

katilp commented Mar 28, 2019

@ArtemisLav Could you kindly build a record (similar to http://opendata-dev.cern.ch/record/328) for these Data science jettuples:

We discussed with @tiborsimko that it could go to a new cms-derived-Run2-datascience.json

It would be good to have this as an example record, so that other similar records (3 more to come) can be based on this. Thanks!

@kimmokal
Copy link

@katilp I updated the readme of the github repo and merged the master into the 2016 branch, so it's up-to-date now. Note that there is the line 'git clone https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/', where the 'cms-legacydata-analyses' part needs to be changed in the actual release.

I am now in the process of writing the description for the dataset. I'll send it to you (or to @ArtemisLav ?) during the weekend. I don't think there's a need for the 'How were these data validated?' section.

ArtemisLav added a commit to ArtemisLav/opendata.cern.ch that referenced this issue Mar 31, 2019
* Closes cernopendata#2447

Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
@ArtemisLav
Copy link
Member

@katilp semantics are fixed 44907a0

@kimmokal
Copy link

kimmokal commented Apr 3, 2019

@katilp @ArtemisLav I wrote up this description:

"The dataset consists of particle jets extracted from simulated proton-proton collision events at a center-of-mass energy of 13 TeV generated with Pythia 8. The particles emerging from the collisions traverse through a simulation of the CMS detector. The particles were reconstructed from the simulated detector signals using the particle-flow (PF) algorithm. The reconstructed particles are also called PF candidates. The jets in this dataset were clustered from the PF candidates of each collision event using the anti-$k_t$ algorithm with distance parameter $R = 0.4$.

From each collision event, only those jets with transverse momentum exceeding 30 GeV were saved to file. The jets were also required to have pseudorapidity of less than 2.5 (this indicates the jet's position in the detector). For each jet, there are variables describing the jet on a high-level, particle-level and generator-level. There are also some variables describing the collision event and the conditions of its simulation. All of the variables are saved on a jet-by-jet basis, which means that one row of data corresponds to one jet.

The origin of a jet is particularly interesting. This so-called flavor of the jet is obtained from the generator-level particles by a jet flavor algorithm, which attempts to match a reconstructed jet to a single initiating particle. As a consequence, the jet flavor definition depends on the chosen algorithm. Here three different flavor definitions are available. The ‘hadron’ definition identifies b- and c-hadrons from the jet’s constituents, so it is only useful for b-tagging studies. The ‘parton’ definition extends this to include the light jet flavors (u, d, s and gluon). Finally there is the ‘physics’ definition, which looks at the quarks and gluons of the initial collision. The ‘parton’ and ‘physics’ definitions both identify all jet flavors, but the former is more biased towards b- and c-quarks. If in doubt, it is recommended to use the ‘physics’ definition."

I can extend it if it's too short or is missing necessary details.

Also, my Orcid-id is 0000-0001-9769-7163. Is there something else required from me for the metadata?

@katilp
Copy link
Member Author

katilp commented Apr 3, 2019

@ArtemisLav Could you also add:

"relations": [
      {
        "doi": "FIXME", 
        "recid": "12021", 
        "title": "/QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM", 
        "type": "isChildOf"
      }
    ], 

This file is on dev but does not have DOI yet

@katilp
Copy link
Member Author

katilp commented Apr 3, 2019

@kimmokal If you have an example notebook or similar it could possibly be entered under "usage". The regular data samples have something like

usage

so in contrast it would maybe be useful to mention that this does not require any CMS experiment specific environment and can be used as in some example (if you have a link)

@ArtemisLav
Copy link
Member

Thanks @kimmokal

Is there something else required from me for the metadata?

I just need a title for the record and if possible the distribution information (dataset characteristics):
"distribution": { "formats": [ "e.g. root" ], "number_events": 11111, "number_files": 2222, "size": 3333 },

@tiborsimko
Copy link
Member

@ArtemisLav I'm copying the files to the final destination, I'll supply all the file information (except number of events).

BTW we'll have 122 files in ROOT and H5 formats, and the data contained in them should be equivalent, so I wonder whether we shall say that this dataset contains 122 or 244 files? I guess the latter, but that cold also confuse some people, e.g. those that only want ROOT and they might wonder why there is only 122 of them... Any DCAT etc standards out there for this "alternative formats" cases?

@ArtemisLav
Copy link
Member

@tiborsimko hmm would it be easier if we just add a note in usage perhaps?

@tiborsimko
Copy link
Member

Yes, I would list all files the record holds, and usage note could explain ROOT vs H5 formats indeed.

Note that there is is a transfer trouble with three H5 files, but otherwise we are good to create this test record.

@ArtemisLav
Copy link
Member

OK, could someone please provide that description?

@katilp
Copy link
Member Author

katilp commented Apr 4, 2019

Could be something like this, I leave @kimmokal to complete:
"The use of these files does not require any software specific to the CMS experiment. There are two sets of equivalent files in two different formats: ROOT and H5"

@kimmokal you could add the mention of h5 specific stuff maybe.

@tiborsimko Should we use H5, h5, HD5, hd5, HDF5, hdf5... in the text?

@kimmokal
Copy link

kimmokal commented Apr 4, 2019

@katilp I have been struggling with the notebook and making it practical enough :/
I could provide in the usage part just two short examples of scripts for loading the .root and .h5 files in Python?

@ArtemisLav @tiborsimko How do we deal with the "number_events" here, because the files don't contain full events, only jets? Should it just be the total number of jets then?

Do you have any suggestion for the title of the record?

@katilp
Copy link
Member Author

katilp commented Apr 4, 2019

@kimmokal that would be very good as well.
Good point about event numbers, it may well be the same for some other ML samples.
Should we have an alternative "number_entries" or similar?

ArtemisLav added a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 5, 2019
* Addresses cernopendata#2447

Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
ArtemisLav added a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 5, 2019
* Addresses cernopendata#2447

Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
tiborsimko pushed a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 5, 2019
* Addresses cernopendata#2447

Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
Signed-off-by: Tibor Simko <tibor.simko@cern.ch>
@katilp
Copy link
Member Author

katilp commented Apr 7, 2019

Some suggestions to the current draft:

  • can we have Data science as a tab in the header part in a similar way as the categories for MC in:
    simheader
    i.e. there:
    MLheader
    • datascience is a keyword whereas the processes for the MC record are Categories. Now we have two possible keywords for derived datasets, either masterclass or datascience, it would be good to display them after Dataset Derived (they would display better starting with a capital letter though)
  • it would be better to have a tabular display for data semantics
  • How to use (usage) should come before dataset semantics
  • Can Related datasets have a description? If so then:
"relations": [
     {
       "description": "<p>This dataset was derived from: </p> ",
       "recid": "12021", 
       "title": "/QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM", 
       "type": "isChildOf"
     }
  • for h5 files, the file listing with root prefix is not useful, if I'm not mistaking, would a listing directly usable for wget be better?
  • in this context if we do not expect people to download the files with xrootd (not the h5), the "Download" button is misleading. Can it be "Download index"?
  • will the file generation/production part be in methodology or else? If so then
"methodology": {
     "description": " <p>This dataset was produced with the software available in: </p>"
 }

with the link to the record to be built for the SW (fifth item in the initial list at the start of this issue), @tiborsimko would that work?

  • As the dataset does not contain events but jets, would it be possible to have an alternative for number_events in:
"distribution": {
      "formats": [
        "root", 
        "h5"
      ], 
      "number_events": 11111, 
      "number_files": 244, 
      "size": 204611954128
    }, 

it could be "Entries" (number_entries). This may be needed for some other ML samples as well.

@tiborsimko
Copy link
Member

@katilp It would be nice to open indepedent issue for these things, so that the work can be parallelised. E.g. @okraskaj can take care of the template amendment while @ArtemisLav could take care of metadata editing.... I'll be busy today until late in the afternoon.

Some quick comments:

  • Yes we can display keywords after categories. (@okraskaj )
  • Yes for nicer semantics display templates: nicer semantic display #2577
  • Usage before semantics is not obvious, since we also have selection, validatiion sections, and we'd have to move characteristics as well to stay close to semantics... The last two had better stay close together
  • Yes for methodology and software record, @ArtemisLav will you have time to add it?
  • As for the number_events', the number 1111 was just a placeholder I think(?) we can simply remove it. Not sure about the change to number_entrieseverywhere, I think we should rather introduce a new propertynumber_jets` if needed? (@okraskaj)

@katilp
Copy link
Member Author

katilp commented Apr 8, 2019

@tiborsimko yes, indeed number_entries or number_jets just as alternative, like here, not change it everywhere

@ArtemisLav
Copy link
Member

  • Yes for methodology and software record, @ArtemisLav will you have time to add it?

Sure, is there metadata somewhere?

@katilp
Copy link
Member Author

katilp commented Apr 8, 2019

@ArtemisLav the description is above i.e.

"methodology": {
     "description": " <p>This dataset was produced with the software available in: </p>"
 }

and the link to the software record will need to be a place holder for now

@ArtemisLav
Copy link
Member

@katilp I meant for the software record. Are we doing that now?

@katilp
Copy link
Member Author

katilp commented Apr 8, 2019

@ArtemisLav not yet done, but I can add it. Should I open an issue for all sw records needed for these ML samples, or one by one?

ArtemisLav added a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 8, 2019
* addresses cernopendata#2447.

Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
tiborsimko pushed a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 8, 2019
* addresses cernopendata#2447.

Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
@katilp
Copy link
Member Author

katilp commented Jul 15, 2019

Closing as remaining issues of Run2 MiniAODSIM provenance followed up in #2525

@katilp katilp closed this as completed Jul 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants