-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CMS: Run2 QCD MC for data science jettuples #2447
Comments
The path to files can be found through NB: (from Kimmo): the VM default architecture is slc6_amd64_gcc472, but CMSSW_8_0_26 needs slc6_amd64_gcc530. arch must be changed when doing cmsrel |
After trying out the Or an error: I couldn't figure out when the result was just a warning, or an actual error. This can be avoided by manually changing the architecture It is worth noting that with the SCRAM arch set to gcc530, the UPDATE:
I guess that having slc6_amd64_gcc472 as the default SCRAM architecture is then fine, but there might be some confusion if someone first works with a Run II-friendly CMSSW version and then tries to go back to creating a CMSSW area for Run I datasets using the same shell instance. This is probably a very rare issue, but it's something to be aware of. |
@kimmokal Are the tuples ready to be copied over? |
@katilp The .root files are placed in my EOS space and can be found in the path /eos/user/k/kkallone/JetNTuple_QCD_RunII_13TeV_MC/ The HDF5 conversion has truly been a headache due to data columns with variable length, but I think/hope I have now conquered the major obstacles. I will spend this afternoon processing all the files and validating that they work as they should. If all goes well, they can be copied over also later today. |
@kimmokal when ready check the permissions of the directory (for the moment I can't access it) |
@katilp Can you now access the directory? I verified that the HDF5 conversion is working as it should. However, turns out that lxplus is so ridiculously slow right now that I will do the converting locally, which admittedly will also take a long time. Hence, the .h5 files will be ready to be copied over tomorrow. |
@katilp The conversion is now ready. I ended up doing it in a parallel fashion on lxplus. I was perhaps a bit unwise in putting the converted .h5 files in the same folder as the .root files. So if you already started copying the files, you might have ended up copying unfinished .h5 files in the process. |
@kimmokal OK, thanks we'll have a look. We did not start copying yet. For the variable description in https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016, would you be able to provide a public page with the relevant information which now resides in CMS internal pages
|
@katilp I actually updated the README for the Master branch earlier, removing all references to the internal twiki pages, but I forgot to update the 2016 branch as well. I'll fix that. |
@kimmokal Could you also provide a description text for the purpose of these files (cfr at the beginning of http://opendata-dev.cern.ch/record/328, but does not need to be that long)? |
@ArtemisLav Could you kindly build a record (similar to http://opendata-dev.cern.ch/record/328) for these Data science jettuples:
We discussed with @tiborsimko that it could go to a new cms-derived-Run2-datascience.json It would be good to have this as an example record, so that other similar records (3 more to come) can be based on this. Thanks! |
@katilp I updated the readme of the github repo and merged the master into the 2016 branch, so it's up-to-date now. Note that there is the line 'git clone https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/', where the 'cms-legacydata-analyses' part needs to be changed in the actual release. I am now in the process of writing the description for the dataset. I'll send it to you (or to @ArtemisLav ?) during the weekend. I don't think there's a need for the 'How were these data validated?' section. |
* Closes cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
@katilp @ArtemisLav I wrote up this description: "The dataset consists of particle jets extracted from simulated proton-proton collision events at a center-of-mass energy of 13 TeV generated with Pythia 8. The particles emerging from the collisions traverse through a simulation of the CMS detector. The particles were reconstructed from the simulated detector signals using the particle-flow (PF) algorithm. The reconstructed particles are also called PF candidates. The jets in this dataset were clustered from the PF candidates of each collision event using the anti-$k_t$ algorithm with distance parameter From each collision event, only those jets with transverse momentum exceeding 30 GeV were saved to file. The jets were also required to have pseudorapidity of less than 2.5 (this indicates the jet's position in the detector). For each jet, there are variables describing the jet on a high-level, particle-level and generator-level. There are also some variables describing the collision event and the conditions of its simulation. All of the variables are saved on a jet-by-jet basis, which means that one row of data corresponds to one jet. The origin of a jet is particularly interesting. This so-called flavor of the jet is obtained from the generator-level particles by a jet flavor algorithm, which attempts to match a reconstructed jet to a single initiating particle. As a consequence, the jet flavor definition depends on the chosen algorithm. Here three different flavor definitions are available. The ‘hadron’ definition identifies b- and c-hadrons from the jet’s constituents, so it is only useful for b-tagging studies. The ‘parton’ definition extends this to include the light jet flavors (u, d, s and gluon). Finally there is the ‘physics’ definition, which looks at the quarks and gluons of the initial collision. The ‘parton’ and ‘physics’ definitions both identify all jet flavors, but the former is more biased towards b- and c-quarks. If in doubt, it is recommended to use the ‘physics’ definition." I can extend it if it's too short or is missing necessary details. Also, my Orcid-id is 0000-0001-9769-7163. Is there something else required from me for the metadata? |
@ArtemisLav Could you also add:
This file is on dev but does not have DOI yet |
@kimmokal If you have an example notebook or similar it could possibly be entered under "usage". The regular data samples have something like so in contrast it would maybe be useful to mention that this does not require any CMS experiment specific environment and can be used as in some example (if you have a link) |
Thanks @kimmokal
I just need a title for the record and if possible the distribution information (dataset characteristics): |
@ArtemisLav I'm copying the files to the final destination, I'll supply all the file information (except number of events). BTW we'll have 122 files in ROOT and H5 formats, and the data contained in them should be equivalent, so I wonder whether we shall say that this dataset contains 122 or 244 files? I guess the latter, but that cold also confuse some people, e.g. those that only want ROOT and they might wonder why there is only 122 of them... Any DCAT etc standards out there for this "alternative formats" cases? |
@tiborsimko hmm would it be easier if we just add a note in |
Yes, I would list all files the record holds, and usage note could explain ROOT vs H5 formats indeed. Note that there is is a transfer trouble with three H5 files, but otherwise we are good to create this test record. |
OK, could someone please provide that description? |
Could be something like this, I leave @kimmokal to complete: @kimmokal you could add the mention of h5 specific stuff maybe. @tiborsimko Should we use H5, h5, HD5, hd5, HDF5, hdf5... in the text? |
@katilp I have been struggling with the notebook and making it practical enough :/ @ArtemisLav @tiborsimko How do we deal with the "number_events" here, because the files don't contain full events, only jets? Should it just be the total number of jets then? Do you have any suggestion for the title of the record? |
@kimmokal that would be very good as well. |
* Addresses cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
* Addresses cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
* Addresses cernopendata#2447 Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch> Signed-off-by: Tibor Simko <tibor.simko@cern.ch>
Some suggestions to the current draft:
with the link to the record to be built for the SW (fifth item in the initial list at the start of this issue), @tiborsimko would that work?
it could be "Entries" ( |
@katilp It would be nice to open indepedent issue for these things, so that the work can be parallelised. E.g. @okraskaj can take care of the template amendment while @ArtemisLav could take care of metadata editing.... I'll be busy today until late in the afternoon. Some quick comments:
|
@tiborsimko yes, indeed number_entries or number_jets just as alternative, like here, not change it everywhere |
Sure, is there metadata somewhere? |
@ArtemisLav the description is above i.e.
and the link to the software record will need to be a place holder for now |
@katilp I meant for the software record. Are we doing that now? |
@ArtemisLav not yet done, but I can add it. Should I open an issue for all sw records needed for these ML samples, or one by one? |
* addresses cernopendata#2447. Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
* addresses cernopendata#2447. Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
Closing as remaining issues of Run2 MiniAODSIM provenance followed up in #2525 |
In connection with #2440, this issue follows the jettuples to be produced from run2 AOD samples, to be made available on the portal.
The datasets:
Data science jettuples (contact Kimmo Kallonen HIP):
To do:
For contributions, see also https://github.com/cernopendata/opendata.cern.ch/wiki/Contributing-content-to-CERN-Open-Data
The text was updated successfully, but these errors were encountered: