CMS: SW records for ML sample production (Run2 QCD for jettuples) #2584

katilp · 2019-04-08T14:46:17Z

From #2447
The structure can be as in http://opendata-dev.cern.ch/record/5104

@caredg @laramaktub This is what we would need to extract from a github repo or CAP in order to build a SW record on the portal

Needed:

title: JetNtupleProducerTool - a Jet tuple producer from CMS Run2 MiniAOD
authors: Kimmo Kallonen @kimmokal others?
tags/keywords: Software, Tool
the datasets that are used as input: recid 12021 (or any Run2 MiniAOD containing jets)
System details
- VM recid 252
- CMSSW version: CMSSW_8_0_26
how can you use, the github repo is in
https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016
but it will be a placholder as it will be moved to https://github.com/cms-opendata-analyses/JetNtupleProducerTool/tree/2016 and have a release tag there
licence @kimmokal do you agree with GNU General Public License (GPL) version 3

@ArtemisLav @tiborsimko :

for run periods, I think we really must add Run2, Phase2 instead of specific run periods. It does not matter if they show separately. For Phase2 (=upgrade) SW (another record), we have not yet done the data-taking so it would be impossible to give the exact run periods...
these ML SW records produce a dataset which is on the portal, for this recid 12100
- do we have an appropriate json field for that?
- it should have as description "The ouput dataset of this software tool is available in:"

json could be:

{
  "created": "FIXME", 
  "id": FIXME, 
  "metadata": {
    "$schema": "http://opendata.cern.ch/schema/records/record-v1.0.0.json", 
    "abstract": {
      "description": " <p>This is a CMSSW module for producing flat tuples containing jet properties from 13 TeV Run2 MC samples. The code is intended to run inside the CMS Open Data environment</p>"
    }, 
    "accelerator": "CERN-LHC", 
    "authors": [
      {
        "name": "Kallonen, Kimmo"
      }
    ], 
    "collections": [
      "FIXME Is this needed/necessary: CMS-Validation-Utilities"
    ], 
    "control_number": "FIXME", 
    "date_created": [
      "2019"
    ], 
    "date_published": "2019", 
    "distribution": {
      "formats": [
        "gz"
      ], 
      "number_files": 1, 
      "size": FIXME
    }, 
    "doi": "FIXME", 
    "experiment": "CMS", 
    "files": [
      {
        "checksum": "FIXME", 
        "filename": "FIXME check JetNtupleProducerTool-1.0.0.tar.gz", 
        "size": 16966, 
        "uri_http": "FIXME", 
        "uri_root": "FIXME"
      }
    ], 
    "index_files": [], 
    "license": {
      "attribution": "GNU General Public License (GPL) version 3"
    }, 
    "publisher": "CERN Open Data Portal", 
    "recid": "FIXME", 
    "run_period": [
      "Run2"
    ], 
    "system_details": {
      "description": "Use this code with the CMS Open Data VM environment", 
      "recid": "252", 
      "release": "CMSSW_8_0_26"
    }, 
    "title": "JetNtupleProducerTool - a Jet tuple producer from CMS Run2 MiniAOD", 
    "type": {
      "primary": "Software", 
      "secondary": [
        "Tool"
      ]
    }, 
    "usage": {
      "description": " <p>If you do not have the CERN Virtual Machine for CMS open data installed, follow the instructions in step 1 at <a href=\"/VM/CMS/2011\">How to install a CERN Virtual Machine</a>.  <p>To run the analysis, follow the instructions in FIXME <a href=\"https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016\">https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016</a>.</p> "
    }, 
    "use_with": {
      "description": "Use this with Run2 QCD MiniAODSIM dataset (or any Run2 MiniAOD containing jets).", 
      "links": [
        {
          "recid": "12021"
        }
      ]
    }
  }, 
  "updated": "FIXME"
}

The text was updated successfully, but these errors were encountered:

* (closes cernopendata#2584) Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>

kimmokal · 2019-04-09T10:19:22Z

@katilp The new github repository in cms-opendata-analyses/JetNtupleProducerTool is up now and I licensed it under GPL 3.0.

If I provide the example Python scripts for loading the ntuples, where should I host them? Or should they be written directly in the record?

katilp · 2019-04-10T06:23:41Z

@kimmokal @caredg I see different options:

if very short, it can be in the record
if it can be considered as a test for the output, it could it in JetNtupleProducerTool repo
if more extensive, it could have a repo of its own - @caredg would it make sense to have a separate cms-opendata-datascience organization or does this fit to cms-opendata-analyses

in case it goes to a separate repo, I would be in favour of a new organization in a similar way as for cms-opendata-education. That would follow the logic:

analysis on CMS official open data (i.e. analysis on AOD) -> cms-opendata-analysis
analysis/examples on csv or other data for education purposes -> cms-opendata-education
examples on ML samples usage -> cms-opendata-datascience

caredg · 2019-04-10T15:54:31Z

I see no problem in creating another organization if needed. If it is a test of the JeNtupleProducerTool, then maybe it could go into another package within the same repository as well. In any case, let me know and I can create the additional organization if needed. BTW, Once everything is in place in the cms-opendata-analysis organization, the JeNtupleProducerTool should be deleted from cms-legacydata-analysis in order not to duplicate efforts.

kimmokal · 2019-04-11T15:14:50Z

@katilp I reckon the examples are rather short, so I don't think it makes sense in this case to create a separate repository. I would provide the following two very similar code snippets.

Loading a ROOT file to a Pandas dataframe:

import pandas as pd
import uproot

# Load a ROOT file
filePath='JetNtuple_RunIISummer16_13TeV_MC_1.root'
rootFile = uproot.open(filePath)['AK4jets']['jetTree']

# Create and fill a dataframe
df = pd.DataFrame()
for key in rootFile.keys():
    df[key] = rootFile.array(key)

Loading an HDF5 file to a Pandas dataframe:

import pandas as pd
import h5py

# Load an HDF5 file
filePath = 'JetNtuple_RunIISummer16_13TeV_MC_1.h5'
h5File = h5py.File(filePath, 'r')

# Create and fill a dataframe
df = pd.DataFrame()
for key in h5File.keys():
    df[key] = h5File[key]

I guess they could just go directly in the record.

@caredg I verified that the cms-opendata-analysis/JeNtupleProducerTool repository works as intended, so should I now just delete the repository in cms-legacydata-analysis?

caredg · 2019-04-11T15:19:00Z

@kimmokal, yes I think that would be good.

kimmokal · 2019-04-11T15:22:28Z

@caredg Done!

* (closes cernopendata#2584) Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch> Signed-off-by: Tibor Simko <tibor.simko@cern.ch>

katilp added Topic: records Experiment: CMS CMS: software labels Apr 8, 2019

katilp added this to the CMS-Q4-Updates milestone Apr 8, 2019

katilp assigned ArtemisLav Apr 8, 2019

This was referenced Apr 8, 2019

CMS: Run2 QCD MC for data science jettuples #2447

Closed

CMS: SW record for ML sample production (Run2 Hbb and QCD MC for ML studies) #2585

Closed

ArtemisLav added a commit to ArtemisLav/opendata.cern.ch that referenced this issue Apr 8, 2019

records: adds new CMS ML software records

9e4fce0

* (closes cernopendata#2584) Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>

This was referenced Apr 8, 2019

records: fixes for record 12100 #2582

Merged

records: adds new CMS ML software records #2586

Merged

ghost added the Status: in review label Apr 8, 2019

katilp mentioned this issue Apr 8, 2019

CMS: SW record for ML sample production (upgrade MC for tracking GPU studies) #2589

Closed

tiborsimko pushed a commit to ArtemisLav/opendata.cern.ch that referenced this issue May 13, 2019

records: CMS Run2 datascience tools

aaca7a7

* (closes cernopendata#2584) Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch> Signed-off-by: Tibor Simko <tibor.simko@cern.ch>

tiborsimko pushed a commit to ArtemisLav/opendata.cern.ch that referenced this issue May 13, 2019

records: CMS Run2 datascience tools

c419b2c

* (closes cernopendata#2584) Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch> Signed-off-by: Tibor Simko <tibor.simko@cern.ch>

tiborsimko closed this as completed in #2586 May 13, 2019

ghost removed the Status: in review label May 13, 2019

tiborsimko added this to Done in CMS-Q4-Updates Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMS: SW records for ML sample production (Run2 QCD for jettuples) #2584

CMS: SW records for ML sample production (Run2 QCD for jettuples) #2584

katilp commented Apr 8, 2019

kimmokal commented Apr 9, 2019

katilp commented Apr 10, 2019

caredg commented Apr 10, 2019

kimmokal commented Apr 11, 2019

caredg commented Apr 11, 2019

kimmokal commented Apr 11, 2019

CMS: SW records for ML sample production (Run2 QCD for jettuples) #2584

CMS: SW records for ML sample production (Run2 QCD for jettuples) #2584

Comments

katilp commented Apr 8, 2019

kimmokal commented Apr 9, 2019

katilp commented Apr 10, 2019

caredg commented Apr 10, 2019

kimmokal commented Apr 11, 2019

caredg commented Apr 11, 2019

kimmokal commented Apr 11, 2019