Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMS: SW records for ML sample production (Run2 QCD for jettuples) #2584

Closed
katilp opened this issue Apr 8, 2019 · 6 comments · Fixed by #2586
Closed

CMS: SW records for ML sample production (Run2 QCD for jettuples) #2584

katilp opened this issue Apr 8, 2019 · 6 comments · Fixed by #2586

Comments

@katilp
Copy link
Member

katilp commented Apr 8, 2019

From #2447
The structure can be as in http://opendata-dev.cern.ch/record/5104

@caredg @laramaktub This is what we would need to extract from a github repo or CAP in order to build a SW record on the portal

Needed:

@ArtemisLav @tiborsimko :

  • for run periods, I think we really must add Run2, Phase2 instead of specific run periods. It does not matter if they show separately. For Phase2 (=upgrade) SW (another record), we have not yet done the data-taking so it would be impossible to give the exact run periods...
  • these ML SW records produce a dataset which is on the portal, for this recid 12100
    • do we have an appropriate json field for that?
    • it should have as description "The ouput dataset of this software tool is available in:"

json could be:

{
  "created": "FIXME", 
  "id": FIXME, 
  "metadata": {
    "$schema": "http://opendata.cern.ch/schema/records/record-v1.0.0.json", 
    "abstract": {
      "description": " <p>This is a CMSSW module for producing flat tuples containing jet properties from 13 TeV Run2 MC samples. The code is intended to run inside the CMS Open Data environment</p>"
    }, 
    "accelerator": "CERN-LHC", 
    "authors": [
      {
        "name": "Kallonen, Kimmo"
      }
    ], 
    "collections": [
      "FIXME Is this needed/necessary: CMS-Validation-Utilities"
    ], 
    "control_number": "FIXME", 
    "date_created": [
      "2019"
    ], 
    "date_published": "2019", 
    "distribution": {
      "formats": [
        "gz"
      ], 
      "number_files": 1, 
      "size": FIXME
    }, 
    "doi": "FIXME", 
    "experiment": "CMS", 
    "files": [
      {
        "checksum": "FIXME", 
        "filename": "FIXME check JetNtupleProducerTool-1.0.0.tar.gz", 
        "size": 16966, 
        "uri_http": "FIXME", 
        "uri_root": "FIXME"
      }
    ], 
    "index_files": [], 
    "license": {
      "attribution": "GNU General Public License (GPL) version 3"
    }, 
    "publisher": "CERN Open Data Portal", 
    "recid": "FIXME", 
    "run_period": [
      "Run2"
    ], 
    "system_details": {
      "description": "Use this code with the CMS Open Data VM environment", 
      "recid": "252", 
      "release": "CMSSW_8_0_26"
    }, 
    "title": "JetNtupleProducerTool - a Jet tuple producer from CMS Run2 MiniAOD", 
    "type": {
      "primary": "Software", 
      "secondary": [
        "Tool"
      ]
    }, 
    "usage": {
      "description": " <p>If you do not have the CERN Virtual Machine for CMS open data installed, follow the instructions in step 1 at <a href=\"/VM/CMS/2011\">How to install a CERN Virtual Machine</a>.  <p>To run the analysis, follow the instructions in FIXME <a href=\"https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016\">https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016</a>.</p> "
    }, 
    "use_with": {
      "description": "Use this with Run2 QCD MiniAODSIM dataset (or any Run2 MiniAOD containing jets).", 
      "links": [
        {
          "recid": "12021"
        }
      ]
    }
  }, 
  "updated": "FIXME"
}
@kimmokal
Copy link

kimmokal commented Apr 9, 2019

@katilp The new github repository in cms-opendata-analyses/JetNtupleProducerTool is up now and I licensed it under GPL 3.0.

If I provide the example Python scripts for loading the ntuples, where should I host them? Or should they be written directly in the record?

@katilp
Copy link
Member Author

katilp commented Apr 10, 2019

@kimmokal @caredg I see different options:

  • if very short, it can be in the record
  • if it can be considered as a test for the output, it could it in JetNtupleProducerTool repo
  • if more extensive, it could have a repo of its own - @caredg would it make sense to have a separate cms-opendata-datascience organization or does this fit to cms-opendata-analyses

in case it goes to a separate repo, I would be in favour of a new organization in a similar way as for cms-opendata-education. That would follow the logic:

  • analysis on CMS official open data (i.e. analysis on AOD) -> cms-opendata-analysis
  • analysis/examples on csv or other data for education purposes -> cms-opendata-education
  • examples on ML samples usage -> cms-opendata-datascience

@caredg
Copy link
Contributor

caredg commented Apr 10, 2019

I see no problem in creating another organization if needed. If it is a test of the JeNtupleProducerTool, then maybe it could go into another package within the same repository as well. In any case, let me know and I can create the additional organization if needed. BTW, Once everything is in place in the cms-opendata-analysis organization, the JeNtupleProducerTool should be deleted from cms-legacydata-analysis in order not to duplicate efforts.

@kimmokal
Copy link

@katilp I reckon the examples are rather short, so I don't think it makes sense in this case to create a separate repository. I would provide the following two very similar code snippets.

Loading a ROOT file to a Pandas dataframe:

import pandas as pd
import uproot

# Load a ROOT file
filePath='JetNtuple_RunIISummer16_13TeV_MC_1.root'
rootFile = uproot.open(filePath)['AK4jets']['jetTree']

# Create and fill a dataframe
df = pd.DataFrame()
for key in rootFile.keys():
    df[key] = rootFile.array(key)

Loading an HDF5 file to a Pandas dataframe:

import pandas as pd
import h5py

# Load an HDF5 file
filePath = 'JetNtuple_RunIISummer16_13TeV_MC_1.h5'
h5File = h5py.File(filePath, 'r')

# Create and fill a dataframe
df = pd.DataFrame()
for key in h5File.keys():
    df[key] = h5File[key]

I guess they could just go directly in the record.

@caredg I verified that the cms-opendata-analysis/JeNtupleProducerTool repository works as intended, so should I now just delete the repository in cms-legacydata-analysis?

@caredg
Copy link
Contributor

caredg commented Apr 11, 2019

@kimmokal, yes I think that would be good.

@kimmokal
Copy link

@caredg Done!

tiborsimko pushed a commit to ArtemisLav/opendata.cern.ch that referenced this issue May 13, 2019
* (closes cernopendata#2584)

Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
Signed-off-by: Tibor Simko <tibor.simko@cern.ch>
tiborsimko pushed a commit to ArtemisLav/opendata.cern.ch that referenced this issue May 13, 2019
* (closes cernopendata#2584)

Signed-off-by: Artemis Lavasa <artemis.lavasa@cern.ch>
Signed-off-by: Tibor Simko <tibor.simko@cern.ch>
@ghost ghost removed the Status: in review label May 13, 2019
@tiborsimko tiborsimko added this to Done in CMS-Q4-Updates Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants