### When running this notebook via the Galaxy portal
You can access your data via the dataset number. Using a Python kernel, you can access dataset number 42 with ``handle = open(get(42), 'r')``.
To save data, write your data to a file, and then call ``put('filename.txt')``. The dataset will then be available in your galaxy history.
<br><br>Note that if you are putting/getting to/from a different history than your default history, you must also provide the history-id.
<br><br>More information including available galaxy-related environment variables can be found at https://github.com/bgruening/docker-jupyter-notebook. This notebook is running in a docker container based on the Docker Jupyter container described in that link.


# ATLAS OpenData with RDataFrame

This notebook uses <a href="https://root.cern/doc/master/classROOT_1_1RDataFrame.html" target="_blank">RDataFrame</a> in ROOT to perform an analysis of the 13 TeV ATLAS OpenData. It needs ROOT version >= 6.24/02. 

## Includes and imports

The follwing cells includes the needed libraries as well as a helper function with some useful function to retrieve all the available samples and the categorization of backgrounds. See the ouput for more information.

In [1]:
import ROOT
ROOT.EnableImplicitMT(220)
import os
import import_ipynb
import setPath
from Input.OpenDataPandaFramework13TeV import *
%jsroot on

Welcome to JupyROOT 6.24/02
importing Jupyter notebook from setPath.ipynb
importing Jupyter notebook from /home/eirikgr/software/Input/OpenDataPandaFramework13TeV.ipynb
This library contains handy functions to ease the access and use of the 13TeV ATLAS OpenData release

getBkgCategories()
	 Dumps the name of the various background cataegories available 
	 as well as the number of samples contained in each category.
	 Returns a vector with the name of the categories

getSamplesInCategory(cat)
	 Dumps the name of the samples contained in a given category (cat)
	 Returns dictionary with keys being DSIDs and values physics process name from filename.

getMCCategory()
	 Returns dictionary with keys DSID and values MC category

initialize(indir)
	 Collects all the root files available in a certain directory (indir)

getSkims(indir)
	 Prints all available skims in the directory



Setting luminosity to 10064 pb^-1

###############################
#### Background categories ####
##############

In [2]:
import socket
print(socket.gethostname())

hepp03.hpc.uio.no


In [3]:
# Not really needed since lumi is set as a public variable in include above
lumi = 10064.0
print('Run on data corresponding to {:.2f} fb^-1'.format(lumi/ 1000.0))

Run on data corresponding to 10.06 fb^-1


## Get the samples and categories

Set the path to the location of the openData ntuples and the <a href="http://opendata.atlas.cern/release/2020/documentation/datasets/files.html" target="_blank">dataset</a> you want to run over. The *initialize()* checks for all available samples in the directory and categorize them accordingly.

In [4]:
dir = "/storage/shared/data/fys5555/ATLAS_opendata/"
#dir = "/storage/shared/data/fys5555/ATLAS_opendata/RNTuples/" #use RNtuple
ana = "2lep"
mcfiles = initialize(dir+"/"+ana+"/MC")
datafiles = initialize(dir+"/"+ana+"/Data")
allfiles = z = {**mcfiles, **datafiles}
Backgrounds = getBkgCategories(); 
Signals = getSignalCategories();

####################################################################################################
BACKGROIUND SAMPLES
####################################################################################################

###############################
#### Background categories ####
###############################
Category             N(samples)
-------------------------------
Diboson                      10
Higgs                        20
Wjets                        42
Wjetsincl                     6
Zjets                        42
Zjetsincl                     3
singleTop                     6
topX                          3
ttbar                         1
###############################
#### Signal categories ####
###############################
Category             N(samples)
-------------------------------
GG_ttn1                       4
Gee                           5
Gmumu                         5
RS_G_ZZ                       5
SUSYC1C1                     10
SUSYC1N2     

In [5]:
processes = allfiles.keys()
df = {}
all_samples_MC = []
for p in processes:
    samples = []
    datafrs = []
    ns = 0
    for d in allfiles[p]["files"]:
        if ns == 0:
            fold = "/".join(d.split("/")[:-1])
            haddfile = "%s/%s.root"%(fold,p)
            if os.path.isfile(haddfile): 
                break
        samples.append(d)
        if not 'data' in p:
            all_samples_MC.append(d)
        ns += 1
    if len(samples):
        print("Using %i unhadded files for %s"%(len(samples),p))
        df[p] = ROOT.RDataFrame("mini", samples)
    else:
        print("Using hadded file %s for %s"%(haddfile,p))
        df[p] = ROOT.RDataFrame("mini", haddfile)

Using 3 unhadded files for topX
Using 3 unhadded files for Zjetsincl
Using 10 unhadded files for Diboson
Using 5 unhadded files for Gmumu
Using 4 unhadded files for ZPrimeee
Using 10 unhadded files for dmV_Zll
Using 5 unhadded files for RS_G_ZZ
Using 13 unhadded files for Higgs
Using 4 unhadded files for ZPrimemumu
Using 12 unhadded files for ZPrimett
Using 6 unhadded files for Wjetsincl
Using 5 unhadded files for Gee
Using 42 unhadded files for Zjets
Using 42 unhadded files for Wjets
Using 10 unhadded files for SUSYC1C1
Using 4 unhadded files for GG_ttn1
Using 4 unhadded files for TT_directTT
Using 18 unhadded files for SUSYC1N2
Using 14 unhadded files for SUSYSlepSlep
Using 1 unhadded files for mc_999999
Using 1 unhadded files for ttbar
Using 6 unhadded files for singleTop
Using 4 unhadded files for data


In [6]:
df["mc"] = ROOT.RDataFrame("mini", all_samples_MC)

In [7]:
! g++ -shared -fPIC -o Cfunctions.so /storage/shared/software/Input/Cfunctions.cxx `root-config --cflags --glibs`

[01m[K/storage/shared/software/Input/Cfunctions.cxx:[m[K In function ‘[01m[Kstd::vector<std::__cxx11::basic_string<char> > DropColumns(std::vector<std::__cxx11::basic_string<char> >&&)[m[K’:
    auto is_blacklisted = [&[01;35m[Kblacklist[m[K](const std::string &s)  { return std::find(blacklist.begin(), blacklist.end(), s) != blacklist.end(); };
                            [01;35m[K^~~~~~~~~[m[K
[01m[K/storage/shared/software/Input/Cfunctions.cxx:150:42:[m[K [01;36m[Knote: [m[K‘[01m[Kconst std::vector<std::__cxx11::basic_string<char> > blacklist[m[K’ declared here
    static const std::vector<std::string> [01;36m[Kblacklist[m[K = {"useless", "columns"};
                                          [01;36m[K^~~~~~~~~[m[K


Include a pre-compiled c++ library of useful functions. Do ROOT.help() to see content.

In [8]:
ROOT.gSystem.AddDynamicPath("/storage/shared/software/Input/")
ROOT.gROOT.ProcessLine(".include /storage/shared/software/Input/");
ROOT.gInterpreter.AddIncludePath("/storage/shared/software/Input/");
ROOT.gInterpreter.Declare('#include "/storage/shared/software/Input/Cfunctions.h"') # Header with the definition of the myFilter function
ROOT.gSystem.Load("Cfunctions.so") # Library with the myFilter function

0

In [9]:
ROOT.help()

Library of handy functions to be used with RDataFrame
isOS(const ROOT::VecOps::RVec<int>& chlep)
	 Checks if pair of leptons has opposite sign. Returns bool
isSF(Vec_t& fllep)
	 Checks if pair of leptons has same flavour (i.e. electron, muon, tau etc.). Returns bool
ComputeInvariantMass(Vec2_t& pt, Vec2_t& eta, Vec2_t& phi, Vec2_t& e)
	 Computes invariant mass of leptons. Input can be any size, function will compute the total invariant mass of all objects.  Returns float
calcMT2(Vec2_t& pt, Vec2_t& eta, Vec2_t& phi, Vec2_t& e, Float_t met_et, Float_t met_phi)
	 Computes the stransverse mass [Ref.: https://gitlab.cern.ch/atlas-phys-susy-wg/CalcGenericMT2].  Returns float
costhetastar(Vec2_t& pt, Vec2_t& eta, Vec2_t& phi, Vec2_t& e)
	 Computes the cos(theta)* of two leptons.  Returns float
deltaPhi_ll(Vec2_t& pt, Vec2_t& eta, Vec2_t& phi, Vec2_t& e)
	 Computes the difference in phi between two leptons.  Returns float
deltaPhi_metl(Vec2_t& pt, Vec2_t& eta, Vec2_t& phi, Vec2_t& e)
	 Comput

In [10]:
nlep = 2
lepv = ["lep_pt","lep_eta","lep_phi","lep_E",
        "lep_ptcone30","lep_etcone20",
        "lep_trackd0pvunbiased","lep_tracksigd0pvunbiased",
        "lep_isTightID","lep_z0"]

In [11]:
MCcat = {}
MCdescr = {}
for cat in allfiles:
    n = -1
    for dsid in allfiles[cat]["dsid"]:
        n += 1
        try:
            MCcat[int(dsid)] = cat
            MCdescr[int(dsid)] = allfiles[cat]["files"][n].split("/")[-1].split(".")[1]
        except:
            continue

In [12]:
# Function to add
def getCategory(dsid):
    return MCcat[dsid]

In [13]:
# Function to add
def getDescr(dsid):
    return MCdescr[dsid]

In [14]:
%%time
import time
for p in df.keys():
    
    print(p)
    
    #if not p in ["mc"]: continue
    
    print("Looking at %s"%p)
    
    # Define good leptons using pT > 25 GeV and isolation
    df[p] = df[p].Define("goodLEP","lep_etcone20/lep_pt < 0.15 && lep_ptcone30/lep_pt < 0.15")
    df[p] = df[p].Define("n_goodLEP","Sum(goodLEP)")
    # Require number of good leptons
    df[p] = df[p].Filter("n_goodLEP == 2","2 good leptons")
    
    # Calculate flavour and charge of the two leptons
    df[p] = df[p].Define("isOS","isOS(lep_charge[goodLEP])")
    df[p] = df[p].Define("isSF","isSF(lep_type[goodLEP])")
    
    
    for i in range(nlep):
        df[p] = df[p].Define("lep%i_flav"%(i+1),"getTypeTimesCharge(lep_charge[goodLEP],lep_type[goodLEP],%i)"%(i))
        for v in lepv:
            if "lep_" in v:
                var = v.replace("lep_","")
            else:
                var = v
            #print(var)
            df[p] = df[p].Define("lep%i_%s"%(i+1,var),"getVar(lep_%s[goodLEP],%i)"%(var,i))
            
    #df[p] = df[p].Define("lep2_pt","getVar(lep_pt[goodLEP],1)")
    # Cut on SF + OS
    #df[p] = df[p].Filter("isSF","Same flavour")
    #df[p] = df[p].Filter("isOS","Opposite sign")
    # Compute mll
    df[p] = df[p].Define("mll","ComputeInvariantMass(lep_pt[goodLEP],lep_eta[goodLEP],lep_phi[goodLEP],lep_E[goodLEP])")
    df[p] = df[p].Define("mt2","calcMT2(lep_pt[goodLEP],lep_eta[goodLEP],lep_phi[goodLEP],lep_E[goodLEP],met_et,met_phi)")
    
    df[p] = df[p].Define("njet20","countJets(jet_pt,20000)")
    df[p] = df[p].Define("njet60","countJets(jet_pt,60000)")
    
    df[p] = df[p].Define("nbjet60","countBJets(jet_pt,jet_MV2c10,20000,60)")
    df[p] = df[p].Define("nbjet70","countBJets(jet_pt,jet_MV2c10,20000,70)")
    df[p] = df[p].Define("nbjet77","countBJets(jet_pt,jet_MV2c10,20000,77)")
    df[p] = df[p].Define("nbjet85","countBJets(jet_pt,jet_MV2c10,20000,85)")
    
    # Compute costheta*
    df[p] = df[p].Define("costhstar","costhetastar(lep_pt[goodLEP],lep_eta[goodLEP],lep_phi[goodLEP],lep_E[goodLEP])")
    
    
    # Calculate weight for scaling (inlcudes scaling to luminosisty)
    if "data" in p:
        df[p] = df[p].Define("weight", "1.0")
        #df[p] = df[p].Define("category","getDataMetaData(channelNumber)")
        #df[p] = df[p].Define("physdescr","getDataMetaData(channelNumber)")
    else:
        df[p] = df[p].Define("weight", "scaleFactor_ELE * scaleFactor_MUON * scaleFactor_LepTRIGGER * scaleFactor_PILEUP * mcWeight * (XSection * {} / SumWeights)".format(lumi))
        #df[p] = df[p].Define("category","getMCMetaData(channelNumber).second")
        #df[p] = df[p].Define("physdescr","getMCMetaData(channelNumber).first")

topX
Looking at topX
Zjetsincl
Looking at Zjetsincl
Diboson
Looking at Diboson
Gmumu
Looking at Gmumu
ZPrimeee
Looking at ZPrimeee
dmV_Zll
Looking at dmV_Zll
RS_G_ZZ
Looking at RS_G_ZZ
Higgs
Looking at Higgs
ZPrimemumu
Looking at ZPrimemumu
ZPrimett
Looking at ZPrimett
Wjetsincl
Looking at Wjetsincl
Gee
Looking at Gee
Zjets
Looking at Zjets
Wjets
Looking at Wjets
SUSYC1C1
Looking at SUSYC1C1
GG_ttn1
Looking at GG_ttn1
TT_directTT
Looking at TT_directTT
SUSYC1N2
Looking at SUSYC1N2
SUSYSlepSlep
Looking at SUSYSlepSlep
mc_999999
Looking at mc_999999
ttbar
Looking at ttbar
singleTop
Looking at singleTop
data
Looking at data
mc
Looking at mc
CPU times: user 519 ms, sys: 58.6 ms, total: 578 ms
Wall time: 673 ms


In [15]:
all_cols = []
for c in df["Zjets"].GetColumnNames():
    all_cols.append(str(c))
    print(c)

goodLEP
n_goodLEP
isOS
isSF
lep1_flav
lep1_pt
lep1_eta
lep1_phi
lep1_E
lep1_ptcone30
lep1_etcone20
lep1_trackd0pvunbiased
lep1_tracksigd0pvunbiased
lep1_isTightID
lep1_z0
lep2_flav
lep2_pt
lep2_eta
lep2_phi
lep2_E
lep2_ptcone30
lep2_etcone20
lep2_trackd0pvunbiased
lep2_tracksigd0pvunbiased
lep2_isTightID
lep2_z0
mll
mt2
njet20
njet60
nbjet60
nbjet70
nbjet77
nbjet85
costhstar
weight
runNumber
eventNumber
channelNumber
mcWeight
scaleFactor_PILEUP
scaleFactor_ELE
scaleFactor_MUON
scaleFactor_PHOTON
scaleFactor_TAU
scaleFactor_BTAG
scaleFactor_LepTRIGGER
scaleFactor_PhotonTRIGGER
trigE
trigM
trigP
lep_n
lep_truthMatched
lep_trigMatched
lep_pt
lep_eta
lep_phi
lep_E
lep_z0
lep_charge
lep_type
lep_isTightID
lep_ptcone30
lep_etcone20
lep_trackd0pvunbiased
lep_tracksigd0pvunbiased
met_et
met_phi
jet_n
jet_pt
jet_eta
jet_phi
jet_E
jet_jvt
jet_trueflav
jet_truthMatched
jet_MV2c10
photon_n
photon_truthMatched
photon_trigMatched
photon_pt
photon_eta
photon_phi
photon_E
photon_isTightID
photon_ptcon

In [16]:
all_cols = ['njet20','njet60','nbjet60','nbjet70','nbjet77','nbjet85',
            #'category','physdescr',
            'isOS','isSF','mll','mt2','met_et', 'met_phi',
            'lep1_flav',
            'lep1_pt',
            'lep1_eta',
            'lep1_phi',
            'lep1_E',
            'lep1_ptcone30',
            'lep1_etcone20',
            'lep1_trackd0pvunbiased',
            'lep1_tracksigd0pvunbiased',
            'lep1_isTightID',
            'lep1_z0',
            'lep2_flav',
            'lep2_pt',
            'lep2_eta',
            'lep2_phi',
            'lep2_E',
            'lep2_ptcone30',
            'lep2_etcone20',
            'lep2_trackd0pvunbiased',
            'lep2_tracksigd0pvunbiased',
            'lep2_isTightID',
            'lep2_z0',
            'channelNumber',
            'costhstar']

In [17]:
#df["Gmumu"].Display("category").Print()

In [18]:
#df['Gmumu'].Snapshot("thinned_tree", "out3.root", ROOT.DropColumns(all_cols));

In [19]:
numpy = df["mc"].AsNumpy(all_cols)

In [20]:
df = pd.DataFrame(data=numpy)

In [21]:
df.shape

(130848028, 36)

In [22]:
df['category'] = df.apply(lambda row : getCategory(row['channelNumber']), axis = 1)
df['physdescr'] = df.apply(lambda row : getDescr(row['channelNumber']), axis = 1)

In [23]:
df.head()

Unnamed: 0,njet20,njet60,nbjet60,nbjet70,nbjet77,nbjet85,isOS,isSF,mll,mt2,...,lep2_ptcone30,lep2_etcone20,lep2_trackd0pvunbiased,lep2_tracksigd0pvunbiased,lep2_isTightID,lep2_z0,channelNumber,costhstar,category,physdescr
0,2,2,0,0,1,1,1,1,260112.453125,274073.0625,...,0.0,-460.782593,0.010055,1.36722,1.0,-0.023045,301333,0.842731,ZPrimett,ZPrime3000_tt
1,3,2,2,2,2,2,1,0,416591.125,250325.65625,...,0.0,-83.693237,0.018806,1.681733,1.0,0.07151,301333,0.699803,ZPrimett,ZPrime3000_tt
2,4,2,0,0,0,0,1,0,808462.0625,586403.375,...,0.0,-723.799927,0.025268,2.887424,1.0,0.045441,301333,1.062639,ZPrimett,ZPrime3000_tt
3,4,2,1,2,2,2,1,1,240576.65625,326527.9375,...,0.0,90.810547,0.004849,0.640823,1.0,0.030815,301333,0.428338,ZPrimett,ZPrime3000_tt
4,4,4,0,1,1,1,0,1,387675.84375,885130.4375,...,0.0,4380.707031,0.019385,1.494093,1.0,-0.014116,301333,0.646013,ZPrimett,ZPrime3000_tt


In [24]:
df.to_hdf("/storage/shared/data/2lep_df_forML_bkg_signal_fromRDF.hdf5","mini")