# Jet Substructure

Because boosted jets represent the hadronic products of a heavy particle produced with high momentum, some tools have been developed to study the internal structure of these jets. This topic is usually called Jet Substructure. 

Jet substructure algorithms can be divided into three main tools:
 * **grooming algorithms** attempt to reduce the impact of *soft* contributions to clustering sequence by adding some other criteria. Examples of these algorimths are softdrop, trimming, pruning.
 * **subtructure variables** are observables that try to quantify how many cores or prongs can be identify within the structure of the boosted jet. Examples of these variables are n-subjetiness or energy correlation functions.
 * **taggers** are more sofisticated algorithms that attempt to identify the origin of the boosted jet. Currently taggers are based on sofisticated machine-learning techniques which try to use as much information as possible in order to efficiency identify boosted W/Z/Higgs/top jets. Examples of these taggers in CMS are deepAK8/ParticleNet or deepDoubleB.
 
For further reading, several measurements have been performed about jet substructure:
 * [Studies of jet mass in dijet and W/Z+jet events](http://arxiv.org/abs/1303.4811) (CMS).
 * [Jet mass and substructure of inclusive jets in sqrt(s) = 7 TeV pp collisions with the ATLAS experiment](http://arxiv.org/abs/1203.4606) (ATLAS).
 * [Theory slides](http://www.hri.res.in/~sangam/sangam18/talks/Marzani-2.pdf) 
 * [More theory slides]( http://indico.hep.manchester.ac.uk/getFile.py/access?contribId=14&resId=0&materialId=slides&confId=4413)
 * [Talk from Phil Harris](https://web.pa.msu.edu/seminars/hep_seminars/abstracts/2018/Harris-HEPSeminar-Slides-4172018.pdf) on searching for boosted $W$ bosons.
 
 In this part of the tutorial, we will compare different subtructure algorithms as well as some usually subtructure variables.

In [1]:
### RUN THIS CELL ONLY IF YOU ARE USING SWAN 
import os

##### REMEMBER TO MANUALLY COPY THE PROXY TO YOUR CERNBOX FOLDER AND TO MODIFY THE NEXT LINE
#os.environ['X509_USER_PROXY'] = '/eos/home-X/Y/x509up_u0000'
os.environ['X509_USER_PROXY'] = '/eos/home-a/algomez/tmp/x509up_u15148'
if os.path.isfile(os.environ['X509_USER_PROXY']): pass
else: print("os.environ['X509_USER_PROXY'] ",os.environ['X509_USER_PROXY'])
os.environ['X509_CERT_DIR'] = '/cvmfs/cms.cern.ch/grid/etc/grid-security/certificates'
os.environ['X509_VOMS_DIR'] = '/cvmfs/cms.cern.ch/grid/etc/grid-security/vomsdir'

The code we will use `$CMSSW_BASE/src/Analysis/JMEDAS/scripts/jmedas_make_histograms.py` is a python-based script accessing miniAOD information. Additionally, the code also fills different histograms that we will compare in the next steps of the tutorial. 

The content of the script can be seen [here](https://github.com/cms-jet/JMEDAS/blob/DASSep2020/scripts/jmedas_make_histograms.py). Also, keep in mind that next step can take some minutes.

In [9]:
%%bash
python $CMSSW_BASE/src/Analysis/JMEDAS/scripts/jmedas_make_histograms.py --files=$CMSSW_BASE/src/Analysis/JMEDAS/data/MiniAODs/RunIIFall17MiniAODv2/rsgluon_ttbar_3000GeV.txt --outname=$CMSSW_BASE/src/Analysis/JMEDAS/notebooks/files/rsgluon_ttbar_3000GeV.root --maxevents=2000 --maxFiles 10 --maxjets=2 --correctJets Fall17_17Nov2017_V32_MC
python $CMSSW_BASE/src/Analysis/JMEDAS/scripts/jmedas_make_histograms.py --files=$CMSSW_BASE/src/Analysis/JMEDAS/data/MiniAODs/RunIIFall17MiniAODv2/ttjets.txt --outname=$CMSSW_BASE/src/Analysis/JMEDAS/notebooks/files/ttjets.root --maxevents=2000 --maxjets=6 --maxFiles 5

Added root://cmsxrootd.fnal.gov//store/mc/RunIIFall17MiniAODv2/RSGluonToTT_M-3000_TuneCP5_13TeV-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/100000/386070E5-12B5-E811-8186-0CC47A78A426.root
Added root://cmsxrootd.fnal.gov//store/mc/RunIIFall17MiniAODv2/RSGluonToTT_M-3000_TuneCP5_13TeV-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/100000/5A0FAD4B-12B5-E811-B5E4-FA163EC2D066.root
Added root://cmsxrootd.fnal.gov//store/mc/RunIIFall17MiniAODv2/RSGluonToTT_M-3000_TuneCP5_13TeV-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/100000/727ADE98-24B0-E811-B795-FA163E5CEAA8.root
Added root://cmsxrootd.fnal.gov//store/mc/RunIIFall17MiniAODv2/RSGluonToTT_M-3000_TuneCP5_13TeV-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/100000/8EDC1A54-12B1-E811-A827-FA163EFD3E7C.root
Added root://cmsxrootd.fnal.gov//store/mc/RunIIFall17MiniAODv2/RSGluonToTT_M-3000_TuneCP5_13TeV-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/1000

## Grooming and PU removal algoritms

Now, let's compare the jet masses for ungroomed, pruned, soft drop (SD), PUPPI, and SD+PUPPI:

In [1]:
import ROOT
%jsroot on    

ROOT.gStyle.SetOptStat(0)
ROOT.gStyle.SetOptTitle(0)
f = ROOT.TFile("$CMSSW_BASE/src/Analysis/JMEDAS/notebooks/files/rsgluon_ttbar_3000GeV.root")

h_mAK8   = f.Get("h_mAK8")
h_msoftdropAK8 = f.Get("h_msoftdropAK8")
h_mprunedAK8   = f.Get("h_mprunedAK8")
h_mpuppiAK8 = f.Get("h_mpuppiAK8")
h_mSDpuppiAK8 = f.Get("h_mSDpuppiAK8")

h_msoftdropAK8.SetLineColor(2)
h_mprunedAK8.SetLineColor(4) 
h_msoftdropAK8.SetLineColor(2)
h_mprunedAK8.SetLineStyle(3) 
h_mpuppiAK8.SetLineColor(ROOT.kGreen+3)
h_mSDpuppiAK8.SetLineColor(ROOT.kOrange+3)

leg = ROOT.TLegend(0.6, 0.6, 0.88, 0.88)
leg.SetFillColor(0)
leg.SetBorderSize(0)
leg.AddEntry( h_mAK8, "Ungroomed", 'l')
leg.AddEntry( h_msoftdropAK8, "Soft Drop", 'l')
leg.AddEntry( h_mprunedAK8, "Pruned", 'l')
leg.AddEntry( h_mpuppiAK8, "PUPPI", 'l')
leg.AddEntry( h_mSDpuppiAK8, "PUPPI+SD", 'l')

c_mass = ROOT.TCanvas('c_mass', 'c_mass')
h_mprunedAK8.SetMaximum(500)
h_mprunedAK8.Draw()
#h_mprunedAK8.GetXaxis().SetRangeUser(0, 400)
#h_mprunedAK8.GetYaxis().SetRangeUser(0, 500)
h_msoftdropAK8.Draw("same") 
h_mAK8.Draw("same") 
h_mpuppiAK8.Draw("same")
h_mSDpuppiAK8.Draw("same")
#h_mprunedAK8.SetMaximum(h_msoftdropAK8.GetMaximum()*1.2)

leg.Draw()

c_mass.Draw()


Welcome to JupyROOT 6.14/09


Notice that the previous plot is interactive. Click on it or move your mouse to zoom in or out.

<details>
<summary>
    <font color='blue'>The histogram should look like this:</font>
</summary>
<img src="../files/ex5_rsg_jetmass.png" width=400px/>
</details>
Note that the histogram has two peaks. What do these correspond to? How do the algorithms affect the relative size of the two populations?

# Substructure Variables

Now, let's compare the different subtructure variables between two different samples. Using the histograms that you created in the previous steps, the next cell is just a function to create comparison plots.

In [87]:
canvas = {}
f1 = ROOT.TFile("$CMSSW_BASE/src/Analysis/JMEDAS/notebooks/files/ttjets.root")
f2 = ROOT.TFile("$CMSSW_BASE/src/Analysis/JMEDAS/notebooks/files/rsgluon_ttbar_3000GeV.root")
leg = ROOT.TLegend(0.15, 0.75, 0.4, 0.85)
leg.SetFillColor(0)
leg.SetBorderSize(0)

def compareHistogram( variable ):
    
    h1 = f1.Get("h_"+variable)
    h2 = f2.Get("h_"+variable)

    h1.SetLineColor(1)
    h1.Rebin(5)
    h1.Scale( 1/ h1.Integral() )
    h2.SetLineColor(2)
    h2.Rebin(5)
    h2.Scale( 1/ h2.Integral() )
    h1.SetMaximum( 1.2*max([h1.GetMaximum(), h2.GetMaximum()]) )
    
    leg.Clear()
    leg.AddEntry( h1, "t#bar{t}", 'l')
    leg.AddEntry( h2, "RS KK Gluon", 'l')
    
    canvas[variable] = ROOT.TCanvas(variable, variable) 
    
    
    h1.DrawNormalized('hist')
    h2.DrawNormalized("hist same")
    
    leg.Draw()
    canvas[variable].Draw()
    del h1, h2

Let's start with n-subjetiness ratios. The variable $\tau_N$ gives a sense of how many N prongs or cores can be find inside the jet. It is known that the n-subjetiness variables itself ($\tau_{N}$) do not provide good discrimination power, but its ratios do. Then, a $\tau_{MN} = \dfrac{\tau_M}{\tau_N}$ basically tests if the jet is more M-prong compared to N-prong. For instance, we expect 2 prongs for boosted jets originated from hadronic Ws, while we expect 1 prongs for high-pt jets from QCD multijet processes.

Let's compare one of the most common nsubjetiness ratio $\tau_{21}$:

In [82]:
compareHistogram( 'tau21AK8' )

What can you say about the two histograms? Is $\tau_{21}$ telling you something about the nature of the boosted jets selected?

Let's compare now $\tau_{32}$:

In [76]:
compareHistogram( 'tau32AK8' )

What can you say about the two histograms? Is $\tau_{32}$ telling you something about the nature of the boosted jets selected?

Another subtructure variable commonly used is the energy correlation function $N2$. Similarly than $\tau_{21}$, $N2$ tests if the boosted jet is compatible with a 2-prong jet hypothesis. Let's compare now $N2$ and $N3$:

In [88]:
compareHistogram( 'ak8_N2_beta1' )

In [89]:
compareHistogram( 'ak8_N3_beta1' )

What can you say about the two histograms? Are $N2$ and $N3$ telling you something about the nature of the boosted jets selected?

# $\rho$ parameter
A useful variable for massive, fat jets is the QCD scaling parameter $\rho$, defined as:

$\rho=m^2/(p_{\mathrm{T}}R)^2$.

(Sometimes $\rho$ is defined as the logarithm of this quantity). One useful feature of this variable is that QCD jet mass grows with $p_{\mathrm{T}}$, i.e. the two quantities are strongly correlated, while $\rho$ is much less correlated with $p_{\mathrm{T}}$.


In [92]:
compareHistogram( 'rhoRatioAK8' )

In [93]:
compareHistogram( 'logrhoRatioAK8' )

In which cases do you think the $\rho$ variable can be used?