# Dilepton analysis (background only)  


This is an example of a simple dilepton analysis, with special emphasis on how to treat the MC background samples, using the ATLAS Open Data dataset. We will run over the full set of background samples, pick out final states with two leptons, scale the background to correct cross section and luminosity, and finally plot the resulting histograms. 

**Notice:** This is *only an example* on how to do this. Feel free to be creative, and to find better and/or more elegant ways of doing the various steps! 

In [1]:
#include <iostream>
#include <string>
#include <stdio.h>

In [2]:
%jsroot on

## 1. Reading the dataset

In [3]:
TChain *dataset = new TChain("mini"); 

A list of all the background samples and their IDs can be found in **Background_samples.txt**. We read that list, and add all the samples to the TChain. We also (for later convenience) make a vector containing the dataset IDs. 

In [4]:
TString sample; 
TString path; 
vector<Int_t> dataset_IDs;
Int_t DSID;

In [5]:
ifstream infile("Background_samples.txt");

In [6]:
infile.clear();
infile.seekg(0, ios::beg);  // Start at the beginning of the file
dataset->Reset(); // Reset the TChain (if necessary)  
while(infile >> sample >> DSID){
    path = "DataSamples/MC/"+sample; // Specify path to the samples 
    dataset->Add(path);  
    dataset_IDs.push_back(DSID);
}

Next we define the variables we want to include in the analysis, and link them to branches in the TTree. A few things to notice at this point: 
-  In this example we will only study events with two leptons, so the vectorial variables only need to be two dimensional. 
-  The variables are here given names corresponding to the branches in the TTree. This is not necessary, so if you want to give them other names you are free to do so. 
-  The variable called "channelNumber" is the same as we have called "dataset ID" above. These terms are used interchangeably. 

In [7]:
Int_t lep_n, lep_charge[2], lep_type[2], channelNumber; 
Float_t lep_pt[2], lep_E[2], lep_phi[2], lep_eta[2], met_et; 

In [8]:
dataset->SetBranchAddress("lep_n",      &lep_n);
dataset->SetBranchAddress("lep_charge", &lep_charge);
dataset->SetBranchAddress("lep_type",   &lep_type);
dataset->SetBranchAddress("lep_pt",     &lep_pt);
dataset->SetBranchAddress("lep_eta",    &lep_eta);
dataset->SetBranchAddress("lep_phi",    &lep_phi);
dataset->SetBranchAddress("lep_E",      &lep_E);
dataset->SetBranchAddress("met_et",     &met_et); 
dataset->SetBranchAddress("channelNumber", &channelNumber);

## 2. Making (a lot of) histograms

Now that we have read our dataset we want to start analyzing the data. To do so we need to put the data into histograms. For reasons that will become clear later in the analysis we must (for each variable) make one histogram per dataset ID. (We have 31 background samples, so if we want to study 10 variables we have to make 310 histograms!) A very elegant way of dealing with all these histograms is by using [map](http://www.cplusplus.com/reference/map/map/)s (the C$++$ equivalent of Python dictionaries). Below we define one map for each variable. Here the *key values* are the dataset IDs, while the *mapped values* are the histograms.   

In [9]:
map<Int_t, TH1*> hist_mll; 
map<Int_t, TH1*> hist_lep_pt; 
map<Int_t, TH1*> hist_met;

In [10]:
for(const auto & i:dataset_IDs){
    hist_mll[i] = new TH1F(); 
    hist_lep_pt[i] = new TH1F(); 
    hist_met[i] = new TH1F();
}

In [11]:
for(const auto & i:dataset_IDs){
    hist_mll[i]->SetNameTitle("hist_mll", "Invariant mass"); 
    hist_lep_pt[i]->SetNameTitle("hist_lep_pt", "Lepton pT"); 
    hist_met[i]->SetNameTitle("hist_met", "Missing ET");
    hist_mll[i]->SetBins(20,0,500); 
    hist_lep_pt[i]->SetBins(20,0,1000);
    hist_met[i]->SetBins(20,0,500); 
}

### 2.1 Fill the histograms 
We can now loop over all events in our dataset, implement desired cuts, and fill the histograms we created above. In this example we choose only events containing exactly to same flavour leptons with opposite charge (i.e. $e^+e^-$ or $\mu^+\mu^-$). 
Before starting the loop we extract the total number of entries (events) in the TChain. We also make [TLorentzVector](https://root.cern.ch/doc/master/classTLorentzVector.html)s, which are very practical for handling the kinematics of the leptons, e.g. calculating the invariant mass of the two leptons. 

In [12]:
int nentries = (Int_t)dataset->GetEntries(); 

In [13]:
TLorentzVector l1, l2, dileptons; 

In [14]:
for(const auto & i:dataset_IDs){ // Reset histograms if you have filled them before 
    hist_mll[i]->Reset(); 
    hist_lep_pt[i]->Reset(); 
    hist_met[i]->Reset();
}

In [None]:
for (int i = 0; i < nentries ; i++){
    
    if( i%1000000 == 0 && i>0){ cout << i/1000000 << " million events processed" << endl;}
    dataset->GetEntry(i); // We "pull out" the i'th entry in the chain. The variables are now 
                          // available through the names we have given them. 
    
    // Cut #1: Require (exactly) 2 leptons
    if(lep_n == 2)
    {
        // Cut #2: Require opposite charge
        if(lep_charge[0] != lep_charge[1])
        {
            // Cut #3: Require same flavour (2 electrons or 2 muons)
            if(lep_type[0] == lep_type[1])
            {
                l1.SetPtEtaPhiE(lep_pt[0]/1000., lep_eta[0], lep_phi[0], lep_E[0]/1000.);
                l2.SetPtEtaPhiE(lep_pt[1]/1000., lep_eta[1], lep_phi[1], lep_E[1]/1000.);
                // Variables are stored in the TTree with unit MeV, so we need to divide by 1000 
                // to get GeV, which is a more practical unit. 
                
                dileptons = l1 + l2;   
    
                hist_mll[channelNumber]->Fill(dileptons.M());
                hist_lep_pt[channelNumber]->Fill(l1.Pt());
                hist_lep_pt[channelNumber]->Fill(l2.Pt()); 
                hist_met[channelNumber]->Fill(met_et/1000);   
                
            }
        }
    }      
}
cout << "Done!" << endl; 

We have now done the "heavy lifting" of an analysis, i.e. looping through all the events. Usually in such an analysis we create new ROOT files where we store the histograms we made above, and then analyse the output in a separate program/script. The advantage of doing this is that you can do the rest of the analysis in another language, e.g. Python, since we are done with part that requires the speed of C$++$. If you want to write ROOT files you can check out the [TFile](https://root.cern.ch/doc/master/classTFile.html) class reference. In this example we will however carry on in C$++$. 

## 3. Scale and classify the histograms

Before we are ready to make plots we need to do some further processing of the histograms we made above. The information necessary for doing the two steps below is found in the file **Infofile.txt**.   
1. We need to **scale** the histograms to the right cross section and luminosity. Why? When making the MC samples a certain number of events is simulated, which will usually not correspond to the number of events in our data. The expected number of events from a certain kind of process is given by $N=\sigma L$, where $\sigma$ is the cross section and $L$ is the integrated luminosity. Therefore we need to scale each histogram by a scale factor <br> <br>
$$sf = \frac{N}{N_{MC}} = \frac{ \sigma L }{N_{MC}},$$ <br>  where $N_{MC}$ is the number of generated MC events.  <br> <br>
2. We also need to **classify** the background processes into different categories. This is necessary when we eventually want to make the characteristic colorful background plots you might have seen before.  

### 3.1 Make new histograms 
Maybe a bit depressingly we have to make a set of new histograms, this time corresponding to the different background categories, instead of the dataset IDs. Notice that these new histograms are made in a very similar way as above, i.e. with the same range and binning. 

In [16]:
map<TString, TH1*> H_mll; 
map<TString, TH1*> H_lep_pt; 
map<TString, TH1*> H_met;

In [17]:
vector<TString> Backgrounds; 

In [18]:
Backgrounds = {"Higgs","Diboson", "Wjets", "DY", "singleTop", "ttbar", "Zjets"}; 

In [19]:
for(const auto i:Backgrounds){
    H_mll[i] = new TH1F(); 
    H_lep_pt[i] = new TH1F(); 
    H_met[i] = new TH1F(); 
}

In [20]:
for(const auto & i:Backgrounds){
    H_mll[i]->Reset(); 
    H_lep_pt[i]->Reset(); 
    H_met[i]->Reset();
}

In [21]:
for(const auto & i:Backgrounds){
    H_mll[i]->SetNameTitle("hist_mll", "Invariant mass"); 
    H_lep_pt[i]->SetNameTitle("hist_lep_pt", "Lepton pT"); 
    H_met[i]->SetNameTitle("hist_met", "Missing ET");
    H_mll[i]->SetBins(20,0,500); 
    H_lep_pt[i]->SetBins(20,0,1000);
    H_met[i]->SetBins(20,0,500); 
}

### 3.2 Scale and add histograms 
Now we read our info file, scale all (old) histograms, and then add them to the new histograms we just defined.  

In [22]:
ifstream info("Infofile.txt"); 
TString process; 
TString type; 
Int_t dsid; 
Int_t n_events; 
Double_t red_eff; 
Double_t sum_w; 
Double_t x_sec; 
Double_t L = 1000.6; // Integrated luminosity (pb)
Double_t SF; 

In [23]:
info.clear();
info.seekg(0, ios::beg);  
while(info >> process >> type >> dsid >> n_events >> red_eff >> sum_w >> x_sec){
    
    SF = x_sec*L/(n_events*red_eff); 
    
    hist_mll[dsid]->Scale(SF); 
    hist_lep_pt[dsid]->Scale(SF); 
    hist_met[dsid]->Scale(SF); 
    
    H_mll[type]->Add(hist_mll[dsid]); 
    H_lep_pt[type]->Add(hist_lep_pt[dsid]); 
    H_met[type]->Add(hist_met[dsid]); 
    
}

### 3.3 Color the histograms 
Make yet another map, this time containing the colors you want the backgrounds to have, and then set the colors of your histograms. Note that colors are defined by integers in ROOT. If you are not happy with the colors chosen below you can have look at the [TColor](https://root.cern.ch/doc/master/classTColor.html) class reference for more options. 

In [24]:
map<TString, Int_t> colors; 

In [25]:
colors["Diboson"] = kGreen; 
colors["Zjets"] = kYellow; 
colors["ttbar"] = kRed;
colors["singleTop"] = kBlue-7; 
colors["Wjets"] = kBlue+3; 
colors["DY"] = kOrange+1; 
colors["Higgs"] = kMagenta; 

In [26]:
for(const auto h:Backgrounds){
    H_mll[h]->SetFillColor(colors[h]); 
    H_met[h]->SetFillColor(colors[h]);
    H_lep_pt[h]->SetFillColor(colors[h]);
    
    H_mll[h]->SetLineColor(colors[h]); 
    H_met[h]->SetLineColor(colors[h]);
    H_lep_pt[h]->SetLineColor(colors[h]);
}

## 4. Stack and plot the histograms

Finally we have arrived to the part where we can plot the results of all the work done above. For each variable we need to stack the backgrounds on top of each other, which is done by using the [THStack](https://root.cern.ch/doc/master/classTHStack.html) class. In the example below we do this for two variables; invariant mass and missing $E_T$.   

In [27]:
THStack *stack_mll = new THStack("Invariant mass", "");
THStack *stack_met = new THStack("Missing ET", ""); 

In [28]:
for(const auto h:Backgrounds){
    stack_mll->RecursiveRemove(H_mll[h]); // Remove previously stacked histograms  
    stack_met->RecursiveRemove(H_met[h]);
    stack_mll->Add(H_mll[h]); 
    stack_met->Add(H_met[h]);
}    

Now we make a legend (see [TLegend](https://root.cern.ch/doc/master/classTLegend.html)), and add  the different backgrounds. Next we make a canvas (see [TCanvas](https://root.cern.ch/doc/master/classTCanvas.html)), which is allways necessary when we want to make a plot. Then you draw the stack and the legend, and display them by drawing the canvas. We can also specify axis labels and a bunch of other stuff. 

In [29]:
gStyle->SetLegendBorderSize(0); // Remove (default) border around legend 
TLegend *leg = new TLegend(0.65, 0.60, 0.9, 0.85); 

In [30]:
leg->Clear();
for(const auto i:Backgrounds){
    leg->AddEntry(H_mll[i], i, "f");  // Add your histograms to the legend
} 

In [31]:
TCanvas *C = new TCanvas("c", "c", 600, 600);

In [32]:
gPad->SetLogy(); // Set logarithmic y-axis

In [33]:
stack_mll->Draw("hist"); 
stack_mll->GetYaxis()->SetTitle("# events");
stack_mll->GetYaxis()->SetTitleOffset(1.3); 
stack_mll->GetXaxis()->SetTitle("m_{ll} (GeV)");
stack_mll->GetXaxis()->SetTitleOffset(1.3);
leg->Draw();
C->Draw();

In [34]:
stack_met->Draw("hist"); 
stack_met->GetYaxis()->SetTitle("# events");
stack_met->GetYaxis()->SetTitleOffset(1.3); 
stack_met->GetXaxis()->SetTitle("E_{T}^{miss} (GeV)");
stack_met->GetXaxis()->SetTitleOffset(1.3);
leg->Draw();
C->Draw(); 

## 5. Further steps
Here are some further steps necessary to do a proper analysis. 

### 5.1 MC weights
In addition to scaling of the MC samples we also need to "weight" each individual event. Various weights can be found as variables in the TTree: 
> 
mcWeigth <br>
scaleFactor_PILEUP <br>
scaleFactor_ELE <br>
scaleFactor_MUON <br>
scaleFactor_BTAG <br>
scaleFactor_TRIGGER <br>
scaleFactor_JVFSF <br>
scaleFactor_ZVERTEX <br>

These should be multiplied toghether and applied when filling the histograms, e.g. something like this: 
> 
*histogram->Fill(variable, weight);* 

### 5.2 Add data
A very important part of the analysis is of course to also add the real data. This is however slightly less complicated than the background, since you only need one histogram per variable. (We don't know which processes the real data events come from, so we can't classify them like we did for the MC background.) For the data we do however need a certain "quality control" that we don't need for the MC. Variables typically related to this are <br>
>trigE <br>
trigM <br>
passGRL <br>
hasGoodVertex <br> 


These are all binary variables (true (1) or false (0)). The two first ones indicate if the event was triggered by electron or muon triggers. We only want triggered events, so we should require one of these to be true. The last two variables indicates if the event (or run) is on the Good Runs List (GRL) and if the event has a good vertex (i.e. a vertex where particles originate from the same point). The GRL indicates whether or not ATLAS and the LHC was operating properly at the time the event was recorded. Both of these variables need to be true. <br>
<br>
The data should be plottet togehter with the MC background. If the MC weighting and scaling is done correctly the data and MC should match quite good, given that there are no "new physics" in the data. 

### 5.3 Add signal samples
We also need to study simulations of the signal (process) we are looking for. This is important in order to know how the signal process behaves for different variables, and hence know which part(s) of the parameter space we should study. Typically we don't know exactly how the signal would look like, since the masses of new particles (e.g. Z', W', graviton or sparticles) are unknown. Therefore we usually have several signal samples for a variety of different scenarios. 
These samples should be treated in a somewhat similar way as the MC background (weights and scale factors). 

### 5.4 Introduce further cuts (signal regions)
The cuts introduced in the loop above are merely a first selection of events, and we would never be able to see anything interesting with these cuts only (except checking that data and MC agree). Therefore we need to add more cuts. A cut is usually defined as a requirement on a variable. Examples of such cuts could be:
> 
$80>m_{\ell\ell}>100\:GeV$ (cut away the Z-peak) <br>
<br>
$E_T^{miss}>150\:GeV$ (require high missing energy) 

A set of cuts that make us sensitive to a signal process is called a **signal region**. A lot of effort is usually spent on optimizing such regions, in order to be as sensitive as possible to to the signal, so that we can either discover or exclude the signal process.   