# $t\bar{t}$-analysis
The chapter covers the $t\bar{t}$-analysis logic. [2015 CMS Open Data](https://cms.cern/news/first-cms-open-data-lhc-run-2-released) is assumed as input data for concreteness (see **Links for download** section). The basic idea of this chapter is to give a basic algorithm for distinguishing the t-quark pair production channel among other concurrent channels (production of single t-quark and w-jet) and to get the mass peak of the t-quark using a histogram.

## Input
Input data is five sets of root-files. Each set is produced in MC simulation and represents a partial interaction channel, one of five: $t\bar{t}$-channel, *single top s*-channel, *single top t*-channel, *single top tW*-channel, *Wjets*-channel. 
The root-file structure can be represented as a schematic:
\
\
<img src="images/input_structer.png" width="600" height="300">
\
This diagram shows only those fields that will be required for further analysis, such as electron, muon, and jet. Each of these branches has its number of particles ($N_e$, $N_{\mu}$, $N_{jet}$), the transverse momentum value ($P_\mathrm{T}$) that will be used in the following sections for events filtering. Also, jets have a b-tag value, which is the output of a discriminator used to identify b-jets (jets produced by b-quark).

## Output
The analysis task is selecting events from the whole input data set, in which measured quantities originated from $t\bar{t}$-decay. In real data, one can not exactly know in which event $t\bar{t}$-pair were produced. But since we have simulated data, we know exactly whether or not $t\bar{t}$ in any particular event. That is defined by the set to which the file belongs, as was mentioned above. As five channels were involved, five different sets were generated and are given as input data (*ttbar*, *single_top_s_chan*, *single_top_t_chan*, *single_top_tW*, *wjets*).
To select events we will apply some criteria (explained below) and then compare the relative rate of those events which indeed were generated with $t\bar{t}$-quark pair production. The example below demonstrate successfully performed $t\bar{t}$ analysis. That is concluded from the fact, that most of the selected events indeed belong to $t\bar{t}$-channel.


<img src="images/analysis.png" width=500 hight=500>

## Events filtering algorithm
Not all particles with their quantities are needed for doing analysis. The only ones to which selection criteria will be applied are leptons (electrons and muons) and jets that are the products of $t\bar{t}$ decay. In the semi-leptonic decay channel of $t\bar{t}$ production, two jets, two b-jets, and one outgoing lepton are expected, as can be concluded from the diagram below:

<img src=images/ttbardecay.png hight = 400 width = 400 >

So, events that belong to semi-leptonic $t\bar{t}$ decay mode can be identified by the presence of those products: one lepton, two jets, and two b-tag jets. This is the foundation of the algorithm presented below.

### 1. Filtering particles by transverse momentum threshold

It is expected that the first step has to be the selection of those events that have at least one lepton and four jets, according to the decay schema. But, before we go any further, one significant thing must be noted.

The note is that some electrons, muons, and jets must be excluded from the analysis procedure. That means some particles will not be considered as potential decay products, and all further actions will not relate to them. 

The transverse momentum ($P_\mathrm{T}$) of appropriative electrons, muons, and jets must be higher than some threshold value.

It is well suggested to set a threshold of 25 GeV. One can test any other value, but it must be high enough to distinguish collision events from diffractive scattering events and must not be too large to avoid losing a significant amount of interesting events.  

This step can be understood as the following input$\Rightarrow$output transformation:
* input: every row contains a full number of particles
* output: every row contains muons, electrons and jets with appropriate transverse momentum value ($P_\mathrm{T} > 25$ GeV) per event.

So, this action is identical to defining new branches containing only appropriative particles. All further actions are described assuming that the user define new branches like as muon($P_\mathrm{T, e}>25$ GeV), electron($P_\mathrm{T, \mu}> 25$ GeV) and jets($P_\mathrm{T, jet}>25$ GeV) or take care about transverse momentum of particles in each step in its own way.

### 2.  Filtering events by lepton and jet numbers
Then events that contain only one lepton (exactly one electron or exactly one muon) and four or more jets must be selected.

### 3. Filtering events by jet b-tagging
Next, events that contain at least two b-tagged jets must be selected. Setting a b-tag threshold equal to 0.5 is well suggested. So, the output is only those events that contain at least two jets with a b-tag value higher than 0.5.\
The b-tag threshold also is not fixed as well as $P_\mathrm{T}$

## $t$-quark mass plotting
For this moment, one had to leave only appropriate events for our analysis. 
The final task is plotting $t$-quark mass. As shown in the $t\bar{t}$-decay diagram, the first $t$-quark decayed into $b$-quark, lepton, and neutrino, while the second ${t}$-quark ($\bar{t}$-quark) into two quarks and one $b$-quark. Consequently, the $t$-quark mass can be restored by those products: one b-quark and two quarks, so three jets are needed for $t$-quark mass plotting.\
Consequently, to plot mass, it is necessary to find all possible trijet combinations per event and select the most appropriate one. Required properties are the highest total transverse momentum value and at least one b-tagged jet. So, at the output to each event will be assigned a sample of three jets with those requirements:
* The maximal total $P_\mathrm{T}$
* At least one jet is ***b-tagged***

At this point, all is ready to plot trijet mass and to get a similar $t$-quark peak as shown above.

## Summary
Here is a brief cheat sheet for performing $t\bar{t}$ analysis:

### 1. Select events by following criteria:
1. Transverse momentum of each of the further selected particles must be more than 25 GeV
2. Single lepton, at least four jets
3. At least two b-tagged jets (b-tag value threshold can equal 0.5)

### 2. Build all possible trijet combinations per every event and leave single more appropriative by following criteria:
1. At least one jet in a three-jet set must have a b-tag value more than the threshold (0.5)
2. Trijet $P_\mathrm{T}$ value must be the largest among all combinations per event.


### 3. Graphical schema

<img src="images/pipeline.png" width="600" height="600">

## Weighting
The above-described algorithm assumes that data samples generated using different channels come into the histogram with equivalent weights. It is not the case, as various data samples are not with equal numbers of events. Also, partial cross-section values must be taken into account. So we need to decrease weight with an increasing total number of events and increase ones with increasing cross-sections for normalization. This yields the formula:
$$w_i = {{\sigma}_i L \over N_i}$$\
where $i$ represents partial interaction channel,\
$\sigma_i$ - is partial cross-section,\
$L$ - is luminosity,\
$N_i$ - is the total number of events in the data sample.\
$L$=3378 $pb^{-1}$. Correct cross-section values for five interaction channels can be found [here](https://atlas-groupdata.web.cern.ch/atlas-groupdata/dev/AnalysisTop/TopDataPreparation/XSection-MC15-13TeV.data) and are presented below:
* "ttbar": 729.84 pb
* "single_top_s_chan": 3.2944 pb
* "single_top_t_chan": 234.7936 pb
* "single_top_tW": 75.842 pb,
* "wjets": 15487.164 pb\

Those $w_i$ values are the weights that accord to different data samples generated using different interaction channels.

## Variations
The last part of this description is the variations section. The motivation for variations calculations is that output histograms are transferring further to statistical processing. That is not described here, but what is going on - the data is fitting according to the defined statistical model with some parameters. A statistical model describes input data and allows correctly determine observable quantities, for example, t-quark mass which means the mean value and uncertainty.

Two values must be assigned to every bin - mean value and uncertainty. Assume we have N counts inside some bin. This is the mean value. According to Poisson distribution, the standard uncertainty is $\sqrt{N}$. This means the weight in which bin will come into the statistical model.

For correct mass of $t$-quark estimating one need to have those uncertainties as shown in example below:

<img src="images/jetvar.png" hight=500 width=500>

Uncertainties here are determined by Poisson distribution ($\sqrt{N}$)

## Linsk for download:
### Samples categorized by process

- **ttbar**:
  - nominal:
    - [19980](https://opendata.cern.ch/record/19980): Powheg + Pythia 8 (ext3), 2413 files, 3.4 TB -> converted
    - [19981](https://opendata.cern.ch/record/19981): Powheg + Pythia 8 (ext4), 4653 files, 6.4 TB -> converted
  - scale variation:
    - [19982](https://opendata.cern.ch/record/19982): same as below, unclear if overlap
    - [19983](https://opendata.cern.ch/record/19983): Powheg + Pythia 8 "scaledown" (ext3), 902 files, 1.4 TB -> converted
    - [19984](https://opendata.cern.ch/record/19984): same as below, unclear if overlap
    - [19985](https://opendata.cern.ch/record/19985): Powheg + Pythia 8 "scaleup" (ext3), 917 files, 1.3 TB -> converted
  - ME variation:
    - [19977](https://opendata.cern.ch/record/19977): same as below, unclear if overlap
    - [19978](https://opendata.cern.ch/record/19978): aMC@NLO + Pythia 8 (ext1), 438 files, 647 GB -> converted
  - PS variation:
    - [19999](https://opendata.cern.ch/record/19999): Powheg + Herwig++, 443 files, 810 GB -> converted

- **single top**:
  - s-channel:
    - [19394](https://opendata.cern.ch/record/19394): aMC@NLO + Pythia 8, 114 files, 76 GB -> converted
  - t-channel:
    - [19406](https://opendata.cern.ch/record/19406): Powheg + Pythia 8 (antitop), 935 files, 1.1 TB -> converted
    - [19408](https://opendata.cern.ch/record/19408): Powheg + Pythia 8 (top), 1571 files, 1.8 TB -> converted
  - tW:
    - [19412](https://opendata.cern.ch/record/19412): Powheg + Pythia 8 (antitop), 27 files, 30 GB -> converted
    - [19419](https://opendata.cern.ch/record/19419): Powheg + Pythia 8 (top), 23 files, 30 GB -> converted

- **W+jets**:
  - nominal (with 1l filter):
    - [20546](https://opendata.cern.ch/record/20546): same as below, unclear if overlap
    - [20547](https://opendata.cern.ch/record/20547): aMC@NLO + Pythia 8 (ext2), 5601 files, 4.5 TB -> converted
    - [20548](https://opendata.cern.ch/record/20548): aMC@NLO + Pythia 8 (ext4), 4598 files, 3.8 TB -> converted
