# $\mathrm{H} \rightarrow \mathrm{ZZ}$ : Higgs boson discovery
---------------------------------------
by Artur Monsch and Artur Gottmann

Last updated at December 18, 2020

-----------------------------------

In 2012 at CERN the discovery of the previously predicted Higgs boson was made and thus a piece of the Standard Model of particle physics was confirmed. One of the decay channels that led to the discovery was the decay into four leptons. Compared to the other decay channels, this decay channel is ideally suited for analysis, which you now carry out in the form of this notebook.

This notebook is the first part of this analysis and deals mainly with the simulated and measured datasets. The aim is to increase the sensitivity and to achieve a high ratio between the background and the signal. At the end of the task, the significance of the measurement carried out will be estimated. Based on the resulting significance, a first statement about the detection of the Higgs boson can be made. A detailed statistical treatment of the significance up to the combination of measurements to increase the significance will be presented in the second part, which you will probably encounter in the TP2 course.

The inspiration for this exercise is the following [example analysis](http://opendata.cern.ch/record/5500)

In [None]:
import sys
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

sys.path.append("..")

from include.RandomHelper import check_data_state

# Check if all folders in the directory are present and unpacking them
check_data_state(directory="../data") 

# Events
------------------

To get familiar with the signatures of proton-proton collision events recorded by the CMS experiment, have a look at the visualization of the CMS detector together with the examples of recorded
events with the [**Ispy-WebGL web interface**](https://github.com/cms-outreach/ispy-webgl). The various components of the CMS multi-purpose detector, which are used to detect different particles, are shown in the [**detector slice**](https://cds.cern.ch/record/2120661/files/CMSslice_whiteBackground.png?subformat=icon-1440):
<center><img src="https://cds.cern.ch/record/2120661/files/CMSslice_whiteBackground.png?subformat=icon-1440" width=75%></center>


By using Ispy-WebGL, try to understand the functionality of each component by enabling them in the 'Detector' menu, and to select events with interesting signatures from the event collections. Try also to enable and disable 'Physics' objects in the corresponding menu.


<div class="alert alert-info">
Select two examples for each of the decays listed below and save them as an image. How would the typical signature of these individual decays appear in the detector?

- $\mathrm{H} \rightarrow \mathrm{ZZ} \rightarrow 4\ell$
- $\mathrm{H} \rightarrow \gamma \gamma$
- $\mathrm{H} \rightarrow \mathrm{W}^+ \mathrm{W}^- \rightarrow 2\ell 2\nu$

</div>


The switch-on functions of individual detector components and the option to save the individual events as images can be performed within the web interface. To open the event collection, you must first download `TP1_event_collection.ig` located in the `./data/for_event_display/ig_files/` folder locally on your computer. Then, within the event display under the tab "Open" - folder symbol - this file (open local file) can be selected and opened. The images can then be inserted directly into the notebook using `<img src="your img url or path">` in a markdown cell.


> note:
> `.ig` data format is similar to a `.zip` archive and containing one or more event files and is related to ISpy which is used by CMS for displaying events.

With an Internet connection:

In [None]:
%%html
<iframe src="https://ispy-webgl.web.cern.ch/ispy-webgl/" width="100%" height="700"></iframe>

Without an internet connection: Open the `index.html` from the [Github Repository](https://github.com/cms-outreach/ispy-webgl) locally in a web browser.

The main task in this section is a manual classification of the results according to their decay channels based on the event images. The classification is later automated using software designed for this purpose to analyse a large number of events within a short time instead of looking at each collision event image by image.


The focus of this exercise will be to rediscover the decay of the Higgs boson into two Z bosons, which in turn decay into four charged leptons, $H\rightarrow ZZ\rightarrow 4\ell$.
From all charged leptons, only electrons and muons are used in the analysis, since decays of the Z boson into two $\tau$ leptons, $Z\rightarrow\tau\tau$, are much more difficult to handle,
and it will be more difficult to distinguish the $H\rightarrow ZZ$ signal from backgrounds.


>Detailed Explanation:
>
>The $\tau$ leptons decay before they reach the first tracker layer but it is possible to reconstruct them from the visible final states. The difficulty arises from the additional neutrinos in the $\tau$ lepton decays. Since these neutrinos carry some part of the energy away, the $H\rightarrow ZZ$ peak in ${m}_{4\ell}$ distribution - $\ell$ corresponds in this case to visible decay products of the $\tau$ lepton -
is smeared out and shifted to lower energies. It is therefore much more difficult to distinguish $H\rightarrow ZZ$ from backgrounds (especially $Z\rightarrow 4 \ell$),
if using $\tau$ leptons.

### Your possible solution:
-------------------------------------
* $\mathrm{H} \rightarrow \mathrm{ZZ} \rightarrow 4\ell$    
  (Please insert your notes and include your images here)
* $\mathrm{H} \rightarrow \gamma \gamma$    
  (Please insert your notes and include your images here)
* $\mathrm{H} \rightarrow \mathrm{W}^+ \mathrm{W}^- \rightarrow 2\ell 2\nu$    
  (Please insert your notes and include your images here)
--------------------------------------

# Data format
------------------------

In the course of this exercise, we will introduce custom classes that reduce the time and complexity involved in processing datasets. These classes build on packages such as `numpy`, `pandas` or `matplotlib` and combine steps that would otherwise have to be performed explicitly when using these packages, e.g. to obtain a particular kinematic variable of individual particles.

In general, this is a common procedure that you can apply to your future analyses: It is not necessary to rewrite everything from scratch every time you want to process new data sets. Instead, it makes sense to create an intermediate layer between your analysis workflows and the existing packages.

The original datasets are several TB in size, and time-efficient processing requires a cluster of worker nodes to run on for several hours. In contrast, the datasets used below are only some MB in size and can be processed quite quickly as part of this exercise. A very rough preselection of events was made and only the variables necessary for the analysis were written out: certain information about the event, the individual leptons, and the particle reconstructed from them. There are at least four leptons in an event and a maximum of eight.

The data format used in this exercise, from which all the required variables are taken, is a modified human-readable `.csv` format. There are three data sets. In the exercise, an attempt is made to maximize the ratio between signal and background. To do this, the Monte Carlo simulations for the background (`MC_2012_ZZ_to_4L.csv`) and the signal (`MC_2012_H_to_ZZ_to_4L.csv`) are reduced by applying the filters you have developed to whole events or individual leptons.

When the selection of filter values is complete, the filters will be applied to the measurement made in 2012 (`CMS_Run2012_[B,C].csv`).

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from include.Helper import load_dataset, save_dataset
from tqdm import tqdm

from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

df_bkg_init = load_dataset("../data/for_long_analysis/mc_init/MC_2012_ZZ_to_4L_[100,151].csv")
df_sig_init = load_dataset("../data/for_long_analysis/mc_init/MC_2012_H_to_ZZ_to_4L_[100,151].csv")

In [None]:
df_bkg_init

In [None]:
df_sig_init

Information about the variables contained in the data sets:
- **run_evnt_lumisec** are event specific variables and ensure the uniqueness of each event.
- **leptons** contains a list of leptons from the respective event. All important information about them are recorded in the corresponding `lepton` object. A more detailed breakdown is given below. The list is sorted in descending order by the transverse momentum of the leptons and thus contains leptons with the highest transverse momentum.
- **z1, z2** are the pre-reconstructed Z-bosons from the event. Of those, the four-vector is stored.
- **four_lep** is also a four-vector of the four leptons invariant mass following from the reconstruction of the two Z bosons.
- **changed**, on the other hand, is not a physical quantity, but is simply used to avoid unnecessary repeated reconstruction. This is set to `True` when applying the filters, if the number of leptons in an event would change. A more exact explanation follows in the filter section.

# Lepton object, TLorentzVector and access to kinematic quantities.
-----------

In this section we will now briefly discuss the `Lepton` object, as well as the `TLorentzVector`, and touch on some useful functionality that could avoid re-implementation and/or tedious operation steps.

Let us start with the `Lepton`, where we first take a list of leptons from the first event in the background:

In [None]:
one_event_leptons = df_bkg_init.loc[0].leptons
one_event_leptons

At first only one lepton is to be considered. For this purpose it can either be accessed via `one_event_leptons[0]` or - if desired - be created explicitly. In this case the Lepton class should be imported first.

In [None]:
from include.Particle import Lepton
from uproot3_methods.classes.TLorentzVector import TLorentzVector

one_lepton = Lepton(lv=TLorentzVector(x=-16.588, y=38.981, z=-7.3221, t=42.992), charge=-1, flavour='e', 
                    relpfiso=0.103053, dxy=-0.000581222, dz=0.00161887, sip3d=0.369838)
one_lepton

About the single quantities which describe such a lepton:

- **TLorentzVector** is the four-vector. The individual components are always the momentum components, corresponding analogously the energy is assigned to t: `(x,y,z,t)->(px,py,pz,E)`
- **charge** is the electric charge of the lepton
- **flavour** distinguishes whether it is a muon (`"m"`) or an electron (`"e"`)
- **relpfiso** is the relative isolation of the leptons. It is the sum over the transverse momentum of all non-leptonic particles divided by the transverse momentum of the lepton under consideration, calculated in a cone in $(\Delta\eta, \Delta\phi)$ around the lepton. To avoid inaccurate reconstruction and misidentification with other particles, good relative isolation is important: the smaller the value, the better, since it leads to fewer particles around the considered lepton and the particle, or measurable quantities - in this case the lepton - can be better determined in the detector.
- **sip3d** is the significance of the 3D impact parameter, and **dxy** and **dz** are the impact parameters (in cm) transverse and longitudinal, respectively, with respect to the beam axis. These quantities allow distinguishing leptons from primary decays, such as from a Z boson, from leptons originating from decays of particles with larger decay times, such as B mesons, occurring outside the primary vertices.


Likewise, other quantities, such as the pseudorapidity $\eta$ or the azimuthal angle $\phi$, which are useful for the description of the lepton, can be taken analogously as from the `TLorentzVector`. The possible attributes can be listed with the `dir()` function, where the `__method__` are not further necessary for us.

In [None]:
print([it for it in dir(Lepton) if  not it.startswith("__")])

The Z bosons, as well as the four_lep four-vector are stored as `TLorentzVector` - the same attributes (except the lepton specific ones) can be accessed in the same way.

In [None]:
print(df_bkg_init.loc[0].z1)
print(df_bkg_init.loc[0].z1.mass)

For the creation of histograms, a custom pandas accessor is used, which handles the conversion of the individual quantities from the individual events.

In [None]:
from include.QuantityAccessors import ParticleQuantitySeriesAccessor

This intermediate layer can be applied via `quantity` to lepton objects or four-vectors from the reconstructed particles to obtain a `pandas.series`, which can be used in the usual way.

In [None]:
fig, ax = plt.subplots()
print(df_bkg_init.z1.quantity.px)
df_bkg_init.z1.quantity.px.hist(bins=100, range=(-100, 100), ax=ax, label="Z$_1$-Boson")
ax.set_xlabel("$p_x$ in GeV")
ax.set_ylabel("Ereignisse")
plt.legend()
plt.show()

For the individual leptons a separate selection can be made, in which it can be distinguished whether only electrons or muons are considered and whether all leptons from the events are taken into consideration or only the first lepton in the respective event - with the highest transverse momentum.

In [None]:
fig, ax = plt.subplots(ncols=3, nrows=1, figsize=(20, 5))
df_bkg_init.leptons.quantity.px.hist(bins=100, range=(-100, 100), ax=ax[0], 
                                     label="Allen Leptonen")
df_bkg_init.leptons.quantity(flavour="e").px.hist(bins=100, range=(-100, 100), ax=ax[1], 
                                                  label="Elektronen/Positronen")
df_bkg_init.leptons.quantity(flavour="e", lep_num=[0, 1]).px.hist(bins=100, range=(-100, 100), ax=ax[2], 
                                                                  label=f"Ersten zwei Elektronen/Positronen")

[(_ax.set_xlabel("$p_x$ in GeV"), _ax.set_ylabel("Leptonenanzahl"), _ax.legend()) for _ax in ax]

plt.show()

# Creation and application of filters
----------

For signal enrichment, certain filters can now be applied that either discard individual leptons from the events or discard entire events. An example of a lepton filter can be the condition for a minimum transverse momentum of leptons.

From a physical point of view, it is necessary to force the condition of the minimum transverse momentum, since below a certain value the probability of an incorrect identification of the lepton increases. For the CMS detector and the data set used, the minimum value of the transverse momentum is $5 \, \mathrm{GeV}$ for the muons and $7 \, \mathrm{GeV}$ for the electrons. Thus, any leptons that do not satisfy this condition are removed in one event. According to this application, it is possible to have more than four or less than four leptons in an event, which makes it impossible to reconstruct two Z-bosons in the second case. Consequently, the event must be discarded completely.

The general task for a lepton filter is then given by the two (or three) following points:
1. Calculation of the necessary physical quantity by which the filtering is done, if this quantity is not already available
2. Application of the filter condition to each single lepton in an event, eventually differentiating by lepton flavor
3. Checking that the minimum number of leptons is being fulfilled.

To avoid rewriting the individual points for each filter, it is useful to introduce two helper functions that are assessed in each filter. We start from the end and introduce a helper function that checks the minimum number of leptons within an event.


In [None]:
def check_min_lepton_number(row: pd.Series):
    """
    Checks whether the minimum number of 
    leptons for reconstruction is present 
    in an event - a row in the data set.
    
    row: pd.Series;
    
    return: bool
    """

    if row.channel != "2e2mu": # Flavour distinction not necessary
        return len(row.leptons) >= 4

    if row.channel == "2e2mu": # Flavour distinction necessary
        _flavour_count = lambda x: sum([1 for lepton in row.leptons if lepton.flavour == x])
        return (_flavour_count("e") >= 2) and (_flavour_count("m") >= 2)

The second helper function is a generic lepton filter that applies an operation - always a function in our case - to all leptons within an event and discards certain leptons after applying the operation.

In [None]:
def generic_lepton_filter(row, operation):
        """
        Function that creates a mask for 
        the leptons in an event based on 
        an operation and applies it.
        
        row: pd.Series - ein Ereignis
        operation: Filter condition for a lepton;
                   For the case in which the distinction by lepton flavor is necessary:
                   Form: {"e": lambda lepton: <operation>, "m": lambda lepton: <operation>}
                   For the case in which the distinction by lepton flavor is not necessary:
                   Form: lambda lepton: <operation>
                   
        return: pd.Series
        """
        
        # Creation of a filter mask for the case...
        if isinstance(operation, dict):  # ... of a distinction according to the flavor
            _mask = np.array([operation[lep.flavour](lep) for lep in row.leptons])
        else:  # ... of no distinction according to the flavor
            _mask = np.array([operation(lep) for lep in row.leptons])
        
        row.leptons = row.leptons[_mask]  # Application of the created filter mask
        
        if not check_min_lepton_number(row):  # Checking for the minimum number of leptons
            # An entry in the row is set to "nan" and will be sorted out later
            row.run_event_lumisec = np.nan  
        else:
            # Selection for repeated reconstruction based on the remaining leptons.
            row.changed = len(_mask) != sum(_mask)
        
        return row

`generic_lepton_filter` can now be used to apply lepton filters as follows (`check_min_lepton_number` is also implicitly applied by running `generic_lepton_filter`, as seen above. 

In [None]:
def filter_pt_min(row):
    _used_operation = {"m": lambda lep: lep.pt > 5, "e": lambda lep: lep.pt > 7}
    return generic_lepton_filter(row, _used_operation)

In the case where no distinction between the lepton flavors is necessary, the variable `_used_operation` would be simplified to

``Python
_used_operation = lambda lep: <something that returns a bool>
```

Applying filters is done via the already known `pd.DataFrame.apply(myfunction, axis=1)`. Dropping all events that do not meet the minimum number of leptons (`row.run_event_lumisec = np.nan `) is done via `pd.DataFrame.dropna()`. To make applying the filters a bit more compact in the further course of the task:

In [None]:
def apply_filter_functions(dataframe, filter_function_list):
    new_dataframe = dataframe
    for filter_function in filter_function_list:
        tqdm.pandas(desc=f"{filter_function.__name__:<30}")
        new_dataframe = new_dataframe.progress_apply(filter_function, axis=1)
        new_dataframe = new_dataframe.dropna()
    return new_dataframe

In this case the `apply` method of `pandas` was replaced by `progess_apply` of `tqdm` to give a visual feedback about the progress. The output is always the modified `pd.DateFrame`. This can either be assigned to a new variable or explicitly overwrite the old `pd.DateFrame`. If the old `pd.DateFrame` is overwritten there is no possibility to reverse this process except restarting the kernel! It is possible to save a backup of the already processed datasets via `save_dataset`:

``Python
from include.Helper import save_dataset
save_dataset(dataframe, "new_name.csv")
```
Applying the filter implemented above after the minimum transversal momentum:

In [None]:
df_bkg_after_pt_min = apply_filter_functions(df_bkg_init, [filter_pt_min])
print(df_bkg_init.shape, df_bkg_after_pt_min.shape)

An example for an event filter on the other hand can be the requirement for a possible combination of the electric charge. If a neutral combination of four leptons (keeping the same lepton flavor in pairs) cannot be satisfied within an event - which could contain more than four leptons - then the event will be discarded.

In code, this exclusion condition may look like the following in the form of a helper function:

In [None]:
# Useful python library for combinatorics etc. Problems
import itertools

def valid_charge_combination(charges, num=4):
    # All num-fold combinations of charges with early termination
    for combination in itertools.combinations(charges, num): 
        if np.sum(combination) == 0:
            return True 
    else:
        return False

The actual filter function for this event filter would be:

In [None]:
def filter_electric_charge(row):    
        # Distinction between the decay channels
        if row.channel != "2e2mu": # All leptons have the same flavor
            charge_list = [lep.charge for lep in row.leptons]
            if not valid_charge_combination(charge_list, 4):
                # An entry in the row is set to "nan" and will be sorted out later
                row.run_event_lumisec = np.nan
        
        if row.channel == "2e2mu":  # Mixing channel: A distinction according to the flavor is required 
            charge_list_mu = [lep.charge for lep in row.leptons if lep.flavour == "m"]
            charge_list_el = [lep.charge for lep in row.leptons if lep.flavour == "e"]
            if not valid_charge_combination(charge_list_mu, 2) or not valid_charge_combination(charge_list_el, 2):
                row.run_event_lumisec = np.nan
            
        return row

Applying `filter_electric_charge` to `df_bkg_init` would lead to no change, since the already successful reconstruction of the Z bosons implies this filter condition. For this reason, it makes sense to apply this filter only after a lepton filter.

In [None]:
df_bkg_after_pt_min_q = apply_filter_functions(df_bkg_init, [filter_pt_min, filter_electric_charge])

If there were more than four leptons in an event before applying the lepton filter and after applying it there are still four or more leptons, then a new reconstruction for the selected events is necessary, because a better reconstruction of the Z-bosons might be possible. For the case of `filter_pt_min` and subsequent `filter_electric_chagre` the number of changed events in the dataset results is:

In [None]:
sum(df_bkg_after_pt_min_q.changed)

This reconstruction can be done analogously to `apply_filter_functions`:

In [None]:
from include.ReconstructionFunctions import reconstruct_zz

def reconstruct(dataframe, pt_exact=None):
    new_dataframe = dataframe
    tqdm.pandas(desc=f"{reconstruct_zz.__name__:<30}")
    new_dataframe = new_dataframe.progress_apply(lambda x: reconstruct_zz(x, pt_dict=pt_exact), axis=1, raw=True)
    new_dataframe = new_dataframe.dropna()
    new_dataframe.four_lep = new_dataframe.z1 + new_dataframe.z2
    return new_dataframe

The mentioned `pt_exact` in the arguments is, to take an example, of the form `{1: 15, 2: 10, 3: 0, 4: 0}` and makes a greater constraint on the transverse momentum of the leptons used for the reconstruction. For this case, it would ensure, that one of the leptons used for the reconstruction satisfies, for example, at least one lepton that satisfies the condition $p_T > 15 \, \mathrm{GeV}$ and two leptons satisfy the condition $p_T > 10 \, \mathrm{GeV}$. A more detailed motivation and associated task follows below. For the reconstruction in this case the previously generated `df_bkg_after_pt_min_q` is overwritten:

In [None]:
df_bkg_after_pt_min_q = reconstruct(df_bkg_after_pt_min_q)

The direct effect on the transverse momentum can then be visualized as previously illustrated:

In [None]:
fig, ax = plt.subplots()
df_bkg_init.leptons.quantity.pt.hist(bins=100, range=(0, 100), ax=ax, label="Vor dem Filter")
df_bkg_after_pt_min_q.leptons.quantity.pt.hist(bins=100, range=(0, 100), ax=ax, label="Nach dem Filter")
ax.set_xlabel("$p_T$ in GeV")
ax.set_ylabel("Leptonenanzahl")
ax.legend()
plt.show()

<div class="alert alert-info">
Other variables from the dataset can be visualized in the same way.   
    
  * Does the distribution of transverse momentum look as expected?
  * On which other variables can you see an effect due to the application of these two filters?
  * Are there any deviations from expectation in the distributions of other quantities?
</div>

# Creation of further selection conditions
-------

As shown in the example for the condition on the minimum transversal momentum required, further selection conditions are implemented in this section and defined by you similarly. You can choose the threshold values for the individual conditions yourself. For this purpose, the visualization of the distributions, as you have already seen in the previous sections, may be helpful.

Feel free to try different settings for the selection conditions, as the signal sensitivity of your analysis may vary depending on the choice of a threshold value for a quantity. Therefore, try to find a combination that makes sense.

For a motivation for the thresholds chosen by the CMS collaboration for the 2012 discovery of the Higgs boson, see the [**official publication**](https://arxiv.org/pdf/1207.7235.pdf). This paper is a combination of all the decay channels of the Higgs boson that were studied and combined for the discovery, so you can focus your reading on the decay into four leptons only. If you are interested, you can also look at the other decay channels - but it is not necessary for the exercise.

Also, there is room for you to vary the choice of selection thresholds with respect to the values suggested in the paper in order to try to get a better result. Remember, however, that the choice of values for the thresholds should be based on studies with the MC simulations and not on the measured datasets, to avoid a biased search for a signal peak.

In the following, further filters are to be implemented by you according to the above principle. When choosing the appropriate thresholds, they can take the [**official publication**](https://arxiv.org/pdf/1207.7235.pdf)) as a basis as well as deviate from it at their own discretion.
<div class="alert-info">
At a minimum, the following filters should be implemented:

* **Lepton filter** for:
    * **relative isolation** (`relpfiso`), which discards all leptons whose relative isolation value is greater than a threshold to be chosen.
        
        Why is it important to look at this in more detail?
    * **Pseudorapidity** (`eta` or `pseudorapidity`) of individual leptons. Here the detector geometry should be taken into account and weighed whether a distinction by lepton flavor is useful.
    
    (*Note*: Muons can be considered as MIPS. The following [**image**](http://hep.fi.infn.it/CMS/software/ResultsWebPage/Images/Geometry/Tracker_SubDetectors_x_vs_eta.gif) shows the "material buget" of the CMS detector.)
    * **Impact parameter** of individual leptons, which discards all leptons that exceed a `(dz, dxy, sip3d)` threshold.
    
    Which leptons are discarded by applying this filter step?
* **Event filter** for:
    * **More strict transverse momentum filter** For example, it may be required that there be at least one lepton with transverse momentum greater than $20 \, \mathrm{GeV}$, and another that satisfies the condition $p_T > 10 \mathrm{GeV}$. After applying this filter, the selected condition should be passed as a `dict` (`pt_exact`) to the `reconstruct` function (see above).
    
    Specify a possible purpose for this filter.
    * Choosing appropriate **interval(s) for the mass of the Z-boson(s)** to get an off-shell and an on-shell Z-boson.
    
    What is an off-shell Z-boson?    
    Why is there an attempt to reconstruct such a combination?
</div>

# Final application of filtering and reconstruction steps to simulated and measured data sets.
---------------------------------------

In this section, the conditions you previously implemented are applied to the data sets. Also the actual measurement is now introduced here. So, if you are not yet satisfied with some thresholds of the requirements, you should change them before running this section.

Also, this ensures that you don't see the measurement beforehand and try to adjust your requirements thresholds to match the measurement - this would contradict the idea of blind analysis, which is optimized before looking at the data to avoid bias due to subjectivity.

In [None]:
# List to be extended with the own implemented filters
final_filter_function_list = [filter_pt_min, filter_electric_charge]

df_measurement_init = load_dataset("../data/for_long_analysis/ru_init/CMS_Run2012_[B,C]_[100,151].csv")

# Apply the filters to all records
df_bkg = apply_filter_functions(df_bkg_init, final_filter_function_list)
df_sig = apply_filter_functions(df_sig_init, final_filter_function_list)
df_measurement = apply_filter_functions(df_measurement_init, final_filter_function_list)

# Application of reconstruction to changed events:

# Replace with the chosen values
chosen_pt_exact_dict = {1:0, 2:0, 3:0, 4:0}
df_bkg = reconstruct(df_bkg, pt_exact=chosen_pt_exact_dict)
df_sig = reconstruct(df_sig, pt_exact=chosen_pt_exact_dict)
df_measurement = reconstruct(df_measurement, pt_exact=chosen_pt_exact_dict)

# Application of the filter for the selection of the masses of the Z-bosons within a selected interval:
# df_bkg = apply_filter_functions(df_bkg, <your func>)
# df_sig = apply_filter_functions(df_sig, <your func>)
# df_measurement = apply_filter_functions(df_measurement, <your func>)

# Examination of the final distributions
-----------------------

The idea now is to combine all channels in a histogram and look at some specific variables. The aim is to ensure that there is sufficient agreement between the simulated samples and the measured data. In the previous visualizations, the simulated data was displayed shown in form of histograms without any additional scaling.

In this section, histograms from simulated samples should be scaled to match the integrated luminosity of measured data, and a combination of the three decay channels into a single histogram is performed. This scaling and merging can be expressed as follows for each histogram bin:

$$N_{\mathrm{bin}} = \sum_{i\in\{4\mu, 4e, 2\mu2e\}} N_{\mathrm{bin},i}\frac{\mathcal{L}_{\mathrm{exp}}\sigma_i k}{N_{\mathrm{tot},i}} \quad (*)$$

Where $N_{\mathrm{tot},i}$ is the total number of events from the Monte Carlo simulation of the respective channel, $N_{\mathrm{bin},i}$ is the actual number of events in the considered histogram bin and $\sigma_i$ the cross-section of the respective channel.

The correction factor $k$ ($k=1$ for signal simulation and $k=1.386$ for background simulation) is a scaling factor specific to this simulation, which was only introduced because the background simulation, that was simulated up to the next-to-leading order precision in QCD (NLO). In contrast to that, the analysis you perform requires a precision of next-to-NLO in QCD(NNLO) to account for all effects.

Instead of a new simulation, it turned out, that introducing this global correction factor solves the existing problem of NLO$\rightarrow$NNLO.

The $\mathcal{L}_{\mathrm{exp}}$ is the integrated luminosity of the used measurement data set.

To avoid confusion it is important to point out, that this scaling is not done event-wise but is applied to the complete histogram (or more precisely to the histograms of the individual channels).

Furthermore, only the Higgs boson signal simulation for a Higgs boson mass of 125 GeV is used here, opposed to the publication using simulated samples with different mass hypotheses. The reason why this simulation is the appropriate one - taking into account the existing, published measurement - will be explained in the second part of the task.

For the scaling in $(*)$ the helper function `mc_hist_scale_factor` will be used. The decay channel (`channel = "2e2mu" | "4e" | "4mu"`) and the considered process (`process = "background" | "signal"`) are passed as arguments.

In [1]:
from include.Helper import mc_hist_scale_factor as mhsf

mhsf(channel="2e2mu", process="background"), mhsf(channel="4e", process="signal")

ModuleNotFoundError: No module named 'include'

<div class="alert-info">
Visualize the distribution
    
   * of the four leptons invariant mass.
   * the mass of the Z bosons
   * of some kinematic quantities of the leptons, Z bosons, and the four leptons invariant mass.

Indicate the uncertainties on the measurement in yout visualization and compare to the prediction from the simulated data sets. Where do they see deviations from the background process?

*Note*: Masks of different channels can be applied directly to build histograms and have, for example, the form `dataframe[dataframe.channel == "4e"]`.

</div>

# Estimation of statistical significance
--------------------

You will get to know the idea of the determination of statistical significance in the second part of this exercise.


Nevertheless, a simple estimate for the significance can already be made here by:

$$ Z = \sqrt{-2\left( s+(s+b)\ln\left( \frac{b}{s+b} \right) \right)} \, ,$$

where $b$ is the number of background events and $s$ the number of signal events. The details, how to derive the formula above for the significance $Z$, can be found in [**arXiv:1007.1727**](https://arxiv.org/abs/1007.1727).

<div class="alert alert-info">

* Where in this equation do you take the measured data into account implicitly?
* Estimate the significance of the Higgs boson with the mass of 125 GeV. What statements can you make about this value? 
* Derive a term for significance assuming that $s\ll b$ and interpret your result. Compare it with previous results.
* Does anything change if you use a different number of bins or consider a different mass interval and if so, why?
</div>

In [None]:
# necessary quantities
_, hist = h.variable("mass_4l", 15, (100, 150))
plt.show()
mc_signal, mc_background, measurement = hist.data["mc_sig"], hist.data["mc_bac"], hist.data["data"]

In [None]:
# your code for this task part