version 0.0.2.9.2

In [None]:
import sys
sys.path.append("..")

from include.RandomHelper import check_data_state
check_data_state()

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {return false;}

# Events
------------------

In this section the CMS detector will be discussed in more detail. This is an important subject, because the used data originates from this detector and introduced cuts on the data sets can be explained, for example, by the geometry of the detector. The visualization is done with the CMS own web interface: the [Ispy-WebGL](https://github.com/cms-outreach/ispy-webgl)

The CMS detector is a multi-purpose detector and consists of several components which are used for different particle detection. Select from the collection relevant events that could contain decays of the Higgs boson and which show the functionality of the individual detector components.

<div class="alert alert-info">
Possible decays contained in the collection are: 

- $\mathrm{H} \rightarrow \mathrm{ZZ} \rightarrow 4\ell$
- $\mathrm{H} \rightarrow \gamma \gamma$
- $\mathrm{H} \rightarrow \mathrm{W}^+ \mathrm{W}^- \rightarrow 4\ell$

Select two examples each and save them as an image. How would the typical signature of these individual decays appear in the detector?
</div>

The switch-on functions of individual detector components and the option to save the individual events as images can be performed within the web interface.



In [None]:
import sys
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:70% !important; }</style>"))

With an Internet connection:

In [None]:
%%html
<iframe src="https://ispy-webgl.web.cern.ch/ispy-webgl/" width="100%" height="700"></iframe>

Without an internet connection: Open the `index.html` from the [Github Repository](https://github.com/cms-outreach/ispy-webgl) locally in a web browser.

The detailed study during this exercise is the decay into four leptons from the two Z bosons and is the only one considered in this task. From all leptons only electrons and muons are analyzed, since the Tau leptons decay already in the tracker and therefore cannot be seen directly in the ECAL.

# Data format
------------------------

The data format used in this exercise, from which all required quantities are taken, is a modified .CSV format. The advantage of this format is that it can be viewed by a human operator at any time without much effort. 

The purpose of this section is to get familiar with the used data format and to be able to perform the tasks given in the next sections.

The separation of the individual variables in an event, which are stored row wise in the dataset, is done with ";". The values of the individual leptons within an event are classicaly separated with ",". This offers the advantage of an individual number of leptons in an event without the introduction of additional placeholders.

In [None]:
%matplotlib inline
#%matplotlib qt
import numpy as np
import pandas as pd

from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

name_1 = "../data/for_long_analysis/mc_init/MC_2012_ZZ_to_4L_2el2mu_init.csv"
name_2 = "../data/for_long_analysis/mc_init/MC_2012_ZZ_to_4L_4mu_init.csv"

dataframe_1 = pd.read_csv(name_1, delimiter=";")
dataframe_2 = pd.read_csv(name_2, delimiter=";")

In [None]:
dataframe_1

In [None]:
dataframe_2

For processing individual quantities, the respective elements from the string (`str`) data format can be converted back into a list (`list`):
For this purpose, either the `ast` library can be used, or the `split` method of a `string`:

In [None]:
import ast
px = ast.literal_eval(f"[{dataframe_2.loc[3, 'px']}]")
px = np.array(px, dtype=float)
px

In [None]:
py = dataframe_2.loc[3, 'py'].split(",")
py = np.array(py, dtype=float)
py

From here on new quantities like the transverse momentum can be determined:

In [None]:
pt = np.sqrt(px ** 2 + py ** 2)
pt

Now the restriction to the trnasversal momentum can be performed. This ensures, for example, that the misidentification with a lepton is sufficiently small.

For muons, a minimum value of 5 GeV for the transverse momentum must be exceeded. Transverse momentum values below this value increase the probability of muon misidentification.
To this extent, in this case all muons which do not meet the condition must be rejected.

In [None]:
pt_minimum_filter = pt > 5
pt_minimum_filter

In [None]:
pt = pt[pt_minimum_filter]
pt

This application of the resulting filter must therefore be applied individually for an event to each variable in the data set and the processed event must be saved again, which is far too time-consuming for a single, manual application, but can be significantly reduced by introducing an already created class that does it automatically.

# Application of the filters by using custom Apply Class
-----------------------------------

A filter is applied to all variables within an event and is different from the filter for the next event.
The application of the filter to the data set in series is already implemented and can be applied via the Apply class.

In [None]:
from include.processing.Apply import Apply

When the Apply object is created, an "allowed" and a "calculation" class instance must also be passed. The required variables or filters for the application of the filters are calculated using these classes.
For the example of the minimum transverse momentum the two classes can be represented as a collection of functions, which could stand alone (`@staticmethod`), but logically combined in one (two) class(es).

In [None]:
class Calc_Start(object):
    
    @staticmethod
    def pt(px, py):
        return np.sqrt(px ** 2 + py ** 2)

class Allowed_Start(object):
    
    @staticmethod
    def min_pt(pt, look_for):
        return pt > 7.0 if look_for == "electron" else (
            pt > 5.0 if look_for == "muon" else None)

In [None]:
print(name_2)
process = Apply(input_=name_2, particle_type="muon", use_n_rows=10000,
                calc_instance=Calc_Start, allowed_instance=Allowed_Start,
               use_swifter=True, multi_cpu=False)

In the data set `name_2` the variable of the transverse pulse does not yet exist. In an intermediate step, which is already contained in the filter `"check_min_pt"`, the transverse pulse can also be added explicitly:

In [None]:
process.add_variable("pt")
process.data

The histogram of such a variable can be displayed by the method `hist_of_variable`.

In [None]:
process.hist_of_variable(variable="pt", bins=100, hist_range=(0, 80))

Applying the filter for the minimum transverse momentum and removing all subsequent events containing less than four leptons provides the appropriate distribution:

In [None]:
process.filter(filter_name="check_min_pt")
process.hist_of_variable(variable="pt", bins=100, hist_range=(0, 80))

Other variables can also be visualized by the histogram.
The performed cut mainly influences the distribution of the transverse momentum, but changes little in the distribution of other variables.

<div class="alert alert-info">
Look at the distributions of some of the quantities in the records. Do some of the quantities deviate from your expectations? What would they be and why.

You can also use the code fragments from this section for the tasks below.
</div>

In [None]:
# here goes the code

# Calculation of important variables
--------------------------------

<div class="alert alert-info">

Following the example of the transversal impulse, implement all subsequently necessary quantities and visualize them appropriately:
  * Pseudorapidity $\eta$
  * Transverse pulse $p_T$
  * Azimuthal angle $\phi$

And explain your observations.

(Check their implementations using MC simulations of the background)
</div>

The class skeleton used for this inherits the previous method for the transverse momentum and an initial class containing methods such as the reconstruction of Z-boson pairs are added afterwards.

In [None]:
class CalcStudent(Calc_Start):
    '''
    Class for the calculation of certain sizes that are used for
    the cuts or are essential for the reconstruction.
    '''

    @staticmethod
    def combined_charge(charge, combine_num):
        '''
        Tests whether an electrically neutral charge combination is possible.

        :param charge: ndarray
                       1D array containing data with "int" type.
        :param combine_num: int
                            4 if look_for is not "both", 2 else
        :return: bool
        '''
        # code

    @staticmethod
    def eta(px=None, py=None, pz=None, energy=None):
        '''
        Calculates the pseudorapidity.
        Optional with or without energy.

        :param px: ndarray
                   1D array containing data with "float" type.
        :param py: ndarray
                    1D array containing data with "float" type.
        :param pz: ndarray
                   1D array containing data with "float" type.
        :param energy: ndarray
                       1D array containing data with "float" type.
        :return: ndarray
                 1D array containing data with "float" type.
        '''
        # code

    @staticmethod
    def invariant_mass_square(px, py, pz, energy=None, eta=None, phi=None):
        '''
        Calculates the square of the invariant mass.
        Optional with or without energy.
        Optionally with or without eta and phi.

        :param phi: ndarray
                    1D array containing data with "float" type.
        :param eta: ndarray
                    1D array containing data with "float" type.
        :param px: ndarray
                   1D array containing data with "float" type.
        :param py: ndarray
                   1D array containing data with "float" type.
        :param pz: ndarray
                   1D array containing data with "float" type.
        :param energy: ndarray
                       1D array containing data with "float" type.
        :return: ndarray
                 1D array containing data with "float" type.
        '''
        # code

    @staticmethod
    def phi(px, py):
        '''
        Calculation of the angle phi.

        :param px: ndarray
                   1D array containing data with "float" type.
        :param py:  ndarray
                    1D array containing data with "float" type.
        :return: ndarray
                 1D array containing data with "float" type.
        '''
        # code

    @staticmethod
    def delta_phi(phi1, phi2):
        '''
        Calculation of the difference between two phi angles.

        :param phi1: ndarray
                     1D array containing data with "float" type.
        :param phi2: ndarray
                     1D array containing data with "float" type.
        :return: ndarray
                 1D array containing data with "float" type.
        '''
        # code

    @staticmethod
    def delta_r(eta, phi):
        '''
        Calculation of delta_r.

        :param eta: ndarray
                    1D array containing data with "float" type.
        :param phi: ndarray
                    1D array containing data with "float" type.
        :return: ndarray
                 1D array containing data with "float" type.
        '''
        # code

# Creating the cuts
------------------------------

Analogous to the cut to the minimum transverse pulse, more cuts can be introduced which are then applied later. An example of which cuts were used by CMS can be found in the [official publication] (https://arxiv.org/pdf/1207.7235.pdf).

<div class="alert alert-info">

Implement the methods listed in the class `AllowedStudent`. Estimate the choice of cuts using the distributions in the chapter **Application of the filters by using custom Apply Class**, the **detector** and the **kinematic constraints** of the events.

(Check your implementations using the MC simulations of the background).
    
</div>

In [None]:
class AllowedStudent(Allowed_Start):
    '''
    Class that introduces certain cuts and thus restricts the leptons in the events.
    '''
    
    @staticmethod
    def lepton_type(classification, particle_type):
        '''
        Checks if electrons or muons satisfy the classification condition.
        
        :param classification: ndarray
                               1D array containing data with `str` type.
        :param particle_type: str
                              "muon" or "electron"
        :return: ndarray
                 1D array containing data with "bool" type.
        '''
        #code
    
    @staticmethod
    def delta_r(delta_r):
        '''
        Checks if delta_r is smaller than the allowed value.

        :param delta_r: ndarray
                        1D array containing data with `float` type.
        :return: ndarray
                 1D array containing data with `bool` type.
        '''
        # code

    @staticmethod
    def rel_pf_iso(rel_pf_iso):
        '''
        Checks if rel_pf_iso is smaller than the allowed value.

        :param rel_pf_iso: ndarray
                           1D array containing data with `float` type.
        :return: ndarray
                 1D array containing data with `bool` type.
        '''
        # code
        
    @staticmethod
    def misshits(misshits):
        '''
        Checks if the minimum number of misshits was kept.

        :param misshits:
        :return:
        '''
        # code
        
    @staticmethod
    def pt(p_t, look_for, coll_size=4):
        '''
        Checks if the exact pedingun regarding pt is observed.
        (>20 GeV: >= 1; >10 GeV: >= 2; >Minimum pt: >= 4).

        :param p_t: ndarray
                    1D array containing data with `float` type.
        :param look_for: str
                         "muon"; "electron" or "both"
        :param coll_size: int
                          4 if look_for is not "both", 2 else
        :return: ndarray
                 1D array containing data with `bool` type.
        '''
        # code
        
    @staticmethod
    def eta(eta, look_for):
        '''
        Checks if the pseudorapidity of leptons is valid.

        :param eta: ndarray
                    1D array containing data with "float" type.
        :param look_for: str
                         "muon"; "electron" or "both"
        :return: ndarray
                 1D array containing data with "bool" type.
        '''
        # code

    @staticmethod
    def lepton_type(typ, look_for):
        '''
        Checks for the permitted classification of leptons.

        :param typ: ndarray
                    1D array containing data with "float" type.
        :param look_for: str
                         "muon"; "electron" or "both"
        :return: ndarray
                 1D array containing data with "bool" type.
        '''
        # code

    @staticmethod
    def impact_param(sip3d, dxy, dz):
        '''
        Checks if the impact parameters of the collision are valid and sorts out
        events that do not have a clear and equal collision point.

        :param sip3d: ndarray
                      1D array containing data with "float" type.
        :param dxy: ndarray
                    1D array containing data with "float" type.
        :param dz: ndarray
                   1D array containing data with "float" type.
        :return: ndarray
                 1D array containing data with "bool" type.
        '''
        # code

    @staticmethod
    def zz(z1, z2):
        '''
        Checks if the Z1 candidate and the Z2 candidate is within the allowed range.

        :param z1: float
        :param z2: float
        :return: bool
        '''
        # code

Combination of the implemented code with the partially provided classes:

In [None]:
from include.processing.CalcAndAllowerInit import AllowedInit
from include.processing.CalcAndAllowerInit import CalcInit

AllowedInit.a_allowed_instance = AllowedStudent
AllowedInit.a_calc_instancea = CalcStudent
class Allowed(AllowedStudent, AllowedInit):
    pass
    
CalcInit.c_allowed_instance = Allowed
CalcInit.c_calc_instance = CalcStudent
class Calc(CalcStudent, CalcInit):
    pass

# Application of filters and reconstruction on MC - simulations
---------------------------------------

The existing implementations of the cuts were tested using the underground MC simulations. This is intended to prevent the targeted work towards a certain goal (actual measurement), since a small number of later events in the actual measurement should largely avoid the fact of subjectively selecting specific events in an area.

As far as the above sections have been fully completed, the task of this section is to find a meaningful sequence of the above defined functions. If it is possible to use multiple CPU cores, this option is recommended, because it reduces the filter and reconstruction time according to the number of CPU cores. Nevertheless, it is also possible to do everything on one core - with a little more time.

For the routine described below, the information can be summarized in tuples:

In [None]:
from include.processing.ApplyHelper import ProcessHelper


# All Background MC and signal MC for m_H = 125 GeV
mc_files = True
# actual measurement
run_files = not True

# folder with initial records: measurement
dir_measurement ="../data/for_long_analysis/ru_init/"
# folder with initial records: backround MC and singal (125 GeV) MC
dir_mc = "../data/for_long_analysis/mc_init/"

file_tuples = []
if mc_files:
    file_tuples += ProcessHelper.create_tuple(dir_mc)    

if run_files:
    file_tuples += ProcessHelper.create_tuple(dir_measurement)

A single such `namedtuple` contains the file and particle type of the data set (necessary for `Apply`):

In [None]:
file_tuples[0]

With the previously added new methods for different quantities and the created cuts a routine with a logical order can be build in which all necessary steps are collected. A possible example of such a routine used for this part is shown below. Within this routine the order can and should be varied to a certain degree.

In [None]:
# possible operations
Apply.help()

def filter_and_reco_process(used_pair):
    process = Apply(input_=used_pair.name, 
                    particle_type=used_pair.particle, 
                    multi_cpu=False, use_swifter=True,
                    calc_instance=Calc, 
                    allowed_instance=Allowed)
    
    # Logical order selection
    # quicksave: Saves the data set AFTER the application 
    # of the filter or reconstruction step
    
    process.filter(filter_name="check_type", quicksave=ProcessHelper.change_on_affix(used_pair.name, "aftT"))
    
    # ProcessHelper.change_on_affix("Name_OldAffix.csv", "NewAffix"):
    # -> This method changes "Name_OldAffix.csv" in "Name_NewAffix.csv"
    #    and saves "Name_NewAffix.csv" in the new directory <ru or mc>_NewAffix
    
    
    process.filter(filter_name="check_q")
    process.filter(filter_name="check_q")
    process.filter(filter_name="check_min_pt")
    process.filter(filter_name="check_impact_param")
    process.filter(filter_name="check_q")
    process.filter(filter_name="check_exact_pt")
    process.filter(filter_name="check_m_2l")
    process.filter(filter_name="check_rel_iso")
    process.filter(filter_name="check_q")
    if process.particle_type != "muon":
        process.filter(filter_name="check_misshit")
    process.filter(filter_name="check_q")
    process.filter(filter_name="check_eta")
    process.filter(filter_name="check_q")
    process.filter(filter_name="check_m_4l", quicksave=ProcessHelper.change_on_affix(used_pair.name, "befZ"))
    process.reconstruct(reco_name="zz", quicksave=ProcessHelper.change_on_affix(used_pair.name, "aftZ"))
    process.reconstruct(reco_name="mass_4l_out_zz", quicksave=ProcessHelper.change_on_affix(used_pair.name, "aftH"))
    del process

<div class="alert alert-info">

Select the appropriate sequence of filter and reconstruction steps.

Do some filter steps have to be performed several times?and why is it not useful to perform the reconstruction of the ZZ bosons very early?

(If you are interested in a time-optimized version with the functions available here, the speed of the individual filters can be evaluated in a procedure: Use the data set of the underground MC simulations. When initializing the 'Apply' instance use the following `kwargs`: 'multi_cpu=False, use_swifter=True'. Now you should see the speed displayed next to it, which shows how many lines (events) per second are being processed.)
    
</div>

The application of `filter_and_reco_process` to the records: 

In [None]:
from tqdm import tqdm
from IPython.display import clear_output

if input("Run all filter + reco (y/n): ") == "y":
    for pair in tqdm(file_tuples):
        filter_and_reco_process(pair)
        clear_output()

# Examination of the final distributions
-----------------------

After performing the filtering and reconstruction step, the final distributions of individual quantities can now be examined. Here a distinction is made between the signal MC and the background MC. The signal MC simulations used here are those of a Higgs boson with a mass of 125 GeV. The reason why this simulation is the appropriate one - taking into account the existing measurement - will be explained in the second part of the task, which you will encounter in the course TP2.

<div class="alert alert-info">

Consider at least the distribution of the four lepton invariant masses, as well as the masses of the two Z-bosons.
    
</div>

Furthermore other quantities can also be displayed, although it is questionable whether `z1_index`, `z2_index`, as well as `z1_tag` or `z2_tag` are quantities that require detailed visual observation.

In [None]:
from include.processing.ApplyHelper import ProcessHelper
print(ProcessHelper.print_possible_variables("../data/for_long_analysis/mc_aftH/MC_2012_H_to_ZZ_to_4L_2el2mu_aftH.csv"))

In [None]:
from include.histogramm.HistOf import HistOf
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams["figure.figsize"] = (12, 9)
h = HistOf(mc_dir="../data/for_long_analysis/mc_aftH",
           ru_dir="../data/for_long_analysis/ru_aftH", info=info)

In [None]:
# example
h.variable("energy", 50, (0, 200))
ax = plt.gca()
ax.set_xlabel(r"$p_T$ in GeV")
ax.set_ylabel("Entries")
plt.show()

The uncertainties presented here are asymmetrical. For this purpose, a Poisson distribution (measured number of events) is determined from the expected value, the lower limit ( - 34%) and the upper limit (+34%), so that a 68% uncertainty interval can be specified. The asymmetry of the Poisson distribution is clearly visible for a small expected value. Only in the limit case of large expected value does the Poisson distribution transition to the Gaussian distribution and the uncertainties on a measured value become symmetrical.

In [None]:
# other histograms

# First estimation of statistical significance

The detailed version of the determination of statistical significance is given in the second part. However, a rough estimate can be made by $$ Z \approx \frac{s}{\sqrt{b}} ,$$where $b$ is the number of expected background events taken from the MC simulation. For the signal $s$ the difference between the total number of measurements and the expected background (again taken from the MC simulation) is taken.


<div class="alert alert-info">

Use the above formula to estimate the significance of the Higgs boson with the mass of 125 GeV. What statements can you make about the value? What problems does this kind of estimation have and why is it not really suitable for a quantitative statement?
    
</div>

In [None]:
# usefull code
_, hist = h.variable("mass_4l", 15, (100, 150))
mc_sig, mc_sig_bac, measurement = hist.data["mc_sig"], histogramm.data["mc_bac"], histogramm.data["data"]

In [None]:
# your code