The notebook is intended to be a more open-ended variation of the previous practice sheet for students. In this case, the notebook exercise should not represent the complete exercise sheet, but only show the changed technical details.

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from uproot3_methods.classes.TLorentzVector import TLorentzVector
from include.Particle import Lepton

from include.QuantityAccessors import ParticleQuantitySeriesAccessor  # 'quantity'
from include.Helper import load_dataset, save_dataset

Let's start with the data set. Here, the individual quantities of the leptons are no longer stored as a list (or as a string representation of a list), but a list of leptons that have the individual quantities. The data set has already been reduced to include only events at the plateau range [105, 155] GeV. The peak at Z mass and the rise at >180 GeV have thus been effectively eliminated.

In [None]:
df = load_dataset("./data/for_long_analysis/mc_init/MC_2012_ZZ_to_4L_[105,155].csv")
df

`run_event_lumisec` remains the previous event identifier. `channel` is the decay channel. There is no division of the individual channels into different files as before. `leptons` contain the previously mentioned leptons. `z1`, `z2`, `four_lep` are the respective already reconstructed four vectors of the Z bosons and the "four" lepton. `changed` is an auxiliary column for the later (re)reconstruction with changing lepton number. 

In [None]:
print(df.leptons.iloc[0])
print(df.z1.iloc[0])

The attributes of each `Lepton` would be the existing ones and any additional ones added from the `TLorentzVector`. Students will therefore not *again* implement all the kinematic quantities that are needed.

In [None]:
print([it for it in dir(Lepton) if not it.startswith("_")])

The `x`, `y`, `z`, `t` attribute from the `uproot3-methods` implementation then corresponds to the `px`, `py`, `pz`, `E` from the `ROOT` implementation and can be called accordingly. In addition, there are the lepton specific attributes, such as the charge, the flavour or the impact parameter.

In [None]:
print(df.leptons[0][0].charge)
print(df.leptons[0][0].flavour)
print(df.leptons[0][0].dxy, df.leptons[0][0].dz, df.leptons[0][0].sip3d)

To avoid loops in the code an additional accessor (`quantity`) for the pandas series is introduced:

In [None]:
df.leptons.quantity.pt

In the above example, the pd.Series contains __all__ the leptons in the data set. If you want to look at only the first or the first two or the third muon from each event, `df.leptons.quantity.pt` can be expanded to `df.leptons.quantity(flavour="m", lep_num=[0, 1]).pt`. For the four vectors of the reconstructed objects it works analogously (`df.z1.quanitity.mass`), but here the `quantity()` variant makes no sense, because there are no several `z1` in one event. An exemplary application:

In [None]:
fig, axes = plt.subplots(1, 4, figsize=(30, 5))

df.leptons.quantity.pt.hist(bins=10, range=(0, 100), ax=axes[0], label="All Leptons of all events")
axes[0].set_xlabel(r"$p_T$ in GeV")
axes[0].set_ylabel(r"$Leptons$")
axes[0].legend()

df.leptons.quantity(flavour="m", lep_num=[0]).eta.hist(bins=10, range=(-3.5, 3.5), ax=axes[1], label="First muon of all events")
axes[1].set_xlabel(r"$\eta$")
axes[1].set_ylabel(r"$Leptons$")
axes[1].legend()

df.leptons.quantity(lep_num=[0, 2]).relpfiso.hist(bins=20, range=(0, 2), ax=axes[2], label="First and third lepton of all events")
axes[2].set_xlabel(r"relative isolation")
axes[2].set_ylabel(r"$Leptons$")
axes[2].legend()

df.z1.quantity.mass.hist(bins=20, range=(0, 120), ax=axes[3], label="Z1", histtype="step")
df.z2.quantity.mass.hist(bins=20, range=(0, 120), ax=axes[3], label="Z2", histtype="step")
axes[3].set_xlabel(r"relative isolation")
axes[3].set_ylabel(r"$Events$")
axes[3].legend()

plt.show()

Now we come to the filters: these are still to be implemented by the students themselves. Since the filter"class" (a logical collection of functions used for the same task) no longer needs to be passed to an Apply class, this can be made somewhat simpler. The main idea is that, the students consider only single leptons in their implementation (for example `lambda lep: lep.pt > 5`). It would also be possible to replace every filter that should be implemented by the students with `lambda lep: True`, which would allow the whole exercise sheet to run through without having any filters applied.

In [None]:
import itertools

# Helper function for a event based filter
# A function that checks whether a num-size combination of leptons in an event can result in a neutral electric charge.
def valid_charge_combination(charge, num=4):
    for combination in itertools.combinations(charge, num):
        if np.sum(combination) == 0:
            return True
    else:
        return False


class Filter(object):
    
    # Tast-oriented collection of functions that are used for the filter step
    
    # --- Helper functions ---
    
    # Helper function 1
    @staticmethod
    def check_min_lepton_number(row):
        """
        Checks if the minimum number of leptons in an event is guaranteed.

        row: Pandas.Series
        """
        
        if row.channel != "2e2mu":
            return len(row.leptons) >= 4
        
        if row.channel == "2e2mu":
            _flavour_sublist = lambda x: [1 for lepton in row.leptons if lepton.flavour == x]
            return (len(_flavour_sublist("e")) >= 2) and (len(_flavour_sublist("m")) >= 2)
    
    # Helper function 2
    @staticmethod
    def generic_lepton_filter(row, operation):
        """
        Does the repetitive application of a mask to the list of leptons.

        row: Pandas.Series
        operation: filter instruction for a lepton;
                   Simple function when no distinction between flavors is necessary
                   dict {"e": func, "m": func}, if a distinction between flavours is necessary
        """
        
        if isinstance(operation, dict):  # distinction between flavors
            _mask = np.array([operation[lep.flavour](lep) for lep in row.leptons])
        
        else:  # simple function
            _mask = np.array([operation(lep) for lep in row.leptons])
        
        row.leptons = row.leptons[_mask]  # applying the created _mask
        
        if not Filter.check_min_lepton_number(row):
            row.run_event_lumisec = np.nan  # skipping rows with np.nans and dropping at the end
            return row
        
        if False in _mask:  # Change for new reconstruction of Z bosons
            row.changed = True
        
        return row
    
    # --- exemplary lepton filter ---
    
    @staticmethod
    def pt_min(row):
        """
        Apply the minimum threshold condition of transverse momentum for every lepton

        row: Pandas.Series
        """
        
        if str(row.run_event_lumisec) == "nan":  # skip if the event was thrown away by previous filter
            return row
        
        _v = {"m": lambda lep: lep.pt > 5, "e": lambda lep: lep.pt > 7}  # lepton based application
        
        return Filter.generic_lepton_filter(row, _v)
    
    # --- exemplary event filter
    
    @staticmethod
    def electrinc_charge(row):
        
        """
       Checks whether a charge-neutral combination of leptons is possible

       row: Pandas.Series
       """
        
        if str(row.run_event_lumisec) == "nan":  # skip if the event was thrown away by previous filter
            return row
            
        # A distinction between the mixed channel
        if row.channel != "2e2mu":  # combination of four leptons
            if not valid_charge_combination([lep.charge for lep in row.leptons], 4):
                row.run_event_lumisec = np.nan
        
        if row.channel == "2e2mu":  # combination of two lepton pairs with the same flavour
            if not valid_charge_combination([lep.charge for lep in row.leptons if lep.flavour == "m"], 2) or \
                    not valid_charge_combination([lep.charge for lep in row.leptons if lep.flavour == "e"], 2):
                row.run_event_lumisec = np.nan
            
        return row
    
    # ------- This part is implemented by the students
    
    @staticmethod
    def pseudorapidity(row):
        """
        Checks if the threshold on Eta is fulfilled by the leptons in an event. 

        row: Pandas.Series
        """
        
        if str(row.run_event_lumisec) == "nan":  # skip if the event was thrown away by previous filter
            return row
        
        # This line(s) is(are) to be implemented by the students; or all subsequent filters accordingly.
        _v = {"m": lambda lep: abs(lep.eta) < 2.4,
              "e": lambda lep: (abs(lep.eta) < 1.479) | (abs(lep.eta) > 1.653) & (abs(lep.eta) < 2.5)}
        
        return Filter.generic_lepton_filter(row, _v)
    
    @staticmethod
    def relative_isolation(row):
        """
        Checks if the threshold on relative isolation is fulfilled by the leptons in an event. 

        row: Pandas.Series
        """
        
        if str(row.run_event_lumisec) == "nan":  # skip if the event was thrown away by previous filter
            return row
        
        return Filter.generic_lepton_filter(row, lambda lep: lep.relpfiso < 0.4)
    
    @staticmethod
    def impact_parameter(row):
        """
        Checks if the threshold on impact parameter is fulfilled by the leptons in an event. 

        row: Pandas.Series
        """
        
        if str(row.run_event_lumisec) == "nan":  # skip if the event was thrown away by previous filter
            return row
        
        return Filter.generic_lepton_filter(row, lambda lep: (lep.sip3d < 4) & (lep.dxy < 0.5) & (lep.dz < 1))
    
    @staticmethod
    def pt_exact(row):
        """
        Checks if the existing leptons in an event fulfill the exact requirement for the transverse momentum.

        row: Pandas.Series
        """
        
        # use of pt_min beforehand is necessary
        
        if str(row.run_event_lumisec) == "nan":  # skip if the event was thrown away by previous filter
            return row
        
        _pt = np.array([lep.pt for lep in row.leptons])  
        
        if not (np.sum(_pt > 20) >= 1) & (np.sum(_pt > 10) >= 2):
            row.run_event_lumisec = np.nan  
        
        return row

If students feel the need to create additional custom filters that do not appear in the list, they can do so simply by creating a new `@staticmethod` or a standalone function using the same principle as shown above. This would also be the only "larger" code block in the notebook.

After implementing the filters, you can choose an order in the form of a list. The reading direction is from left to right.

In [None]:
filter_steps = [Filter.electrinc_charge, Filter.relative_isolation, Filter.impact_parameter, Filter.pt_min, Filter.pt_exact, Filter.pseudorapidity, Filter.electrinc_charge]

In [None]:
from include.Helper import pipeline
from copy import deepcopy

The application of the filters can also be shown to the students again with an example of pt_min, as it is the case in the previous exercise sheet.

The pandas Apply variant always processes the __entire__ data set and then discards all entries that contain a `np.nan`. Thus the dataset is reduced bit by bit. Since the length of the lists in the column `leptons` is variable, it is not really possible to "use pandas". This would require a change in the data structure, which in turn would lead to a no longer intuitive filtering.

In [None]:
def not_so_effective_filter_process(dataframe, filter_list):
    _df = deepcopy(dataframe)    
    for func in filter_list:
        print(f"Doing {func.__name__}...")
        _df = _df.apply(lambda x: func(x), axis=1)
        _df = _df.dropna()
    return _df

If we now implement the idea of a pipeline, i.e. the whole thing in approximately the form `f(g(h(x)))` where `x` in this case is an iterator that goes through all the rows of the dataset and `f`, `g`, `h` are the functions that correspond to the filter functions in this case, then this can also be implemented in a familiar way. The only flaw this introduces is that the one row has to go through the whole pipeline before the next row can start. Therefore, in the imported `pipeline(dataframe, function_list, buffer_size)` function, a threaded variant of `f(g(h(x)))` is dispatched, whereby the next line actually comes into the pipeline, for example, if the previous one has gone on. For full details, the code and comments in `include.Helper.py` can be examined. For the less interested students it would remain with the import function and the short explanation of how the function operates.

__Note__: MyBinder does not seem to support the threaded variants, so in this case there is even slower performance. Therefore, the whole thing should be tested once on the architecture to be used (on a local computer, on which the whole thing was also tested, there was a minimal improvement).

If there is no significant improvement, then this variant might not even have to be brought up, or either should be only shown as an alternative.

In [None]:
def more_effective_filter_process(dataframe, filter_list, buffer_size=10):
    _df = deepcopy(dataframe)
    processed_events = (event for event in pipeline(_df, filter_list, buffer_size=buffer_size))
    _df = pd.DataFrame(pd.concat((event for i, event in processed_events), ignore_index=True, axis=1).T)
    _df = _df.dropna()
    return _df

In [None]:
%%time
df_after_filter = not_so_effective_filter_process(df, filter_steps)

In [None]:
%%time
more_effective_filter_process(df, filter_steps, 5)

After the filter steps, there are events where there were, for example, five leptons before and one was removed. This set `row.changed` to `True`. The subsequent (re)reconstruction then covers only these cases and skips all unchanged rows. Since the reconstruction time issue only occurs with the MC, since there are simply a lot of events, this is a possible solution to make the rather time-consuming reconstruction (see `include.ReconstructionFunctions.py` for details) more time-efficient. However, the function as such would be hidden from the students as before.

In [None]:
from include.ReconstructionFunctions import reconstruct_zz

def reconstruct(dataframe, pt_exact=None):
    _df = deepcopy(dataframe)
    _df = _df.progress_apply(lambda x: reconstruct_zz(x, pt_dict=pt_exact), axis=1, raw=True)
    _df = _df.dropna()
    _df.four_lep = _df.z1 + _df.z2
    return _df

In [None]:
%%time
df_after_filter = reconstruct(df_after_filter, pt_exact={1:20, 2:10, 3:0, 4:0})

Since a new DataFrame was created in each step, they can be used to examine the influences between the individual filter steps. In all of them, all quantities are already calculated and all kinematic variables are accessible via the `quantity` accessor.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20, 5))

df.four_lep.quantity.mass.hist(bins=20, range=(105, 155), label="Before", ax=axes[0])
df_after_filter.four_lep.quantity.mass.hist(bins=20, range=(105, 155), label="After", ax=axes[0])
axes[0].set_xlabel("$m_{4\ell}$ in GeV")
axes[0].set_ylabel("$Events")
axes[0].set_title("Applying filter and reapply reconstruction: $m_{4\ell}$")
axes[0].legend()

# explicit four_lep [105, 155] mask:
_df_mask = (df.four_lep.quantity.mass > 105) & (df.four_lep.quantity.mass < 155)
_df_after_filter_mask = (df_after_filter.four_lep.quantity.mass > 105) & (df_after_filter.four_lep.quantity.mass < 155)

df.z1.quantity.mass[_df_mask].hist(bins=40, range=(40, 120), label="Before", ax=axes[1])
df_after_filter.z1.quantity.mass[_df_after_filter_mask].hist(bins=40, range=(40, 120), label="After", ax=axes[1])
axes[1].set_xlabel("$m_{2\ell}$ in GeV")
axes[1].set_ylabel("$Events")
axes[1].set_title("Applying filter and reapply reconstruction: $Z_1$")
axes[1].legend()

df.z2.quantity.mass[_df_mask].hist(bins=40, range=(0, 120), label="Before", ax=axes[2])
df_after_filter.z2.quantity.mass[_df_after_filter_mask].hist(bins=40, range=(0, 120), label="After", ax=axes[2])
axes[2].set_xlabel("$m_{2\ell}$ in GeV")
axes[2].set_ylabel("$Events")
axes[2].set_title("Applying filter and reapply reconstruction: $Z_2$")
axes[2].legend()


plt.show()

For a better comparison, it would be possible to additionally place the signal distributions next to it to see that the filters have removed more from the background than from the signal.

Now, if you perform the same filters and reconstruction steps for the measurement...

In [None]:
df_data = load_dataset("./data/for_long_analysis/ru_init/CMS_Run2012_[B,C].csv")
df_data_after_filter = not_so_effective_filter_process(df_data, filter_steps)
df_data_after_filter = reconstruct(df_data_after_filter, pt_exact={1:20, 2:10, 3:0, 4:0})

... you see an bigger effect:

In [None]:
df_data.four_lep.quantity.mass.hist(bins=15, range=(106, 151))
df_data_after_filter.four_lep.quantity.mass.hist(bins=15, range=(106, 151))
plt.xlabel("$m_{4\ell} in GeV$")
plt.ylabel("Events")
plt.show()

In [None]:
from include.Helper import mc_hist_scale_factor

For the comparison between the prediction and the actual measurement, the simulations must be scaled to the data.
The method `mc_hist_scale_factor` already fulfills this, whereby the scaling in the following form is always carried out for the whole histogram of a single decay channel:

$N_i' = \dfrac{\sigma_{\mathrm{channel}} k \mathcal{L}_{\mathrm{int}}}{N_{\mathrm{MC(channel)}_{\mathrm{tot}}}} N_i$

With the bin $i$ before ($N_i$) and after scaling ($N_i'$), $k$ the scaling factor $\mathcal{L}_{\mathrm{int}}$ the integrated luminosity, $\sigma_{\mathrm{channel}}$ the channel specific effective cross-section and $N_{\mathrm{MC(channel)}_{\mathrm{tot}}}$ the total number of events in the corresponding simulation.

The exact values can be looked up in the function. It is also possible to leave this part to the students themselves, where it is effectively just diligence work to implement the scaling for each individual channel.

In [None]:
hist, bin_ranges = np.histogram(np.zeros(10), bins=15, range=(106, 151))
hist, bin_centers = np.array(hist, dtype=float), bin_ranges[1:] - abs(bin_ranges[4] - bin_ranges[3])/2
for ch in np.unique(df_after_filter.channel):
    _hist, _ = np.histogram(df_after_filter[df_after_filter.channel == ch].four_lep.quantity.mass, bins=bin_ranges)
    hist += mc_hist_scale_factor(channel=ch, process="background")*np.array(_hist, dtype=float)
y_data, _ = np.histogram(df_data_after_filter.four_lep.quantity.mass, bins=bin_ranges)
x_data = bin_centers

In [None]:
plt.step(bin_centers, hist, where="mid", label="Background")
plt.errorbar(x_data, y_data, yerr=np.sqrt(y_data), fmt="ko", label="Data")
plt.xlabel("$m_{4\ell}$ in GeV")
plt.ylabel("Events")
plt.legend()
plt.show()

The implementation of the exercise sheet in the previous form would be covered with it (the significance estimation is in this case since background`hist` and measurement `y_data` are available is not a problem, if you repeat the same steps with the signal simulation).