### **Importing libraries**

In [None]:
import uproot

import pandas as pd
import numpy as np
from scipy.stats import spearmanr, pearsonr
from scipy.stats import norm

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.figure_factory as ff

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import Lasso, LinearRegression, HuberRegressor
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
np.random.seed(5) # for results to be reproducible

### **Before we start**

Before proceeding any further, please download [this ROOT file](https://drive.google.com/file/d/1i0YbH7rrljIm4oJcAzArrzv8-ED5pXFd/view?usp=sharing) into `./data` folder, since this might take a while.

### **Introduction**

In this seminar we are going to train our first **linear model for regression**. For this purpose we will use a sample of simulated events in the CMS detector, in particular the process of the **Higgs boson decaying into a pair of tau leptons**, where one tau lepton decays into the final state with electron, and the other decays hadronically. That is, 

$\text{H}\rightarrow\tau^+\tau^- \, \rightarrow \, (e \nu_e \nu_\tau) \, (\text{hadrons}+\nu_\tau)$

The Higgs boson is produced in a so-called **gluon-gluon fusion** (or ggH for short) production mode, here is the corresponding Feynman diagram:

<img src="images/Higgs-gluon-fusion.png" alt="drawing" width="300"/>

and for illustration, an event in the CMS detector - however with muon, not electron, in the final state:

<img src="images/cms_mt_event.png" alt="drawing" width="900"/>

with a picture of CMS detector itself:

<img src="images/cms.png" alt="drawing" width="900"/>

### **Getting data**

Let's firstly load the data. It is stored in a [**ROOT file**](https://root.cern/), and until very recently this has been not trivial to retrieve data from ROOT files (an everlasting data format in HEP) and convert it into NumPy/Pandas (common Data Science community formats). Hopefully, tools have been developed by HEP community and we will introduce them all in the following classes. Hereafter we will be using [**uproot**](https://github.com/scikit-hep/uproot) for this ROOT$\rightarrow$pandas data transformation. Don't focus on this much since it's not the major goal at the moment - we will present you uproot in its all shine in the upcoming lessons.

In [None]:
data_file = uproot.open('data/et-NOMINAL_ntuple_GluGluHToUncorrTauTau_2018.root')

In [None]:
type(data_file)

In [None]:
# print uproot file content
data_file.keys()

What we've got inside of the ROOT file is a single [TTree](https://root.cern.ch/doc/master/classTTree.html) called "TauCheck" - let's get it:

In [None]:
# fetch TTree
data_tree = data_file['TauCheck'] # as simple as that
type(data_tree)

The data in TTree is stored in the form of **branches** - essentially it is like feature columns in DataFrame. Generally, there is lots of similarities in the concept of TTree and DataFrame - this naturally allowed for building the transformation packages between them, for example, uproot.  

In [None]:
# print TTree branches
branches = data_tree.keys()
branches

In [None]:
# decoding into readable format and print once again
branches = [branch.decode("utf-8").split(';')[0] for branch in branches]
branches

Here is a brief description of some of the stored branches which we will need in the following:

**run:** run number in data taking  
**evt:** unique event ID  
**pt_1/2:** momentum of electon/tau in the transverse to a beam line plane, in [GeV](https://en.wikipedia.org/wiki/Electronvolt)   
**eta_1/2, phi_1/2:** [pseudorapidity](https://en.wikipedia.org/wiki/Pseudorapidity) and azimuthal angles  
**gen_match_1/2:** particle ID at generator level  
**iso_1/2:** relative isolation of electron/tau. Defined as the sum of particles' momentum in the cone around e/tau devided but e/tau momentum    
**m_vis:** visible mass of ditau pair, in GeV  
**os:** whether electron and hadronic tau are oppositely charged  
**puppimet:** [missing transverse energy](https://en.wikipedia.org/wiki/Missing_energy), in GeV    
**puppimt_1/2, mt_tot:** [missing transverse mass](https://en.wikipedia.org/wiki/Transverse_mass) for electron, tau and ditau system, in GeV    
**pt_tt:** transverse momentum of (ditau + MET) Lorentz vector, in GeV    
**m_fast, pt_fast:** some [analytical approximation](https://indico.cern.ch/event/684622/contributions/2807248/attachments/1575090/2487044/presentation_tmuller.pdf) of ditau mass  and ditau transverse momentum, in GeV    
**ip{x,y,z}_1/2:**  [impact parameter](https://www.desy.de/~ameyer/hq/node25.html) coordinates  
**IP_signif_\*:** impact parameter significance - (slightly naively) absolute value of impact parameter divided by its uncertainty  
**trg_* and os_*:** [trigger](https://en.wikipedia.org/wiki/Trigger_(particle_physics)) matching booleans  
**njets, nbtag:** number of [jets](https://en.wikipedia.org/wiki/Jet_(particle_physics)) and [b-jets](https://en.wikipedia.org/wiki/B-tagging) in event  
**mjj, jdeta, dijetpt:** invariant mass, delta of pseudorapidities and transverse momentum of a pair of jets  
**jpt_1/2, jeta_1/2:** momentum and pseudorapidity of leading and subleading jets, in GeV    
**acotautau_\*:** CP angles defined in [this publication](http://cms-results.web.cern.ch/cms-results/public-results/preliminary-results/HIG-20-006/index.html)

In [None]:
# remove "evt" branch and branches with "weight" in their names, don't need them here 
branches = [branch for branch in branches if 'weight' not in branch]
branches.remove('evt')
#branches

And here is where magic happens - with a single line we convert an uprooted TTree instance into a Pandas DataFrame! 

In [None]:
data = data_tree.pandas.df(branches)

In [None]:
data.sample(5, random_state=10)

In [None]:
data.shape

So as we see, there is ~777k events and 75 branches/features. That's too much for now, so we are gonna take 10k events for faster calculations.

In [None]:
# warm-up exercise: select first 10k events

# data_df = 

As was mentioned in the beginning, we want to do some regression here. What would be interesting to try is to **predict a visible mass of a ditau pair** based on all other available features. As we have neutrinos in the final state which carry away some fraction of energy without being detected, the resolution of the visible mass is kind of not good, comparing to the case if we were able to completely reconstruct the decay. That means that the signal in this visible mass variable because of a large width overlaps with and can hardly be separated from the main background - a **Drell-Yan process**  $Z/\gamma^*\rightarrow\tau\tau$ with the same final state, therefore reducing the analysis sensitivity. Perhaps, some Machine Learning can help us to improve the resolution and increase signal/background separation in the variable? This makes sense to use because when we provide additional information (and later we will check that this information is indeed helpful and meaningful) about the event as an input, it can help the model to better predict the visible mass. 

<img src="images/m_vis.png" alt="drawing" width="400"/>

And for now we are going to probe this idea by training a linear model to predict visible mass as follows:

$m\_vis \, (\text{event candidate}) \sim \sum \theta_j \cdot f_j(\text{event candidate})$,

where $\theta_i$ are the parameters of the model (we will also call them **weights**), and $f_i$ are the input feautures describing the $H\rightarrow\tau\tau$ event candidate

Later in the course you will learn more models, e.g. Random Forest or Neural Nets - don't hesitate to come back later to this dataset and probe their performance in the same problem. Would be interesting to know whether they'll manage to beat a simple linear model😏

In [None]:
target_branch = 'm_vis' 

In [None]:
# exercise: plot the histogram of the target variable

The target has pretty much a Gaussian-like shape, which is good from statistical properties of fitting linear models (we leave it to you to find out why). Otherwise, there'are means of transforming the target to have a proper shape, [here](https://scikit-learn.org/stable/auto_examples/compose/plot_transformed_target.html) you will find some helpful examples.

So, can we get this peak narrower?

### **Vanilla approach**

For the starters, let's be somewhat hasty and just do an out-of-the-box fit/predict on the whole available data and see what we get.

In [None]:
X = data_df
y = data_df[target_branch]

In [None]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

And here is our model - the class is called **LinearRegression** and it's basically an ordinary least squares linear regression as we know it from the lecture. We imported the model at the beginning of the seminar from an **sklearn** library. This is a staple library for doing classical ML (hence not much of Deep Learning available) which covers many aspects of it. We will showcase some of its capabilities here in this notebook but don't hesitate to check out [its website](https://scikit-learn.org/stable/) for more tools and features! For example, [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) you will find a comprehensive overview of implemented linear models.

Also, as you might've noticed, we defined the **target** variable to be infered and split the data into **train and test parts**. This is done in order to train the model on the first part, and then test it on the second one, which model hasn't seen, so that we can fairly judge about its performance. From the lecture you know that there is another method of validating the model - **cross-validation**, but we are not going to do it here and you will have a chance to try it out in the homework. 

In [None]:
lin_regr = LinearRegression()

In [None]:
lin_regr.fit(X_train, y_train) # fit it to the train data straight-away

Oops, looks like we've got some Not a Number ([NaNs](https://en.wikipedia.org/wiki/NaN)) here and the model can't be trained! Will need to sort this out.

### **Processing NaNs**

As we just noticed, some algorithms and libraries cannot automatically take care of NaN values in data (some of them can though). There are several ways to preprocess NaNs by yourself: firstly, you can always just remove samples (or columns) which have them. However, this might be not optimal, especially if you are sparse on data - every sample is precious there. So alternatively, you can replace these missing values by some others - e.g. by mean/most frequent/random values in the column. [Here](https://scikit-learn.org/stable/modules/impute.html) you can read how it can be done in sklearn.

Also note that it might happen that data is missing but it is not written as NaN explicitly, rather was just saved with some default values like 0 or -9999. This case also should be handled by the researcher manually after carefully studying the data. (and in our dataset  there are such feature - can you spot them?)

In our problem we'll just remove what is found to be "NaN" - can afford that. By the way, this is the first example of **data preprocessing** in this notebook, you will get to know more of them later on.

In [None]:
# which features are affected?
data_df.isna().sum()[data_df.isna().sum() != 0] 

In [None]:
# dropping them
data_df.dropna(inplace=True) # inplace=True means that data_df is modified, well, inplace - that is, overwritten

### **Building baseline**

Let's repeat the training for the NaN-cleaned dataframe:

In [None]:
X = data_df
y = data_df[target_branch]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

In [None]:
reg_baseline = LinearRegression()

In [None]:
reg_baseline.fit(X_train, y_train)

Good, so it worked without any errors, now we need to make prediction for a test set and compare with "true" values. For that we use predict() method, and from now on we will refer to this simple "fit/predict" model as a **baseline** - so something basic and simple, which will be improved upon in the following steps.

In [None]:
y_pred_baseline = reg_baseline.predict(X_test)

And finally, plotting prediction side-by-side with true values of m_vis:

In [None]:
plt.hist(y_pred_baseline, bins=30, range=(0, 150), label='baseline', histtype='step', linewidth=2)
plt.hist(y_test, bins=30, range=(0, 150), label='true', histtype='step', linewidth=2)
plt.legend()
plt.grid()
plt.show()

Weird, seems like there's no difference between pred and test 🤔 Let's compare predictions with targets:

In [None]:
list(zip(y_pred_baseline[:5], y_test.values[:5]))

Hmm, they are the same! And what are the weights of the linear model?

In [None]:
reg_baseline.coef_

They all are extremely close to 0! Apart from a single one - which feature does it correspond to?

In [None]:
# exercise: find a feature with a non-zero weight

Gotcha! We've just got into a trap of **data leakage** - there was a target hanging out amongst the training features. No surprise, the model found it to be an easy task - just saying "~ah you dumb-ass~ I will simply predict output m_vis based on, well, input m_vis which you've given me, with a corresponding coefficient equals to 1.

Another example of data leakage would be the case, where, for instance, you are asked to clasify images of cats from dogs, and all images containing cats would be marked by some special symbol. Then your alhorithm could learn to find cats based not on the unique elements specific to cats, but just trying to find this special mark. This then will likely be revealed on the inference step when you try to make predictions on a new/validation data, which doesn't have this mark. What you will find in this case is that the performance of your algorithm dropped significantly - just because it can't find this special mark on images which it learned to find and probably gives the answer randomly.

[Here](https://machinelearningmastery.com/data-leakage-machine-learning/) you will find a good article explaining data leakage in more detail and also describing the ways to avoid it.

#### **droping target**

Shall we be smarter now and remove the target from the training features:

In [None]:
X = data_df.drop(columns=target_branch)
y = data_df[target_branch]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

In [None]:
reg_baseline = LinearRegression()

In [None]:
reg_baseline.fit(X_train, y_train)

In [None]:
y_pred_baseline = reg_baseline.predict(X_test)
w_baseline = reg_baseline.coef_

In [None]:
plt.hist(y_pred_baseline, bins=30, range=(20, 140), label='baseline', histtype='step', linewidth=2)
plt.hist(y_test, bins=30, range=(20, 140), label='true', histtype='step', linewidth=2)
plt.legend()
plt.grid()
plt.show()

Ain't bad! By doing literally nothing (remember we didn't really looked into the data and simply fitted the model) we already improved resolution in this variable - the peak by eye got narrower! But can we do better and improve upon the baseline?

#### **plotting weights**

As we know from the lecture, weights of the linear model contain some useful information about features which the model found to be useful. Let's study them to get some insight into the current model.

In [None]:
w_baseline = reg_baseline.coef_

In [None]:
w_baseline.shape, X_train.columns.shape

Here we will sort the features according to the value of the weight and plot them alongside the scale of the feature (that is, its variance)

In [None]:
sorted_w_to_feature = sorted(zip(w_baseline, X_train.columns, X_train.std()), reverse=True)
weights_baseline = [x[0] for x in sorted_w_to_feature]
features_baseline = [x[1] for x in sorted_w_to_feature]
scales_baseline = [x[2] for x in sorted_w_to_feature]

In [None]:
fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Bar( x=weights_baseline,
            y=features_baseline,
            orientation='h', name="weights"),
    row=1, col=1
)

fig.add_trace(
    go.Bar( x=scales_baseline,
            y=features_baseline,
            orientation='h', name="scales"),
    row=1, col=2
)

fig.show()

OK, so we see that there are some other variables which have a large variance. What are they?

In [None]:
# exercise: from interacting with the plot or using a bit of coding find these features' names 

Let's figure out why the scale is that big by plotting the variable

In [None]:
X_train.acotautau_refitbs_uncorr_01.var()

In [None]:
plt.hist(X_train.acotautau_refitbs_01)
plt.show()

Here it is - there is a bunch of events with -9999 values. In fact, they are also **NaNs**, but preprocessed at the earlier stage of data production by filling them with default values, so that they became **outliers** - samples "outside" of the main bulk of data.

In general, **it's worth checking this way all variables** (e.g. by using [pairplot()](https://seaborn.pydata.org/generated/seaborn.pairplot.html) method in different libraries) and cleaning/preprocessing them, or use loss functions which are robust to them. In our case we will be lazy and leave them as they are.

As a short summary, this small study of variances firstly helped us to understand data a bit better: to find outliers and some strangely behaving variables. And secondly, this helped to understand the scale of the data. We see that it's not uniform and linear models are very sensitive to this. Moreover, we can't interpret the weights of the fitted model as some kind of "importance" of features - extremely large weight may be not the sign of the feature contributing a lot to the final target, but just the way to account for the feature's small scale.

### **Scaling features**

As weights aren't interpretable at the moment because of differences in features' scales, let's scale them! For that purpose, we are gonna be using [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) class from sklearn. Basically, it scales all features by their means and varience, so that afterwards they have the mean of 0 and the variance of 1. In addition to interpretability, scaling also makes convergence for linear models more stable: the gradient size (thus the rate and stability of convergence) will not be affected by the large features' values. But this is not the only scaler available on the market, check out [this](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py) comparison of them!

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # why don't we fit test?

In [None]:
reg_baseline_scaled = LinearRegression()

In [None]:
reg_baseline_scaled.fit(X_train_scaled, y_train)

In [None]:
y_pred_baseline_scaled = reg_baseline_scaled.predict(X_test_scaled)
w_baseline_scaled = reg_baseline_scaled.coef_

In [None]:
plt.hist(y_pred_baseline, bins=30, range=(0, 150), label='baseline', histtype='step', linewidth=2)
plt.hist(y_pred_baseline_scaled, bins=30, range=(0, 150), label='baseline, scaled', histtype='step', linewidth=2)
plt.hist(y_test, bins=30, range=(0, 150), label='true', histtype='step', linewidth=2)
plt.legend()
plt.grid()
plt.show()

Looks like it didn't change the results much, let's plot the weights once again.

In [None]:
sorted_w_to_feature_scaled = sorted(zip(w_baseline_scaled, X_train.columns, X_train_scaled.std(axis=0)), reverse=True)
weights_baseline_scaled = [x[0] for x in sorted_w_to_feature_scaled]
features_baseline_scaled = [x[1] for x in sorted_w_to_feature_scaled]
scales_baseline_scaled = [x[2] for x in sorted_w_to_feature_scaled]

In [None]:
fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Bar( x=weights_baseline_scaled,
            y=features_baseline_scaled,
            orientation='h', name="weights"),
    row=1, col=1
)

fig.add_trace(
    go.Bar( x=scales_baseline_scaled,
            y=features_baseline_scaled,
            orientation='h', name="scales"),
    row=1, col=2
)

fig.show()

In [None]:
# question: and why are there empty spaces on the scales plot?

It looks like even after scaling we still have extremely large weights for acotautau variables, and this looks suspicious. We also saw previously that these variables have -9999 values, maybe it is because of them? Let's remove them and see what we will have.  

In [None]:
acotautau_features = [
                 'acotautau_refitbs_00',
                 'acotautau_refitbs_01',
                 'acotautau_helix_00',
                 'acotautau_helix_01',
                 'acotautau_bs_00',
                 'acotautau_bs_01',
                 'acotautau_00',
                 'acotautau_01',
                 'acotautau_refitbs_uncorr_00',
                 'acotautau_refitbs_uncorr_01',
                 'acotautau_helix_uncorr_00',
                 'acotautau_helix_uncorr_01', # you can actually leave comma in the end
]

In [None]:
acotautau_cuts = [aco_var + '>0' for aco_var in acotautau_features]
acotautau_cut = ' and '.join(acotautau_cuts)
print(acotautau_cut)

In [None]:
X = data_df.query(acotautau_cut).drop(columns=target_branch) # here we use query() method to apply the selection
y = data_df.query(acotautau_cut)[target_branch]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) 

In [None]:
reg_baseline_scaled = LinearRegression()

In [None]:
reg_baseline_scaled.fit(X_train_scaled, y_train)

In [None]:
y_pred_baseline_scaled = reg_baseline_scaled.predict(X_test_scaled)
w_baseline_scaled = reg_baseline_scaled.coef_

In [None]:
plt.hist(y_pred_baseline, bins=30, range=(0, 150), label='baseline', histtype='step', linewidth=2)
plt.hist(y_pred_baseline_scaled, bins=30, range=(0, 150), label='scaled+aco-dropped', histtype='step', linewidth=2)
plt.hist(y_test, bins=30, range=(0, 150), label='true', histtype='step', linewidth=2)
plt.legend()
plt.grid()
plt.show()

OK, kind of narrower, but also by eye it looks like in the range of interest for unscaled baseline we have more events 

In [None]:
y_pred_baseline_scaled[y_pred_baseline_scaled < 200].shape

In [None]:
y_pred_baseline[y_pred_baseline < 200].shape

In [None]:
y_test[y_test < 200].shape

Yep, so it's a bit unfair to compare the histograms (remember, we applied the cut and removed some events), a better approach is to look at the **density** of distribution.

In [None]:
dist_data = [y_pred_baseline_scaled[y_pred_baseline_scaled < 200], y_pred_baseline[y_pred_baseline < 200], y_test[y_test < 200]]

group_labels = ['scaled+aco-dropped', 'baseline', 'true']
fig = ff.create_distplot(dist_data, group_labels, bin_size=.2, show_hist=False)

fig.update_layout(
    autosize=False,
    width=1000,
    height=700,
)

fig.show()

So not much of a difference, maybe less **bias** though? We will come back to this later.

In [None]:
np.median(y_test), np.median(y_pred_baseline), np.median(y_pred_baseline_scaled)

And what about weights?

In [None]:
sorted_w_to_feature_scaled = sorted(zip(w_baseline_scaled, X_train.columns, X_train_scaled.std(axis=0)), reverse=True)
weights_baseline_scaled = [x[0] for x in sorted_w_to_feature_scaled]
features_baseline_scaled = [x[1] for x in sorted_w_to_feature_scaled]
scales_baseline_scaled = [x[2] for x in sorted_w_to_feature_scaled]

In [None]:
fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Bar( x=weights_baseline_scaled,
            y=features_baseline_scaled,
            orientation='h', name="weights, scaled"),
    row=1, col=1
)

fig.add_trace(
    go.Bar( x=scales_baseline_scaled,
            y=features_baseline_scaled,
            orientation='h', name="scales, scaled"),
    row=1, col=2
)

fig.show()

Well, not that huge as before, but still are the most influential once. This is a bit concerning since these variables aren't supposed to have an impact on the outcome from our physical knowledge. A scatter plot can help to illustrate this (and also reaveal some peculiar outliers):

In [None]:
sns.pairplot(data_df.query(acotautau_cut + 'and m_vis < 200')[['acotautau_refitbs_uncorr_01', 'acotautau_helix_01', 'm_vis']], diag_kind='kde')
plt.show()

### **Feature engineering**

This may be actually an indication that our model is not really physical and picks irrelevant correlations in data. It is worthwhile checking the result without using these acotautau variables. Here we will drop them and some other features which we know don't affect the prediction of the visible mass. Essentially, this is a very simple example of **feature engineering** - a process of preparing features to be inputted into the model. A bit later in this notebook we will have a look at other examples of this.   

In [None]:
# exercise: find all boolean features

# bool_features = 

In [None]:
# acotautau vars
acotautau_features

In [None]:
# some other redundant features
redundant_features = ['run', 'htxs_reco_flag_qqh', 'htxs_reco_flag_ggh', 
                      'ff_nom', 'ff_mva', 'ff_sys', 'gen_sm_htt125', 'gen_ps_htt125', 'gen_mm_htt125',
                     ]

In [None]:
X = data_df.drop(columns=acotautau_features + list(bool_features) + redundant_features + [target_branch])
y = data_df[target_branch]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

#### **pipeline**

Previously we were separately applying transformations to train and test data, be it scaling or predicting. In sklearn there is a handy method which encapsulate these steps into a one single object - [pipeline](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline). Thanks to it you can incorporate all preprocessing steps into a single chain, and then call its method fit/predict() to execute all the underlying steps in one go, one after another. Essentially it is like a pipe, through which your data flows towards the end.  

In [None]:
pipeline = Pipeline([
    ('Scaling', StandardScaler()),
    ('LinearRegression', LinearRegression())
])

pipeline = pipeline.fit(X_train, y_train) # here we automatically fit StandardScaler() and then fit LinearRegression() to the scaled train data
y_pred_preproc = pipeline.predict(X_test) # and here automaticall fit StandardScaler() and make predictions for the scaled test data using already fitted model
scaler_preproc = pipeline[0] # here we can access the fitted scaler object
weights_preproc = pipeline[1].coef_ # and here the model and consequently its coefficients

In [None]:
dist_data = [y_pred_preproc[y_pred_preproc < 200], y_pred_baseline_scaled[y_pred_baseline_scaled < 200], y_pred_baseline[y_pred_baseline < 200], y_test[y_test < 200]]

group_labels = ['preproc', 'scaled', 'baseline', 'true']
fig = ff.create_distplot(dist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False)

fig.update_layout(
    autosize=False,
    width=1000,
    height=700,
)

fig.show()

In [None]:
# well, hard to tell at the first glance, but what is disturbing is this bumpiness at the left tale of "preproc" distribution - 
# can you figure out why it is there? 
# hint: you may want to carefully look at the "m_fast" feature distribution.

In [None]:
sorted_w_to_feature_preproc = sorted(zip(weights_preproc, X_train.columns, scaler_preproc.transform(X_train).std(axis=0)), reverse=True)
sorted_weights_preproc = [x[0] for x in sorted_w_to_feature_preproc]
sorted_features_preproc = [x[1] for x in sorted_w_to_feature_preproc]
sorted_scales_preproc = [x[2] for x in sorted_w_to_feature_preproc]

In [None]:
fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Bar( x=sorted_weights_preproc,
            y=sorted_features_preproc,
            orientation='h', name="weights"),
    row=1, col=1
)

fig.add_trace(
    go.Bar( x=sorted_scales_preproc,
            y=sorted_features_preproc,
            orientation='h', name="scales"),
    row=1, col=2
)

fig.show()

Good, so now we have a model which seems to be more reasonable from physics perspective - we know that momentum of taus should be somehow connected with the visible mass. The weights are interpretable now and we see that the model assignes higher weight to pt variables (pt_1 and pt_2) - a good sign of its meaningfulness. Performance of the "preprocessed" model didn't change much comparing to the dummy baseline though, but we've got plenty of insights into data and the model itself along the way. Good for us!

In [None]:
# question: but wait, why IP_signif variables have the highest weights? 

# this is quite a diffucult problem to solve, we must admit - but if you manage to find the answer, you may be pretty much surprised (and we will be happy that you are surpised:)
# hint: start from checking correlation of outliers in these variables with the target

#### **studying correlations**

At this point we did some simple feature engineering by manually removing features which we _a priori_ know are useless. However, we could use some guidance in this removal, and one example would be to fit a linear model and then remove the features with the smallest weights (thus only keeping features with the highest impact on the target prediction). Another useful way is to study **correlations** - as we know, linear models by design capture linear dependancies of target on input features, so if we use just the features which are the most correlated with the target, we could "clean up" our model from redundant features and also improve its predictive power.

To measure the degree of correlation we are going to use [Pearson](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)

$\rho _{X,Y}={\frac { \mathbb{E}[(X-\mu _{X})(Y-\mu _{Y})]}{\sigma _{X}\sigma _{Y}}}$

where 

* $ \sigma _{X}$ and $\sigma _{Y}$ are standard deviation of X and Y,
* $\mu_X$ and $\mu_Y$ are means of X and Y,
* $\mathbb{E}[]$ is [expected value](https://en.wikipedia.org/wiki/Expected_value)


and [Spearman](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) correlation coefficients (essentially, Pearson correlation between the rank values of two variables)



In [None]:
correlation_spearman = {
    feature: spearmanr(X_train[feature], y_train)[0]
    for feature in X_train.columns if not np.isnan(spearmanr(X_train[feature], y_train)[0])
}

sorted_correlation_spearman = sorted(correlation_spearman.items(), key=lambda x: x[1], reverse=True)
features_spearman = [x[0] for x in sorted_correlation_spearman]
correlations_spearman = [x[1] for x in sorted_correlation_spearman]

In [None]:
correlation_pearson = {
    feature: pearsonr(X_train[feature], y_train)[0]
    for feature in X_train.columns
}

sorted_correlation_pearson = sorted(correlation_pearson.items(), key=lambda x: x[1], reverse=True)
features_pearson = [x[0] for x in sorted_correlation_pearson]
correlations_pearson = [x[1] for x in sorted_correlation_pearson]

In [None]:
fig = go.Figure(data=[
    go.Bar(name='Spearman', x=correlations_spearman, y=features_spearman, orientation='h', marker_color='midnightblue', opacity=0.6),
    go.Bar(name='Pearson', x=correlations_pearson, y=features_pearson, orientation='h', marker_color='tan', opacity=0.6),
])

fig.update_layout(barmode='group', title='Spearman vs Pearson')
fig.show()

In [None]:
# questions: how does this plot relate to the one with weights? are there similarities in the ordering?

Good, so we see that the highest correlation coefficient for both methods is for m_fast, m_tot, pt_1 and pt_2, which was also the case for weights in the linear model and which meets our physical knowledge. Now it makes sense to verify this by making **scatter plots**: 

In [None]:
n_features = 5

In [None]:
fig, axs = plt.subplots(2, n_features, figsize=(26, 10))
for j, feature in enumerate(features_pearson[:n_features]):
    sns.regplot(x=X_train[feature][:1000], y=y_train[:1000], fit_reg=True, robust=False, ax=axs[0,j])  # do you know what "robust=False" means?
    axs[0,j].set_xlabel(feature)
    axs[0,j].set_ylabel(target_branch)

for j, feature in enumerate(features_pearson[-n_features:][::-1]):
    sns.regplot(x=X_train[feature][:1000], y=y_train[:1000], fit_reg=True, robust=False, ax=axs[1,j]) 
    axs[1,j].set_xlabel(feature)
    axs[1,j].set_ylabel(target_branch)

In [None]:
fig, axs = plt.subplots(2, n_features, figsize=(26, 10))
for j, feature in enumerate(features_spearman[:n_features]):
    sns.regplot(x=X_train[feature][:1000], y=y_train[:1000], fit_reg=True, robust=False, ax=axs[0,j])  # do you know what "robust=False" means?
    axs[0,j].set_xlabel(feature)
    axs[0,j].set_ylabel(target_branch)

for j, feature in enumerate(features_spearman[-n_features:][::-1]):
    sns.regplot(x=X_train[feature][:1000], y=y_train[:1000], fit_reg=True, robust=False, ax=axs[1,j]) 
    axs[1,j].set_xlabel(feature)
    axs[1,j].set_ylabel(target_branch)

Yep, we see that there is indeed correlation between target and aforementioned features. Now let's train model just on those features and see, whether we improve upon our baseline.

#### **train on top 1/5/10 (anti)correlated features**

In [None]:
# exercise: select top-1/5/10 correlated + anticorrelated features, fit the model, plot and compare results

You should see a bit of improvement here - the peak got slightly higher!

### **Lasso**

So far we've made feature engineering by hand-picking some features based on our knowledge or on correlations with target. However, as we know, **L1 regularisation** can do this for us automatically. This is also called as [Lasso regression](https://en.wikipedia.org/wiki/Lasso_(statistics)) (Least Absolute Shrinkage and Selection Operator), and in sklearn there is a dedicated class for that. What is done there comparing to a previously covered LinearRegression() is an addition of L1 regularisation term into MSE loss:

$\min_{\theta} \, \mathcal{L}(a(X,\theta), y) = \min_{\theta} { \frac{1}{2n_{\text{samples}}} ||X \theta - y||_2 ^ 2 + \alpha ||\theta||_1}$

In [None]:
lasso_pipeline = Pipeline([
    ('scaling', StandardScaler()),
    ('Lasso', Lasso()) # here is the lasso model
])

lasso_pipeline = lasso_pipeline.fit(X_train, y_train)
y_pred_train_lasso = lasso_pipeline.predict(X_train)
y_pred_lasso = lasso_pipeline.predict(X_test)

In [None]:
dist_data = [ y_pred_top5[y_pred_top5 < 200], y_pred_lasso[y_pred_lasso < 200], y_pred_baseline[y_pred_baseline < 200], y_test[y_test < 200]]

group_labels = ['top5', 'lasso', 'baseline', 'true']
fig = ff.create_distplot(dist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False)

fig.update_layout(
    autosize=False,
    width=1000,
    height=700,
#     margin=dict(
#         l=50,
#         r=50,
#         b=100,
#         t=100,
#         pad=4
#     ),
#     paper_bgcolor="LightSteelBlue",
)

fig.show()


And we are well beyond other methods! Would be interesting to find out which features are the most important from Lasso's perspective and compare them with the ones we picked based on correlations.

In [None]:
# exercise: print sorted non-zero Lasso weights with corresponding feature names; compare results with top-10 (anti)correlated features

Nice, we see that there is an overlap and the one which is also consistent with our physical knowledge - we have momentum of particles and also puppimet (actually, it would useful for you to think about why its anticorrelation is important for prediction) entering. We would say that one at this point should gain more confidence in this model: if it meets their expactations and produces meaningful results, then it must be at least not rubbish, rather indeed a correct model.

### **Checking for outliers**

As we saw before, some features had outliers in their distributions. We can remove them (**but be careful** - you might be removing useless events from ML point of view but they still can be very important from a physical pespective!) but here we'll have a look at the outliers from the loss point of view, that is we will remove outliers which have the largest predicted error.

In [None]:
lasso_error = (y_train - lasso_pipeline.predict(X_train)) ** 2
plt.hist(lasso_error, bins=50)
plt.grid()
plt.show()

Wow, the errors on the training set are very large. Let's separate core events from outliers in this histogram by using 5% [quantile](https://en.wikipedia.org/wiki/Quantile), and then retrain the model.

In [None]:
core_mask = lasso_error < np.quantile(lasso_error, 0.95)
outlier_mask = lasso_error >= np.quantile(lasso_error, 0.95)

In [None]:
lasso_features

In [None]:
lasso_pipeline_no_outliers = Pipeline([
    ('scaling', StandardScaler()),
    ('Lasso', Lasso())
])

lasso_pipeline_no_outliers = lasso_pipeline_no_outliers.fit(X_train[core_mask], y_train[core_mask])
y_pred_train_lasso_no_outliers = lasso_pipeline_no_outliers.predict(X_train[core_mask])
y_pred_lasso_no_outliers = lasso_pipeline_no_outliers.predict(X_test)

In [None]:
dist_data = [ y_pred_lasso[y_pred_lasso < 200], y_pred_lasso_no_outliers[y_pred_lasso_no_outliers < 200], y_test[y_test < 200]]

group_labels = ['lasso', 'lasso: no outliers', 'true']
fig = ff.create_distplot(dist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False)

fig.update_layout(
    autosize=False,
    width=1000,
    height=700,
)

fig.show()


In [None]:
np.mean(y_pred_lasso_no_outliers), np.mean(y_pred_lasso), np.mean(y_test)

In [None]:
np.median(y_pred_lasso_no_outliers), np.median(y_pred_lasso), np.median(y_test)

Here it is, we also noticed that before when we were removing outliers for acotautau observables. The distribution is now shifted towards smaller values for m_vis, therefore **biasing towards smaller values of m_vis** comparing to the unprocessed Lasso model. This might be because some of outliers are actually large pt particles and therefore if we remove them from the training data, the model learns to predict on general smaller values of m_vis. This is just a guess, there might other reasons for that, so if we were approaching this study seriously (e.g. for publishing a paper), we would have to investigate and understand this effect. For now we will leave it for you to explore😉

In [None]:
lasso_error = (y_train[core_mask] - lasso_pipeline_no_outliers.predict(X_train[core_mask])) ** 2
plt.hist(lasso_error, bins=30)
plt.grid()
plt.show()

As we see, the model's error on the training set has been reduced significantly. And here we have several comments to be made. Firstly, we were removing manually removing _loss_ outliers, but we could also do the same using individual distributions of all features and their quantiles using . Also, we could use [robust scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) to automate this type of data preprocessing. Moreover, if you remember from the lecture, we could make use of [Huber loss](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html) in the training, which was specifically designed to tackle the problem of outliers.

And the last comment would be on feature engineering in general: here we were mostly removing features, but one can **add new features** too. We could do this by hand, so to derive, for example, a new feature corresponding to a magnitute of particle's momentum, not its transverse part which we used in this study. Or we could add [polynomial features](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html), that is additional powers and multiplicative terms to a features set, and then check whether the performance improved. There is plenty of room for creativity when you design your features, so give it a go!

### **Quantifying the difference**

And finally, we should say that all this time we were judging upon the perfomance of the model by simple comparing distributions _by eye_. To be honest, this isn't the most scientific way of understanding things, so let's calculate some numbers now.   

#### **by fitting**

One way to do this is to somehow measure the width of the resulting predicted distribution. As we discussed in the beginning we want this width to be narrow, so this makes sense to use this variable as the **metric** to evaluate the model's performance. One way to derive the width is by simple calculating the variance of this distribution - this should give a reasonable estimate. Another way would be to fit the distribution by a Gaussian function and taking the width from the fit resuts - as we also noticed before, the target distribution has pretty much a Gaussian shape, so that is suitable in our case. For fitting we will use **scipy** which was covered in the Python seminar: specifically, a normal distribution with its fit() method from the scipy.stats module.   

In [None]:
# question: does fitting belong to a Machine Learning gang?

In [None]:
mu_baseline, sigma_baseline = norm.fit(y_pred_baseline)
mu_top1, sigma_top1 = norm.fit(y_pred_top1)
mu_top5, sigma_top5 = norm.fit(y_pred_top5)
mu_top10, sigma_top10 = norm.fit(y_pred_top10)
mu_lasso, sigma_lasso = norm.fit(y_pred_lasso)  
mu_lasso_no_outliers, sigma_lasso_no_outliers = norm.fit(y_pred_lasso_no_outliers)
mu_test, sigma_test = norm.fit(y_test)

In [None]:
left = 40; right = 120; nbins = 30
x = np.linspace(left, right, 100)
fig = plt.figure(figsize=(10,8))

plt.hist(y_pred_baseline, bins=nbins, range=(left, right), label='baseline', histtype='step', density=True, color='grey')
plt.plot(x, norm.pdf(x, mu_baseline, sigma_baseline), color='grey', label=f'baseline, $\mu$={"%.1f" % round(mu_baseline, 1)}, $\sigma$={"%.1f" % round(sigma_baseline, 1)}')

# plt.hist(y_pred_top1, bins=nbins, range=(left, right), label='top1', histtype='step', density=True, color='plum')
# plt.plot(x, norm.pdf(x, mu_top1, sigma_top1), color='plum', label=f'top1, $\mu$={"%.1f" % round(mu_top1, 1)}, $\sigma$={"%.1f" % round(sigma_top1, 1)}')

plt.hist(y_pred_top5, bins=nbins, range=(left, right), label='top5', histtype='step', density=True, color='darkcyan')
plt.plot(x, norm.pdf(x, mu_top5, sigma_top5), color='darkcyan', label=f'top5, $\mu$={"%.1f" % round(mu_top5, 1)}, $\sigma$={"%.1f" % round(sigma_top5, 1)}')

# plt.hist(y_pred_top10, bins=nbins, range=(left, right), label='top10', histtype='step', density=True, color='steelblue')
# plt.plot(x, norm.pdf(x, mu_top10, sigma_top10), color='steelblue', label=f'top10, $\mu$={"%.1f" % round(mu_top10, 1)}, $\sigma$={"%.1f" % round(sigma_top10, 1)}')

plt.hist(y_pred_lasso, bins=nbins, range=(left, right), label='lasso', histtype='step', density=True, color='lightcoral')
plt.plot(x, norm.pdf(x, mu_lasso, sigma_lasso), color='lightcoral', label=f'lasso, $\mu$={"%.1f" % round(mu_lasso, 1)}, $\sigma$={"%.1f" % round(sigma_lasso, 1)}')

plt.hist(y_test, bins=nbins, range=(left, right), label='true', histtype='step', density=True, color='tan')
plt.plot(x, norm.pdf(x, mu_test, sigma_test), color='tan', label=f'true, $\mu$={"%.1f" % round(mu_test, 1)}, $\sigma$={"%.1f" % round(sigma_test, 1)}')

plt.legend()
plt.grid()
plt.show()

It's worth checking though that the fit quality is reasonable and Gaussian model describes the data well. But anyway as a zero-level approximation it should be fine. From the numbers we see that we improved the resolution from 15.8 to 10.3 GeV, which is a 30% better resolution!

#### **by numbers**

The other way which you might remember from the lecture is to look at metrics. On the lecture we covered some of them and here we'll calculate two of them, MSE and $R^2$:

$\text{MSE}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (y_i - \hat{y}_i)^2$

$R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$

In [None]:
r2_baseline = r2_score(y_test, y_pred_baseline)
mse_baseline = mean_squared_error(y_test, y_pred_baseline)

r2_top5 = r2_score(y_test, y_pred_top5)
mse_top5 = mean_squared_error(y_test, y_pred_top5)

r2_lasso = r2_score(y_test, y_pred_lasso)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)

In [None]:
print('(r2 | mse)\n')
print(f'baseline: {r2_baseline}, {mse_baseline}')
print(f'top5: {r2_top5}, {mse_top5}')
print(f'lasso: {r2_lasso}, {mse_lasso}')

Here it is - on the contrary, **metrics disfavour Lasso model!** How can we think of this?

Well, in fact, they are not really physical, rather intuitive and mathematical. These metrics' goal, by design, is to make sure that the model predicts _exactly_ what it was trained on - and therefore, no width change would the perfect case for such metrics. But our goal is different: we want the model prediction to be narrower and, consequently, _different_ from the initial target distribution. So we **can't rely on these metrics in our problem** and need to derive the ones which we would suit our initial goals - to improve the resolution of the mass distribution.

This is one of the challenges of ML in HEP nowadays - how to find a proper physical metrics suitable for a given physical problem. Maybe half width? Or some observable related to theory? Asking these kind of questions is crutial while doing ML-based analysis in your research.

So to sum it up, **remember to ask this questions before taking a hold of ML apparatus:** what is my goal? what do I want to improve? how can I quantify the improvement? how can I measure the perfomance of my model _given the physical context?_ 


### **Overfitted?**

And now definitely the last funny thing: we never actually compared the prediction on the train and on the test samples... Maybe they are different, we are badly overfitted and this was all in vain?

In [None]:
# exercise: plot train vs test true/predict distributions

In [None]:
# question: so are we?

### **Summary**

As you noted, there were plenty of important questions arising during, seemingly, a simple study and which we didn't have time to look into during the seminar. But we highly encourage you to do complete studies at home by following the links and questions we were asking in the notebook! 

However, we believe that we managed to cover the essentials:

* Guide you through the first HEP-related application of ML in this course!
* Starting from some simple data preprocessing
* Building a baseline
* Working on features
* Understanding the data 
* Understanding the model
* finally measuring the model's performance

and as a result, we've got an improvement of 30% in resolution of visible mass!

But here is **the most important message we were willing to convey:** it is super easy to train **some** model and make **some** predictions. But these predictions might make no sense whatsoever because one didn't get to know their data - and it is this what makes a real impact on the final result and distinguish researcher from amateur. So **do study your data**, preprocess it and inspect carefully and sensibly to find/add what is useful and remove/transform what is not. **"Garbage in, garbage out"**, as they say. 

Another our advice would be: **do not try to go straight-away for training some fancy Neural Network** (because "I heard that Deep Learning is cool you know") - it won't get you the best results that simply. Oppositely, the major improvement mostly often comes from getting your data in a proper shape **using your domain knowledge** (as we did in this notebook) and applying some simple algorithms first - they actually might be David against Goliath. And after you've tried this, increase the complexitity of your model to see whether you can beat your simple baseline model. 

And lastly, try to **interpret your model**: in case of linear models it's easy, but it can get tricky once the model complexity grows. However, there are tools and ways to make sense of your predictions and to make sure that you can trust your model. In [this awesome book](https://christophm.github.io/interpretable-ml-book/) you will find an overview of them, and we will also go through some of them during this course.

**So as a take-away:**
* **do study your data**
* **train a simple baseline and then improve it**
* **try to understand your model and its predictions**

In the **homework notebook** we will be guiding you through an interesting problem of **predicting students' results for a Maths exam**. Moreover, there we explain the concepts of data imbalance, hyperparameter optimisation, some new tricks to preprocess data and also a nice bagging of linear models in the end😉 These are very important aspects of ML, so we highly encourage you to have a look there and play around the code a bit.

And see you in the following lessons!

~~~