### When running this notebook via the Galaxy portal
You can access your data via the dataset number. Using a Python kernel, you can access dataset number 42 with ``handle = open(get(42), 'r')``.
To save data, write your data to a file, and then call ``put('filename.txt')``. The dataset will then be available in your galaxy history.
<br><br>Note that if you are putting/getting to/from a different history than your default history, you must also provide the history-id.
<br><br>More information including available galaxy-related environment variables can be found at https://github.com/bgruening/docker-jupyter-notebook. This notebook is running in a docker container based on the Docker Jupyter container described in that link.


# Apply Machine Learning Methods

The following notebook apply some examples of ML methods available in <a href="https://scikit-learn.org/stable/" target="_blank">scikit-learn</a>. The notbook reads in hdf5 files for testing and training from disk. If these are not available they can be produced using the following two notebooks:

1. **ConvertNtupToHdf5** - convert openData ntuples to hdf5 files using uproot.
2. **MakeTrainTestSamples** - divides the hdf5 files into training and test samples.

In [1]:
# To make the notebook view a bit wider
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:70% !important; }</style>"))
import warnings
warnings.filterwarnings('ignore')

### Some imports and includes

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
#from sklearn.model_selection import train_test_split
import xgboost as xgb
import numpy as np
import import_ipynb
import setPath
from Input.OpenDataPandaFramework13TeV import * 
%jsroot on

importing Jupyter notebook from setPath.ipynb
importing Jupyter notebook from /home/eirikgr/software/Input/OpenDataPandaFramework13TeV.ipynb
This library contains handy functions to ease the access and use of the 13TeV ATLAS OpenData release

getBkgCategories()
	 Dumps the name of the various background cataegories available 
	 as well as the number of samples contained in each category.
	 Returns a vector with the name of the categories

getSamplesInCategory(cat)
	 Dumps the name of the samples contained in a given category (cat)
	 Returns dictionary with keys being DSIDs and values physics process name from filename.

getMCCategory()
	 Returns dictionary with keys DSID and values MC category

initialize(indir)
	 Collects all the root files available in a certain directory (indir)

getSkims(indir)
	 Prints all available skims in the directory



Setting luminosity to 10064 pb^-1

###############################
#### Background categories ####
###############################
Category  

UsageError: Line magic function `%jsroot` not found.


In [3]:
xgbclassifier = xgb.XGBClassifier()
xgbclassifier.load_model("mymodel.json")

### Function for categorizing training and test into signal and background

In [None]:
def categorizeTrainandTest(samples, Signals, Backgrounds):
    files_signal = []
    files_background = []
    for t in samples:
        found = False
        for s in Signals:
            if s+"_" in t:
                files_signal.append(t)
                found = True
                break
        if found: continue
        for b in Backgrounds:
            if b+"_" in t:
                print(b,t)
                files_background.append(t)
                found = True
                break
    return files_signal, files_background

Specifies the following
* location of hdf5 files for MC (signal and background) and data.  
* the skimtag (used when producing the hdf5 files)
* the signal model/DSID you want to use to supervise the network 

In [None]:
mcdir = "/scratch/eirikgr/openData_13TeV/2lep/MC/hdf5/"
datadir = "/scratch/eirikgr/openData_13TeV/2lep/Data/hdf5/"

Backgrounds = getBkgCategories();

skimtag = "2L_pt25_25_met50"

Backgrounds.remove('Wjetsincl')
Backgrounds.remove('Zjetsincl')

print(Backgrounds)

Signals = ["SUSYC1N2"]
# Set to specific DSID if only want to train on one sample, 
# if not train on all samples in signal model specified above
signal_dsid = -1

testing_files = [f for f in listdir(mcdir) if (f.endswith('.h5') and (f.startswith("testing") and skimtag in f))]
training_files = [f for f in listdir(mcdir) if (f.endswith('.h5') and (f.startswith("training") and skimtag in f))]

training_files_signal, training_files_background = categorizeTrainandTest(training_files,Signals,Backgrounds)
testing_files_signal, testing_files_background = categorizeTrainandTest(testing_files,Signals,Backgrounds)

### Function for getting the training and test data frames

In [None]:
def GetTrainTestDF(f_signal,f_background,signal_dsid = -1):
    sig = []
    bkg = []
    for tfs in f_signal:
        df = pd.read_hdf(mcdir+"/"+tfs)
        if signal_dsid > 0:
            df = df.loc[df['channelNumber'] == signal_dsid]
        sig.append(df)
        print("ÌNFO \t Adding %s to DF"%tfs)
    for tfb in f_background:
        bkg.append(pd.read_hdf(mcdir+"/"+tfb))
        print("ÌNFO \t Adding %s to DF"%tfb)
    merged_train = pd.concat(sig + bkg)
    print("\n")
    return merged_train

In [None]:
# get the training and testing DFs
merged_training = GetTrainTestDF(training_files_signal, training_files_background, signal_dsid)
merged_testing = GetTrainTestDF(testing_files_signal, testing_files_background, signal_dsid)

In [None]:
merged_training["mll_12"]/1000.

#### Choosing the variables to be dropped in the ML algorithm. 
#### Note the beauty of pandas: the variable names are entered, compared to numpy arrays which don't have this feature. You can add/remove as many variables as you wish to improve the classification.

In [None]:
for c in merged_training.columns:
    print(c)
todrop = ['XSection','SumWeights','eventNumber','channelNumber',"wgt",'isSignal','MCType']
X_train = merged_training.drop(todrop,axis = 1)
Y_train = merged_training['isSignal']

X_test = merged_training.drop(todrop,axis = 1)
Y_test = merged_training['isSignal']

#### At this point, you have to choose the [ML algorithm](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) (BDT, logistic regression, ...)

## XGBClassifier
Let's have a look at the [XGBoost classifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)

In [None]:
# BDT classifier
xgbclassifier = xgb.XGBClassifier(
    max_depth=3, 
    n_estimators=120,
    learning_rate=0.1,
    n_jobs=4,
    #scale_pos_weight=sum_wbkg/sum_wsig,
    objective='binary:logistic')
    #missing=-999.0) 
xgbclassifier.fit(X_train, Y_train) 

#### As for any descision trees XGBoost lets you look at the variable importance and a long range of [other things](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.plotting) for your model. Like the feature importance:

In [None]:
# Plot variable importance
fig_size = plt.rcParams["figure.figsize"] 
ax = xgb.plot_importance(xgbclassifier)
ax.xaxis.label.set_size(20)
ax.yaxis.label.set_size(30)
fig_size[0] = 20
fig_size[1] = 15
plt.rcParams["figure.figsize"] = fig_size
plt.show()
y_pred = xgbclassifier.predict(X_test)
y_pred_prob = xgbclassifier.predict_proba(X_test)

#### ... or the descision tree:

In [None]:
xgb.plot_tree(xgbclassifier)

The following cell allow you to check the classification performance of your algorithm. The y-axis is the number of events and the x-axis is the probability that the sample is signal. The blue distribution corresponds to background and the pink to signal.

## Plotting not specific to any ML model

Let's have a look at some other interesting plots which are not specific to any ML algorithm.

#### First have a look at the distribution of the score from the classification algorithm using the unseen test data set. 

In [None]:
#  histogram of the ML outputs
n, bins, patches = plt.hist(y_pred_prob[:,1][Y_test==0], 100,  facecolor='blue', alpha=0.2,label="Background")
n, bins, patches = plt.hist(y_pred_prob[:,1][Y_test==1], 100,  facecolor='red', alpha=0.2, label="Signal")
plt.xlabel('ML output')
plt.ylabel('Events')
plt.yscale('log')
plt.title('ML output, OpenData dataset, validation data')
plt.grid(True)
plt.legend()
plt.show()

#### In machine learning, the performance of the algorithms can be studied using different [metrics](https://scikit-learn.org/stable/modules/classes.html?highlight=sklearn%20metrics#module-sklearn.metrics) such as the Compute Receiver operating characteristic ([ROC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve)) curve.

In [None]:
from sklearn.metrics import roc_curve,auc
fpr, tpr, thresholds = roc_curve(Y_test,y_pred_prob[:,1], pos_label=1)
roc_auc = auc(fpr,tpr)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',lw=lw, label='ROC curve (area = %0.3f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC on OpenData 13TeV dataset')
plt.legend(loc="lower right")
plt.show()

#### The ROC curve is related to the [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix). The confusion matrix is plotted bellow.

In [None]:
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    #classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax


np.set_printoptions(precision=2)

plot_confusion_matrix(Y_test, y_pred, ['background','signal'], normalize=False)

#### One can also compare the predicted vs. true distributions of both background and signal for some given variable.

In [None]:
X_test

In [None]:
m = 0
h_true_bkg = []
h_true_sig = []
for q in Y_test:
    if q == 0:
        h_true_bkg.append((X_test['mll_12'].iloc[m])/1000.)
    else:    
        h_true_sig.append((X_test['mll_12'].iloc[m])/1000.)
    m += 1

In [None]:
n = 0
h_pred_bkg = []
h_pred_sig = []
for p in y_pred:
    if p == 0:
        h_pred_bkg.append((X_test['mll_12'].iloc[n])/1000.)
    else:    
        h_pred_sig.append((X_test['mll_12'].iloc[n])/1000.)
    n += 1

In [None]:
bins = []
for i in range(0,500):
    bins.append(i*20)
plt.hist(h_pred_sig,bins=bins,log=True, color = 'b',alpha=0.3)
plt.hist(h_true_sig,bins=bins,log=True, color = 'r', alpha=0.3)
plt.show

In [None]:
plt.hist(h_true_bkg,bins=100,log=True)
plt.hist(h_pred_bkg,bins=100,log=True)
plt.show

## Logistic Regression
An example of a much simpler classification model is the [logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression).

In [None]:
# LOGREG classifier
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
y_pred = logreg.predict(X_test)
y_pred_prob = logreg.predict_proba(X_test)