# **Domain Adaptation to train model-independent classifiers in High Energy Physics**

In this notebook we want to build a simple DNN and train it using a typical HEP approach. Our dataset in this case will consist of only SM events.

# Check the SM dataset

This new dataset is very similar to the previous one, except that it is smaller and the BSM models have been dropped.

Note that in this dataset all the events labelled as `isVBF` SM VBF events.

In [None]:
import pandas as pd


In [None]:
# IF USING COLAB, Uncomment and run the next lines and comment the next block; OTHERWISE leave all as is
#%pip install pickle5
#import pickle5 as pickle
#!wget https://pandora.infn.it/public/123c9b/dl/dataset_SM.pkl
#with open('dataset_SM.pkl', "rb") as fh:
#  df = pickle.load(fh)
# END COLAB BLOCK

In [None]:
# IF NOT USING COLAB, please use the standard code below
df = pd.read_pickle('https://pandora.infn.it/public/123c9b/dl/dataset_SM.pkl')
# END NON COLAB BLOCK

In [None]:
#BACK to standard flow

pd.set_option('display.max_columns', None)
df

## How many events of each kind do we have?

Let's know check how many events of each process we have in our dataset. In this case we opted for a balanced dataset composed by **VBF**, **ggH** and **BKG** processes in equal proportions.

In [None]:
for col in df.columns[24:35]:
    print (col, len(df[df[col]==1]))

# Define the a simple DNN

We now want to build a simple DNN model. For the sake of clarity we maintain the same structure of the ADNN class, although it's much simpler now.

In [None]:
import tensorflow as tf
# import optuna
import tqdm
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Activation, Dense, Dropout, InputLayer
from tensorflow.keras.constraints import max_norm
from tensorflow.keras import initializers
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import Adam
from sklearn.utils import shuffle
import copy
print(tf.__version__)
from tensorflow.python.client import device_lib
tf.config.run_functions_eagerly(True)

In [None]:
class SimpleNeuralNetwork ( tf.Module ):
    def __init__ (self, nEpochs, learning_rate, N_NODES, n_layers, n_features, n_outputsC=3):
        self.learning_rate = learning_rate
        self.optimizer  = tf.optimizers.Adam (self.learning_rate)
        self.nEpochs = nEpochs
        self.N_NODES = N_NODES
        self.n_layers = n_layers
        self.n_features = n_features
        self.n_outputsC = n_outputsC
        self.weights = self.build (self.n_features, self.N_NODES)
        
                                
    # Define the structure of the model
    def build (self, n_input, N_NODES):

        # Classifier model
        self.model1 = Sequential()
        self.model1.add(Dense (self.N_NODES, activation = 'relu', input_dim  = n_input))
        for i in range(self.n_layers):
            self.model1.add(Dense (self.N_NODES, activation = 'relu'))
        self.model1.add(Dense (self.n_outputsC, activation = 'softmax',input_dim = self.N_NODES))      
        
        return self.model1.weights
     
    # Performs the epochs loop and the actual training.
    # Monitors the training and validation loss functions, both for the classifier and the adversary.
    # Returns the classifier categorical accuracy.
    def fit (self, X, Y, X_val, Y_val, show_loss = False):
        losses = []
        losses_val = []

        self.means = np.mean ( X, axis = 0)
        self.sigmas = np.std ( X, axis = 0)

        for iEpoch in tqdm.tqdm(range(self.nEpochs)):
                l, l_val = self._train (X, Y, X_val, Y_val)
                losses.append ( l )
                losses_val.append ( l_val )

        losses = np.array(losses)               
        losses_val = np.array(losses_val)
               
        plt.plot (losses, color = "c", label='Training set')
        plt.plot (losses_val, color ='tab:blue', label = "Validation set")
        plt.xlabel ("Epoch"); plt.ylabel ("Loss")
        plt.legend(frameon=False)
        plt.show()
        
        ca = tf.keras.metrics.CategoricalAccuracy()
        ca.update_state(Y, self.predict_proba(X))
        
        return ca.result().numpy()


    @tf.function
    def _train (self, X, Y, X_val, Y_val):
        Y_true = tf.cast (Y, tf.float32)
        Y_true_val = tf.cast (Y_val, tf.float32)

        with tf.GradientTape() as gt:
            #gt.watch ( self.weightsC )
            Y_hat = self.predict_proba (X)
            Y_hat_val = self.predict_proba (X_val)
            
            ## Training set
            # Use the categorical cross-entropy as loss function for the classifier
            cce = tf.keras.losses.CategoricalCrossentropy()
            loss = tf.reduce_mean ( cce( Y_true, Y_hat ) )
            
            ## Validation set
            cce_val = tf.keras.losses.CategoricalCrossentropy()
            loss_val = tf.reduce_mean (cce_val( Y_true_val, Y_hat_val ) )
            
            # Compute the gradient of the overall loss with respect to the classifier weights
            gradients = gt.gradient ( loss, self.weights )

        # Apply the gradients
        self.optimizer.apply_gradients ( zip(gradients, self.weights) )
        
        return loss, loss_val

    
    # Applies a pre-processing to the input features and returns the classifier representation.
    @tf.function
    def predict_proba (self, X):
        ppX = (X - self.means)/self.sigmas
        return  tf.clip_by_value ( self.model1 (ppX) , 1e-7, 1. - 1e-7 )
    
    def save_weights(self, model_name):
        self.model1.save_weights(model_name+'_weights_1')
    
    def load_weights(self, model_name):
        self.model1.load_weights(model_name+'_weights_1')
        
    def save_model(self, model_name):
        self.model1.save("saved_models/"+model_name+"_1")

    def reset_optimizers(self):
        self.optimizer  = tf.optimizers.Adam (self.learning_rate)
        
    def set_epochs(self, epochs):
        self.nEpochs = epochs
        


# Training of the simple DNN model

## Extract the input features and labels from the SM dataset and split in training (80%) and validation (20%)

In [None]:
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
import numpy as np
import matplotlib as mpl
%matplotlib inline
mpl.rcParams['figure.dpi'] = 150

features = df.columns[:24]

NDIM = len(features)

df = shuffle(df)

# Perform the splitting and define training and validation datasets
msk = np.random.rand(len(df)) < 0.8
df_train = df[msk]
df_val = df[~msk]

X = df_train.values[:,0:NDIM]
Y = df_train.values[:,NDIM:NDIM+3] # isVBF, isGGH, isBKG

X_val = df_val.values[:,0:NDIM]
Y_val = df_val.values[:,NDIM:NDIM+3] # isVBF, isGGH, isBKG


## Build the simple DNN model and define the needed parameters

In [None]:
dnn = SimpleNeuralNetwork(500, learning_rate=0.0001, N_NODES=50, n_layers=8, n_features=X.shape[1])

# Save initial set of weights (before training) to re-initialize the ADNN in later steps.
# Useful if we want to restart always from the same starting point during the optimization studies.
dnn.save_weights("my_simpleDNN_model_init")

## Perform the training

In [None]:
acc = dnn.fit (X.astype(np.float32), Y.astype(np.float32), X_val.astype(np.float32), Y_val.astype(np.float32))

# How do we quantify the simple DNN performance

To compare with the ADNN approach, we want to know the accuracy and the performance of the simple DNN on the models belonging to the target domain, i.e. BSM models.

To achieve this, let's resume the ADNN dataset and use it to evaluate the simple DNN.

## Plot the probability distributions for the three output nodes of C

**YOUR TURN: as for the ADNN, redo the same plots also for the simple DNN**

In [None]:
### YOUR CODE HERE

## Plot the probability distributions for labelled events

**YOUR TURN: as for the ADNN, redo the same plots also for the simple DNN**

In [None]:
### YOUR CODE HERE

## Let's calculate the accuracy

**YOUR TURN: as for the ADNN, check the categorical accuracy of the simple DNN.**

In [None]:
### YOUR CODE HERE

## Plot the probability distributions of different VBF signal models

Note that to do this we reload the original dataset that contained also the BSM events, and we evaluate the simple DNN on this dataset.

**YOUR TURN: after reloading the model used for the ADNN, plot the VBF output distribution for all the BSM models. What do you expect in this case?**

In [None]:
# IF NOT USING COLAB, USE LINES BELOW; OTHERWISE COMMENT THEM AND UNCOMMENT NEXT BLOCK
df_bsm = pd.read_pickle('https://pandora.infn.it/public/488317/dl/dataset_DA.pkl')

#BLOCK FOR COLAB
#!wget https://pandora.infn.it/public/488317/dl/dataset_DA.pkl
#with open('dataset_DA.pkl', "rb") as fh2:
#  df_bsm = pickle.load(fh2)
#END NON COLAB BLOCK

In [None]:
### YOUR CODE HERE

Wow, that's really bad...

# Summary of the simple DNN performance

Let's finish this part by computing the DNN accuracy on BSM models, as well ad the metrics used to evaluate the level of agreement between SM and BSM distributions.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools
#from matplotlib.backends.backend_pdf import PdfPages

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45, fontsize = 16)
    plt.yticks(tick_marks, classes, fontsize = 16)

    thresh = cm.max() / 1.2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black",fontsize=10)

    plt.xlabel("Predicted label", fontsize=16)
    plt.ylabel("True label", fontsize=16)

    
    plt.tight_layout()

**YOUR TURN: use the `summary()` function defined in the other notebook and adapt it to produce the same outputs for the simple DNN**

In [None]:
### YOUR CODE HERE