## Traininig of the High Level Feature classifier with TensorFlow/Keras

**4.0 Tensorflow/Keras, HLF classifier** This notebooks trains a dense neural network for the particle classifier using High Level Features. It uses TensorFlow/Keras on a single node. Spark is used in local mode to read the data.

To run this notebook we used the following configuration:
* *Software stack*: Spark 2.4.3, TensorFlow 1.14.0 or 2.0.0_beta1n* *Platform*: CentOS 7, Python 3.6
* *Spark* : local mode

In [1]:
# pip install pyspark or use your favorite way to set Spark Home, here we use findspark
import findspark
findspark.init('/home/luca/Spark/spark-2.4.3-bin-hadoop2.7') #set path to SPARK_HOME

In [2]:
# Configure according to your environment
# Spark is used in local mode, just to fork data for HLF classifier (~ 300 MB)

from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .appName("4.0c-Tensorflow-Keras HLF classifier") \
        .master("local[*]") \
        .config("spark.driver.memory","2g") \
        .config("spark.sql.execution.arrow.enabled","true") \
        .getOrCreate()

In [3]:
# Check if Spark Session has been created correctly
spark

## Load train and test datasets via Spark

In [4]:
#PATH = "file:<full_path>/SparkDLTrigger/Data/"
PATH = "../Data/"

trainDF = spark.read.format('parquet')\
        .load(PATH + 'trainUndersampled_HLF_features.parquet')\
        .select(['HLF_input', 'encoded_label'])
        
testDF = spark.read.format('parquet')\
        .load(PATH + 'testUndersampled_HLF_features.parquet')\
        .select(['HLF_input', 'encoded_label'])

In [5]:
# Check the number of events in the train and test datasets

num_test = testDF.count()
num_train = trainDF.count()

print('There are {} events in the test dataset'.format(num_test))
print('There are {} events in the train dataset'.format(num_train))

There are 856090 events in the test dataset
There are 3426083 events in the train dataset


In [6]:
# Show the schema and a data sample of the test dataset
testDF.printSchema()
testDF.limit(5).toPandas()

root
 |-- HLF_input: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- encoded_label: array (nullable = true)
 |    |-- element: double (containsNull = true)



Unnamed: 0,HLF_input,encoded_label
0,"[0.015150733133517018, 0.003511028294205839, 0...","[1.0, 0.0, 0.0]"
1,"[0.0, 0.003881822832783805, 0.7166341448458555...","[1.0, 0.0, 0.0]"
2,"[0.009639073600865505, 0.0010022659022912096, ...","[1.0, 0.0, 0.0]"
3,"[0.016354407625436572, 0.002108937905084598, 0...","[1.0, 0.0, 0.0]"
4,"[0.01925979125354152, 0.004603697276827594, 0....","[1.0, 0.0, 0.0]"


## Convert training and test datasets from Spark DataFrames to Numpy arrays

Now we will collect and convert the Spark DataFrame into numpy arrays in order to be able to feed them to TensorFlow/Keras.
We use the toPandas optimization in Spark to move data faster bewteen JVM and Python.


In [7]:
import numpy as np

trainDF_pandas=trainDF.toPandas()
testDF_pandas=testDF.toPandas()

%time X = np.array(trainDF_pandas["HLF_input"].tolist())
%time y = np.array(trainDF_pandas["encoded_label"].tolist())

%time X_test = np.array(testDF_pandas["HLF_input"].tolist())
%time y_test = np.array(testDF_pandas["encoded_label"].tolist())

CPU times: user 961 ms, sys: 96 ms, total: 1.06 s
Wall time: 1.06 s
CPU times: user 904 ms, sys: 0 ns, total: 904 ms
Wall time: 903 ms
CPU times: user 226 ms, sys: 30.1 ms, total: 256 ms
Wall time: 255 ms
CPU times: user 228 ms, sys: 1.55 ms, total: 230 ms
Wall time: 229 ms


**As a reference**, this is the code without to_Pandas optimization. It takes a few minutes to execute (compare with the few seconds of the optimized code above). The cause is that data serialization from Spark JVM to Python is a slow operation, moreover only one core is used in this case.

```
import numpy as np

%time X = np.array(trainDF.select("HLF_input").collect()).reshape(num_train,14)
%time y = np.array(trainDF.select("encoded_label").collect()).reshape(num_train,3)

%time X_test = np.array(testDF.select("HLF_input").collect()).reshape(num_test,14)
%time y_test = np.array(testDF.select("encoded_label").collect()).reshape(num_test,3)
```

## Create the Keras model

In [8]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

def create_model(nh_1, nh_2, nh_3):
    ## Create model
    model = Sequential()
    model.add(Dense(nh_1, input_shape=(14,), activation='relu'))
    model.add(Dense(nh_2, activation='relu'))
    model.add(Dense(nh_3, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    
    ## Compile model
    optimizer = 'Adam'
    loss = 'categorical_crossentropy'
    model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"])
    
    return model

keras_model = create_model(50,20,10)

## Train the model

In [10]:
%time history = keras_model.fit(X, y, batch_size=128, epochs=5, \
                                validation_data=(X_test, y_test), verbose=1)

Train on 3426083 samples, validate on 856090 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
CPU times: user 3min 43s, sys: 20.9 s, total: 4min 3s
Wall time: 2min 18s


## Performance metrics

In [11]:
%matplotlib notebook
import matplotlib.pyplot as plt 
plt.style.use('seaborn-darkgrid')
# Graph with loss vs. epoch

plt.figure()
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(loc='upper right')
plt.title("HLF classifier loss")
plt.show()

<IPython.core.display.Javascript object>

In [13]:
# Graph with accuracy vs. epoch
%matplotlib notebook
plt.figure()
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='validation')
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.legend(loc='lower right')
plt.title("HLF classifier accuracy")
plt.show()

<IPython.core.display.Javascript object>

## Confusion Matrix

In [14]:
y_pred=history.model.predict(X_test)
y_true=y_test

In [15]:
from sklearn.metrics import accuracy_score

print('Accuracy of the HLF classifier: {:.4f}'.format(
    accuracy_score(np.argmax(y_true, axis=1),np.argmax(y_pred, axis=1))))

Accuracy of the HLF classifier: 0.9164


In [16]:
import seaborn as sns
from sklearn.metrics import confusion_matrix
labels_name = ['qcd', 'tt', 'wjets']
labels = [0,1,2]

cm = confusion_matrix(np.argmax(y_true, axis=1), np.argmax(y_pred, axis=1), labels=labels)

## Normalize CM
cm = cm / cm.astype(np.float).sum(axis=1)

fig, ax = plt.subplots()
ax = sns.heatmap(cm, annot=True, fmt='g')
ax.xaxis.set_ticklabels(labels_name)
ax.yaxis.set_ticklabels(labels_name)
plt.xlabel('True labels')
plt.ylabel('Predicted labels')
plt.show()

<IPython.core.display.Javascript object>

## ROC and AUC

In [17]:
from sklearn.metrics import roc_curve, auc

fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

In [18]:
# Dictionary containign ROC-AUC for the three classes 
roc_auc

{0: 0.9874097428496796, 1: 0.9856290300776949, 2: 0.9814107498814256}

In [19]:
%matplotlib notebook

# Plot roc curve 
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')

plt.figure()
plt.plot(fpr[0], tpr[0], lw=2, \
         label='HLF classifier (AUC) = %0.4f' % roc_auc[0])
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$tt$ selector')
plt.legend(loc="lower right")
plt.show()

<IPython.core.display.Javascript object>