## Hyperparameter search

This stage performs a grid search usign Spark to find the best model for the HLF classifier. We achieve this by training multiple Keras model in parallel.

To run this notebook we used the following configuration:
* *Software stack*: Spark 2.4.1
* *Platform*: CentOS 7, Python 3.6
* *Spark cluster*: Analytix

In [1]:
# pip install pyspark or use your favorite way to set Spark Home, here we use findspark
import findspark
findspark.init('/home/luca/Spark/spark-2.4.1-bin-hadoop2.7') #set path to SPARK_HOME

In [2]:
# Configure according to your environment

from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .appName("3-Hyperparameter search") \
        .master("yarn") \
        .config("spark.driver.memory","8g") \
        .config("spark.executor.memory","8g") \
        .config("spark.executor.cores","6") \
        .config("spark.executor.instances","50") \
        .config("spark.dynamicAllocation.enabled","false") \
        .getOrCreate()

In [3]:
# Check if Spark Session has been created correctly
spark

## Load train and test datasets

In [4]:
PATH = "hdfs://analytix/Training/Spark/TopologyClassifier/"

trainDF = spark.read.format('parquet')\
        .load(PATH + 'trainUndersampled.parquet')\
        .select(['HLF_input', 'encoded_label'])
        
testDF = spark.read.format('parquet')\
        .load(PATH + 'testUndersampled.parquet')\
        .select(['HLF_input', 'encoded_label'])

In [5]:
# Optionally check the number of events in the train and test datasets

test_events = testDF.count()
train_events = trainDF.count()
print('There are {} events in the train dataset'.format(train_events))
print('There are {} events in the test dataset'.format(test_events))

There are 3426083 events in the train dataset
There are 856090 events in the test dataset


### Take small subset of data

In [5]:
# Customize the fraction of data that you want to use
fraction=0.02 # 2%

trainDF_fraction = trainDF.sample(fraction=fraction, seed=42)
testDF_fraction = testDF.sample(fraction=fraction, seed=42)

## Convert to Pandas 

Now we will collect and convert the Spark DataFrame into a Pandas dataframe in order to be able to use Keras.

In [6]:
trainDF = trainDF_fraction.toPandas()
testDF = testDF_fraction.toPandas()

In [7]:
trainDF.head()

Unnamed: 0,HLF_input,encoded_label
0,"[0.00809329440432877, 0.0017148579507185722, 0...","(0.0, 0.0, 1.0)"
1,"[0.013682833697080407, 0.0021841484297045193, ...","(1.0, 0.0, 0.0)"
2,"[0.006159859158211762, 0.0034116357291298794, ...","(0.0, 0.0, 1.0)"
3,"[0.0, 0.005300722834447411, 0.6162382423979611...","(0.0, 0.0, 1.0)"
4,"[0.006966742424100583, 0.006391548439125299, 0...","(1.0, 0.0, 0.0)"


We need to convert `Dense` and `Sparse` vectors into list

In [8]:
trainDF[trainDF.columns] = trainDF[trainDF.columns].applymap(lambda x: list(x))
testDF[testDF.columns] = testDF[testDF.columns].applymap(lambda x: list(x))

In [9]:
import numpy as np

X = np.array(trainDF['HLF_input'].tolist())
y = np.array(trainDF['encoded_label'].tolist())

X_test = np.array(testDF['HLF_input'].tolist())
y_test = np.array(testDF['encoded_label'].tolist())

## Create the Keras model

In [10]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

def create_model(nh_1, nh_2, nh_3):
    ## Create model
    model = Sequential()
    model.add(Dense(nh_1, input_shape=(14,), activation='relu'))
    model.add(Dense(nh_2, activation='relu'))
    model.add(Dense(nh_3, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    
    ## Compile model
    optimizer = 'Adam'
    loss = 'categorical_crossentropy'
    model.compile(loss=loss, optimizer=optimizer, metrics=["accuracy"])
    
    return model

## Test baseline model

In [11]:
baseline = create_model(50,20,10)

%time history = baseline.fit(X, y, batch_size=64, epochs=50, validation_data=(X_test, y_test), verbose=0)

CPU times: user 1min 48s, sys: 19.8 s, total: 2min 7s
Wall time: 51.2 s


In [24]:
import matplotlib.pyplot as plt 

# Graph with loss vs. epoch
%matplotlib notebook
plt.figure()
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(loc='upper right')
plt.show()

<IPython.core.display.Javascript object>

In [13]:
# Graph with accuracy vs. epoch
%matplotlib notebook
plt.figure()
plt.plot(history.history['acc'], label='train')
plt.plot(history.history['val_acc'], label='validation')
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.legend(loc='lower right')
plt.show()

<IPython.core.display.Javascript object>

## Confusion Matrix

In [31]:
y_pred=history.model.predict(X_test)
y_true=y_test

In [32]:
from sklearn.metrics import accuracy_score

print('Accuracy of the HLF classifier: {:.4f}'.format(
    accuracy_score(np.argmax(y_true, axis=1),np.argmax(y_pred, axis=1))))

Accuracy of the HLF classifier: 0.9138


In [33]:
import seaborn as sns
from sklearn.metrics import confusion_matrix
labels_name = ['qcd', 'tt', 'wjets']
labels = [0,1,2]

cm = confusion_matrix(np.argmax(y_true, axis=1), np.argmax(y_pred, axis=1), labels=labels)

## Normalize CM
cm = cm / cm.astype(np.float).sum(axis=1)

fig, ax = plt.subplots()
ax = sns.heatmap(cm, annot=True, fmt='g')
ax.xaxis.set_ticklabels(labels_name)
ax.yaxis.set_ticklabels(labels_name)
plt.xlabel('True labels')
plt.ylabel('Predicted labels')
plt.show()

<IPython.core.display.Javascript object>

## Create the Keras classifier

Wrapping our keras model into a Sklearn classifier allows us to use Sklearn grid Search. We will the distribute the grid search across executors usign Spark-Sklearn.

In [14]:
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

model = KerasClassifier(build_fn=create_model, verbose=0)

## Define the grid parameters

In [15]:
batch_size = [64, 100,200]
epochs = [10, 30, 50]

## Number of hidden units per layer
nh_1 = [50,100,150]
nh_2 = [20,50,100]
nh_3 = [10,20,50]

In [16]:
param_grid = {'batch_size':batch_size,
              'nb_epoch':epochs,
              'nh_1':nh_1, 'nh_2':nh_2, 'nh_3':nh_3}

## Grid Search with Spark
Spark is used to parallelize grid search.

In [17]:
from spark_sklearn.grid_search import GridSearchCV

sc = spark.sparkContext

grid = GridSearchCV(sc, estimator=model, param_grid=param_grid, cv=10, verbose=1)
# Note, for random grid search search use:
# grid = GridSearchCV(sc, estimator=model, param_grid=random_param_grid, cv=10, verbose=1)

In [18]:
%time gridSearch_result = grid.fit(X, y)

Fitting 10 folds for each of 243 candidates, totalling 2430 fits
CPU times: user 3.54 s, sys: 578 ms, total: 4.11 s
Wall time: 1min 33s


In [19]:
# Get the parameters giving the best result
gridSearch_result.best_estimator_.get_params()

{'verbose': 0,
 'batch_size': 64,
 'nb_epoch': 30,
 'nh_1': 150,
 'nh_2': 50,
 'nh_3': 10,
 'build_fn': <function __main__.create_model(nh_1, nh_2, nh_3)>}

In [20]:
# See the performance on the test dataset 
y_pred = gridSearch_result.best_estimator_.predict_proba(X_test)

In [21]:
from sklearn.metrics import roc_curve, auc

fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

In [22]:
# Dictionary containign ROC-AUC for the three classes 
roc_auc

{0: 0.9643316889339326, 1: 0.9717324240923253, 2: 0.9592112903729023}

In [23]:
%matplotlib notebook

# Plot roc curve 
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')


plt.figure()
plt.plot(fpr[0], tpr[0], lw=2, 
         label='HLF classifier (AUC) = %0.4f' % roc_auc[0])
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$tt$ selector')
plt.legend(loc="lower right")
plt.show()

<IPython.core.display.Javascript object>