Now we will train the Particle Sequence classifier.

To run this notebook we used the following configuration:
* *Software stack*: LCG 94 (it has spark 2.3.1)
* *Platform*: centos7-gcc7
* *Spark cluster*: Hadalytic

In [23]:
# Check if Spark Session has been created correctly
spark

In [24]:
# Add the BDL zip file
# sc.addPyFile("/eos/project/s/swan/public/BigDL/bigdl-0.7.0-python-api.zip")

## Load train and test dataset

In [25]:
PATH = "file:///data/cern/"

trainDF = spark.read.format('parquet')\
        .load(PATH + 'trainUndersampled.parquet')\
        .select(['GRU_input', 'encoded_label'])
        
testDF = spark.read.format('parquet')\
        .load(PATH + 'testUndersampled.parquet')\
        .select(['GRU_input', 'encoded_label'])

In [26]:
trainDF.printSchema()

root
 |-- GRU_input: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: double (containsNull = true)
 |-- encoded_label: vector (nullable = true)



In [27]:
from pyspark.sql.functions import col, size

trainDF.select(size("GRU_input")).show

<bound method DataFrame.show of DataFrame[size(GRU_input): int]>

## Create the model

In [28]:
# Init analytics zoo
from zoo.common.nncontext import *
sc = init_nncontext("particle sequence classifier")

In [29]:
# Create keras like zoo model.
# Only need to change package name from keras to zoo.pipeline.api.keras
from zoo.pipeline.api.keras.optimizers import Adam
from zoo.pipeline.api.keras.models import Sequential
from zoo.pipeline.api.keras.layers.core import *
from zoo.pipeline.api.keras.layers.torch import Select
from zoo.pipeline.api.keras.layers.normalization import BatchNormalization
from zoo.pipeline.api.keras.layers.recurrent import GRU

model = Sequential()
model.add(Masking(0.0, input_shape=(801, 19)))
model.add(GRU(        
        output_dim=50,
        activation='tanh'
        ))
model.add(BatchNormalization())
model.add(Dense(3, activation='softmax'))

creating: createZooKerasSequential
creating: createZooKerasMasking
creating: createZooKerasGRU
creating: createZooKerasBatchNormalization
creating: createZooKerasDense


<zoo.pipeline.api.keras.models.Sequential at 0x7fba07b1c610>

## Create train and valiation DataFrame

No need to change to RDD of Sample, just use dataframe. But we need to flatten multi-dimension array before training.

In [30]:
import itertools
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, DoubleType

flat = udf(lambda particles : list(itertools.chain.from_iterable(particles)), ArrayType(DoubleType()))

In [31]:
trainDF = trainDF.withColumn("GRU_input_flat", flat('GRU_input'))
trainDF.printSchema()


root
 |-- GRU_input: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: double (containsNull = true)
 |-- encoded_label: vector (nullable = true)
 |-- GRU_input_flat: array (nullable = true)
 |    |-- element: double (containsNull = true)



In [32]:
testDF = testDF.withColumn("GRU_input_flat", flat('GRU_input'))
testDF.printSchema()

root
 |-- GRU_input: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: double (containsNull = true)
 |-- encoded_label: vector (nullable = true)
 |-- GRU_input_flat: array (nullable = true)
 |    |-- element: double (containsNull = true)



We can see that `Sample` is composed by the list of 801 particles with 19 features each (`shape=[801 19]`) and the encoded label (`shape=[3]`).

## Estimator setup and training

In [33]:
# Set of hyperparameters
numEpochs = 50

# The batch used by BDL must be a multiple of numExecutors * executorCores
# Because data will be equally distibuted inside each executor

workerBatch = 64
# numExecutors = int(sc._conf.get('spark.executor.instances'))
numExecutors = 1
# executorCores = int(sc._conf.get('spark.executor.cores'))
executorCores = 4

BDLbatch = workerBatch * numExecutors * executorCores

In [34]:
# Create SparkML compatible estimator for deep learning training

from bigdl.optim.optimizer import EveryEpoch, Loss, TrainSummary, ValidationSummary
from zoo.pipeline.nnframes import *
from zoo.pipeline.api.keras.objectives import CategoricalCrossEntropy

# Since we flat the array before, we need to set original input array shape in feature_preprocessing transformer
# for estimator
estimator = NNEstimator(model=model, criterion=CategoricalCrossEntropy(), feature_preprocessing=[801,19])\
        .setOptimMethod(Adam()) \
        .setBatchSize(BDLbatch) \
        .setMaxEpoch(numEpochs) \
        .setFeaturesCol("GRU_input_flat") \
        .setLabelCol("encoded_label") \
        .setValidation(trigger=EveryEpoch() , val_df=testDF,
                       val_method=[Loss(CategoricalCrossEntropy())], batch_size=BDLbatch)

creating: createZooKerasCategoricalCrossEntropy
creating: createSeqToTensor
creating: createSeqToTensor
creating: createFeatureLabelPreprocessing
creating: createNNEstimator
creating: createZooKerasAdam
creating: createDefault
creating: createEveryEpoch
creating: createZooKerasCategoricalCrossEntropy
creating: createLoss


Let's define a directory to store logs (i.e. train and validation losses) and save models

In [35]:
# name of our application
appName = "ParticleSequenceClassifier"

# Change it! 
logDir = "/data/cern/ZOOlogs"

# Check if there is already an application with the same name 
# and remove it, otherwise logs will be appended to that app
import os
try:
    os.system('rm -rf '+logDir+'/'+appName)
except:
    pass

print("Saving logs to {}".format(logDir+'/'+appName))

Saving logs to /data/cern/ZOOlogs/ParticleSequenceClassifier


In [36]:
trainSummary = TrainSummary(log_dir=logDir,app_name=appName)
estimator.setTrainSummary(trainSummary)

valSummary = ValidationSummary(log_dir=logDir,app_name=appName)
estimator.setValidationSummary(valSummary)

creating: createTrainSummary
creating: createValidationSummary


NNEstimator_4bc49b5297bae17114b5




We are now ready to launch the training.

Warnign: During the trainign it would be better to shutdown the Toggle Spark Monitorin Display because each iteration is seen as a spark job, therefore the toggle will try to display everything causing problem to the browser. 

In [37]:
%%time 
trained_model = estimator.fit(trainDF)

creating: createToTuple
creating: createChainedPreprocessing
CPU times: user 172 ms, sys: 48 ms, total: 220 ms
Wall time: 44min 23s


## Plot loss

In [38]:
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
%matplotlib notebook

loss = np.array(trainSummary.read_scalar("Loss"))
val_loss = np.array(valSummary.read_scalar("Loss"))

plt.plot(loss[:,0], loss[:,1], label="Training loss")
plt.plot(val_loss[:,0], val_loss[:,1], label="Validation loss", color='crimson', alpha=0.8)
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.legend()
plt.title("Particle sequence classifier loss")
plt.show()

<IPython.core.display.Javascript object>

## Save the model

In [39]:
modelDir = logDir + '/nnmodels/ParticleSequenceClassifier'
trained_model.save(modelDir)

NNModel_4aa08cdb26a2d3fb2c4e

It is possible to load the model in the following way:
```Python
model = NNModel.load(path=modelDir+'/module')
```

## Prediction

In [40]:
# Predict with trained NNModel using pipeline transform API
predDF = trained_model.transform(testDF)

In [41]:
y_pred = np.asarray(predDF.select("prediction").collect())
y_true = np.asarray(testDF.select('encoded_label').rdd\
                    .map(lambda row: np.asarray(row.encoded_label)).collect())

In [42]:
y_pred = np.squeeze(y_pred)
y_pred.shape

(2123, 3)

In [43]:
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_true[:, i], y_pred[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

In [44]:
plt.figure()
plt.plot(fpr[0], tpr[0], lw=2, 
         label='HLF classifier (AUC) = %0.4f' % roc_auc[0])
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$tt$ selector')
plt.legend(loc="lower right")
plt.show()

<IPython.core.display.Javascript object>