# Random Forest Classifier on Spark
**Example 4.0a, Random Forest classifier:** This trains a particle classifier using Random Forest distributed using PySpark ML APIs.

The High-Level Features classifier is built with labeled data
 - input: 14 features, described in [ Topology classification with deep learning to improve real-time event selection at the LHC](https://link.springer.com/epdf/10.1007/s41781-019-0028-1?author_access_token=eTrqfrCuFIP2vF4nDLnFfPe4RwlQNchNByi7wbcMAY7NPT1w8XxcX1ECT83E92HWx9dJzh9T9_y5Vfi9oc80ZXe7hp7PAj21GjdEF2hlNWXYAkFiNn--k5gFtNRj6avm0UukUt9M9hAH_j4UR7eR-g%3D%3D)
 - output: 3 classes, "W + jet", "QCD", "t tbar", see also [Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020)](https://rdcu.be/b4Wk9)  
 - Open dataset: [download data](https://github.com/cerndb/SparkDLTrigger/tree/master/Data)
![Physics use case for the particle classifier](../Docs/Physics_use_case.png)

To run this notebook we used the following configuration:

The notebook has been tested using the following configuration:
* *Software stack*: Spark 3.3.2
* *Platform*: CentOS 7, Python 3.9
* *Spark cluster*: Analytix

In [1]:
# No need to run this when using CERN SWAN service
# Just add the configuration parameters for Spark on the "star" button integration

# pip install pyspark or use your favorite way to set Spark Home, here we use findspark
import findspark
findspark.init('/home/luca/Spark/spark-3.3.2-bin-hadoop3') #set path to SPARK_HOME

# Create Spark session and configure according to your environment
from pyspark.sql import SparkSession

spark = ( SparkSession.builder
          .appName("Training-RandomForestClassifier")
          .master("yarn")
          .config("spark.driver.memory","4g")
          .config("spark.executor.memory","32g")
          .config("spark.executor.cores","8")
          .config("spark.sql.execution.arrow.pyspark.enabled", "true")
          .config("spark.dynamicAllocation.enabled", "true")
          .config("spark.ui.showConsoleProgress", "false")
          .getOrCreate()
        )

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/01 16:50:26 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
23/03/01 16:50:48 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!


In [2]:
spark

## Load train and test dataframes

In [3]:
# For CERN users, data is available on the Analytix cluster
# You can also download the test and training data sets as described at
# https://github.com/cerndb/SparkDLTrigger/tree/master/Data
# DATASET_NAME="trainUndersampled.parquet"
# wget -r -np -R "index.html*" -e robots=off http://sparkdltrigger.web.cern.ch/sparkdltrigger/$DATASET_NAME
# ...

# For CERN users, data is already available on the Analytix Hadoop cluster
PATH = "hdfs://analytix/Training/Spark/TopologyClassifier/"

trainDF = ( spark.read.format('parquet')
              .load(PATH + 'trainUndersampled.parquet')
              .select(['hfeatures', 'label', 'encoded_label'])
          )
        
testDF = ( spark.read.format('parquet')
             .load(PATH + 'testUndersampled.parquet')
             .select(['hfeatures', 'label', 'encoded_label'])
         )

In [4]:
# Optionally count the number of events in the training and test datasets
print('There are', trainDF.count(), 'training events')
print('There are', testDF.count(), 'test events')

There are 3426083 training events
There are 856090 test events


In [5]:
# There are 14 High Level Features for this classifier, 
# packed into a vector in the "features" column
# The label can take 3 possible values. Details at:
# https://github.com/cerndb/SparkDLTrigger/tree/master/Data

trainDF.printSchema()

root
 |-- hfeatures: vector (nullable = true)
 |-- label: long (nullable = true)
 |-- encoded_label: vector (nullable = true)



In [6]:
testDF.show(3)

+--------------------+-----+-------------+
|           hfeatures|label|encoded_label|
+--------------------+-----+-------------+
|[74.9491729736328...|    0|(3,[0],[1.0])|
|[0.0,27.335390090...|    0|(3,[0],[1.0])|
|[47.6835403442382...|    0|(3,[0],[1.0])|
+--------------------+-----+-------------+
only showing top 3 rows



## Train the classifier using Random Forest

In [7]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(numTrees=100, maxDepth=10,
                            featuresCol='hfeatures',
                            labelCol="label",
                            predictionCol='prediction')

In [8]:
%time rf_model = rf.fit(trainDF)

23/02/24 15:32:30 WARN DAGScheduler: Broadcasting large task binary with size 1013.3 KiB
23/02/24 15:32:38 WARN DAGScheduler: Broadcasting large task binary with size 2003.8 KiB
23/02/24 15:32:49 WARN DAGScheduler: Broadcasting large task binary with size 4.0 MiB
23/02/24 15:32:57 WARN DAGScheduler: Broadcasting large task binary with size 1243.9 KiB
23/02/24 15:33:02 WARN DAGScheduler: Broadcasting large task binary with size 7.9 MiB
23/02/24 15:33:18 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
23/02/24 15:33:26 WARN DAGScheduler: Broadcasting large task binary with size 15.6 MiB
23/02/24 15:33:55 WARN DAGScheduler: Broadcasting large task binary with size 4.7 MiB


CPU times: user 49.5 ms, sys: 17.9 ms, total: 67.4 ms
Wall time: 2min 28s


## Save the model

In [9]:
# save the model to the local filesystem
rf_model.save(path='file:/tmp/models/RandomForest/rf_model')

In [10]:
# reload with:
# from pyspark.ml.classification import RandomForestClassificationModel
# rf_model = RandomForestClassificationModel.load('file:/tmp/models/RandomForest/rf_model')

## Prediction

In [9]:
pred = rf_model.transform(testDF)

In [10]:
pred.show(5)

23/02/24 11:36:07 WARN DAGScheduler: Broadcasting large task binary with size 9.9 MiB


+--------------------+-----+-------------+--------------------+--------------------+----------+
|           hfeatures|label|encoded_label|       rawPrediction|         probability|prediction|
+--------------------+-----+-------------+--------------------+--------------------+----------+
|[74.9491729736328...|    0|(3,[0],[1.0])|[94.0594099218962...|[0.94059409921896...|       0.0|
|[0.0,27.335390090...|    0|(3,[0],[1.0])|[79.7304762776576...|[0.79730476277657...|       0.0|
|[47.6835403442382...|    0|(3,[0],[1.0])|[93.0278130759845...|[0.93027813075984...|       0.0|
|[80.9036312103271...|    0|(3,[0],[1.0])|[75.5027551600768...|[0.75502755160076...|       0.0|
|[95.2762756347656...|    0|(3,[0],[1.0])|[96.2870586750627...|[0.96287058675062...|       0.0|
+--------------------+-----+-------------+--------------------+--------------------+----------+
only showing top 5 rows



## Compute the AUC

In [11]:
from pyspark.sql.types import ArrayType, DoubleType
from pyspark.sql.functions import udf
    
vector_udf = udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))
pred = pred.select([vector_udf('encoded_label').alias('encoded_label'),
                    vector_udf('probability').alias('probability')])

In [None]:
%time pred_pd = pred.select(['encoded_label', 'probability']).toPandas()

In [15]:
pred_pd.head()

Unnamed: 0,encoded_label,probability
0,"[1.0, 0.0, 0.0]","[0.9424003539209361, 0.009917096385128082, 0.0..."
1,"[1.0, 0.0, 0.0]","[0.8401903902284893, 0.009514216251149978, 0.1..."
2,"[1.0, 0.0, 0.0]","[0.9245375675539995, 0.024420467319359502, 0.0..."
3,"[1.0, 0.0, 0.0]","[0.7555536517469684, 0.05238705360406882, 0.19..."
4,"[1.0, 0.0, 0.0]","[0.9491124441342738, 0.028314183336938966, 0.0..."


In [16]:
import numpy as np
y_true = np.array(pred_pd['encoded_label'].tolist())
y_pred = np.array(pred_pd['probability'].tolist())

In [17]:
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_true[:, i], y_pred[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

In [18]:
%matplotlib notebook
import matplotlib.pyplot as plt

plt.style.use('seaborn-darkgrid')
plt.figure()
plt.plot(fpr[1], tpr[1], color='blue', 
         lw=2, label='Random Forest classifier (AUC) = %0.4f' % roc_auc[1])
plt.plot([0, 1], [0, 1], color='orange', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$tt$ selector')
plt.legend(loc="lower right")
plt.grid()
plt.show()

<IPython.core.display.Javascript object>

In [19]:
plt.figure()
plt.plot(fpr[2], tpr[2], color='blue', 
         lw=2, label='Random Forest classifier (AUC) = %0.4f' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='orange', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$W$ selector')
plt.legend(loc="lower right")
plt.grid()
plt.show()

<IPython.core.display.Javascript object>

## Confusion Matrix

In [20]:
from sklearn.metrics import accuracy_score

print('Accuracy of the HLF classifier: {:.4f}'.format(
    accuracy_score(np.argmax(y_true, axis=1),np.argmax(y_pred, axis=1))))

Accuracy of the HLF classifier: 0.9076


In [21]:
import seaborn as sns
from sklearn.metrics import confusion_matrix
labels_name = ['qcd', 'tt', 'wjets']
labels = [0,1,2]

cm = confusion_matrix(np.argmax(y_true, axis=1), np.argmax(y_pred, axis=1), labels=labels)

## Normalize CM
cm = cm / cm.astype(np.float).sum(axis=1)

fig, ax = plt.subplots()
ax = sns.heatmap(cm, annot=True, fmt='g')
ax.xaxis.set_ticklabels(labels_name)
ax.yaxis.set_ticklabels(labels_name)
plt.xlabel('True labels')
plt.ylabel('Predicted labels')
plt.show()

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  cm = cm / cm.astype(np.float).sum(axis=1)


<IPython.core.display.Javascript object>

In [9]:
spark.stop()