# XGBoost Classifier on Spark
**Example 4.0b, XGBoost classifier:** This trains a particle classifier using XGBoost distributed using PySpark.

The High-Level Features classifier is built with labeled data
 - input: 14 features, described in [ Topology classification with deep learning to improve real-time event selection at the LHC](https://link.springer.com/epdf/10.1007/s41781-019-0028-1?author_access_token=eTrqfrCuFIP2vF4nDLnFfPe4RwlQNchNByi7wbcMAY7NPT1w8XxcX1ECT83E92HWx9dJzh9T9_y5Vfi9oc80ZXe7hp7PAj21GjdEF2hlNWXYAkFiNn--k5gFtNRj6avm0UukUt9M9hAH_j4UR7eR-g%3D%3D)
 - output: 3 classes, "W + jet", "QCD", "t tbar", see also [Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics Comput Softw Big Sci 4, 8 (2020)](https://rdcu.be/b4Wk9)  
 - Open dataset: [download data](https://github.com/cerndb/SparkDLTrigger/tree/master/Data)  
![Physics use case for the particle classifier](../Docs/Physics_use_case.png)

The notebook has been tested using the following configuration:
* *Software stack*: Spark 3.5.1, XGBoost 2.0.3
* *Platform*: CentOS 7, Python 3.11
* *Spark cluster*: Analytix

In [None]:
# No need to run this when using CERN SWAN service
# Just add the configuration parameters for Spark on the "star" button integration

# ! pip install pyspark 
# or use your favorite way to set Spark Home, here we use findspark
# import findspark
# findspark.init('/home/luca/Spark/spark-3.5.1-bin-hadoop3') #set path to SPARK_HOME

# Create Spark session and configure according to your environment
from pyspark.sql import SparkSession

spark = ( SparkSession.builder
            .appName("Training-XGBoostClassifier")
            .master("yarn")
            .config("spark.driver.memory","4g")
            .config("spark.executor.memory","32g")
            .config("spark.executor.cores","8")
            .config("spark.executor.instances","4")
            .config("spark.dynamicAllocation.enabled","false") # barrier scheduling does not allow dynamic allocation
            .config("spark.sql.execution.arrow.pyspark.enabled", "true")
            .config("spark.jars.packages", "ml.dmlc:xgboost4j_2.12:2.0.3")
            .config("spark.ui.showConsoleProgress", "false")
            .getOrCreate()
        )


In [2]:
spark

## Load train and test dataframes

In [3]:
# For CERN users, data is already available in the Analytix cluster

# Open data: download the test and training data sets as described at
# https://github.com/cerndb/SparkDLTrigger/tree/master/Data
# DATASET_NAME="trainUndersampled.parquet"
# wget -r -np -R "index.html*" -e robots=off http://sparkdltrigger.web.cern.ch/sparkdltrigger/$DATASET_NAME
# 

# PATH = "./" # use the path where you downloaded the data
# For CERN users, data is already available on the Analytix Hadoop cluster
PATH = "hdfs://analytix/Training/Spark/TopologyClassifier/"


trainDF = ( spark.read.format('parquet')
              .load(PATH + 'trainUndersampled.parquet')
              .selectExpr('hfeatures as features', 'label')
          )
        
testDF = ( spark.read.format('parquet')
             .load(PATH + 'testUndersampled.parquet')
             .selectExpr('hfeatures as features', 'label', 'encoded_label')
         )

In [4]:
# Optionally count the number of events in the training and test datasets
print('There are', trainDF.count(), 'training events')
print('There are', testDF.count(), 'test events')

There are 3426083 training events
There are 856090 test events


In [5]:
# There are 14 High Level Features for this classifier, 
# packed into a vector in the "features" column
# The label can take 3 possible values. Details at:
# https://github.com/cerndb/SparkDLTrigger/tree/master/Data

trainDF.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: long (nullable = true)



In [6]:
testDF.show(3)

+--------------------+-----+-------------+
|            features|label|encoded_label|
+--------------------+-----+-------------+
|[74.9491729736328...|    0|(3,[0],[1.0])|
|[0.0,27.335390090...|    0|(3,[0],[1.0])|
|[47.6835403442382...|    0|(3,[0],[1.0])|
+--------------------+-----+-------------+
only showing top 3 rows



## Train the classifier using XGBoost

In [7]:
# Install XGBoost if needed
#! pip install xgboost
from xgboost.spark import SparkXGBClassifier

# set up distributed training with XGBoost on Spark
xgboost = SparkXGBClassifier(num_workers = 16)


In [8]:
%time xgboost_model = xgboost.fit(trainDF)

2024-05-17 16:06:15,352 INFO XGBoost-PySpark: _fit Running xgboost-2.0.3 on 16 workers with
	booster params: {'objective': 'multi:softprob', 'device': 'cpu', 'num_class': 3, 'nthread': 1}
	train_call_kwargs_params: {'verbose_eval': True, 'num_boost_round': 100}
	dmatrix_kwargs: {'nthread': 1, 'missing': nan}
2024-05-17 16:08:10,676 INFO XGBoost-PySpark: _fit Finished xgboost training!


CPU times: user 474 ms, sys: 98.3 ms, total: 572 ms
Wall time: 2min 7s


## Save the model

In [9]:
# save the model to the local filesystem
xgboost_model.save("file:/tmp/models/XGBoost/xgboost_model")

In [13]:
# reload with:
# from sparkxgb.xgboost import XGBoostClassifier
# xgboost_model = XGBoostClassifier.load('file:/tmp/models/XGBoost/xgboost_model')

## Prediction

In [10]:
pred = xgboost_model.transform(testDF)

In [11]:
pred.show(5)

+--------------------+-----+-------------+--------------------+----------+--------------------+
|            features|label|encoded_label|       rawPrediction|prediction|         probability|
+--------------------+-----+-------------+--------------------+----------+--------------------+
|[74.9491729736328...|    0|(3,[0],[1.0])|[3.75366473197937...|       0.0|[0.99778509140014...|
|[0.0,27.335390090...|    0|(3,[0],[1.0])|[2.04441666603088...|       0.0|[0.85107922554016...|
|[47.6835403442382...|    0|(3,[0],[1.0])|[3.30667686462402...|       0.0|[0.98559838533401...|
|[80.9036312103271...|    0|(3,[0],[1.0])|[2.26769089698791...|       0.0|[0.90009635686874...|
|[95.2762756347656...|    0|(3,[0],[1.0])|[1.76696527004241...|       0.0|[0.91470485925674...|
+--------------------+-----+-------------+--------------------+----------+--------------------+
only showing top 5 rows



## Compute the AUC

In [12]:
from pyspark.sql.types import ArrayType, DoubleType
from pyspark.sql.functions import udf
    
vector_udf = udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))
pred = pred.select([vector_udf('encoded_label').alias('encoded_label'),
                    vector_udf('probability').alias('probability')])

In [13]:
%time pred_pd = pred.select(['encoded_label', 'probability']).toPandas()

CPU times: user 519 ms, sys: 247 ms, total: 766 ms
Wall time: 22.4 s


In [14]:
pred_pd.head()

Unnamed: 0,encoded_label,probability
0,"[1.0, 0.0, 0.0]","[0.9977850914001465, 0.000999481650069356, 0.0..."
1,"[1.0, 0.0, 0.0]","[0.8510792255401611, 0.003909738268703222, 0.1..."
2,"[1.0, 0.0, 0.0]","[0.9855983853340149, 0.00229356880299747, 0.01..."
3,"[1.0, 0.0, 0.0]","[0.9000963568687439, 0.02538803033530712, 0.07..."
4,"[1.0, 0.0, 0.0]","[0.9147048592567444, 0.06927593052387238, 0.01..."


In [15]:
import numpy as np
y_true = np.array(pred_pd['encoded_label'].tolist())
y_pred = np.array(pred_pd['probability'].tolist())

In [16]:
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(3):
    fpr[i], tpr[i], _ = roc_curve(y_true[:, i], y_pred[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

In [17]:
%matplotlib notebook
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8-darkgrid')
plt.figure()
plt.plot(fpr[1], tpr[1], color='blue', 
         lw=2, label='XGBoost classifier (AUC) = %0.4f' % roc_auc[1])
plt.plot([0, 1], [0, 1], color='orange', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$tt$ selector')
plt.legend(loc="lower right")
plt.grid()
plt.show()

<IPython.core.display.Javascript object>

In [18]:
plt.figure()
plt.plot(fpr[2], tpr[2], color='blue', 
         lw=2, label='XGBoost classifier (AUC) = %0.4f' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='orange', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Background Contamination (FPR)')
plt.ylabel('Signal Efficiency (TPR)')
plt.title('$W$ selector')
plt.legend(loc="lower right")
plt.grid()
plt.show()

<IPython.core.display.Javascript object>

## Confusion Matrix

In [19]:
from sklearn.metrics import accuracy_score

print('Accuracy of the XGBoost classifier: {:.4f}'.format(
    accuracy_score(np.argmax(y_true, axis=1),np.argmax(y_pred, axis=1))))

Accuracy of the XGBoost classifier: 0.9205


In [21]:
import seaborn as sns
from sklearn.metrics import confusion_matrix
labels_name = ['qcd', 'tt', 'wjets']
labels = [0,1,2]

cm = confusion_matrix(np.argmax(y_true, axis=1), np.argmax(y_pred, axis=1), labels=labels)

## Normalize CM
cm = cm / cm.astype(np.float64).sum(axis=1)

fig, ax = plt.subplots()
ax = sns.heatmap(cm, annot=True, fmt='g')
ax.xaxis.set_ticklabels(labels_name)
ax.yaxis.set_ticklabels(labels_name)
plt.xlabel('True labels')
plt.ylabel('Predicted labels')
plt.show()

<IPython.core.display.Javascript object>

In [22]:
spark.stop()