### Apache Spark has recently added GPU support with their 3.0 release. In this notebook we will explore how we can run a ML model on GPU using Apache Spark 3.0 ###

In [1]:
!nvidia-smi

Sat Aug  7 15:10:27 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz
!tar xf spark-3.0.3-bin-hadoop2.7.tgz

In [3]:
!pip install -q findspark

In [4]:
!wget https://repo1.maven.org/maven2/ai/rapids/cudf/0.18/cudf-0.18-cuda11.jar
#downloading RAPIDS ai to run on GPU. It uses GPU DataReader & GPU DataFrame
#The CUDA version needs to be in sync with the CUDA version supported by the current GPU. This can be obtained by executing
#the !nvidia-smi command which shows the type of GPU, available GPU memory & CUDA version.

#!wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.1.0/rapids-4-spark_2.12-0.1.0.jar
#RAPIDS is NVIDIA framework for accelerating ML model on GPUs. This was not needed when we ran cudf-0.18-cuda11

--2021-08-07 15:10:59--  https://repo1.maven.org/maven2/ai/rapids/cudf/0.18/cudf-0.18-cuda11.jar
Resolving repo1.maven.org (repo1.maven.org)... 199.232.192.209, 199.232.196.209
Connecting to repo1.maven.org (repo1.maven.org)|199.232.192.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 419045057 (400M) [application/java-archive]
Saving to: ‘cudf-0.18-cuda11.jar’


2021-08-07 15:11:02 (191 MB/s) - ‘cudf-0.18-cuda11.jar’ saved [419045057/419045057]



In [5]:
#XGBoost is not built-in into Spark ML. So we need to seperately download the same.
!wget https://repo1.maven.org/maven2/com/nvidia/xgboost4j_3.0/1.0.0-0.1.0/xgboost4j_3.0-1.0.0-0.1.0.jar
!wget https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/xgboost4j-spark_3.0-1.0.0-0.1.0.jar

--2021-08-07 15:11:02--  https://repo1.maven.org/maven2/com/nvidia/xgboost4j_3.0/1.0.0-0.1.0/xgboost4j_3.0-1.0.0-0.1.0.jar
Resolving repo1.maven.org (repo1.maven.org)... 199.232.192.209, 199.232.196.209
Connecting to repo1.maven.org (repo1.maven.org)|199.232.192.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 231556205 (221M) [application/java-archive]
Saving to: ‘xgboost4j_3.0-1.0.0-0.1.0.jar’


2021-08-07 15:11:03 (206 MB/s) - ‘xgboost4j_3.0-1.0.0-0.1.0.jar’ saved [231556205/231556205]

--2021-08-07 15:11:03--  https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/xgboost4j-spark_3.0-1.0.0-0.1.0.jar
Resolving repo1.maven.org (repo1.maven.org)... 199.232.192.209, 199.232.196.209
Connecting to repo1.maven.org (repo1.maven.org)|199.232.192.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2040779 (1.9M) [application/java-archive]
Saving to: ‘xgboost4j-spark_3.0-1.0.0-0.1.0.jar’


2021-08-07 15:11:03 (49.4 MB/s

In [6]:
!ls

cudf-0.18-cuda11.jar	   spark-3.0.3-bin-hadoop2.7.tgz
sample_data		   xgboost4j_3.0-1.0.0-0.1.0.jar
spark-3.0.3-bin-hadoop2.7  xgboost4j-spark_3.0-1.0.0-0.1.0.jar


In [7]:
!pwd

/content


In [8]:
import os
os.environ["JAVA_HOME"]="/usr/lib/jvm/java-8-openjdk-amd64"
os.environ['SPARK_HOME']="/content/spark-3.0.3-bin-hadoop2.7"

In [9]:
os.environ["PYSPARK_SUBMIT_ARGS"] = '--jars /content/cudf-0.18-cuda11.jar,/content/xgboost4j_3.0-1.0.0-0.1.0.jar,/content/xgboost4j-spark_3.0-1.0.0-0.1.0.jar pyspark-shell'

In [11]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
#spark = SparkSession.builder.master("local[*]").config("spark.plugins","com.nvidia.spark.SQLPlugin").config("spark.rapids.memory.gpu.pooling.enabled",False).getOrCreate()
#2 changes are needed to run Spark on GPU
#1. set spark.plugins to com.nvidia.spark.SQLPlugin-NVIDIA plugin for execution on GPU
#2. set spark.rapids.memory.gpu.pooling.enabled to FALSE, which tells that we want our Resource Manager to not manage the GPU
#pooling, but rather we wish to directly execute on the CUDA environment. In case we are in a multi-tenant environment, 
#then we will want the ResourceManager to manage GPU & we will set the above parameter to TRUE. 
#the above was not needed when we didn't download and add rapids-4-spark_2.12-0.1.0.jar to sparkContext

In [None]:
spark.sparkContext.addPyFile("/content/xgboost4j-spark_3.0-1.0.0-0.1.0.jar") #Adding the Python bindings for XGBoostSpark to the PyFile.
#spark.sparkContext.addPyFile("/content/rapids-4-spark_2.12-0.1.0.jar") #Adding the Python bindings for RAPIDS to the PyFile.

In [12]:
reader = spark.read

In [13]:
from ml.dmlc.xgboost4j.scala.spark import XGBoostClassificationModel, XGBoostClassifier

import numpy as np
import pandas as pd

In [14]:
!wget https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/covtype_train.parquet
!wget https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/covtype_test.parquet

--2021-08-07 15:12:03--  https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/covtype_train.parquet
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6827427 (6.5M) [application/octet-stream]
Saving to: ‘covtype_train.parquet’


2021-08-07 15:12:04 (107 MB/s) - ‘covtype_train.parquet’ saved [6827427/6827427]

--2021-08-07 15:12:04--  https://raw.githubusercontent.com/srivatsan88/YouTubeLI/master/dataset/covtype_test.parquet
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1591040 (1.5M) [application/octet-stream]
Sa

In [15]:
!ls -alrt

total 860840
-rw-r--r--  1 root root 231556205 Jun 26  2020 xgboost4j_3.0-1.0.0-0.1.0.jar
-rw-r--r--  1 root root   2040779 Jun 26  2020 xgboost4j-spark_3.0-1.0.0-0.1.0.jar
-rw-r--r--  1 root root 419045057 Feb 25 09:41 cudf-0.18-cuda11.jar
drwxr-xr-x 13 1000 1000      4096 Jun 17 05:27 spark-3.0.3-bin-hadoop2.7
-rw-r--r--  1 root root 220400553 Jun 17 05:28 spark-3.0.3-bin-hadoop2.7.tgz
drwxr-xr-x  4 root root      4096 Jul 16 13:19 .config
drwxr-xr-x  1 root root      4096 Jul 16 13:20 sample_data
drwxr-xr-x  1 root root      4096 Aug  7 15:09 ..
-rw-r--r--  1 root root   6827427 Aug  7 15:12 covtype_train.parquet
drwxr-xr-x  1 root root      4096 Aug  7 15:12 .
-rw-r--r--  1 root root   1591040 Aug  7 15:12 covtype_test.parquet


In [16]:
import pyarrow.parquet as pq
pq.read_table('covtype_train.parquet')

pyarrow.Table
Elevation: double
Aspect: double
Slope: double
Horizontal_Distance_To_Hydrology: double
Vertical_Distance_To_Hydrology: double
Horizontal_Distance_To_Roadways: double
Hillshade_9am: double
Hillshade_Noon: double
Hillshade_3pm: double
Horizontal_Distance_To_Fire_Points: double
Wilderness_Area1: double
Wilderness_Area2: double
Wilderness_Area3: double
Wilderness_Area4: double
Soil_Type1: double
Soil_Type2: double
Soil_Type3: double
Soil_Type4: double
Soil_Type5: double
Soil_Type6: double
Soil_Type7: double
Soil_Type8: double
Soil_Type9: double
Soil_Type10: double
Soil_Type11: double
Soil_Type12: double
Soil_Type13: double
Soil_Type14: double
Soil_Type15: double
Soil_Type16: double
Soil_Type17: double
Soil_Type18: double
Soil_Type19: double
Soil_Type20: double
Soil_Type21: double
Soil_Type22: double
Soil_Type23: double
Soil_Type24: double
Soil_Type25: double
Soil_Type26: double
Soil_Type27: double
Soil_Type28: double
Soil_Type29: double
Soil_Type30: double
Soil_Type31: doubl

In [34]:
train_data = reader.parquet('covtype_train.parquet')
test_data = reader.parquet('covtype_test.parquet')

In [35]:
train_data.show()

+---------+------+-----+--------------------------------+------------------------------+-------------------------------+-------------+--------------+-------------+----------------------------------+----------------+----------------+----------------+----------------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+------+
|Elevation|Aspect|Slope|Horizontal_Distance_To_Hydrology|Vertical_Distance_To_Hydrology|Horizontal_Distance_To_Roadways|Hillshade_9am|Hillshade_Noon|Hillshade_3pm|Horizontal_Distance_To_Fire_Points|Wilderness_Area1|Wilderness_Area2|Wilderness_Area3|Wilder

In [36]:
train_data.schema

StructType(List(StructField(Elevation,DoubleType,true),StructField(Aspect,DoubleType,true),StructField(Slope,DoubleType,true),StructField(Horizontal_Distance_To_Hydrology,DoubleType,true),StructField(Vertical_Distance_To_Hydrology,DoubleType,true),StructField(Horizontal_Distance_To_Roadways,DoubleType,true),StructField(Hillshade_9am,DoubleType,true),StructField(Hillshade_Noon,DoubleType,true),StructField(Hillshade_3pm,DoubleType,true),StructField(Horizontal_Distance_To_Fire_Points,DoubleType,true),StructField(Wilderness_Area1,DoubleType,true),StructField(Wilderness_Area2,DoubleType,true),StructField(Wilderness_Area3,DoubleType,true),StructField(Wilderness_Area4,DoubleType,true),StructField(Soil_Type1,DoubleType,true),StructField(Soil_Type2,DoubleType,true),StructField(Soil_Type3,DoubleType,true),StructField(Soil_Type4,DoubleType,true),StructField(Soil_Type5,DoubleType,true),StructField(Soil_Type6,DoubleType,true),StructField(Soil_Type7,DoubleType,true),StructField(Soil_Type8,DoubleType

In [37]:
pq_file=pq.read_table('covtype_train.parquet')

In [38]:
pq_file.column_names

['Elevation',
 'Aspect',
 'Slope',
 'Horizontal_Distance_To_Hydrology',
 'Vertical_Distance_To_Hydrology',
 'Horizontal_Distance_To_Roadways',
 'Hillshade_9am',
 'Hillshade_Noon',
 'Hillshade_3pm',
 'Horizontal_Distance_To_Fire_Points',
 'Wilderness_Area1',
 'Wilderness_Area2',
 'Wilderness_Area3',
 'Wilderness_Area4',
 'Soil_Type1',
 'Soil_Type2',
 'Soil_Type3',
 'Soil_Type4',
 'Soil_Type5',
 'Soil_Type6',
 'Soil_Type7',
 'Soil_Type8',
 'Soil_Type9',
 'Soil_Type10',
 'Soil_Type11',
 'Soil_Type12',
 'Soil_Type13',
 'Soil_Type14',
 'Soil_Type15',
 'Soil_Type16',
 'Soil_Type17',
 'Soil_Type18',
 'Soil_Type19',
 'Soil_Type20',
 'Soil_Type21',
 'Soil_Type22',
 'Soil_Type23',
 'Soil_Type24',
 'Soil_Type25',
 'Soil_Type26',
 'Soil_Type27',
 'Soil_Type28',
 'Soil_Type29',
 'Soil_Type30',
 'Soil_Type31',
 'Soil_Type32',
 'Soil_Type33',
 'Soil_Type34',
 'Soil_Type35',
 'Soil_Type36',
 'Soil_Type37',
 'Soil_Type38',
 'Soil_Type39',
 'Soil_Type40',
 'target']

In [39]:
label="target"
features = [ x for x in pq_file.column_names if x != label ]

In [40]:
features

['Elevation',
 'Aspect',
 'Slope',
 'Horizontal_Distance_To_Hydrology',
 'Vertical_Distance_To_Hydrology',
 'Horizontal_Distance_To_Roadways',
 'Hillshade_9am',
 'Hillshade_Noon',
 'Hillshade_3pm',
 'Horizontal_Distance_To_Fire_Points',
 'Wilderness_Area1',
 'Wilderness_Area2',
 'Wilderness_Area3',
 'Wilderness_Area4',
 'Soil_Type1',
 'Soil_Type2',
 'Soil_Type3',
 'Soil_Type4',
 'Soil_Type5',
 'Soil_Type6',
 'Soil_Type7',
 'Soil_Type8',
 'Soil_Type9',
 'Soil_Type10',
 'Soil_Type11',
 'Soil_Type12',
 'Soil_Type13',
 'Soil_Type14',
 'Soil_Type15',
 'Soil_Type16',
 'Soil_Type17',
 'Soil_Type18',
 'Soil_Type19',
 'Soil_Type20',
 'Soil_Type21',
 'Soil_Type22',
 'Soil_Type23',
 'Soil_Type24',
 'Soil_Type25',
 'Soil_Type26',
 'Soil_Type27',
 'Soil_Type28',
 'Soil_Type29',
 'Soil_Type30',
 'Soil_Type31',
 'Soil_Type32',
 'Soil_Type33',
 'Soil_Type34',
 'Soil_Type35',
 'Soil_Type36',
 'Soil_Type37',
 'Soil_Type38',
 'Soil_Type39',
 'Soil_Type40']

In [41]:
import time
params = { 
    'eta': 0.1,
    'gamma': 0.1,
    'missing': 0.0,
    'treeMethod': 'gpu_hist',
    'maxDepth': 6, 
    'growPolicy': 'depthwise',
    'lambda_': 1.0,
    'subsample': 1.0,
    'numRound': 1000,
    'numWorkers': 1,
    'verbosity': 2
}
classifier = XGBoostClassifier(**params).setLabelCol(label).setFeaturesCol(features)

In [42]:
start_time = time.time()

model=classifier.fit(train_data)

print("GPU Training Time: %s seconds" % (str(time.time() - start_time)))

Py4JJavaError: ignored

In [None]:
!nvidia-smi

In [None]:
model.write().overwrite().save('/content/model/')

In [None]:
!ls

In [None]:
loaded_model = XGBoostClassificationModel().load('/content/model/')

In [None]:
result=loaded_model.transform(test_data)

In [None]:
result.select(label, 'rawPrediction', 'probability', 'prediction').show(5)