d
# XGBoost
 
If you are not using the DBR 7.x ML Runtime, you will need to install `ml.dmlc:xgboost4j-spark_2.12:1.0.0` from Maven, well as `xgboost` from PyPI.

**NOTE:** There is currently only a distributed version of XGBoost for Scala, not Python. We will switch to Scala for that section.

## Data Preparation

Let's go ahead and index all of our categorical features, and set our label to be `log(price)`.

In [3]:
from pyspark.sql.functions import log, col
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline

filePath = "/databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean.parquet"
airbnbDF = spark.read.parquet(filePath)
(trainDF, testDF) = airbnbDF.withColumn("label", log(col("price"))).randomSplit([.8, .2], seed=42)

categoricalCols = [field for (field, dataType) in trainDF.dtypes if dataType == "string"]
indexOutputCols = [x + "Index" for x in categoricalCols]

stringIndexer = StringIndexer(inputCols=categoricalCols, outputCols=indexOutputCols, handleInvalid="skip")

numericCols = [field for (field, dataType) in trainDF.dtypes if ((dataType == "double") & (field != "price") & (field != "label"))]
assemblerInputs = indexOutputCols + numericCols
vecAssembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
pipeline = Pipeline(stages=[stringIndexer, vecAssembler])

## Scala

Distributed XGBoost with Spark only has a Scala API, so we are going to create views of our DataFrames to use in Scala, as well as save our (untrained) pipeline to load in to Scala.

In [5]:
trainDF.createOrReplaceTempView("trainDF")
testDF.createOrReplaceTempView("testDF")

fileName = "/tmp/xgboost_feature_pipeline"
pipeline.write().overwrite().save(fileName)

## Load Data/Pipeline in Scala

This section is only available in Scala because there is no distributed Python API for XGBoost in Spark yet.

Let's load in our data/pipeline that we defined in Python.

In [7]:
%scala
import org.apache.spark.ml.Pipeline

val fileName = "tmp/xgboost_feature_pipeline"
val pipeline = Pipeline.load(fileName)

val trainDF = spark.table("trainDF")
val testDF = spark.table("testDF")

## XGBoost

Now we are ready to train our XGBoost model!

In [9]:
%scala

import ml.dmlc.xgboost4j.scala.spark._
import org.apache.spark.sql.functions._

val paramMap = List("num_round" -> 100, "eta" -> 0.1, "max_leaf_nodes" -> 50, "seed" -> 42, "missing" -> 0).toMap

val xgboostEstimator = new XGBoostRegressor(paramMap)

val xgboostPipeline = new Pipeline().setStages(pipeline.getStages ++ Array(xgboostEstimator))

val xgboostPipelineModel = xgboostPipeline.fit(trainDF)
val xgboostLogPredictedDF = xgboostPipelineModel.transform(testDF)

val expXgboostDF = xgboostLogPredictedDF.withColumn("prediction", exp(col("prediction")))
expXgboostDF.createOrReplaceTempView("expXgboostDF")

## Evaluate

Now we can evaluate how well our XGBoost model performed.

In [11]:
expXgboostDF = spark.table("expXgboostDF")

display(expXgboostDF.select("price", "prediction"))

In [12]:
from pyspark.ml.evaluation import RegressionEvaluator

regressionEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

rmse = regressionEvaluator.evaluate(expXgboostDF)
r2 = regressionEvaluator.setMetricName("r2").evaluate(expXgboostDF)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

## Export to Python

We can also export our XGBoost model to use in Python for fast inference on small datasets.

In [14]:
%scala

val nativeModelPath = "xgboost_native_model"
val xgboostModel = xgboostPipelineModel.stages.last.asInstanceOf[XGBoostRegressionModel]
xgboostModel.nativeBooster.saveModel(nativeModelPath)

## Predictions in Python

Let's pass in an example record to our Python XGBoost model and see how fast we can get predictions!!

Don't forget to exponentiate!

In [16]:
%python
import numpy as np
import xgboost as xgb
bst = xgb.Booster({'nthread': 4})
bst.load_model("xgboost_native_model")

# Per https://stackoverflow.com/questions/55579610/xgboost-attributeerror-dataframe-object-has-no-attribute-feature-names, DMatrix did the trick

data = np.array([[0.0, 2.0, 0.0, 14.0, 1.0, 0.0, 0.0, 1.0, 37.72001, -122.39249, 2.0, 1.0, 1.0, 1.0, 2.0, 128.0, 97.0, 10.0, 10.0, 10.0, 10.0, 9.0, 10.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]])
log_pred = bst.predict(xgb.DMatrix(data))
print(f"The predicted price for this rental is ${np.exp(log_pred)[0]:.2f}")
