
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>




# XGBoost

Up until this point, we have only used SparkML. Let's look a third party library for Gradient Boosted Trees. 
 
Ensure that you are using the <a href="https://docs.microsoft.com/en-us/azure/databricks/runtime/mlruntime" target="_blank">Databricks Runtime for ML</a> because that has Distributed XGBoost already installed. 

**Question**: How do gradient boosted trees differ from random forests? Which parts can be parallelized?

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Learning Objectives:<br>

By the end of this lesson, you should be able to;

* Build a XGBoost model and integrate it into Spark ML pipeline
* Evaluate XGBoost model performance
* Compare and contrast most common gradient boosted approaches

## 📌 Requirements

**Required Databricks Runtime Version:** 
* Please note that in order to run this notebook, you must use one of the following Databricks Runtime(s): **12.2.x-cpu-ml-scala2.12**

## Lesson Setup

The first thing we're going to do is to **run setup script**. This script will define the required configuration variables that are scoped to each user.

In [0]:
%run "./Includes/Classroom-Setup"

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02"

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)

Creating & using the schema "charlie_ohara_4mi2_da_sml" in the catalog "hive_metastore"...(0 seconds)

Predefined tables in "charlie_ohara_4mi2_da_sml":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02

Setup completed (7 seconds)





## Data Preparation

Let's go ahead and index all of our categorical features, and set our label to be **`log(price)`**.

In [0]:
from pyspark.sql.functions import log, col
from pyspark.ml.feature import StringIndexer, VectorAssembler

file_path = f"dbfs:/mnt/dbacademy-datasets/scalable-machine-learning-with-apache-spark/v02/airbnb/sf-listings/sf-listings-2019-03-06-clean.delta/"
airbnb_df = spark.read.format("delta").load(file_path)
train_df, test_df = airbnb_df.withColumn("label", log(col("price"))).randomSplit([.8, .2], seed=42)

# add categorical columns + index them
# don't need to do 1-hot encoding with XGBoost
categorical_cols = [field for (field, dataType) in train_df.dtypes if dataType == "string"]
index_output_cols = [x + "Index" for x in categorical_cols]

string_indexer = StringIndexer(inputCols=categorical_cols, outputCols=index_output_cols, handleInvalid="skip")

# prepare our numeric columns and drop our prediction target 
numeric_cols = [field for (field, dataType) in train_df.dtypes if ((dataType == "double") & (field != "price") & (field != "label"))]
assembler_inputs = index_output_cols + numeric_cols
vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

### Distributed Training of XGBoost Models

Let's create our distributed XGBoost model. We will use `xgboost`'s PySpark estimator. 

To use the distributed version of XGBoost's PySpark estimators, you can specify two additional parameters:

* **`num_workers`**: The number of workers to distribute over.
* **`use_gpu`**: Enable to utilize GPU based training for faster performance.

For more information about these parameters and performance considerations, please check this <a href="https://docs.databricks.com/en/machine-learning/train-model/xgboost-spark.html" target="_blank">documentation page.</a>

In [0]:
from xgboost.spark import SparkXGBRegressor
from pyspark.ml import Pipeline

# hyper parameters we want to specify
# n_estimators = number of trees to create 
# learning rate = set the control of the weighting of the new trees that are added to the model
# max depth per tree is 4 nodes
# random state = consistent 
# specify 0 as missing to fill in values that are missing instead of NaN because it is easier for the model to process because takes up less memory
params = {"n_estimators": 100, "learning_rate": 0.1, "max_depth": 4, "random_state": 42, "missing": 0}

# Uses Spark but not part of open-source Spark, uses a separate open source implementation
# this is the algo we are going to use 
xgboost = SparkXGBRegressor(**params)

# do the feature eng and then specify the algo type 
pipeline = Pipeline(stages=[string_indexer, vec_assembler, xgboost])
pipeline_model = pipeline.fit(train_df)






## Evaluate Model Performance

Now we can evaluate how well our XGBoost model performed. Don't forget to exponentiate!

In [0]:
from pyspark.sql.functions import exp, col

# predict the log price using the model 
log_pred_df = pipeline_model.transform(test_df)

exp_xgboost_df = log_pred_df.withColumn("prediction", exp(col("prediction")))

display(exp_xgboost_df.select("price", "prediction"))

price,prediction
86.0,78.48510332182467
190.0,209.78999489981905
100.0,114.06312614288848
325.0,300.3517078077793
200.0,257.09793265174113
200.0,149.52932721686213
80.0,71.62514704573447
160.0,86.48540915769904
132.0,149.1521216558058
100.0,154.38841868827015


Databricks visualization. Run in Databricks to view.




Compute some metrics.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

# evaluate results
regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

rmse = regression_evaluator.evaluate(exp_xgboost_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(exp_xgboost_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}") 

RMSE is 119.30345607045693
R2 is 0.5538114388646336





## Alternative Gradient Boosted Approaches

There are lots of other gradient boosted approaches, such as <a href="https://catboost.ai/" target="_blank">CatBoost</a>, <a href="https://github.com/microsoft/LightGBM" target="_blank">LightGBM</a>, vanilla gradient boosted trees in <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.GBTClassifier.html?highlight=gbt#pyspark.ml.classification.GBTClassifier" target="_blank">SparkML</a>/<a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html" target="_blank">scikit-learn</a>, etc. Each of these has their respective <a href="https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db" target="_blank">pros and cons</a> that you can read more about.


## Classroom Cleanup

Run the following cell to remove lessons-specific assets created during this lesson:

In [0]:
DA.cleanup()

Resetting the learning environment:
| dropping the schema "charlie_ohara_4mi2_da_sml"...(1 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/charlie.ohara@standard.ai/scalable-machine-learning-with-apache-spark"...(0 seconds)

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)


&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>