# Ensembles & Pipelines In PySpark
  
Finally you'll learn how to make your models more efficient. You'll find out how to use pipelines to make your code clearer and easier to maintain. Then you'll use cross-validation to better test your models and select good model parameters. Finally you'll dabble in two types of ensemble model.
  
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   
      /_/
```


## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
[Apache Spark Documentation](https://spark.apache.org/docs/latest/api/python/index.html)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>pyspark.sql.SparkSession</td>
    <td>Main entry point for using Spark functionality</td>
  </tr>
  <tr>
    <td>2</td>
    <td>spark.version</td>
    <td>Retrieves the version of Spark</td>
  </tr>
  <tr>
    <td>3</td>
    <td>spark.stop()</td>
    <td>Terminates the Spark session and releases resources</td>
  </tr>
  <tr>
    <td>4</td>
    <td>SparkSession.builder.master('local[*]').appName('flights').getOrCreate()</td>
    <td>Creates a SparkSession with specific configuration</td>
  </tr>
  <tr>
    <td>5</td>
    <td>spark.count()</td>
    <td>Counts the number of rows in a DataFrame</td>
  </tr>
  <tr>
    <td>6</td>
    <td>spark.show()</td>
    <td>Displays the contents of a DataFrame</td>
  </tr>
  <tr>
    <td>7</td>
    <td>pyspark.sql.types.StructType</td>
    <td>Defines the structure for a DataFrame's schema</td>
  </tr>
  <tr>
    <td>8</td>
    <td>pyspark.sql.types.StructField</td>
    <td>Defines a single field within a schema</td>
  </tr>
  <tr>
    <td>9</td>
    <td>pyspark.sql.types.IntegerType</td>
    <td>Represents the integer data type in a schema</td>
  </tr>
  <tr>
    <td>10</td>
    <td>pyspark.sql.types.StringType</td>
    <td>Represents the string data type in a schema</td>
  </tr>
  <tr>
    <td>11</td>
    <td>spark.read.csv</td>
    <td>Reads data from a CSV file into a DataFrame</td>
  </tr>
  <tr>
    <td>12</td>
    <td>spark.printSchema()</td>
    <td>Prints the schema of a DataFrame</td>
  </tr>
  <tr>
    <td>13</td>
    <td>spark.filter</td>
    <td>Filters rows from a DataFrame based on a condition</td>
  </tr>
  <tr>
    <td>14</td>
    <td>spark.select</td>
    <td>Selects specific columns from a DataFrame</td>
  </tr>
  <tr>
    <td>15</td>
    <td>spark.dropna</td>
    <td>Removes rows with missing values from a DataFrame</td>
  </tr>
  <tr>
    <td>16</td>
    <td>spark.drop</td>
    <td>Removes specified columns from a DataFrame</td>
  </tr>
  <tr>
    <td>17</td>
    <td>pyspark.sql.functions.round</td>
    <td>Rounds the values in a column</td>
  </tr>
  <tr>
    <td>18</td>
    <td>spark.withColumn</td>
    <td>Adds a new column or replaces an existing one</td>
  </tr>
  <tr>
    <td>19</td>
    <td>pyspark.ml.feature.StringIndexer</td>
    <td>Converts string labels into numerical indices</td>
  </tr>
  <tr>
    <td>20</td>
    <td>spark.fit</td>
    <td>Trains a machine learning model</td>
  </tr>
  <tr>
    <td>21</td>
    <td>spark.transform</td>
    <td>Applies a transformation to a DataFrame</td>
  </tr>
  <tr>
    <td>22</td>
    <td>pyspark.ml.feature.VectorAssembler</td>
    <td>Combines multiple columns into a single vector column</td>
  </tr>
  <tr>
    <td>23</td>
    <td>spark.randomSplit</td>
    <td>Splits a DataFrame into random subsets</td>
  </tr>
  <tr>
    <td>24</td>
    <td>pyspark.ml.classification.DecisionTreeClassifier</td>
    <td>Creates a decision tree classification model</td>
  </tr>
  <tr>
    <td>25</td>
    <td>spark.groupBy</td>
    <td>Groups data in a DataFrame by specified columns</td>
  </tr>
  <tr>
    <td>26</td>
    <td>pyspark.ml.classification.LogisticRegression</td>
    <td>Creates a logistic regression classification model</td>
  </tr>
  <tr>
    <td>27</td>
    <td>pyspark.ml.evaluation.MulticlassClassificationEvaluator</td>
    <td>Evaluates multiclass classification models</td>
  </tr>
  <tr>
    <td>28</td>
    <td>pyspark.ml.evaluation.BinaryClassificationEvaluator</td>
    <td>Evaluates binary classification models</td>
  </tr>
  <tr>
    <td>29</td>
    <td>pyspark.sql.functions.regexp_replace</td>
    <td>Replaces occurrences of a pattern in a string column</td>
  </tr>
  <tr>
    <td>30</td>
    <td>pyspark.ml.feature.Tokenizer</td>
    <td>Splits text into words (tokens)</td>
  </tr>
  <tr>
    <td>31</td>
    <td>pyspark.ml.feature.StopWordsRemover</td>
    <td>Removes common words (stop words) from text</td>
  </tr>
  <tr>
    <td>32</td>
    <td>pyspark.ml.feature.HashingTF</td>
    <td>Converts text data into numerical vectors</td>
  </tr>
  <tr>
    <td>33</td>
    <td>pyspark.ml.feature.IDF</td>
    <td>Applies Inverse Document Frequency (IDF) to text vectors</td>
  </tr>
  <tr>
    <td>34</td>
    <td>pyspark.ml.feature.OneHotEncoder</td>
    <td>This is different from scikit-learn’s OneHotEncoder, which keeps all categories. The output vectors are sparse.</td>
  </tr>
  <tr>
    <td>35</td>
    <td>pyspark.ml.regression.LinearRegression</td>
    <td>This supports multiple types of regularization: none (a.k.a. ordinary least squares), L2 (ridge regression), L1 (Lasso), L2 + L1 (elastic net)</td>
  </tr>
  <tr>
    <td>36</td>
    <td>pyspark.ml.evaluation.RegressionEvaluator</td>
    <td>Evaluator for Regression, which expects input columns prediction, label and an optional weight column.</td>
  </tr>
  <tr>
    <td>37</td>
    <td>pyspark.ml.feature.Bucketizer</td>
    <td>Maps a column of continuous features to a column of feature buckets. Since 3.0.0, Bucketizer can map multiple columns at once by setting the inputCols parameter. Note that when both the inputCol and inputCols parameters are set, an Exception will be thrown. The splits parameter is only used for single column usage, and splitsArray is for multiple columns.</td>
  </tr>
</table>


---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
Name: pyspark  
Version: 3.4.1  
Summary: Apache Spark Python API  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [1]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations
import pyspark                      # Apache Spark:             Cluster Computing

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

## Pipeline
  
Welcome back! So far you've learned how to build classifier and regression models using Spark. In this chapter you'll learn how to make those models better. You'll start by taking a look at pipelines, which will seriously streamline your workflow. They will also help to ensure that training and testing data are treated consistently and that no leakage of information between these two sets takes place.
  
**Leakage?**
  
What do I mean by leakage? Most of the actions you've been using involve both a `.fit()` and a `.transform()` method. Those methods have been applied in a fairly relaxed way. But to get really robust results you need to be careful only to apply the `.fit()` method to training data. Why? Because if a `.fit()` method is applied to *any* of the testing data then the model will effectively have seen those data during the training phase, so the results of testing will no longer be objective. The `.transform()` method, on the other hand, can be applied to both training and testing data since it does not result in any changes in the underlying model.
  
<center><img src='../_images/pipeline-leakage-in-pyspark.png' alt='img' width='740'></center>
  
**A leaky model**
  
A figure should make this clearer. Leakage occurs whenever a `.fit()` method is applied to testing data. Suppose that you fit a model using both the training and testing data. The model would then already have *seen* the testing data, so using those data to test the model would not be fair: of course the model will perform well on data which has been used for training! This sounds obvious, but care must be taken not to fall into this trap. Remember that there are normally multiple stages in building a model and if the `.fit()` method in *any* of those stages is applied to the testing data then the model is compromised.
  
**A watertight model**
  
However, if you are careful to only apply `.fit()` to the training data then your model will be in good shape. When it comes to testing it will not have seen *any* of the testing data and the test results will be completely objective. Luckily a pipeline will make it easier to avoid leakage because it simplifies the training and testing process.
  
<center><img src='../_images/pipeline-leakage-in-pyspark1.png' alt='img' width='740'></center>
  
**Pipeline**
  
A pipeline is a mechanism to combine a series of steps. Rather than applying each of the steps individually, they are all grouped together and applied as a single unit.
  
<center><img src='../_images/pipeline-leakage-in-pyspark2.png' alt='img' width='740'></center>
  
**Cars model: Steps**
  
Let's return to our cars regression model. Recall that there were a number of steps involved: - using a string indexer to convert the type column to indexed values;  
  
- applying a one-hot encoder to convert those indexed values into dummy variables; then 
- assembling a set of predictors into a single features column; and finally 
- building a regression model.
  
<center><img src='../_images/pipeline-leakage-in-pyspark3.png' alt='img' width='740'></center>
  
**Cars model: Applying steps**
  
Let's map out the process of applying those steps. - First you fit the indexer to the training data. Then you call the `.transform()` method on the training data to add the indexed column. - Then you call the `.transform()` method on the testing data to add the indexed column there too. Note that the testing data was not used to fit the indexer. Next you do the same things for the one-hot encoder, fitting to the training data and then using the fitted encoder to update the training and testing data sets. The assembler is next. In this case there is no `.fit()` method, so you simply apply the `.transform()` method to the training and testing data. Finally the data are ready. You fit the regression model to the training data and then use the model to make predictions on the testing data. Throughout the process you've been careful to keep the testing data out of the training process. But this is hard work and it's easy enough to slip up.
  
<center><img src='../_images/pipeline-leakage-in-pyspark4.png' alt='img' width='740'></center>
  
**Cars model: Pipeline**
  
A pipeline makes training and testing a complicated model a lot easier. The Pipeline class lives in the `ml` sub-module. You create a pipeline by specifying a sequence of `'stages='`, where each stage corresponds to a step in the model building process. The `'stages='` are executed in order. Now, rather than calling the `.fit()` and `.transform()` methods for each stage, you simply call the `.fit()` method for the pipeline on the training data. Each of the `'stages='` in the pipeline is then automatically applied to the training data in turn. This will systematically apply the `.fit()` and `.transform()` methods for each stage in the pipeline. The trained pipeline can then be used to make predictions on the testing data by calling its `.transform()` method. The pipeline `.transform()` method will only call the `.transform()` method for each of the `'stages='` in the pipeline. Isn't that simple?
  
<center><img src='../_images/pipeline-leakage-in-pyspark5.png' alt='img' width='740'></center>
  
**Cars model: Stages**
  
You can access the `'stages='` in the pipeline by using the `.stages` attribute, which is a list. You pick out individual `.stages` by indexing into the list. For example, to access the regression component of the pipeline you'd use an index of 3. Having access to that component makes it possible to get the intercept and coefficients for the trained `LinearRegression` model.
  
<center><img src='../_images/pipeline-leakage-in-pyspark6.png' alt='img' width='740'></center>
  
**Pipelines streamline workflow!**
  
Pipelines make your code easier to read and maintain. Let's try them out with our flights model.

### Flight duration model: Pipeline stages
  
You're going to create the stages for the flights duration model pipeline. You will use these in the next exercise to build a pipeline and to create a regression model.
  
The `StringIndexer`, `OneHotEncoder`, `VectorAssembler` and `LinearRegression` classes are already imported.
  
---
  
1. Create an indexer to convert the `'org'` column into an indexed column called `'org_idx'`.
2. Create a one-hot encoder to convert the `'org_idx'` and `'dow'` columns into dummy variable columns called `'org_dummy'` and `'dow_dummy'`.
3. Create an assembler which will combine the `'km'` column with the two dummy variable columns. The output column should be called `'features'`.
4. Create a linear regression object to predict flight duration.

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('flights').getOrCreate()

# Read data from CSV file
flights = spark.read.csv('../_datasets/flights-larger.csv', sep=',', header=True, inferSchema=True, nullValue='NA')

# Get number of records 
print('The data contains {} record(s).'.format(flights.count()))

# View the first five rows
flights.show(5)

# Check column data types
print(flights.printSchema())
print(flights.dtypes)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/29 00:14:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

The data contains 275000 record(s).
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 10| 10|  1|     OO|  5836|ORD| 157|  8.18|      51|   27|
|  1|  4|  1|     OO|  5866|ORD| 466|  15.5|     102| null|
| 11| 22|  1|     OO|  6016|ORD| 738|  7.17|     127|  -19|
|  2| 14|  5|     B6|   199|JFK|2248| 21.17|     365|   60|
|  5| 25|  3|     WN|  1675|SJC| 386| 12.92|      85|   22|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows

root
 |-- mon: integer (nullable = true)
 |-- dom: integer (nullable = true)
 |-- dow: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- org: string (nullable = true)
 |-- mile: integer (nullable = true)
 |-- depart: double (nullable = true)
 |-- duration: integer (nullable = true)
 |-- delay: integer (nullable = true)

None
[('mon', 'int

In [3]:
from pyspark.sql.functions import round

# Convert 'mile' to 'km' and drop 'mile' column
flights = flights.withColumn('km', round(flights.mile * 1.60934, 0)).drop('mile')

In [4]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.regression import LinearRegression

# Convert Categorical strings to index values
indexer = StringIndexer(inputCol='org', outputCol='org_idx')

# One-hot encode index values
onehot = OneHotEncoder(
    inputCols=['org_idx', 'dow'],
    outputCols=['org_dummy', 'dow_dummy']
)

# Assemble predictors into a single column
assembler = VectorAssembler(
    inputCols=['km', 'org_dummy', 'dow_dummy'],
    outputCol='features'
)

# A linear regression object
regression = LinearRegression(labelCol='duration')

Excellent! The stages are now ready for you to build a pipeline.

### Flight duration model: Pipeline model
  
You're now ready to put those stages together in a pipeline.
  
You'll construct the pipeline and then train the pipeline on the training data. This will apply each of the individual stages in the pipeline to the training data in turn. None of the stages will be exposed to the testing data at all: there will be no leakage!
  
Once the entire pipeline has been trained it will then be used to make predictions on the testing data.
  
The data are available as `flights`, which has been randomly split into `flights_train` and `flights_test`.
  
---
  
1. Import the class for creating a pipeline.
2. Create a pipeline object and specify the `indexer`, `onehot`, `assembler` and `regression` stages, in this order.
3. Train the pipeline on the training data.
4. Make predictions on the testing data.

In [5]:
from pyspark.ml import Pipeline

flights_train, flights_test = flights.randomSplit([0.8, 0.2], seed=999)

# Construct a pipeline
pipeline = Pipeline(stages=[indexer, onehot, assembler, regression])

# Train the pipeline on the training data
pipeline = pipeline.fit(flights_train)

# Make predictions on the test data
predictions = pipeline.transform(flights_test)

23/08/29 00:15:00 WARN Instrumentation: [17ff8f13] regParam is zero, which might cause numerical instability and overfitting.
23/08/29 00:15:06 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/08/29 00:15:10 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK
                                                                                

Fantastic! A pipeline makes your code easier to read and maintain.

### SMS spam pipeline
  
You haven't looked at the SMS data for quite a while. Last time we did the following:
  
- split the text into tokens
- removed stop words
- applied the hashing trick
- converted the data from counts to IDF and
- trained a logistic regression model.
  
Each of these steps was done independently. This seems like a great application for a pipeline!
  
The `Pipeline` and `LogisticRegression` classes have already been imported into the session, so you don't need to worry about that!
  
---
  
1. Create an object for splitting text into tokens.
2. Create an object to remove stop words. Rather than explicitly giving the input column name, use the `.getOutputCol()` method on the previous object.
3. Create objects for applying the hashing trick and transforming the data into a TF-IDF. Use the `.getOutputCol()` method again.
4. Create a pipeline which wraps all of the above steps as well as an object to create a Logistic Regression model.

In [6]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Read data from CSV file
sms = spark.read.csv('../_datasets/sms.csv', sep=';', header=False, schema=schema, nullValue='NA')

sms.show(5)

+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  1|Sorry, I'll call ...|    0|
|  2|Dont worry. I gue...|    0|
|  3|Call FREEPHONE 08...|    1|
|  4|Win a 1000 cash p...|    1|
|  5|Go until jurong p...|    0|
+---+--------------------+-----+
only showing top 5 rows



In [7]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression

# Break text into tokens at non-word characters
tokenizer = Tokenizer(inputCol='text', outputCol='words')

# Remove stop words
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol='terms')

# Apply the hashing trick and transform to TF-IDF
hasher = HashingTF(inputCol=remover.getOutputCol(), outputCol='hash')
idf = IDF(inputCol=hasher.getOutputCol(), outputCol='features')

# Create a logistic regression object and add everything to a pipeline
logistic = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, remover, hasher, idf, logistic])

Great job! Isn't that a lot simpler than applying each stage separately?

## Cross-Validation
  
Up until now you've been testing models using a rather simple technique: randomly splitting the data into training and testing sets, training the model on the training data and then evaluating its performance on the testing set. There's one major drawback to this approach: you only get one estimate of the model performance. You would have a more robust idea of how well a model works if you were able to test it multiple times. This is precisely the idea behind cross-validation.
  
**CV - complete data**
  
You start out with the full set of data.
  
**CV - train/test split**
  
You still split these data into a training set and a testing set. Remember that before splitting it's important to first randomize the data so that the distributions in the training and testing data are similar.
  
**CV - multiple folds**
  
You then split the training data into a number of partitions or "folds". The number of folds normally factors into the name of the technique. For example, if you split into five folds then you'd talk about 5-fold cross-validation.
  
<center><img src='../_images/cross-validation-in-pyspark.png' alt='img' width='740'></center>
  
**Fold upon fold - first fold**
  
Once the training data have been split into folds you can start cross-validating. First keep aside the data in the first fold. Train a model on the remaining four folds. Then evaluate that model on the data from the first fold. This will give the first value for the evaluation metric.
  
**Fold upon fold - second fold**
  
Next you move onto the second fold, where the same process is repeated: data in the second fold are set aside for testing while the remaining four folds are used to train a model. That model is tested on the second fold data, yielding the second value for the evaluation metric.
  
**Fold upon fold - other folds**
  
You repeat the process for the remaining folds. Each of the folds is used in turn as testing data and you end up with as many values for the evaluation metric as there are folds. At this point you are in a position to calculate the average of the evaluation metric over all folds, which is a much more robust measure of model performance than a single value.
  
<center><img src='../_images/cross-validation-in-pyspark1.png' alt='img' width='740'></center>
  
**Cars revisited**
  
Let's see how this works in practice. Remember the cars data? Of course you do. You're going to build a cross-validated regression model to predict consumption.
  
<center><img src='../_images/cross-validation-in-pyspark2.png' alt='img' width='740'></center>
  
**Estimator and evaluator**
  
Here are the first two ingredients which you need to perform cross-validation: 
  
- an estimator, which builds the model and is often a pipeline; and 
- an evaluator, which quantifies how well a model works on testing data. We've seen both of these a few times already.
  
<center><img src='../_images/cross-validation-in-pyspark3.png' alt='img' width='740'></center>
  
**Grid and cross-validator**
  
Now the final ingredients. You'll need two new classes, `CrossValidator` and `ParamGridBuilder`, both from the `tuning` sub-module. You'll create a parameter grid, which you'll leave empty for the moment, but will return to in detail during the next lesson.  
  
Finally you have everything required to create a cross-validator object:  
  
- an estimator, which is the linear regression model, 
- an empty grid of parameters for the estimator and 
- an evaluator which will calculate the RMSE. You can optionally specify the number of folds (which defaults to three) and a random number seed for repeatability.
  
<center><img src='../_images/cross-validation-in-pyspark4.png' alt='img' width='740'></center>
  
**Cross-validators need training too**
  
The cross-validator has a `.fit()` method which will apply the cross-validation procedure to the training data. You can then look at the average RMSE calculated across all of the folds. This is a more robust measure of model performance because it is based on multiple train/test splits. Note that the average metric is returned as a list. You'll see why in the next lesson.
  
<center><img src='../_images/cross-validation-in-pyspark5.png' alt='img' width='740'></center>
  
**Cross-validators act like models**
  
The trained cross-validator object acts just like any other model. It has a `.transform()` method, which can be used to make predictions on new data. If we evaluate the predictions on the original testing data then we find a smaller value for the RMSE than we obtained using cross-validation. This means that a simple train-test split would have given an overly optimistic view on model performance.
  
<center><img src='../_images/cross-validation-in-pyspark6.png' alt='img' width='740'></center>
  
**Cross-validate all the models!**
  
Let's give cross-validation a try on our flights model.

### Cross validating simple flight duration model
  
You've already built a few models for predicting flight duration and evaluated them with a simple train/test split. However, cross-validation provides a much better way to evaluate model performance.
  
In this exercise you're going to train a simple model for flight duration using cross-validation. Travel time is usually strongly correlated with distance, so using the `'km'` column alone should give a decent model.
  
The data have been randomly split into `flights_train` and `flights_test`.
  
The following classes have already been imported: `LinearRegression`, `RegressionEvaluator`, `ParamGridBuilder` and `CrossValidator`.
  
---
  
1. Create an empty parameter grid.
2. Create objects for building and evaluating a linear regression model. The model should predict the "duration" field.
3. Create a cross-validator object. Provide values for the `estimator=`, `estimatorParamMaps=` and `evaluator=` arguments. Choose 5-fold cross validation.
4. Train and test the model across multiple folds of the training data.

In [8]:
assembler = VectorAssembler(
    inputCols=['km'],
    outputCol='features'
)

flights = assembler.transform(flights.drop('features'))

flights.show(5)

+---+---+---+-------+------+---+------+--------+-----+------+--------+
|mon|dom|dow|carrier|flight|org|depart|duration|delay|    km|features|
+---+---+---+-------+------+---+------+--------+-----+------+--------+
| 10| 10|  1|     OO|  5836|ORD|  8.18|      51|   27| 253.0| [253.0]|
|  1|  4|  1|     OO|  5866|ORD|  15.5|     102| null| 750.0| [750.0]|
| 11| 22|  1|     OO|  6016|ORD|  7.17|     127|  -19|1188.0|[1188.0]|
|  2| 14|  5|     B6|   199|JFK| 21.17|     365|   60|3618.0|[3618.0]|
|  5| 25|  3|     WN|  1675|SJC| 12.92|      85|   22| 621.0| [621.0]|
+---+---+---+-------+------+---+------+--------+-----+------+--------+
only showing top 5 rows



In [9]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

flights_train, flights_test = flights.randomSplit([0.8, 0.2], seed=999)

# Create an empty parameter grid
params = ParamGridBuilder().build()

# Create objects for building and evaluating a regression model
regression = LinearRegression(labelCol='duration')
evaluator = RegressionEvaluator(labelCol='duration')

# Create a cross validator
cv = CrossValidator(
    estimator=regression,
    estimatorParamMaps=params,
    evaluator=evaluator,
    numFolds=5
)

# Train and test model on multiple folds of the training data
cv = cv.fit(flights_train)

23/08/29 00:15:22 WARN Instrumentation: [9760c03e] regParam is zero, which might cause numerical instability and overfitting.
23/08/29 00:15:41 WARN Instrumentation: [5909b433] regParam is zero, which might cause numerical instability and overfitting.
23/08/29 00:16:01 WARN Instrumentation: [03d63793] regParam is zero, which might cause numerical instability and overfitting.
23/08/29 00:16:20 WARN Instrumentation: [500343f5] regParam is zero, which might cause numerical instability and overfitting.
23/08/29 00:16:32 WARN Instrumentation: [133d3487] regParam is zero, which might cause numerical instability and overfitting.
23/08/29 00:16:46 WARN Instrumentation: [b4582f0c] regParam is zero, which might cause numerical instability and overfitting.
                                                                                

Great! Now let's cross validate a model pipeline.

### Cross validating flight duration model pipeline
  
The cross-validated model that you just built was simple, using `'km'` alone to predict `'duration'`.
  
Another important predictor of flight duration is the origin airport. Flights generally take longer to get into the air from busy airports. Let's see if adding this predictor improves the model!
  
In this exercise you'll add the `'org'` field to the model. However, since `'org'` is categorical, there's more work to be done before it can be included: it must first be transformed to an index and then one-hot encoded before being assembled with `'km'` and used to build the regression model. We'll wrap these operations up in a pipeline.
  
The following objects have already been created:
  
- `params` — an empty parameter grid
- `evaluator` — a regression evaluator
- `regression` — a `LinearRegression` object with `labelCol='duration'`.
  
The `StringIndexer`, `OneHotEncoder`, `VectorAssembler` and `CrossValidator` classes have already been imported.
  
---
  
1. Create a string indexer. Specify the input and output fields as `'org'` and `'org_idx'`.
2. Create a one-hot encoder. Name the output field `'org_dummy'`.
3. Assemble the `'km'` and `'org_dummy'` fields into a single field called `'features'`.
4. Create a pipeline using the following operations: string indexer, one-hot encoder, assembler and linear regression. Use this to create a cross-validator.

In [10]:
# Create an empty parameter grid
params = ParamGridBuilder().build()

# Create regression model
regression = LinearRegression(labelCol='duration')
evaluator = RegressionEvaluator(labelCol='duration')

# Create an indexer for the org field
indexer = StringIndexer(inputCol='org', outputCol='org_idx')

# Create an one-hot encoder for the indexed 'org' field
onehot = OneHotEncoder(inputCol='org_idx', outputCol='org_dummy')

# Assemble the 'km' and one-hot encoded fields
assembler = VectorAssembler(inputCols=['km', 'org_dummy'], outputCol='features')

# Create a pipeline and cross-validator
pipeline = Pipeline(stages=[indexer, onehot, assembler, regression])

cv = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=params,
    evaluator=evaluator
)

Wrapping operations in a pipeline makes cross validating the entire workflow easy!

## Grid Search
  
So far you've been using the default parameters for almost everything. You've built some decent models, but they could probably be improved by choosing better model parameters.
  
**Tuning**
  
There is no universal "best" set of parameters for a particular model. The optimal choice of parameters will depend on the data and the modeling goal. The idea is relatively simple, you build a selection of models, one for each set of model parameters. Then you evaluate those models and choose the best one.
  
<center><img src='../_images/grid-search-in-pyspark.png' alt='img' width='740'></center>
  
**Cars revisited (again)**
  
You'll be looking at the fuel consumption regression model again.
  
<center><img src='../_images/grid-search-in-pyspark1.png' alt='img' width='740'></center>
  
**Fuel consumption with intercept**
  
You'll start by doing something simple, comparing a linear regression model with an intercept to one that passes through the origin. By default a linear regression model will always fit an intercept, but you're going to be explicit and specify the `fitIntercept=` parameter as True. You fit the model to the training data and then calculate the RMSE for the testing data.
  
<center><img src='../_images/grid-search-in-pyspark2.png' alt='img' width='740'></center>
  
**Fuel consumption without intercept**
  
Next you repeat the process, but specify False for the `fitIntercept=` parameter. Now you are creating a model which passes through the origin. When you evaluate this model you find that the RMSE is higher. So, comparing these two models you'd naturally choose the first one because it has a lower RMSE. However, there's a problem with this approach. Just getting a single estimate of RMSE is not very robust. It'd be better to make this comparison using cross-validation. You also have to manually build the models for the two different parameter values. It'd be great if that were automated.
  
<center><img src='../_images/grid-search-in-pyspark3.png' alt='img' width='740'></center>
  
**Parameter grid**
  
You can systematically evaluate a model across a grid of parameter values using a technique known as grid search. To do this you need to set up a parameter grid. You actually saw this in the previous lesson, where you simply created an empty grid. Now you are going to add points to the grid. First you create a grid builder and then you add one or more grids. At present there's just one grid, which takes two values for the `fitIntercept=` parameter. Call the `.build()` method to construct the grid. A separate model will be built for each point in the grid. You can check how many models this corresponds to and, of course, this is just two.
  
<center><img src='../_images/grid-search-in-pyspark4.png' alt='img' width='740'></center>
  
**Grid search with cross-validation**
  
Now you create a cross-validator object and fit it to the training data. This builds a bunch of models: one model for each fold and point in the parameter grid. Since there are two points in the grid and ten folds, this translates into twenty models. The cross-validator is going to loop through each of the points in the parameter grid and for each point it will create a cross-validated model using the corresponding parameter values. When you take a look at the average metrics attribute, you can see why the metric is given as a list: you get one average value for each point in the grid. The values confirm what you observed before: the model that includes an intercept is superior to the model without an intercept.
  
<center><img src='../_images/grid-search-in-pyspark5.png' alt='img' width='740'></center>
  
**The best model & parameters**
  
Our goal was to get the best model for the data. You retrieve this using the appropriately named `.bestModel` attribute. But it's not actually necessary to work with this directly because the cross-validator object will behave like the best model. So, you can use it directly to make predictions on the testing data. Of course, you want to know what the best parameter value is and you can retrieve this using the `.explainParam()` method. As expected the best value for the `fitIntercept=` parameter is True. You can see this after the word "current" in the output.
  
<center><img src='../_images/grid-search-in-pyspark6.png' alt='img' width='740'></center>
  
**A more complicated grid**
  
It's possible to add more parameters to the grid. Here, in addition to whether or not to include an intercept, you're also considering a selection of values for the regularization parameter and the elastic net parameter. Of course, the more parameters and values you add to the grid, the more models you have to evaluate. Because each of these models will be evaluated using cross-validation, this might take a little while. But it will be time well spent, because the model that you get back will in principle be much better than what you would have obtained by just using the default parameters.
  
<center><img src='../_images/grid-search-in-pyspark7.png' alt='img' width='740'></center>
  
**Find the best parameters!**
  
Let's apply grid search on the flights and SMS models!

### Optimizing flights linear regression
  
Up until now you've been using the default hyper-parameters when building your models. In this exercise you'll use cross validation to choose an optimal (or close to optimal) set of model hyper-parameters.
  
The following have already been created:
  
- `regression` — a `LinearRegression` object
- `pipeline` — a pipeline with string indexer, one-hot encoder, vector assembler and linear regression and
- `evaluator` — a `RegressionEvaluator` object.
  
---
  
1. Create a parameter grid builder.
2. Add grids for with `regression.regParam` (values 0.01, 0.1, 1.0, and 10.0) and `regression.elasticNetParam` (values 0.0, 0.5, and 1.0).
3. Build the grid.
4. Create a cross validator, specifying five folds.

In [11]:
# Create parameter grid
params = ParamGridBuilder()

# Add grids for two parameters
params = params.addGrid(regression.regParam, [0.01, 0.1, 1.0, 10.0])\
               .addGrid(regression.elasticNetParam, [0.0, 0.5, 1.0])

# Build the parameter grid
params = params.build()
print('Number of models to be tested: ', len(params))

# Create cross-validator
cv = CrossValidator(
    estimator=pipeline, 
    estimatorParamMaps=params, 
    evaluator=evaluator, 
    numFolds=5
)


Number of models to be tested:  12


Nice! Multiple models are built effortlessly using grid search.

### Dissecting the best flight duration model
  
You just set up a `CrossValidator` to find good parameters for the linear regression model predicting flight duration.
  
The model pipeline has multiple stages (objects of type `StringIndexer`, `OneHotEncoder`, `VectorAssembler` and `LinearRegression`), which operate in sequence. The `stages=` are available as the stages attribute on the pipeline object. They are represented by a list and the stages are executed in the sequence in which they appear in the list.
  
Now you're going to take a closer look at the pipeline, split out the stages and use it to make predictions on the testing data.
  
The following objects have already been created:
  
- `cv` — a trained `CrossValidatorModel` object and
- `evaluator` — a `RegressionEvaluator` object.
  
The flights data have been randomly split into `flights_train` and `flights_test`.
  
---
  
1. Retrieve the best model.
2. Look at the stages in the best model.
3. Isolate the linear regression stage and extract its parameters.
4. Use the best model to generate predictions on the testing data and calculate the RMSE.

In [12]:
# Drop the existed feature column
flights_train, flights_test = flights.drop('features').randomSplit([0.8, 0.2], seed=999)

# Train the data
cvModel = cv.fit(flights_train)

# Get the best model from cross validation
best_model = cvModel.bestModel

# Look at the stages in the best model
print(best_model.stages)

# Get the parameters for the LinearRegression object in the best model
best_model.stages[3].extractParamMap()

# Generate predictions on test data using the best model then calculate RMSE
predictions = best_model.transform(flights_test)
print('RMSE: ',evaluator.evaluate(predictions))

[Stage 31:>                                                         (0 + 3) / 3]

                                                                                

[StringIndexerModel: uid=StringIndexer_151d4c3ab0aa, handleInvalid=error, OneHotEncoderModel: uid=OneHotEncoder_c0c929e239e2, dropLast=true, handleInvalid=error, VectorAssembler_afc3318dcf9a, LinearRegressionModel: uid=LinearRegression_3e94151ae75b, numFeatures=8]




RMSE:  11.149352212449399


                                                                                

The best model performs pretty well on the testing data!

### SMS spam optimised
  
The pipeline you built earlier for the SMS spam model used the default parameters for all of the elements in the pipeline. It's very unlikely that these parameters will give a particularly good model though. In this exercise you're going to run the pipeline for a selection of parameter values. We're going to do this in a systematic way: the values for each of the parameters will be laid out on a grid and then pipeline will systematically run across each point in the grid.
  
In this exercise you'll set up a parameter grid which can be used with cross validation to choose a good set of parameters for the SMS spam classifier.
  
The following are already defined:
  
- `hasher` — a `HashingTF` object and
- `logistic` — a `LogisticRegression` object.
  
---
  
1. Create a parameter grid builder object.
2. Add grid points for `numFeatures=` and binary parameters to the `HashingTF` object, giving values 1024, 4096 and 16384, and `True` and `False`, respectively.
3. Add grid points for `regParam=` and `elasticNetParam=` parameters to the `LogisticRegression` object, giving values of 0.01, 0.1, 1.0 and 10.0, and 0.0, 0.5, and 1.0 respectively.
4. Build the parameter grid.

In [13]:
# Create parameter grid
params = ParamGridBuilder()

# Add grid for hashing trick parameters
params = params.addGrid(hasher.numFeatures, (1024, 4096, 16384)).addGrid(hasher.binary, (True, False))

# Add grid for logistic regression parameters
params = params.addGrid(logistic.regParam, (0.01, 0.1, 1.0, 10.0))\
    .addGrid(logistic.elasticNetParam, (0.0, 0.5, 1.0))

# Build parameter grid
params = params.build()

print('Number of models to be tested: ', len(params))

Number of models to be tested:  72


Using cross-validation on a pipeline makes it possible to optimise each stage in the workflow.

### How many models for grid search?
  
How many models will be built when the cross-validator below is fit to data?
  
```python
params = ParamGridBuilder().addGrid(hasher.numFeatures, [1024, 4096, 16384]) \
                           .addGrid(hasher.binary, [True, False]) \
                           .addGrid(logistic.regParam, [0.01, 0.1, 1.0, 10.0]) \
                           .addGrid(logistic.elasticNetParam, [0.0, 0.5, 1.0]) \
                           .build()

cv = CrossValidator(..., estimatorParamMaps=params, numFolds=5)
```
  
---
  
Possible Answers
  
- [ ] 3
- [ ] 5
- [ ] 72
- [x] 360
  
Correct! There are 72 points in the parameter grid and 5 folds in the cross-validator. The product is 360. It takes time to build all of those models, which is why you're not doing it here!

## Ensemble
  
You now know how to choose a good set of parameters for any model using cross-validation and grid search. In the final lesson you're going to learn about how models can be combined to form a collection or "ensemble" which is more powerful than each of the individual models alone.
  
**What's an ensemble?**
  
Simply put, an ensemble model is just a collection of models. An ensemble model combines the results from multiple models to produce better predictions than any one of those models acting alone. The concept is based on the idea of the **"Wisdom of the Crowd"**, which implies that the aggregated opinion of a group is better than the opinions of the individuals in that group, even if the individuals are experts.
  
<center><img src='../_images/ensemble-in-pyspark.png' alt='img' width='740'></center>
  
**Ensemble diversity**
  
As the quote suggests, for this idea to be true, there must be diversity and independence in the crowd. This applies to models too: a successful ensemble requires diverse models. It does not help if all of the models in the ensemble are similar or exactly the same. Ideally each of the models in the ensemble should be different.
  
**Random Forest**
  
A Random Forest, as the name implies, is a collection of trees. To ensure that each of those trees is different, the Decision Tree algorithm is modified slightly: - each tree is trained on a different random subset of the data and - within each tree a random subset of features is used for splitting at each node. The result is a collection of trees where no two trees are the same. Within the Random Forest model, all of the trees operate in parallel.
  
- An ensemble of Decision Tree
- Creating model diversity
- Each tree trained on random subset of data
- Random subset of features used for splitting at each node
- No two trees in the forest should be the same
  
**Create a forest of trees**
  
Let's go back to the cars classifier yet again. You create a Random Forest model using the `RandomForestClassifier` class from the `classification` sub-module. You can select the number of trees in the forest using the `numTrees=` parameter. By default this is twenty, but we'll drop that to five so that the results are easier to interpret. As is the case with any other model, the Random Forest is fit to the training data.
  
<center><img src='../_images/ensemble-in-pyspark1.png' alt='img' width='740'></center>
  
**Seeing the trees**
  
Once the model is trained it's possible to access the individual trees in the forest using the `.trees` attribute. You would not normally do this, but it's useful for illustrative purposes. There are precisely five trees in the forest, as specified. The trees are all different, as can be seen from the varying number of nodes in each tree. You can then make predictions using each tree individually.
  
<center><img src='../_images/ensemble-in-pyspark2.png' alt='img' width='740'></center>
  
**Predictions from individual trees**
  
Here are the predictions of individual trees on a subset of the testing data. Each row represents predictions from each of the five trees for a specific record. In some cases all of the trees agree, but there is often some dissent amongst the models. This is precisely where the Random Forest works best: where the prediction is not clear cut. The Random Forest model creates a consensus prediction by aggregating the predictions across all of the individual trees.
  
<center><img src='../_images/ensemble-in-pyspark3.png' alt='img' width='740'></center>
  
**Consensus predictions**
  
You don't need to worry about these details though because the `.transform()` method will automatically generate a consensus `'prediction'` column. It also creates a `'probability'` column which assigns aggregate probabilities to each of the outcomes.
  
<center><img src='../_images/ensemble-in-pyspark4.png' alt='img' width='740'></center>
  
**Feature importances**
  
It's possible to get an idea of the relative importance of the features in the model by looking at the `.featureImportances` attribute. An importance is assigned to each feature, where a larger importance indicates a feature which makes a larger contribution to the model. Looking carefully at the importances we see that feature 4 (rpm) is the most important, while feature 0 (the number of cylinders) is the least important.
  
<center><img src='../_images/ensemble-in-pyspark5.png' alt='img' width='740'></center>
  
**Gradient-Boosted Trees**
  
The second ensemble model you'll be looking at is Gradient-Boosted Trees. Again the aim is to build a collection of diverse models, but the approach is slightly different. Rather than building a set of trees that operate in parallel, now we build trees which work in series. The boosting algorithm works iteratively. First build a decision tree and add to the ensemble. Then use the ensemble to make predictions on the training data. Compare the predicted labels to the known labels. Now identify training instances where predictions were incorrect. Return to the start and train another tree which focuses on improving the incorrect predictions. As trees are added to the ensemble its predictions improve because each new tree focuses on correcting the shortcomings of the preceding trees.
  
Iterative boosting algorithm:
  
1. Build a Decision Tree and add to ensemble
2. Predict label for each training instance using ensemble
3. Compare predictions with known labels
4. Emphasize training instances with incorrect predictions return to 1.
  
**Boosting trees**
  
The class for the Gradient-Boosted Tree classifier is also found in the `classification` sub-module. After creating an instance of the class you fit it to the training data.
  
<center><img src='../_images/ensemble-in-pyspark6.png' alt='img' width='740'></center>
  
**Comparing trees**
  
You can make an objective comparison between a plain Decision Tree and the two ensemble models by looking at the values of AUC obtained by each of them on the testing data. Both of the ensemble methods score better than the Decision Tree. This is not too surprising since they are significantly more powerful models. It's also worth noting that these results are based on the default parameters for these models. It should be possible to get even better performance by tuning those parameters using cross-validation.
  
<center><img src='../_images/ensemble-in-pyspark7.png' alt='img' width='740'></center>
  
**Ensemble all of the models!**
  
In the final set of exercises you'll try out ensemble methods on the flights data.

### Delayed flights with Gradient-Boosted Trees
  
You've previously built a classifier for flights likely to be delayed using a Decision Tree. In this exercise you'll compare a Decision Tree model to a Gradient-Boosted Trees model.
  
The flights data have been randomly split into `flights_train` and `flights_test`.
  
---
  
1. Import the classes required to create Decision Tree and Gradient-Boosted Tree classifiers.
2. Create Decision Tree and Gradient-Boosted Tree classifiers. Train on the training data.
3. Create an evaluator and calculate AUC on testing data for both classifiers. Which model performs better?
4. For the Gradient-Boosted Tree classifier print the number of trees and the relative importance of features.

In [14]:
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

assembler = VectorAssembler(inputCols=['mon', 'depart', 'duration'], outputCol='features')
flights = assembler.transform(flights.drop('features'))
flights = flights.withColumn('label', (flights.delay >= 15).cast('integer'))
flights = flights.select('mon', 'depart', 'duration', 'features', 'label')
flights = flights.dropna()

flights.show(5)

+---+------+--------+-----------------+-----+
|mon|depart|duration|         features|label|
+---+------+--------+-----------------+-----+
| 10|  8.18|      51| [10.0,8.18,51.0]|    1|
| 11|  7.17|     127|[11.0,7.17,127.0]|    0|
|  2| 21.17|     365|[2.0,21.17,365.0]|    1|
|  5| 12.92|      85| [5.0,12.92,85.0]|    1|
|  3| 13.33|     182|[3.0,13.33,182.0]|    1|
+---+------+--------+-----------------+-----+
only showing top 5 rows



In [15]:
from pyspark.ml.classification import DecisionTreeClassifier, GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pprint import pprint

flights_train, flights_test = flights.randomSplit([0.8, 0.2])

# Create model objects and train on training data
tree = DecisionTreeClassifier().fit(flights_train)
gbt = GBTClassifier().fit(flights_train)

# Compare AUC on test data
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(tree.transform(flights_test))
evaluator.evaluate(gbt.transform(flights_test))

# Find the number of trees and the relative importance of features
pprint(gbt.trees)
print(gbt.featureImportances)

                                                                                

[DecisionTreeRegressionModel: uid=dtr_335a8db8f72b, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressionModel: uid=dtr_10d1426bb2bc, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressionModel: uid=dtr_22c85d205fe8, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressionModel: uid=dtr_0d39a2d24183, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressionModel: uid=dtr_2ab9aafa5636, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressionModel: uid=dtr_84d8fe399b46, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressionModel: uid=dtr_9d353011ecc5, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressionModel: uid=dtr_f2272df53ead, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressionModel: uid=dtr_f5361ea59453, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressionModel: uid=dtr_e3d6f3268400, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressionModel: uid=dtr_84f53f5e9635, depth=5, numNodes=63, numFeatures=3,
 DecisionTreeRegressi

Good job! A Gradient-Boosted Tree almost always provides better performance than a plain Decision Tree.

### Delayed flights with a Random Forest
  
In this exercise you'll bring together cross validation and ensemble methods. You'll be training a Random Forest classifier to predict delayed flights, using cross validation to choose the best values for model parameters.
  
You'll find good values for the following parameters:
  
- `featureSubsetStrategy=` — the number of features to consider for splitting at each node and
- `maxDepth=` — the maximum number of splits along any branch.
  
Unfortunately building this model takes too long, so we won't be running the `.fit()` method on the pipeline.
  
---
  
1. Create a random forest classifier object.
2. Create a parameter grid builder object. Add grid points for the `featureSubsetStrategy=` and `maxDepth=` parameters.
3. Create binary classification evaluator.
4. Create a cross-validator object, specifying the estimator, parameter grid and evaluator. Choose 5-fold cross validation.

In [16]:
from pyspark.ml.classification import RandomForestClassifier

# Create a random forest classifier
forest = RandomForestClassifier()

# Create a parameter grid
params = ParamGridBuilder() \
        .addGrid(forest.featureSubsetStrategy, ['all', 'onethird', 'sqrt', 'log2']) \
        .addGrid(forest.maxDepth, [2, 5, 10]) \
        .build()

# Create a binary classification evaluator
evaluator = BinaryClassificationEvaluator()

# Create a cross-validator
cv = CrossValidator(
    estimator=forest, 
    estimatorParamMaps=params, 
    evaluator=evaluator, 
    numFolds=5
)

Excellent! A grid search can be used to optimize all of the parameters in a model pipeline.

### Evaluating Random Forest
  
In this final exercise you'll be evaluating the results of cross-validation on a Random Forest model.
  
The following have already been created:
  
- `cv` - a cross-validator which has already been fit to the training data
- `evaluator` — a `BinaryClassificationEvaluator` object and
- `flights_test` — the testing data.
  
---
  
1. Print a list of average AUC metrics across all models in the parameter grid.
2. Display the average AUC for the best model. This will be the largest AUC in the list.
3. Print an explanation of the `maxDepth=` and `featureSubsetStrategy=` parameters for the best model.
4. Display the AUC for the best model predictions on the testing data.

In [17]:
cvModel = cv.fit(flights_train)

# Average AUC for each parameter combination in grid
avg_auc = cvModel.avgMetrics

# Average AUC for the best model
best_model_auc = max(avg_auc)

# What's the optimal paramter value?
opt_max_depth = cvModel.bestModel.explainParam('maxDepth')
opt_feat_substrat = cvModel.bestModel.explainParam('featureSubsetStrategy')

# AUC for best model on test data
best_auc = evaluator.evaluate(cvModel.transform(flights_test))
print(best_auc)

23/08/29 00:50:48 WARN DAGScheduler: Broadcasting large task binary with size 1368.1 KiB
23/08/29 00:50:52 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
23/08/29 00:50:59 WARN DAGScheduler: Broadcasting large task binary with size 1359.4 KiB
23/08/29 00:51:35 WARN DAGScheduler: Broadcasting large task binary with size 1482.5 KiB
23/08/29 00:52:10 WARN DAGScheduler: Broadcasting large task binary with size 1402.3 KiB
23/08/29 00:52:14 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
23/08/29 00:52:24 WARN DAGScheduler: Broadcasting large task binary with size 1212.5 KiB
23/08/29 00:53:00 WARN DAGScheduler: Broadcasting large task binary with size 1402.3 KiB
23/08/29 00:53:04 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
23/08/29 00:53:08 WARN DAGScheduler: Broadcasting large task binary with size 1212.5 KiB
23/08/29 00:53:49 WARN DAGScheduler: Broadcasting large task binary with size 1374.9 KiB
23/08/29 00:53:52 WARN DAGSche

0.6834133016457417


Fantastic! Optimized Random Forest > Random Forest > Decision Tree

## Closing thoughts
  
Congratulations on completing this course on Machine Learning with Apache Spark. You have covered a lot of ground, reviewing some Machine Learning fundamentals and seeing how they can be applied to large datasets, using Spark for distributed computing.
  
**Things you've learned**
  
You learned how to load data into Spark and then perform a variety of operations on those data. Specifically, you learned basic column manipulation on DataFrames, how to deal with text data, bucketing continuous data and one-hot encoding categorical data. You then delved into two types of classifiers, Decision Trees and Logistic Regression, in the process building a robust spam classifier. You also learned about partitioning your data and how to use testing data and a selection of metrics to evaluate a model. Next you learned about regression, starting with a simple linear regression model and progressing to penalized regression, which allowed you to build a model using only the most relevant predictors. You learned about pipelines and how they can make your Spark code cleaner and easier to maintain. This led naturally into using cross-validation and grid search to derive more robust model metrics and use them to select good model parameters. Finally you encountered two forms of ensemble models.

**Learning more**
  
Of course, there are many topics that were not covered in this course. If you want to dig deeper then consult the excellent and extensive online documentation. Importantly you can find instructions for setting up and securing a Spark cluster.
  
**Congratulations!**
  
Now go and use what you've learned to solve challenging and interesting big data problems in the real world!

In [18]:
spark.stop()