# Machine Learning Pipelines

PySpark machine learning is found in the `pyspark.ml` module. At the core of the `pyspark.ml` module are the `Transformer` and `Estimator` classes. Almost every other class in the module behaves similarly to these two basic classes.

`Transformer` classes have a `.transform()` method that takes a DataFrame and returns a new DataFrame; usually the original one with a new column appended. For example, you might use the class `Bucketizer` to create discrete bins from a continuous feature or the class `PCA` to reduce the dimensionality of your dataset using principal component analysis.

`Estimator` classes all implement a `.fit()` method. These methods also take a DataFrame, but instead of returning another DataFrame they return a model object. This can be something like a `StringIndexerModel` for including categorical data saved as strings in your models, or a `RandomForestModel` that uses the random forest algorithm for classification or regression.

## Imports

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

## SparkSession

In [2]:
spark = (SparkSession
    .builder
    .getOrCreate()
)

## Display setting

In [3]:
from IPython.core.display import HTML
display(HTML("<style>pre {white-space: pre !important; }</style>"))

## Load the data

In [4]:
# Load the flights data

import os
from pyspark.sql.types import StructField, StructType, StringType, IntegerType

data_path = 'file:///' + os.getcwd() + '/data'

flights_path = data_path + '/flights_small.csv'

flights_df = (
    spark.read
        .option("header", True)
        .csv(flights_path)
)

flights_df.show(5)

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
+----+-----+---+--------+---------+-----

In [5]:
# Load the planes data

planes_path = data_path + '/planes.csv'
planes_df = (
    spark.read
        .option("header", True)
        .csv(planes_path)
)
planes_df.show(5)

+-------+----+--------------------+----------------+--------+-------+-----+-----+---------+
|tailnum|year|                type|    manufacturer|   model|engines|seats|speed|   engine|
+-------+----+--------------------+----------------+--------+-------+-----+-----+---------+
| N102UW|1998|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N103US|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N104UW|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N105UW|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
| N107US|1999|Fixed wing multi ...|AIRBUS INDUSTRIE|A320-214|      2|  182|   NA|Turbo-fan|
+-------+----+--------------------+----------------+--------+-------+-----+-----+---------+
only showing top 5 rows



In [6]:
# Rename the year column of planes
# This is to avoid dupliucate names during joining 

planes_df = planes_df.withColumnRenamed("year", "plane_year")

In [7]:
# Join the DataFrames

model_data_df = flights_df.join(planes_df, on="tailnum", how="leftouter")

In [8]:
model_data_df.show(5)

+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+------------+--------+-------+-----+-----+---------+
|tailnum|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|flight|origin|dest|air_time|distance|hour|minute|plane_year|                type|manufacturer|   model|engines|seats|speed|   engine|
+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+------------+--------+-------+-----+-----+---------+
| N846VA|2014|   12|  8|     658|       -7|     935|       -5|     VX|  1780|   SEA| LAX|     132|     954|   6|    58|      2011|Fixed wing multi ...|      AIRBUS|A320-214|      2|  182|   NA|Turbo-fan|
| N559AS|2014|    1| 22|    1040|        5|    1505|        5|     AS|   851|   SEA| HNL|     360|    2677|  10|    40|      2006|Fixed wing multi ...|      BOEING| 737-890|      2|  1

## String to integer

Sometimes Spark does not infer the data types in a dataset correctly. To remedy this, we can use the `.cast()` method to convert all the appropriate columns from our DataFrame to the correct data type.

In [9]:
# Print the current data types
model_data_df.printSchema()

root
 |-- tailnum: string (nullable = true)
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- distance: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)
 |-- plane_year: string (nullable = true)
 |-- type: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- engines: string (nullable = true)
 |-- seats: string (nullable = true)
 |-- speed: string (nullable = true)
 |-- engine: string (nullable = true)



As shown above, all the columns are of the string data type. Let's cast some columns from string to integers using the `.cast()` method.

In [10]:
# Cast the columns to integers
model_data_df = model_data_df.withColumn("arr_delay", model_data_df.arr_delay.cast("integer"))
model_data_df = model_data_df.withColumn("air_time", model_data_df.air_time.cast("integer"))
model_data_df = model_data_df.withColumn("month", model_data_df.month.cast("integer"))
model_data_df = model_data_df.withColumn("plane_year", model_data_df.plane_year.cast("integer"))

model_data_df.printSchema()

root
 |-- tailnum: string (nullable = true)
 |-- year: string (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: string (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)
 |-- plane_year: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- engines: string (nullable = true)
 |-- seats: string (nullable = true)
 |-- speed: string (nullable = true)
 |-- engine: string (nullable = true)



Notice the changes in data types of the columns in the schema above.

## Create a new column

We are going to create a new column `plane_age` to be used in our model. This is slightly different from the year it was made.

In [11]:
# Create the column plane_age
model_data_df = (
    model_data_df.withColumn("plane_age", model_data_df.year - model_data_df.plane_year))

model_data_df.select("year", "plane_year", "plane_age").show(5)

+----+----------+---------+
|year|plane_year|plane_age|
+----+----------+---------+
|2014|      2011|      3.0|
|2014|      2006|      8.0|
|2014|      2011|      3.0|
|2014|      1992|     22.0|
|2014|      1999|     15.0|
+----+----------+---------+
only showing top 5 rows



## Making a Boolean

Consider that you're modeling a yes or no question: is the flight late? However, your data contains the arrival delay in minutes for each flight. Thus, you'll need to create a boolean column which indicates whether the flight was late or not!

In [12]:
# Create is_late
model_data_df = model_data_df.withColumn("is_late", model_data_df.arr_delay > 0)

# Convert to an integer
model_data_df = model_data_df.withColumn("label", model_data_df.is_late.cast("integer"))

# Remove missing values
predicate = """
arr_delay is not NULL and
dep_delay is not NULL and
air_time is not NULL and
plane_year is not NULL
"""
model_data_df = (
    model_data_df.filter(predicate)
)

In [13]:
model_data_df.select("is_late", "label").show(5)

+-------+-----+
|is_late|label|
+-------+-----+
|  false|    0|
|   true|    1|
|   true|    1|
|   true|    1|
|   true|    1|
+-------+-----+
only showing top 5 rows



## Strings and factors

As you know, Spark requires numeric data for modeling. So far this hasn't been an issue; even boolean columns can easily be converted to integers without any trouble. But you'll also be using the airline and the plane's destination as features in your model. These are coded as strings and there isn't any obvious way to convert them to a numeric data type.

Fortunately, PySpark has functions for handling this built into the `pyspark.ml.feature` submodule. You can create what are called 'one-hot vectors' to represent the carrier and the destination of each flight. A one-hot vector is a way of representing a categorical feature where every observation has a vector in which all elements are zero except for at most one element, which has a value of one (1).

Each element in the vector corresponds to a level of the feature, so it's possible to tell what the right level is by seeing which element of the vector is equal to one (1).

The first step to encoding your categorical feature is to create a `StringIndexer`. Members of this class are `Estimators` that take a DataFrame with a column of strings and map each unique string to a number. Then, the `Estimator` returns a `Transformer` that takes a DataFrame, attaches the mapping to it as metadata, and returns a new DataFrame with a numeric column corresponding to the string column.

The second step is to encode this numeric column as a one-hot vector using a `OneHotEncoder`. This works exactly the same way as the `StringIndexer` by creating an `Estimator` and then a `Transformer`. The end result is a column that encodes your categorical feature as a vector that's suitable for machine learning routines!

This may seem complicated, but don't worry! All you have to remember is that you need to create a `StringIndexer` and a `OneHotEncoder`, and the `Pipeline` will take care of the rest.

## Carrier

In this section, we'll create a `StringIndexer` and a `OneHotEncoder` to code the `carrier` column. To do this, we'll call the class constructors with the arguments `inputCol` and `outputCol`.

The `inputCol` is the name of the column you want to index or encode, and the `outputCol` is the name of the new column that the `Transformer` should create.

In [14]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

In [15]:
# Create a StringIndexer
carr_indexer = StringIndexer(inputCol="carrier", outputCol="carrier_index")

# Create a OneHotEncoder
carr_encoder = OneHotEncoder(inputCol="carrier_index", outputCol="carrier_fact")

## Destination

Now we'll encode the `dest` column just like we did in the section above.

In [16]:
# Create a StringIndexer
dest_indexer = StringIndexer(inputCol="dest", outputCol="dest_index")

# Create a OneHotEncoder
dest_encoder = OneHotEncoder(inputCol="dest_index", outputCol="dest_fact")

## Assemble a vector

The last step in the Pipeline is to combine all of the columns containing our features into a single column. This has to be done before modeling can take place because every Spark modeling routine expects the data to be in this form. You can do this by storing each of the values from a column as an entry in a vector. Then, from the model's point of view, every observation is a vector that contains all of the information about it and a label that tells the modeler what value that observation corresponds to.

Because of this, the pyspark.ml.feature submodule contains a class called VectorAssembler. This Transformer takes all of the columns you specify and combines them into a new vector column.


In [17]:
# Make a VectorAssembler
from pyspark.ml.feature import VectorAssembler

vec_assembler = VectorAssembler(
    inputCols=["month", "air_time", "carrier_fact", "dest_fact", "plane_age"],
    outputCol="features")

## Create the pipeline

Finally, we are going to create a `Pipeline`.

`Pipeline` is a class in the `pyspark.ml` module that combines all the `Estimators` and `Transformers` that we've already created. This lets us reuse the same modeling process over and over again by wrapping it up in one simple object.

In [18]:
from pyspark.ml import Pipeline

# Make the Pipeline
flights_pipeline = Pipeline(
    stages=[dest_indexer, dest_encoder, carr_indexer, carr_encoder, vec_assembler]
)

## Test vs Train

After you've cleaned your data and gotten it ready for modeling, one of the most important steps is to split the data into a test set and a train set. After that, don't touch your test data until you think you have a good model! As you're building models and forming hypotheses, you can test them on your training data to get an idea of their performance.

Once you've got your favorite model, you can see how well it predicts the new data in your test set. This never-before-seen data will give you a much more realistic idea of your model's performance in the real world when you're trying to predict or classify new data.

In Spark it's important to make sure you split the data **after** all the transformations. This is because operations like `StringIndexer` don't always produce the same index even when given the same list of strings.

## Transform the data

We're now finnaly ready to pass our data through the `Pipeline` we created.

In [19]:
# Fit and transform the data
piped_data = flights_pipeline.fit(model_data_df).transform(model_data_df)

In [20]:
selected_cols = ["plane_age", "label", "dest_index", "dest_fact", "carrier_index", "carrier_fact", "features"]
piped_data.select(*selected_cols).show(5)

+---------+-----+----------+---------------+-------------+--------------+--------------------+
|plane_age|label|dest_index|      dest_fact|carrier_index|  carrier_fact|            features|
+---------+-----+----------+---------------+-------------+--------------+--------------------+
|      3.0|    0|       1.0| (68,[1],[1.0])|          7.0|(10,[7],[1.0])|(81,[0,1,9,13,80]...|
|      8.0|    1|      19.0|(68,[19],[1.0])|          0.0|(10,[0],[1.0])|(81,[0,1,2,31,80]...|
|      3.0|    1|       0.0| (68,[0],[1.0])|          7.0|(10,[7],[1.0])|(81,[0,1,9,12,80]...|
|     22.0|    1|       7.0| (68,[7],[1.0])|          1.0|(10,[1],[1.0])|(81,[0,1,3,19,80]...|
|     15.0|    1|      22.0|(68,[22],[1.0])|          0.0|(10,[0],[1.0])|(81,[0,1,2,34,80]...|
+---------+-----+----------+---------------+-------------+--------------+--------------------+
only showing top 5 rows



## Split the data

Now that we've done all our manipulations, the last step before modeling is to split the data.

In [21]:
# Split the data into training and test sets
training, test = piped_data.randomSplit([0.60, 0.40])