# Getting started with machine learning pipelines in Pyspark
  
PySpark has built-in, cutting-edge machine learning routines, along with utilities to create full machine learning pipelines. You'll learn about them in this chapter.
  
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   
      /_/
```


## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
[Apache Spark Documentation](https://spark.apache.org/docs/latest/api/python/index.html)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>pyspark.SparkContext()</td>
    <td>Create a new SparkContext instance, the entry point to using Spark functionality.</td>
  </tr>
  <tr>
    <td>2</td>
    <td>pyspark.SparkContext().version</td>
    <td>Retrieve the version of the SparkContext.</td>
  </tr>
  <tr>
    <td>3</td>
    <td>pyspark.SparkContext().stop()</td>
    <td>Stop the SparkContext, releasing associated resources.</td>
  </tr>
  <tr>
    <td>4</td>
    <td>pyspark.sql.SparkSession</td>
    <td>Create a new SparkSession instance, offering an entry point for DataFrame and SQL functionality.</td>
  </tr>
  <tr>
    <td>5</td>
    <td>pyspark.sql.SparkSession.builder.getOrCreate()</td>
    <td>Retrieve an existing SparkSession or create a new one if none exists.</td>
  </tr>
  <tr>
    <td>6</td>
    <td>pyspark.sql.SparkSession.builder.appName</td>
    <td>Set the application name for the SparkSession.</td>
  </tr>
  <tr>
    <td>7</td>
    <td>pyspark.sql.SparkSession.builder.getOrCreate</td>
    <td>Get an existing SparkSession or create a new one if none exists.</td>
  </tr>
  <tr>
    <td>8</td>
    <td>pyspark.sql.SparkSession.read</td>
    <td>Create a DataFrameReader for reading data in various formats.</td>
  </tr>
  <tr>
    <td>9</td>
    <td>pyspark.sql.SparkSession.read.format</td>
    <td>Specify the input data format when reading data using the DataFrameReader.</td>
  </tr>
  <tr>
    <td>10</td>
    <td>pyspark.sql.SparkSession.read.format.option('inferSchema', 'True')</td>
    <td>Specify options, such as inferring schema from data, when reading data using the DataFrameReader.</td>
  </tr>
  <tr>
    <td>11</td>
    <td>pyspark.sql.SparkSession.read.format.option('header', 'True')</td>
    <td>Specify options, such as reading headers from data, when reading data using the DataFrameReader.</td>
  </tr>
  <tr>
    <td>12</td>
    <td>pyspark.sql.SparkSession.read.format.load()</td>
    <td>Load data into a DataFrame based on specified options using the DataFrameReader.</td>
  </tr>
  <tr>
    <td>13</td>
    <td>pyspark.sql.SparkSession.createOrReplaceTempView</td>
    <td>Create or replace a temporary view of a DataFrame.</td>
  </tr>
  <tr>
    <td>14</td>
    <td>pyspark.sql.SparkSession.catalog.listTables()</td>
    <td>List the tables available in the catalog.</td>
  </tr>
  <tr>
    <td>15</td>
    <td>pyspark.sql.SparkSession.sql()</td>
    <td>Execute a SQL query and return the result as a DataFrame.</td>
  </tr>
  <tr>
    <td>16</td>
    <td>pyspark.sql.SparkSession.sql().toPandas()</td>
    <td>Convert the result of a SQL query to a Pandas DataFrame.</td>
  </tr>
  <tr>
    <td>17</td>
    <td>pyspark.sql.SparkSession.sql().toPandas().head()</td>
    <td>Retrieve the first few rows of a Pandas DataFrame obtained from a SQL query result.</td>
  </tr>
  <tr>
    <td>18</td>
    <td>pyspark.sql.SparkSession.createDataFrame()</td>
    <td>Create a DataFrame from a list or RDD.</td>
  </tr>
  <tr>
    <td>19</td>
    <td>pyspark.sql.SparkSession.read.csv()</td>
    <td>Read data from a CSV file and load it into a DataFrame.</td>
  </tr>
  <tr>
    <td>20</td>
    <td>pyspark.sql.SparkSession.table</td>
    <td>Create a DataFrame representing a table in the catalog.</td>
  </tr>
  <tr>
    <td>21</td>
    <td>pyspark.sql.SparkSession.filter</td>
    <td>Filter rows of a DataFrame based on a condition.</td>
  </tr>
  <tr>
    <td>22</td>
    <td>pyspark.sql.SparkSession.select</td>
    <td>Select columns from a DataFrame.</td>
  </tr>
  <tr>
    <td>23</td>
    <td>pyspark.sql.SparkSession.selectExpr</td>
    <td>Select columns using SQL expressions from a DataFrame.</td>
  </tr>
  <tr>
    <td>24</td>
    <td>pyspark.sql.SparkSession.printSchema</td>
    <td>Print the schema of a DataFrame.</td>
  </tr>
  <tr>
    <td>25</td>
    <td>pyspark.sql.SparkSession.withColumn</td>
    <td>Add or replace a column in a DataFrame.</td>
  </tr>
  <tr>
    <td>26</td>
    <td>pyspark.sql.types.IntegerType</td>
    <td>Create an IntegerType column type for use in DataFrame schema.</td>
  </tr>
  <tr>
    <td>27</td>
    <td>pyspark.sql.functions.col</td>
    <td>Reference a column in a DataFrame based on its name.</td>
  </tr>
  <tr>
    <td>28</td>
    <td>pyspark.sql.SparkSession.groupBy</td>
    <td>Group rows in a DataFrame based on specified columns.</td>
  </tr>
  <tr>
    <td>29</td>
    <td>pyspark.sql.SparkSession.groupBy.min</td>
    <td>Compute the minimum value of specified columns for grouped rows.</td>
  </tr>
  <tr>
    <td>30</td>
    <td>pyspark.sql.SparkSession.groupBy.max</td>
    <td>Compute the maximum value of specified columns for grouped rows.</td>
  </tr>
  <tr>
    <td>31</td>
    <td>pyspark.sql.SparkSession.groupBy.avg</td>
    <td>Compute the average value of specified columns for grouped rows.</td>
  </tr>
  <tr>
    <td>32</td>
    <td>pyspark.sql.SparkSession.groupBy.sum</td>
    <td>Compute the sum of specified columns for grouped rows.</td>
  </tr>
  <tr>
    <td>33</td>
    <td>pyspark.sql.SparkSession.groupBy.count</td>
    <td>Compute the count of rows for grouped columns.</td>
  </tr>
  <tr>
    <td>34</td>
    <td>pyspark.sql.functions.stddev</td>
    <td>Compute the standard deviation of specified columns in a DataFrame.</td>
  </tr>
  <tr>
    <td>35</td>
    <td>pyspark.sql.SparkSession.withColumnRenamed</td>
    <td>Rename a column in a DataFrame.</td>
  </tr>
  <tr>
    <td>36</td>
    <td>pyspark.sql.SparkSession.join</td>
    <td>Join two DataFrames based on specified columns.</td>
  </tr>
  <tr>
    <td>37</td>
    <td>pyspark.ml.feature.StringIndexer</td>
    <td>Convert categorical strings to numerical indices using StringIndexer.</td>
  </tr>
  <tr>
    <td>38</td>
    <td>pyspark.ml.feature.OneHotEncoder</td>
    <td>Encode categorical indices as one-hot vectors using OneHotEncoder.</td>
  </tr>
  <tr>
    <td>39</td>
    <td>pyspark.ml.feature.VectorAssembler</td>
    <td>Combine multiple columns into a single feature vector using VectorAssembler.</td>
  </tr>
  <tr>
    <td>40</td>
    <td>pyspark.ml.Pipeline</td>
    <td>Construct a ML pipeline by assembling a sequence of transformers and an estimator.</td>
  </tr>
  <tr>
    <td>41</td>
    <td>pyspark.sql.SparkSession.randomSplit</td>
    <td>Randomly split a DataFrame into training and testing datasets.</td>
  </tr>
</table>

  
---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
Name: scikit-learn  
Version: 1.3.0  
Summary: A set of python modules for machine learning and data mining  
  
Name: hyperopt  
Version: 0.2.7  
Summary: Distributed Asynchronous Hyperparameter Optimization  
  
Name: TPOT  
Version: 0.12.1  
Summary: Tree-based Pipeline Optimization Tool  
  
Name: pyspark  
Version: 3.4.1  
Summary: Apache Spark Python API  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [1]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations
import pyspark                      # Apache Spark:             Cluster Computing

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

### Machine Learning Pipelines
  
In the next two chapters you'll step through every stage of the machine learning pipeline, from data intake to model evaluation. Let's get to it!
  
At the core of the `pyspark.ml` module are the `Transformer` and `Estimator` classes. Almost every other class in the module behaves similarly to these two basic classes.
  
`Transformer` classes have a `.transform()` method that takes a DataFrame and returns a new DataFrame; usually the original one with a new column appended. For example, you might use the class `Bucketizer` to create discrete bins from a continuous feature or the class `PCA` to reduce the dimensionality of your dataset using principal component analysis.
  
`Estimator` classes all implement a `.fit()` method. These methods also take a DataFrame, but instead of returning another DataFrame they return a model object. This can be something like a `StringIndexerModel` for including categorical data saved as strings in your models, or a `RandomForestModel` that uses the random forest algorithm for classification or regression.
  
---
  
Which of the following is **not** true about machine learning in Spark?
  
Possible Answers
  
- [x] Spark's algorithms give better results than other algorithms.
- [ ] Working in Spark allows you to create reproducible machine learning pipelines.
- [ ] Machine learning pipelines in Spark are made up of Transformers and Estimators.
- [ ] PySpark uses the pyspark.ml submodule to interface with Spark's machine learning routines.
  
That's right! Spark is just a platform that implements the same algorithms that can be found elsewhere.


### Join the DataFrames
  
In the next two chapters you'll be working to build a model that predicts whether or not a flight will be delayed based on the `flights` data we've been working with. This model will also include information about the plane that flew the route, so the first step is to join the two tables: `flights` and `planes`!
  
---
  
1. First, rename the `year` column of `planes` to `plane_year` to avoid duplicate column names.
2. Create a new DataFrame called `model_data` by joining the `flights` table with `planes` using the `'tailnum'` column as the key.

In [2]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("flights").getOrCreate()
)

# Read and create a temporary view for both datatables
# Infer schema (note that for larger files you may want to specify the schema)
flights = (spark.read.format("csv")
  .option("inferSchema", "True")
  .option("header", "True")
  .load("../_datasets/flights_small.csv"))
flights.createOrReplaceTempView("flights")

planes = (spark.read.format("csv")
  .option("inferSchema", "True")
  .option("header", "True")
  .load('../_datasets/planes.csv'))
planes.createOrReplaceTempView("planes")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/24 16:24:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

In [3]:
# Rename year column
planes = planes.withColumnRenamed('year', 'plane_year')

# Join the DataFrames
model_data = flights.join(planes, on='tailnum', how="leftouter")

In [4]:
print(model_data.show())

+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+--------------+-----------+-------+-----+-----+---------+
|tailnum|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|flight|origin|dest|air_time|distance|hour|minute|plane_year|                type|  manufacturer|      model|engines|seats|speed|   engine|
+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+--------------+-----------+-------+-----+-----+---------+
| N846VA|2014|   12|  8|     658|       -7|     935|       -5|     VX|  1780|   SEA| LAX|     132|     954|   6|    58|      2011|Fixed wing multi ...|        AIRBUS|   A320-214|      2|  182|   NA|Turbo-fan|
| N559AS|2014|    1| 22|    1040|        5|    1505|        5|     AS|   851|   SEA| HNL|     360|    2677|  10|    40|      2006|Fixed wing multi ...|        BOEIN

Awesome work! You're one step closer to a model!

### Data types
  
Good work! Before you get started modeling, it's important to know that Spark only handles numeric data. That means all of the columns in your DataFrame must be either integers or decimals (called `'doubles'` in Spark).
  
When we imported our data, we let Spark guess what kind of information each column held. Unfortunately, Spark doesn't always guess right and you can see that some of the columns in our DataFrame are strings containing numbers as opposed to actual numeric values.
  
To remedy this, you can use the `.cast()` method in combination with the `.withColumn()` method. It's important to note that `.cast()` works on columns, while `.withColumn()` works on DataFrames.
  
The only argument you need to pass to `.cast()` is the kind of value you want to create, in string form. For example, to create integers, you'll pass the argument `"integer"` and for decimal numbers you'll use `"double"`.
  
You can put this call to `.cast()` inside a call to `.withColumn()` to overwrite the already existing column, just like you did in the previous chapter!
  
---
  
What kind of data does Spark need for modeling?
  
Possible Answers
  
- [ ] Doubles
- [ ] Integers
- [ ] Decimals
- [x] Numeric
- [ ] Strings
  
Great job! Spark needs numeric values (doubles or integers) to do machine learning.

### String to integer
  
Now you'll use the `.cast()` method you learned in the previous exercise to convert all the appropriate columns from your DataFrame `model_data` to integers!
  
To convert the type of a column using the `.cast()` method, you can write code like this:
  
```python
dataframe = dataframe.withColumn("col", dataframe.col.cast("new_type"))
```
  
---
  
1. Use the method .withColumn() to `.cast()` the following columns to type `"integer"`. Access the columns using the `df.col` notation:
- `model_data.arr_delay`
- `model_data.air_time`
- `model_data.month`
- `model_data.plane_year`

In [5]:
print(model_data.printSchema())

root
 |-- tailnum: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: string (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)
 |-- plane_year: string (nullable = true)
 |-- type: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- engines: integer (nullable = true)
 |-- seats: integer (nullable = true)
 |-- speed: string (nullable = true)
 |-- engine: string (nullable = true)

None


In [6]:
# Cast the columns to integers
model_data = model_data.withColumn("arr_delay", model_data.arr_delay.cast('integer'))
model_data = model_data.withColumn("air_time", model_data.air_time.cast('integer'))
model_data = model_data.withColumn("month", model_data.month.cast('integer'))
model_data = model_data.withColumn("plane_year", model_data.plane_year.cast('integer'))

In [7]:
print(model_data.printSchema())

root
 |-- tailnum: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dep_time: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_time: string (nullable = true)
 |-- arr_delay: integer (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: integer (nullable = true)
 |-- origin: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- air_time: integer (nullable = true)
 |-- distance: integer (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)
 |-- plane_year: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- model: string (nullable = true)
 |-- engines: integer (nullable = true)
 |-- seats: integer (nullable = true)
 |-- speed: string (nullable = true)
 |-- engine: string (nullable = true)

None


Awesome! You're a pro at converting columns!

### Create a new column
  
In the last exercise, you converted the column `'plane_year'` to an integer. This column holds the year each plane was manufactured. However, your model will use the planes' age, which is slightly different from the year it was made!
  
---
  
1. Create the column `'plane_age'` using the `.withColumn()` method and subtracting the year of manufacture (column `'plane_year'`) from the `'year'` (column `'year'`) of the flight.

In [8]:
# Create the column plane_age
model_data = model_data.withColumn("plane_age", model_data.year - model_data.plane_year)

In [9]:
print(model_data.show())

+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+--------------+-----------+-------+-----+-----+---------+---------+
|tailnum|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|flight|origin|dest|air_time|distance|hour|minute|plane_year|                type|  manufacturer|      model|engines|seats|speed|   engine|plane_age|
+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+--------------+-----------+-------+-----+-----+---------+---------+
| N846VA|2014|   12|  8|     658|       -7|     935|       -5|     VX|  1780|   SEA| LAX|     132|     954|   6|    58|      2011|Fixed wing multi ...|        AIRBUS|   A320-214|      2|  182|   NA|Turbo-fan|        3|
| N559AS|2014|    1| 22|    1040|        5|    1505|        5|     AS|   851|   SEA| HNL|     360|    2677|  10|    40|     

Great work! Now you have one more variable to include in your model.

### Making a Boolean
  
Consider that you're modeling a yes or no question: is the flight late? However, your data contains the arrival delay in minutes for each flight. Thus, you'll need to create a boolean column which indicates whether the flight was late or not!
  
---
  
1. Use the `.withColumn()` method to create the column `'is_late'`. This column is equal to `model_data.arr_delay > 0`.
2. Convert this column to an integer column so that you can use it in your model and name it `label` (this is the default name for the response variable in Spark's machine learning routines).
3. Filter out missing values (this has been done for you).

In [10]:
# Create is_late
model_data = model_data.withColumn("is_late", model_data.arr_delay > 0)

# Convert to an integer
model_data = model_data.withColumn("label", model_data.is_late.cast('integer'))

# Remove missing values
model_data = model_data.filter(
    "arr_delay is not NULL and dep_delay is not NULL and air_time is not NULL and plane_year is not NULL")

In [11]:
print(model_data.show())

23/08/24 16:25:01 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+--------------+-----------+-------+-----+-----+---------+---------+-------+-----+
|tailnum|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|flight|origin|dest|air_time|distance|hour|minute|plane_year|                type|  manufacturer|      model|engines|seats|speed|   engine|plane_age|is_late|label|
+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+--------------+-----------+-------+-----+-----+---------+---------+-------+-----+
| N846VA|2014|   12|  8|     658|       -7|     935|       -5|     VX|  1780|   SEA| LAX|     132|     954|   6|    58|      2011|Fixed wing multi ...|        AIRBUS|   A320-214|      2|  182|   NA|Turbo-fan|        3|  false|    0|
| N559AS|2014|    1| 22|    1040|        5|    1505|        5|     A

Awesome! Now you've defined the column that you're going to use as the outcome in your model.

### Strings and factors
  
As you know, Spark requires numeric data for modeling. So far this hasn't been an issue; even boolean columns can easily be converted to integers without any trouble. But you'll also be using the airline and the plane's destination as features in your model. These are coded as strings and there isn't any obvious way to convert them to a numeric data type.
  
Fortunately, PySpark has functions for handling this built into the `pyspark.ml.features` submodule. You can create what are called 'one-hot vectors' to represent the carrier and the destination of each flight. A one-hot vector is a way of representing a categorical feature where every observation has a vector in which all elements are zero except for at most one element, which has a value of one (1).
  
Each element in the vector corresponds to a level of the feature, so it's possible to tell what the right level is by seeing which element of the vector is equal to one (1).
  
The first step to encoding your categorical feature is to create a `StringIndexer`. Members of this class are `Estimator`s that take a DataFrame with a column of strings and map each unique string to a number. Then, the `Estimator` returns a `Transformer` that takes a DataFrame, attaches the mapping to it as metadata, and returns a new DataFrame with a numeric column corresponding to the string column.
  
The second step is to encode this numeric column as a one-hot vector using a OneHotEncoder. This works exactly the same way as the `StringIndexer` by creating an `Estimator` and then a `Transformer`. The end result is a column that encodes your categorical feature as a vector that's suitable for machine learning routines!
  
This may seem complicated, but don't worry! All you have to remember is that you need to create a `StringIndexer` and a OneHotEncoder, and the Pipeline will take care of the rest.
  
---
  
Why do you have to encode a categorical feature as a one-hot vector?
  
Possible Answers
  
- [ ] It makes fitting the model faster.
- [x] Spark can only model numeric features.
- [ ] For compatibility with scikit-learn.
  
Awesome! You remembered that Spark can only model numeric features.

### Carrier
  
In this exercise you'll create a `StringIndexer` and a `OneHotEncoder` to code the carrier column. To do this, you'll call the class constructors with the arguments `inputCol=` and `outputCol=`.
  
The `inputCol=` is the name of the column you want to index or encode, and the `outputCol=` is the name of the new column that the `Transformer` should create.
  
---
  
1. Create a `StringIndexer` called `carr_indexer` by calling `StringIndexer()` with `inputCol="carrier"` and `outputCol="carrier_index"`.
2. Create a `OneHotEncoder` called `carr_encoder` by calling `OneHotEncoder()` with `inputCol="carrier_index"` and `outputCol="carrier_fact"`.

In [12]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

# Create a StringIndexer
carr_indexer = StringIndexer(inputCol='carrier', outputCol='carrier_index')

# Create a OneHotEncoder
carr_encoder = OneHotEncoder(inputCol='carrier_index', outputCol='carrier_fact')

Fantastic work! You're ready to include this column in your model now!

### Destination
  
Now you'll encode the `'dest'` column just like you did in the previous exercise.
  
1. Create a `StringIndexer` called `dest_indexer` by calling` StringIndexer()` with `inputCol="dest"` and `outputCol="dest_index"`.
2. Create a `OneHotEncoder` called `dest_encoder` by calling `OneHotEncoder()` with `inputCol="dest_index"` and `outputCol="dest_fact"`.

In [13]:
# Create a StringIndexer
dest_indexer = StringIndexer(inputCol='dest', outputCol='dest_index')

# Create a OneHotEncoder
dest_encoder = OneHotEncoder(inputCol='dest_index', outputCol='dest_fact')

Perfect! You're all done messing with factors.

### Assemble a vector
  
The last step in the `Pipeline` is to combine all of the columns containing our features into a single column. This has to be done before modeling can take place because every Spark modeling routine expects the data to be in this form. You can do this by storing each of the values from a column as an entry in a vector. Then, from the model's point of view, every observation is a vector that contains all of the information about it and a label that tells the modeler what value that observation corresponds to.
  
Because of this, the `pyspark.ml.feature` submodule contains a class called `VectorAssembler`. This `Transformer` takes all of the columns you specify and combines them into a new vector column.
  
---
  
1. Create a `VectorAssembler` by calling `VectorAssembler()` with the `inputCols=` names as a list and the `outputCol=` name `"features"`.
2. The list of columns should be `["month", "air_time", "carrier_fact", "dest_fact", "plane_age"]`

In [14]:
from pyspark.ml.feature import VectorAssembler

# Make a VectorAssembler
vec_assembler = VectorAssembler(
    inputCols=['month', 'air_time', 'carrier_fact', 'dest_fact', 'plane_age'],
    outputCol='features'
)

Good job! Your data is all assembled now.

### Create the pipeline
  
You're finally ready to create a `Pipeline`!
  
Pipeline is a class in the `pyspark.ml` module that combines all the `Estimators` and `Transformers` that you've already created. This lets you reuse the same modeling process over and over again by wrapping it up in one simple object. Neat, right?
  
---
  
1. Import `Pipeline` from `pyspark.ml`.
2. Call the `Pipeline()` constructor with the keyword argument stages to create a `Pipeline` called `flights_pipe`.
stages should be a list holding all the stages you want your data to go through in the pipeline. Here this is just: `[dest_indexer, dest_encoder, carr_indexer, carr_encoder, vec_assembler]`

In [15]:
from pyspark.ml import Pipeline

# Make the pipeline
flights_pipe = Pipeline(
    stages=[dest_indexer, dest_encoder, carr_indexer, carr_encoder, vec_assembler]
)

Fantastic! You've made a fully reproducible machine learning pipeline!

### Test vs. Train
  
After you've cleaned your data and gotten it ready for modeling, one of the most important steps is to split the data into a test set and a train set. After that, don't touch your test data until you think you have a good model! As you're building models and forming hypotheses, you can test them on your training data to get an idea of their performance.
  
Once you've got your favorite model, you can see how well it predicts the new data in your test set. This never-before-seen data will give you a much more realistic idea of your model's performance in the real world when you're trying to predict or classify new data.
  
In Spark it's important to make sure you split the data after all the transformations. This is because operations like StringIndexer don't always produce the same index even when given the same list of strings.
  
---
  
Why is it important to use a test set in model evaluation?
  
Possible Answers
  
- [ ] Evaluating your model improves its accuracy.
- [x] By evaluating your model with a test set you can get a good idea of performance on new data.
- [ ] Using a test set lets you check your code for errors.

Exactly! A test set approximates the 'real world error' of your model.

### Transform the data
  
Hooray, now you're finally ready to pass your data through the `Pipeline` you created!
  
---
  
1. Create the DataFrame `piped_data` by calling the `Pipeline` methods `.fit()` and `.transform()` in a chain. Both of these methods take `model_data` as their only argument.

In [16]:
# Fit and transform the data
piped_data = flights_pipe.fit(model_data).transform(model_data)

                                                                                

Great work! Your pipeline chewed right through that data!

### Split the data
  
Now that you've done all your manipulations, the last step before modeling is to split the data!
  
---
  
Use the DataFrame method `.randomSplit()` to split `piped_data` into two pieces, training with 60% of the data, and test with 40% of the data by passing the list `[.6, .4]` to the `.randomSplit()` method.

In [17]:
# Split the data into training and test sets
training, test = piped_data.randomSplit([.6, .4])

In [18]:
print(training.show())

[Stage 19:>                                                         (0 + 1) / 1]

+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+----------------+--------+-------+-----+-----+---------+---------+-------+-----+----------+---------------+-------------+--------------+--------------------+
|tailnum|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|flight|origin|dest|air_time|distance|hour|minute|plane_year|                type|    manufacturer|   model|engines|seats|speed|   engine|plane_age|is_late|label|dest_index|      dest_fact|carrier_index|  carrier_fact|            features|
+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+----------------+--------+-------+-----+-----+---------+---------+-------+-----+----------+---------------+-------------+--------------+--------------------+
| N105UW|2014|    3| 13|    1325|        5|    2123|       13|     US|  1

                                                                                

In [19]:
print(test.show())

[Stage 21:>                                                         (0 + 1) / 1]

+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+----------------+--------+-------+-----+-----+---------+---------+-------+-----+----------+---------------+-------------+--------------+--------------------+
|tailnum|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|flight|origin|dest|air_time|distance|hour|minute|plane_year|                type|    manufacturer|   model|engines|seats|speed|   engine|plane_age|is_late|label|dest_index|      dest_fact|carrier_index|  carrier_fact|            features|
+-------+----+-----+---+--------+---------+--------+---------+-------+------+------+----+--------+--------+----+------+----------+--------------------+----------------+--------+-------+-----+-----+---------+---------+-------+-----+----------+---------------+-------------+--------------+--------------------+
| N102UW|2014|    5|  7|    1311|        6|    2115|        2|     US|  1

                                                                                

In [21]:
print('Training data size: ',training.count())
print('Testing data size: ',test.count())

                                                                                

Training data size:  5552


[Stage 35:>                                                         (0 + 1) / 1]

Testing data size:  3751


                                                                                

Awesome! Now you're ready to start fitting a model!