# Classification In PySpark
  
Now that you are familiar with getting data into Spark, you'll move onto building two types of classification model: Decision Trees and Logistic Regression. You'll also find out about a few approaches to data preparation.
  
```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   
      /_/
```

## Resources
  
**Notebook Syntax**
  
<span style='color:#7393B3'>NOTE:</span>  
- Denotes additional information deemed to be *contextually* important
- Colored in blue, HEX #7393B3
  
<span style='color:#E74C3C'>WARNING:</span>  
- Significant information that is *functionally* critical  
- Colored in red, HEX #E74C3C
  
---
  
**Links**
  
[NumPy Documentation](https://numpy.org/doc/stable/user/index.html#user)  
[Pandas Documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)  
[Matplotlib Documentation](https://matplotlib.org/stable/index.html)  
[Seaborn Documentation](https://seaborn.pydata.org)  
[Apache Spark Documentation](https://spark.apache.org/docs/latest/api/python/index.html)  
  
---
  
**Notable Functions**
  
<table>
  <tr>
    <th>Index</th>
    <th>Operator</th>
    <th>Use</th>
  </tr>
  <tr>
    <td>1</td>
    <td>pyspark.sql.SparkSession</td>
    <td>Main entry point for using Spark functionality</td>
  </tr>
  <tr>
    <td>2</td>
    <td>spark.version</td>
    <td>Retrieves the version of Spark</td>
  </tr>
  <tr>
    <td>3</td>
    <td>spark.stop()</td>
    <td>Terminates the Spark session and releases resources</td>
  </tr>
  <tr>
    <td>4</td>
    <td>SparkSession.builder.master('local[*]').appName('flights').getOrCreate()</td>
    <td>Creates a SparkSession with specific configuration</td>
  </tr>
  <tr>
    <td>5</td>
    <td>spark.count()</td>
    <td>Counts the number of rows in a DataFrame</td>
  </tr>
  <tr>
    <td>6</td>
    <td>spark.show()</td>
    <td>Displays the contents of a DataFrame</td>
  </tr>
  <tr>
    <td>7</td>
    <td>pyspark.sql.types.StructType</td>
    <td>Defines the structure for a DataFrame's schema</td>
  </tr>
  <tr>
    <td>8</td>
    <td>pyspark.sql.types.StructField</td>
    <td>Defines a single field within a schema</td>
  </tr>
  <tr>
    <td>9</td>
    <td>pyspark.sql.types.IntegerType</td>
    <td>Represents the integer data type in a schema</td>
  </tr>
  <tr>
    <td>10</td>
    <td>pyspark.sql.types.StringType</td>
    <td>Represents the string data type in a schema</td>
  </tr>
  <tr>
    <td>11</td>
    <td>spark.read.csv</td>
    <td>Reads data from a CSV file into a DataFrame</td>
  </tr>
  <tr>
    <td>12</td>
    <td>spark.printSchema()</td>
    <td>Prints the schema of a DataFrame</td>
  </tr>
  <tr>
    <td>13</td>
    <td>spark.filter</td>
    <td>Filters rows from a DataFrame based on a condition</td>
  </tr>
  <tr>
    <td>14</td>
    <td>spark.select</td>
    <td>Selects specific columns from a DataFrame</td>
  </tr>
  <tr>
    <td>15</td>
    <td>spark.dropna</td>
    <td>Removes rows with missing values from a DataFrame</td>
  </tr>
  <tr>
    <td>16</td>
    <td>spark.drop</td>
    <td>Removes specified columns from a DataFrame</td>
  </tr>
  <tr>
    <td>17</td>
    <td>pyspark.sql.functions.round</td>
    <td>Rounds the values in a column</td>
  </tr>
  <tr>
    <td>18</td>
    <td>spark.withColumn</td>
    <td>Adds a new column or replaces an existing one</td>
  </tr>
  <tr>
    <td>19</td>
    <td>pyspark.ml.feature.StringIndexer</td>
    <td>Converts string labels into numerical indices</td>
  </tr>
  <tr>
    <td>20</td>
    <td>spark.fit</td>
    <td>Trains a machine learning model</td>
  </tr>
  <tr>
    <td>21</td>
    <td>spark.transform</td>
    <td>Applies a transformation to a DataFrame</td>
  </tr>
  <tr>
    <td>22</td>
    <td>pyspark.ml.feature.VectorAssembler</td>
    <td>Combines multiple columns into a single vector column</td>
  </tr>
  <tr>
    <td>23</td>
    <td>spark.randomSplit</td>
    <td>Splits a DataFrame into random subsets</td>
  </tr>
  <tr>
    <td>24</td>
    <td>pyspark.ml.classification.DecisionTreeClassifier</td>
    <td>Creates a decision tree classification model</td>
  </tr>
  <tr>
    <td>25</td>
    <td>spark.groupBy</td>
    <td>Groups data in a DataFrame by specified columns</td>
  </tr>
  <tr>
    <td>26</td>
    <td>pyspark.ml.classification.LogisticRegression</td>
    <td>Creates a logistic regression classification model</td>
  </tr>
  <tr>
    <td>27</td>
    <td>pyspark.ml.evaluation.MulticlassClassificationEvaluator</td>
    <td>Evaluates multiclass classification models</td>
  </tr>
  <tr>
    <td>28</td>
    <td>pyspark.ml.evaluation.BinaryClassificationEvaluator</td>
    <td>Evaluates binary classification models</td>
  </tr>
  <tr>
    <td>29</td>
    <td>pyspark.sql.functions.regexp_replace</td>
    <td>Replaces occurrences of a pattern in a string column</td>
  </tr>
  <tr>
    <td>30</td>
    <td>pyspark.ml.feature.Tokenizer</td>
    <td>Splits text into words (tokens)</td>
  </tr>
  <tr>
    <td>31</td>
    <td>pyspark.ml.feature.StopWordsRemover</td>
    <td>Removes common words (stop words) from text</td>
  </tr>
  <tr>
    <td>32</td>
    <td>pyspark.ml.feature.HashingTF</td>
    <td>Converts text data into numerical vectors</td>
  </tr>
  <tr>
    <td>33</td>
    <td>pyspark.ml.feature.IDF</td>
    <td>Applies Inverse Document Frequency (IDF) to text vectors</td>
  </tr>
</table>


---
  
**Language and Library Information**  
  
Python 3.11.0  
  
Name: numpy  
Version: 1.24.3  
Summary: Fundamental package for array computing in Python  
  
Name: pandas  
Version: 2.0.3  
Summary: Powerful data structures for data analysis, time series, and statistics  
  
Name: matplotlib  
Version: 3.7.2  
Summary: Python plotting package  
  
Name: seaborn  
Version: 0.12.2  
Summary: Statistical data visualization  
  
Name: pyspark  
Version: 3.4.1  
Summary: Apache Spark Python API  
  
---
  
**Miscellaneous Notes**
  
<span style='color:#7393B3'>NOTE:</span>  
  
`python3.11 -m IPython` : Runs python3.11 interactive jupyter notebook in terminal.
  
`nohup ./relo_csv_D2S.sh > ./output/relo_csv_D2S.log &` : Runs csv data pipeline in headless log.  
  
`print(inspect.getsourcelines(test))` : Get self-defined function schema  
  
<span style='color:#7393B3'>NOTE:</span>  
  
Snippet to plot all built-in matplotlib styles :
  
```python

x = np.arange(-2, 8, .1)
y = 0.1 * x ** 3 - x ** 2 + 3 * x + 2
fig = plt.figure(dpi=100, figsize=(10, 20), tight_layout=True)
available = ['default'] + plt.style.available
for i, style in enumerate(available):
    with plt.style.context(style):
        ax = fig.add_subplot(10, 3, i + 1)
        ax.plot(x, y)
    ax.set_title(style)
```
  

In [3]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations
import pyspark                      # Apache Spark:             Cluster Computing

# Setting a standard figure size
plt.rcParams['figure.figsize'] = (8, 8)

# Set the maximum number of columns to be displayed
pd.set_option('display.max_columns', 50)

## Data Preparation
  
In this lesson you are going to learn how to prepare data for building a Machine Learning model.
  
**Do you need all of those columns?**
  
You'll be working with the cars data again. This is what the data look like at present. There are columns for the maker and model, the origin (either USA or non-USA), the type, number of cylinders, engine size, weight, length, RPM and fuel consumption. The models that you'll be building will depend on the physical characteristics of the cars rather than the model names or manufacturers, so you'll remove the corresponding columns from the data.
  
<center><img src='../_images/data-preparation-in-pyspark.png' alt='img' width='740'></center>
  
**Dropping columns**
  
There are two approaches to doing this: either you can `drop()` the columns that you don't want or you can `select()` the fields which you do want to retain. Either way, the resulting data does not include those columns.
  
<center><img src='../_images/data-preparation-in-pyspark1.png' alt='img' width='740'></center>
  
**Filtering out missing data**
  
Earlier you saw that there is a missing value in the cylinders column. Let's check to see how many other missing values there are. You'll use the `.filter()` method and provide a logical predicate using SQL syntax which identifies NULL values. Then the `.count()` method tells you how many records there are remaining. Just one. In this case it makes sense to simply remove the record with the missing value. There are a couple of ways that you could to do this. You could use the `.filter()` method again with a different predicate. Or you could take a more aggressive approach and use the `.dropna()` method to drop all records with missing values in any column. However, this should be done with care because it could result in the loss of a lot of otherwise useful data. You've now stripped down the data to what's needed to build a model.
  
<center><img src='../_images/data-preparation-in-pyspark2.png' alt='img' width='740'></center>
  
**Mutating columns**
  
At present the weight and length columns are in units of pounds and inches respectively. You'll use the `.withColumn()` method to create a new mass column in units of kilograms. The `round()` function is used to limit the precision of the result. You can also use the `.withColumn()` method to replace the existing length column with values in meters. You now have mass and length in metric units.
  
<center><img src='../_images/data-preparation-in-pyspark3.png' alt='img' width='740'></center>
  
**Indexing categorical data**
  
The type column consists of strings which represent six categories of vehicle type. You'll need to transform those strings into numbers. You do this using an instance of the `StringIndexer` class. In the constructor you provide the name of the string input column and a name for the new output column to be created. The indexer is first fit to the data, creating a `StringIndexerModel`. During the fitting process the distinct string values are identified and an index is assigned to each value. The model is then used to transform the data, creating a new column with the index values. By default the index values are assigned according to the descending relative frequency of each of the string values. Midsize is most common, so it gets an index of zero. Small is next most common, so its index is one. And so on. It's possible to choose different strategies for assigning index values by specifying the `stringOrderType` argument. Rather than using frequency of occurrence, strings can be ordered alphabetically. It's also possible to choose between ascending and descending order.
  
<center><img src='../_images/data-preparation-in-pyspark4.png' alt='img' width='740'></center>
  
**Indexing country of origin**
  
You'll be building a classifier to predict whether or not a car was manufactured in the USA. So the origin column also needs to be converted from strings into numbers.
  
<center><img src='../_images/data-preparation-in-pyspark5.png' alt='img' width='740'></center>
  
**Assembling columns**
  
The final step in preparing the cars data is to consolidate the various input columns into a single column. This is necessary because the Machine Learning algorithms in Spark operate on a single vector of predictors, although each element in that vector may consist of multiple values. To illustrate the process you'll start with just a pair of features, cylinders and size. First you create an instance of the VectorAssembler class, providing it with the names of the columns that you want to consolidate and the name of the new output column. The assembler is then used to transform the data. Taking a look at the relevant columns you see that the new "features" column consists of values from the cylinders and size columns consolidated into a vector. Ultimately you are going to assemble all of the predictors into a single column.
  
<center><img src='../_images/data-preparation-in-pyspark6.png' alt='img' width='740'></center>
  
**Let's practice!**
  
Let's try out what we have learned on the SMS and flights data.

### Removing columns and rows
  
You previously loaded airline flight data from a CSV file. You're going to develop a model which will predict whether or not a given flight will be delayed.
  
In this exercise you need to trim those data down by:
  
1. removing an uninformative column and
2. removing rows which do not have information about whether or not a flight was delayed.
  
The data are available as flights.
  
Note: You might find it useful to revise the slides from the lessons in the Slides panel next to the IPython Shell.
  
---
  
1. Remove the flight column.
2. Find out how many records have missing values in the delay column.
3. Remove records with missing values in the delay column.
4. Remove records with missing values in any column and get the number of remaining rows.

In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').appName('flights').getOrCreate()

# Read data from CSV file
flights = spark.read.csv('../_datasets/flights-larger.csv', sep=',', header=True, 
                         inferSchema=True,
                         nullValue='NA')

flights.show(5)

                                                                                

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 10| 10|  1|     OO|  5836|ORD| 157|  8.18|      51|   27|
|  1|  4|  1|     OO|  5866|ORD| 466|  15.5|     102| null|
| 11| 22|  1|     OO|  6016|ORD| 738|  7.17|     127|  -19|
|  2| 14|  5|     B6|   199|JFK|2248| 21.17|     365|   60|
|  5| 25|  3|     WN|  1675|SJC| 386| 12.92|      85|   22|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows



In [5]:
# Remove the 'flight' column
flights_drop_column = flights.drop('flight')

# Number of records with missing 'delay' values
print(flights_drop_column.filter('delay IS NULL').count())

# Remove records with missing 'delay' values
flights_valid_delay = flights_drop_column.filter('delay IS NOT NULL')

# Remove records with missing values in any column and get the number of remaining rows
flights_none_missing = flights_valid_delay.dropna()
print(flights_none_missing.count())

                                                                                

16711




258289


                                                                                

You've discarded the columns and rows which will certainly not contribute to a model.

### Column manipulation
  
The Federal Aviation Administration (FAA) considers a flight to be "delayed" when it arrives 15 minutes or more after its scheduled time.
  
The next step of preparing the flight data has two parts:
  
1. convert the units of distance, replacing the `mile` column with a `km` column; and
2. create a Boolean column indicating whether or not a flight was delayed.
  
---
  
1. Import a function which will allow you to round a number to a specific number of decimal places.
2. Derive a new `km` column from the `mile` column, rounding to zero decimal places. One mile is 1.60934 km.
3. Remove the `mile` column.
4. Create a `label` column with a value of 1 indicating the delay was 15 minutes or more and 0 otherwise. Think carefully about the logical condition.

In [6]:
from pyspark.sql.functions import round

# Convert 'mile' to 'km' and drop 'mile' column
flights_km = flights_none_missing.withColumn('km', round(flights_none_missing.mile * 1.60934, 0)).drop('mile')

# Create 'label' column indicating whether flight delayed (1) or not(0)
flights_km = flights_km.withColumn('label', (flights_km.delay >= 15).cast('integer'))

# Check first five records
flights_km.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|
+---+---+---+-------+---+------+--------+-----+------+-----+
| 10| 10|  1|     OO|ORD|  8.18|      51|   27| 253.0|    1|
| 11| 22|  1|     OO|ORD|  7.17|     127|  -19|1188.0|    0|
|  2| 14|  5|     B6|JFK| 21.17|     365|   60|3618.0|    1|
|  5| 25|  3|     WN|SJC| 12.92|      85|   22| 621.0|    1|
|  3| 28|  1|     B6|LGA| 13.33|     182|   70|1732.0|    1|
+---+---+---+-------+---+------+--------+-----+------+-----+
only showing top 5 rows



Fifteen minutes seems like quite a wide margin, but who are you to argue with the FAA?

### Categorical columns
  
In the flights data there are two columns, `carrier` and `org`, which hold categorical data. You need to transform those columns into indexed numerical values.
  
---
  
1. Import the appropriate class and create an indexer object to transform the `carrier` column from a string to an numeric index.
2. Prepare the indexer object on the flight data.
3. Use the prepared indexer to create the numeric index column.
4. Repeat the process for the `org` column.

In [7]:
from pyspark.ml.feature import StringIndexer

# Create an indexer
indexer = StringIndexer(inputCol='carrier', outputCol='carrier_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(flights_km)

# Indexer creates a new column with numeric index values
flights_indexed = indexer_model.transform(flights_km)

# Repeat the process for the other categorical feature
flights_indexed = StringIndexer(inputCol='org', outputCol='org_idx').fit(flights_indexed).transform(flights_indexed)

                                                                                

Our Machine Learning model needs numbers not strings, so these transformations are vital!

### Assembling columns
  
The final stage of data preparation is to consolidate all of the predictor columns into a single column.
  
An updated version of the flights data, which takes into account all of the changes from the previous few exercises, has the following predictor columns:
  
- `mon`, `dom` and `dow`
- `carrier_idx` (indexed value from carrier)
- `org_idx` (indexed value from org)
- `km`
- `depart`
- `duration`
  
Note: The `truncate=False` argument to the `show()` method prevents data being truncated in the output.
  
---
  
1. Import the class which will assemble the predictors.
2. Create an assembler object that will allow you to merge the predictors columns into a single column.
3. Use the assembler to generate a new consolidated column.

In [8]:
from pyspark.ml.feature import VectorAssembler

# Create an assembler object
assembler = VectorAssembler(inputCols=[
    'mon', 'dom', 'dow',
    'carrier_idx', 
    'org_idx',
    'km', 'depart', 'duration'
], outputCol='features')

# Consolidate predictor columns
flights_assembled = assembler.transform(flights_indexed)

# Check the resulting column
flights_assembled.select('features', 'delay').show(5, truncate=False)

+-----------------------------------------+-----+
|features                                 |delay|
+-----------------------------------------+-----+
|[10.0,10.0,1.0,2.0,0.0,253.0,8.18,51.0]  |27   |
|[11.0,22.0,1.0,2.0,0.0,1188.0,7.17,127.0]|-19  |
|[2.0,14.0,5.0,4.0,2.0,3618.0,21.17,365.0]|60   |
|[5.0,25.0,3.0,3.0,5.0,621.0,12.92,85.0]  |22   |
|[3.0,28.0,1.0,4.0,3.0,1732.0,13.33,182.0]|70   |
+-----------------------------------------+-----+
only showing top 5 rows



The data is now ready for building our first Machine Learning model. You've worked hard to get this sorted: well done!

### Decision Tree
  
Your first Machine Learning model will be a Decision Tree. This is probably the most intuitive model, so it seems like a good place to start.
  
**Anatomy of a Decision Tree: Root node**
  
A Decision Tree is constructed using an algorithm called "Recursive Partitioning". Consider a hypothetical example in which you build a Decision Tree to divide data into two classes, green and blue. You start by putting all of the records into the root node. Suppose that there are more green records than blue, in which case this node will be labelled "green". Now from amongst the predictors in the data you need to choose the one that will result in the most informative split of the data into two groups. Ideally you want the groups to be as homogeneous (or "pure") as possible: one should be mostly green and the other should be mostly blue.
  
**Anatomy of a Decision Tree: First split**
  
Once you have identified the most informative predictor, you split the data into two sets, labeled "green" or "blue" according to the dominant class. And this is where the recursion kicks in: you then apply exactly the same procedure on each of the child nodes, selecting the most informative predictor and splitting again.
  
**Anatomy of a Decision Tree: Second split**
  
So, for example, the green node on the left could be split again into two groups.
  
**Anatomy of a Decision Tree: Third split**
  
And the resulting green node could once again be split. The depth of each branch of the tree need not be the same. There are a variety of stopping criteria which can cause splitting to stop along a branch. For example, if the number of records in a node falls below a threshold or the purity of a node is above a threshold, then you might stop splitting. Once you have built the Decision Tree you can use it to make predictions for new data by following the splits from the root node along to the tip of a branch. The label for the final node would then be the prediction for the new data.
  
**Classifying cars**
  
Let's make this more concrete by looking at the cars data. You've transformed the country of origin column into a numeric index called 'label', with zero corresponding to cars manufactured in the USA and one for everything else. The remaining columns have all been consolidated into a column called 'features'. You want to build a Decision Tree which will use "features" to predict "label".
  
<center><img src='../_images/classification-in-pyspark.png' alt='img' width='740'></center>
  
**Split train/test**
  
An important aspect of building a Machine Learning model is being able to assess how well it works. In order to do this we use the `.randomSplit()` method to randomly split our data into two sets, a training set and a testing set. The proportions may vary, but generally you're looking at something like an 80:20 split, which means that the training set ends up having around 4 times as many records as the testing set.
  
<center><img src='../_images/classification-in-pyspark1.png' alt='img' width='740'></center>
  
**Build a Decision Tree model**
  
Finally the moment has come, you're going to build a Decision Tree. You start by creating a `DecisionTreeClassifier()` object. The next step is to fit the model to the training data by calling the `.fit()` method.
  
<center><img src='../_images/classification-in-pyspark2.png' alt='img' width='740'></center>
  
**Evaluating**
  
Now that you've trained the model you can assess how effective it is by making predictions on the test set and comparing the predictions to the known values. The `.transform()` method adds new columns to the DataFrame. The prediction column gives the class assigned by the model. You can compare this directly to the known labels in the testing data. Although the model gets the first example wrong, it's correct for the following four examples. There's also a probability column which gives the probabilities assigned to each of the outcome classes. For the first example, the model predicts that the outcome is 0 with probability 96%.
  
<center><img src='../_images/classification-in-pyspark3.png' alt='img' width='740'></center>
  
**Confusion matrix**
  
A good way to understand the performance of a model is to create a confusion matrix which gives a breakdown of the model predictions versus the known labels. The confusion matrix consists of four counts which are labelled as follows: - "positive" indicates a prediction of 1, while - "negative" indicates a prediction of 0 and - "true" corresponds to a correct prediction, while - "false" designates an incorrect prediction. In this case the true positives and true negatives dominate but the model still makes a number of incorrect predictions. These counts can be used to calculate the accuracy, which is the proportion of correct predictions. For our model the accuracy is 74%.
  
<center><img src='../_images/classification-in-pyspark4.png' alt='img' width='740'></center>
  
**Let's build Decision Trees!**
  
So, now that you know how to build a Decision Tree model with Spark, you can try that out on the flight data.

### Train/test split
  
To objectively assess a Machine Learning model you need to be able to test it on an independent set of data. You can't use the same data that you used to train the model: of course the model will perform (relatively) well on those data!
  
You will split the data into two components:
  
- training data (used to train the model) and
- testing data (used to test the model).
  
Note: From here on you'll be working with a smaller subset of the `flights` data, which just makes the exercises run more quickly.
  
---
  
1. Randomly split the `flights` data into two sets with 80:20 proportions. For repeatability set a random number seed of 43 for the split.
2. Check that the training data has roughly 80% of the records from the original data.

In [9]:
# Split into training and test sets in a 80:20 ratio
flights_train, flights_test = flights_assembled.randomSplit([0.8, 0.2], seed=17)

# Check that training set has around 80% of records
training_ratio = flights_train.count() / flights_assembled.count()
print(training_ratio)



0.7996856234682855


                                                                                

The ratio looks as expected. You're ready to train and test a Decision Tree model!

### Build a Decision Tree
  
Now that you've split the flights data into training and testing sets, you can use the training set to fit a Decision Tree model.
  
The data are available as `flights_train` and `flights_test`.
  
NOTE: It will take a few seconds for the model to train… please be patient!
  
---
  
1. Import the class for creating a Decision Tree classifier.
2. Create a classifier object and fit it to the training data.
3. Make predictions for the testing data and take a look at the predictions.

In [10]:
from pyspark.ml.classification import DecisionTreeClassifier

# Create a classifier object and fit to the training data
tree = DecisionTreeClassifier()
tree_model = tree.fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
prediction = tree_model.transform(flights_test)
prediction.select('label', 'prediction', 'probability').show(5, False)

[Stage 49:>                                                         (0 + 1) / 1]

+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|1    |0.0       |[0.5576430401366353,0.44235695986336465]|
|1    |0.0       |[0.5576430401366353,0.44235695986336465]|
|0    |1.0       |[0.37154355176634096,0.628456448233659] |
|1    |1.0       |[0.37154355176634096,0.628456448233659] |
|0    |0.0       |[0.6310591646280692,0.3689408353719308] |
+-----+----------+----------------------------------------+
only showing top 5 rows



                                                                                

Congratulations! You've built your first Machine Learning model with PySpark. Now to test!

### Evaluate the Decision Tree
  
You can assess the quality of your model by evaluating how well it performs on the testing data. Because the model was not trained on these data, this represents an objective assessment of the model.
  
A confusion matrix gives a useful breakdown of predictions versus known values. It has four cells which represent the counts of:
  
- True Negatives (TN) — model predicts negative outcome & known outcome is negative
- True Positives (TP) — model predicts positive outcome & known outcome is positive
- False Negatives (FN) — model predicts negative outcome but known outcome is positive
- False Positives (FP) — model predicts positive outcome but known outcome is negative.
  
These counts (TN, TP, FN and FP) should sum to the number of records in the testing data, which is only a subset of the flights data. You can compare to the number of records in the tests data, which is flights_test.count().
  
Note: These predictions are made on the testing data, so the counts are smaller than they would have been for predictions on the training data.
  
---
  
1. Create a confusion matrix by counting the combinations of label and prediction. Display the result.
2. Count the number of True Negatives, True Positives, False Negatives and False Positives.
3. Calculate the accuracy.

In [11]:
# Create a confusion matrix
prediction.groupBy('label', 'prediction').count().show()

# Calculate the elements of the confusion matrix
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.filter('prediction = 1 AND label = prediction').count()
FN = prediction.filter('prediction = 0 AND label = 1').count()
FP = prediction.filter('prediction = 1 AND label = 0').count()

# Accuracy measures the proportion of correct predictions
accuracy = (TN + TP) / (TN + TP + FN + FP)
print(accuracy)

                                                                                

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0| 7704|
|    0|       0.0|14460|
|    1|       1.0|18411|
|    0|       1.0|11164|
+-----+----------+-----+





0.6353234503952531


                                                                                

The accuracy is decent but there are a lot of false predictions. We can make this model better!

### Logistic Regression
  
You've learned to build a Decision Tree. But it's good to have options. Logistic Regression is another commonly used classification model.
  
**Logistic Curve**
  
It uses a logistic function to model a binary target, where the target states are usually denoted by 1 and 0 or TRUE and FALSE. The maths of the model are outside the scope of this course, but this is what the logistic function looks like. For a Logistic Regression model the x-axis is a linear combination of predictor variables and the y-axis is the output of the model. Since the value of the logistic function is a number between zero and one, it's often thought of as a probability. In order to translate this number into one or other of the target states it's compared to a threshold, which is normally set at one half.
  
<center><img src='../_images/logistic-regression-in-pyspark.png' alt='img' width='740'></center>
  
If the number is above the threshold then the predicted state is one.
  
<center><img src='../_images/logistic-regression-in-pyspark1.png' alt='img' width='740'></center>
  
Conversely, if it's below the threshold then the predicted state is zero. The model derives coefficients for each of the numerical predictors. Those coefficients might...
  
<center><img src='../_images/logistic-regression-in-pyspark2.png' alt='img' width='740'></center>
  
shift the curve to the right...
  
<center><img src='../_images/logistic-regression-in-pyspark3.png' alt='img' width='740'></center>
  
or to the left. They might make the transition between states...
  
<center><img src='../_images/logistic-regression-in-pyspark4.png' alt='img' width='740'></center>
  
more gradual...
  
<center><img src='../_images/logistic-regression-in-pyspark5.png' alt='img' width='740'></center>
  
or more rapid. These characteristics are all extracted from the training data and will vary from one set of data to another.
  
<center><img src='../_images/logistic-regression-in-pyspark6.png' alt='img' width='740'></center>
  
**Cars revisited**
  
Let's make this more concrete by returning to the cars data. You'll focus on the numerical predictors for the moment and return to categorical predictors later on. As before you prepare the data by consolidating the predictors into a single column and then randomly splitting the data into training and testing sets.
  
<center><img src='../_images/logistic-regression-in-pyspark7.png' alt='img' width='740'></center>
  
**Build a Logistic Regression model**
  
To build a Logistic Regression model you first need to import the associated class and then create a classifier object. This is then fit to the training data using the `.fit()` method.
  
<center><img src='../_images/logistic-regression-in-pyspark8.png' alt='img' width='740'></center>
  
**Predictions**
  
With a trained model you are able to make predictions on the testing data. As you saw with the Decision Tree, the `.transform()` method adds the prediction and probability columns. The probability column gives the predicted probability of each class, while the prediction column reflects the predicted label, which is derived from the probabilities by applying the threshold mentioned earlier.
  
<center><img src='../_images/logistic-regression-in-pyspark9.png' alt='img' width='740'></center>
  
**Precision and recall**
  
You can assess the quality of the predictions by forming a confusion matrix. The quantities in the cells of the matrix can then be used to form some informative ratios. Recall that a positive prediction indicates that a car is manufactured outside of the USA and that predictions are considered to be true or false depending on whether they are correct or not. Precision is the proportion of positive predictions which are correct. For your model, two thirds of predictions for cars manufactured outside of the USA are correct. Recall is the proportion of positive targets which are correctly predicted. Your model also identifies 80% of cars which are actually manufactured outside of the USA. Bear in mind that these metrics are based on a relatively small testing set.
  
<center><img src='../_images/logistic-regression-in-pyspark10.png' alt='img' width='740'></center>
  
**Weighted metrics**
  
Another way of looking at these ratios is to weight them across the positive and negative predictions. You can do this by creating an evaluator object and then calling the evaluate() method. This method accepts an argument which specifies the required metric. It's possible to request the weighted precision and recall as well as the overall accuracy. It's also possible to get the F1 metric, the harmonic mean of precision and recall, which is generally more robust than the accuracy. All of these metrics have assumed a threshold of one half. What happens if you vary that threshold?
  
<center><img src='../_images/logistic-regression-in-pyspark11.png' alt='img' width='740'></center>
  
**ROC and AUC**
  
A threshold is used to decide whether the number returned by the Logistic Regression model translates into either the positive or the negative class. By default that threshold is set at a half. However, this is not the only choice. Choosing a larger or smaller value for the threshold will affect the performance of the model. The ROC curve plots the true positive rate versus the false positive rate as the threshold increases from zero (top right) to one (bottom left). The AUC summarizes the ROC curve in a single number. It's literally the area under the ROC curve. AUC indicates how well a model performs across all values of the threshold. An ideal model, that performs perfectly regardless of the threshold, would have AUC of 1. In an exercise we'll see how to use another evaluator to calculate the AUC.
  
<center><img src='../_images/logistic-regression-in-pyspark12.png' alt='img' width='740'></center>
  
**Let's do Logistic Regression!**
  
You now know how to build a Logistic Regression model and assess the performance of that model using various metrics. Let's give this a try!

### Build a Logistic Regression model
  
You've already built a Decision Tree model using the flights data. Now you're going to create a Logistic Regression model on the same data.
  
The objective is to predict whether a flight is likely to be delayed by at least 15 minutes (label `1`) or not (label `0`).
  
Although you have a variety of predictors at your disposal, you'll only use the `mon`, `depart` and `duration` columns for the moment. These are numerical features which can immediately be used for a Logistic Regression model. You'll need to do a little more work before you can include categorical features. Stay tuned!
  
The data have been split into training and testing sets and are available as `flights_train` and `flights_test`.
  
---
  
1. Import the class for creating a Logistic Regression classifier.
2. Create a classifier object and train it on the training data.
3. Make predictions for the testing data and create a confusion matrix.

In [12]:
from pyspark.ml.classification import LogisticRegression

# Selecting numeric columns
flights_train_num = flights_train.select("mon", 'depart', 'duration', 'features', 'label')
flights_test_num = flights_test.select("mon", "depart", "duration", 'features', 'label')

# Create classifier object and train on training data
logistic = LogisticRegression().fit(flights_train_num)

# Create a predictions for the test data and show confusion matrix
prediction = logistic.transform(flights_test_num)
prediction.groupBy("label", "prediction").count().show()

23/08/27 23:12:40 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0| 9455|
|    0|       0.0|14931|
|    1|       1.0|16660|
|    0|       1.0|10693|
+-----+----------+-----+



                                                                                

Now let's unpack that confusion matrix.

### Evaluate the Logistic Regression model
  
Accuracy is generally not a very reliable metric because it can be biased by the most common target class.
  
There are two other useful metrics:
  
- precision
- recall
  
Check the slides for this lesson to get the relevant expressions.
  
Precision is the proportion of positive predictions which are correct. For all flights which are predicted to be delayed, what proportion is actually delayed?
  
Recall is the proportion of positives outcomes which are correctly predicted. For all delayed flights, what proportion is correctly predicted by the model?
  
The precision and recall are generally formulated in terms of the positive target class. But it's also possible to calculate weighted versions of these metrics which look at both target classes.
  
The components of the confusion matrix are available as `TN`, `TP`, `FN` and `FP`, as well as the object `prediction`.
  
---
  
1. Find the precision and recall.
2. Create a multi-class evaluator and evaluate weighted precision.
3. Create a binary evaluator and evaluate AUC using the `"areaUnderROC"` metric.

In [13]:
# Calculate the elements of the confusion matrix
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.filter('prediction = 1 AND label = prediction').count()
FN = prediction.filter('prediction = 0 AND label = 1').count()
FP = prediction.filter('prediction = 1 AND label = 0').count()

                                                                                

In [17]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

# Calculate precision and recall
precision = TP / (TP + FP)
recall = TP / (TP + FN)
print('precision \t= {:.2f}\nrecall \t\t= {:.2f}'.format(precision, recall))

# Find weighted precision
multi_evaluator = MulticlassClassificationEvaluator()
weighted_precision = multi_evaluator.evaluate(prediction, {multi_evaluator.metricName: "weightedPrecision"})

# Find AUC
binary_evaluator = BinaryClassificationEvaluator()
auc = binary_evaluator.evaluate(prediction, {binary_evaluator.metricName: "areaUnderROC"})


precision 	= 0.61
recall 		= 0.64


                                                                                

In [18]:
print(weighted_precision)
print(auc)

0.6106605467579678
0.6504928839090344


The weighted precision indicates what proportion of predictions (positive and negative) are correct.

### Turning Text into Tables
  
It's said that 80% of Machine Learning is data preparation. As we'll see in this lesson, this is particularly true for text data. Before you can use Machine Learning algorithms you need to take unstructured text data and create structure, ultimately transforming the data into a table.
  
**One record per document**
  
We start with a collection of documents. These documents might be anything from a short snippet of text, like an SMS or email, to a lengthy report or book. Each document will become a record in the table.
  
<center><img src='../_images/turning-text-into-tables-pyspark.png' alt='img' width='740'></center>
  
**One document, many columns**
  
The text in each document will be mapped to columns in the table. First the text is split into words or tokens. You then remove short or common words that do not convey too much information. The table will then indicate the number of times that each of the remaining words occurred in the text. This table is also known as a "term-document matrix". There are some nuances to the process, but that's the central idea.
  
<center><img src='../_images/turning-text-into-tables-pyspark1.png' alt='img' width='740'></center>
  
**A selection of children's books**
  
Suppose that your documents are the names of children's books. The raw data might look like this. Your job will be to transform these data into a table with one row per document and a column for each of the words.
  
<center><img src='../_images/turning-text-into-tables-pyspark2.png' alt='img' width='740'></center>
  
**Removing punctuation**
  
You're interested in words, not punctuation. You'll use regular expressions (or REGEX), a mini-language for pattern matching, to remove the punctuation symbols. Regular expressions is another big topic and outside of the scope of this course, but basically you are giving a list of symbols or text pattern to match. The hyphen is escaped by the backslashes because it has another meaning in the context of regular expressions. By escaping it you tell Spark to interpret the hyphen literally. You need to specify a column name, books.text, a pattern to be matched (stored in the variable REGEX), and the replacement text, which is simply a space. You now have some double spaces but you can use REGEX to clean those up too.
  
<center><img src='../_images/turning-text-into-tables-pyspark3.png' alt='img' width='740'></center>
  
**Text to tokens**
  
Next you split the text into words or tokens. You create a tokenizer object, giving it the name of the input column containing the text and the output column which will contain the tokens. The tokenizer is then applied to the text using the `.transform()` method. In the results you see a new column in which each document has been transformed into a list of words. As a side effect the words have all been reduced to lower case.
  
<center><img src='../_images/turning-text-into-tables-pyspark4.png' alt='img' width='740'></center>
  
**What are stop words?**
  
Some words occur frequently in all of the documents. These common or "stop" words convey very little information, so you will also remove them using an instance of the `StopWordsRemover` class. This contains a list of stop words which can be customized if necessary.
  
<center><img src='../_images/turning-text-into-tables-pyspark5.png' alt='img' width='740'></center>
  
**Removing stop words**
  
Since you didn't give the input and output column names earlier, you specify them now and then apply the `.transform()` method. You could also have given these names when you created the remover.
  
<center><img src='../_images/turning-text-into-tables-pyspark6.png' alt='img' width='740'></center>
  
**Feature hashing**
  
Your documents might contain a large variety of words, so in principle our table could end up with an enormous number of columns, many of which would be only sparsely populated. It would also be handy to convert the words into numbers. Enter the hashing trick, which in simple terms converts words into numbers. You create an instance of the `HashingTF` class, providing the names of the input and output columns. You also give the number of features, which is effectively the largest number that will be produced by the hashing trick. This needs to be sufficiently big to capture the diversity in the words. The output in the hash column is presented in sparse format, which we will talk about more later on. For the moment though it's enough to note that there are two lists. The first list contains the hashed values and the second list indicates how many times each of those values occurs. For example, in the first document the word "long" has a hash of 8 and occurs twice. Similarly, the word "five" has a hash of 6 and occurs once in each of the last two documents.
  
<center><img src='../_images/turning-text-into-tables-pyspark7.png' alt='img' width='740'></center>
  
**Dealing with common words**
  
The final step is to account for some words occurring frequently across many documents. If a word appears in many documents then it's probably going to be less useful for building a classifier. We want to weight the number of counts for a word in a particular document against how frequently that word occurs across all documents. To do this you reduce the effective count for more common words, giving what is known as the "inverse document frequency". Inverse document frequency is generated by the IDF class, which is first fit to the hashed data and then used to generate weighted counts. The word "five", for example, occurs in multiple documents, so its effective frequency is reduced. Conversely, the word "long" only occurs in one document, so its effective frequency is increased.
  
<center><img src='../_images/turning-text-into-tables-pyspark8.png' alt='img' width='740'></center>
  
**Text ready for Machine Learning!**
  
The inverse document frequencies are precisely what we need for building a Machine Learning model. Let's do that with the SMS data.

### Punctuation, numbers and tokens
  
At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either "spam" (label 1) or "ham" (label 0). You're now going to use those data to build a classifier model.
  
But first you'll need to prepare the SMS messages as follows:
  
- remove punctuation and numbers
- tokenize (split into individual words)
- remove stop words
- apply the hashing trick
- convert to TF-IDF representation.
  
In this exercise you'll remove punctuation and numbers, then tokenize the messages.
  
The SMS data are available as `sms`.
  
---
  
1. Import the function to replace regular expressions and the feature to tokenize.
2. Replace all punctuation characters from the `'text'` column with a space. Do the same for all numbers in the `'text'` column.
3. Split the `'text'` column into tokens. Name the output column `'words'`.

In [19]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Load data from a delimited file
sms = spark.read.csv('../_datasets/sms.csv', sep=';', header=False, schema=schema)

In [22]:
from pyspark.sql.functions import regexp_replace
from pyspark.ml.feature import Tokenizer

# Remove punctuation (REGEX provided) and numbers
wrangled = sms.withColumn('text', regexp_replace(sms.text, '[_():;,.!?\\-]', ' '))
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, '[0-9]', ' '))

# Merge multiple spaces
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))

# Split the text into words
wrangled = Tokenizer(inputCol='text', outputCol='words').transform(wrangled)

wrangled.show(4, truncate=False)

+---+----------------------------------+-----+------------------------------------------+
|id |text                              |label|words                                     |
+---+----------------------------------+-----+------------------------------------------+
|1  |Sorry I'll call later in meeting  |0    |[sorry, i'll, call, later, in, meeting]   |
|2  |Dont worry I guess he's busy      |0    |[dont, worry, i, guess, he's, busy]       |
|3  |Call FREEPHONE now                |1    |[call, freephone, now]                    |
|4  |Win a cash prize or a prize worth |1    |[win, a, cash, prize, or, a, prize, worth]|
+---+----------------------------------+-----+------------------------------------------+
only showing top 4 rows



Well done! Next you'll remove stop words and apply the hashing trick.

### Stop words and hashing
  
The next steps will be to remove stop words and then apply the hashing trick, converting the results into a TF-IDF.
  
A quick reminder about these concepts:
  
- The hashing trick provides a fast and space-efficient way to map a very large (possibly infinite) set of items (in this case, all words contained in the SMS messages) onto a smaller, finite number of values.
- The TF-IDF matrix reflects how important a word is to each document. It takes into account both the frequency of the word within each document but also the frequency of the word across all of the documents in the collection.
  
The tokenized SMS data are stored in `sms` in a column named `'words'`. You've cleaned up the handling of spaces in the data so that the tokenized text is neater.
  
---
  
1. Import the `StopWordsRemover`, `HashingTF` and `IDF` classes.
2. Create a `StopWordsRemover` object (input column `'words'`, output column `'terms'`). Apply to `sms`.
3. Create a `HashingTF` object (input results from previous step, output column `'hash'`). Apply to `wrangled`.
4. Create an `IDF` object (input results from previous step, output column `'features'`). Apply to `wrangled`.

In [23]:
from pyspark.ml.feature import StopWordsRemover, HashingTF, IDF

sms = wrangled.select('id', 'words', 'label')

# Remove stop words.
wrangled = StopWordsRemover(inputCol='words', outputCol='terms').transform(sms)

# Apply the hashing trick
wrangled = HashingTF(inputCol='terms', outputCol='hash', numFeatures=1024).transform(wrangled)

# Convert hashed symbols to TF-IDF
tf_idf = IDF(inputCol='hash', outputCol='features').fit(wrangled).transform(wrangled)

tf_idf.select('terms', 'features').show(4, truncate=False)

                                                                                

+--------------------------------+----------------------------------------------------------------------------------------------------+
|terms                           |features                                                                                            |
+--------------------------------+----------------------------------------------------------------------------------------------------+
|[sorry, call, later, meeting]   |(1024,[138,384,577,996],[2.273418200008753,3.6288353225642043,3.5890949939146903,4.104259019279279])|
|[dont, worry, guess, busy]      |(1024,[215,233,276,329],[3.9913186080986836,3.3790235241678332,4.734227298217693,4.58299632849377]) |
|[call, freephone]               |(1024,[133,138],[5.367951058306837,2.273418200008753])                                              |
|[win, cash, prize, prize, worth]|(1024,[31,47,62,389],[3.6632029660684124,4.754846585420428,4.072170704727778,7.064594791043114])    |
+--------------------------------+--------------

Great! Now you're ready to build a spam classifier.

### Training a spam classifier
  
The SMS data have now been prepared for building a classifier. Specifically, this is what you have done:
  
- removed numbers and punctuation
- split the messages into words (or "tokens")
- removed stop words
- applied the hashing trick and
- converted to a TF-IDF representation.
  
Next you'll need to split the TF-IDF data into training and testing sets. Then you'll use the training data to fit a Logistic Regression model and finally evaluate the performance of that model on the testing data.
  
The data are stored in `sms` and `LogisticRegression` has been imported for you.
  
---
  
1. Split the data into training and testing sets in a 4:1 ratio. Set the random number `seed=` to 13 to ensure repeatability.
2. Create a `LogisticRegression` object and fit it to the training data.
3. Generate predictions on the testing data.
4. Use the predictions to form a confusion matrix.

In [24]:
sms = tf_idf.select('label', 'features')

# Split the data into training and test sets
sms_train, sms_test = sms.randomSplit([0.8, 0.2], seed=13)

# Fit a Logistic Regression model to the training data
logistic = LogisticRegression(regParam=0.2).fit(sms_train)

# Make predictions on the test data
prediction = logistic.transform(sms_test)

# Create a confusion matrix, comparing predictions to known labels
prediction.groupBy('label', 'prediction').count().show()

[Stage 177:>                                                        (0 + 1) / 1]

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0|   39|
|    0|       0.0|  932|
|    1|       1.0|  121|
|    0|       1.0|    4|
+-----+----------+-----+



                                                                                

Well played! Your classifier won't be fooled by spam SMS.

In [25]:
spark.stop()