Note: The notes contained in this notebook were taken from DataCamp's "Machine Learning with PySpark"
# Machine Learning with PySpark
Spark is a powerful, general purpose tool for working with Big Data. Spark transparently handles the distribution of compute tasks across a cluster. This means that operations are fast, but it also allows you to focus on the analysis rather than worry about technical details. In this course you'll learn how to get data into Spark and then delve into the three fundamental Spark Machine Learning algorithms: Linear Regression, Logistic Regression/Classifiers, and creating pipelines. Along the way you'll analyse a large dataset of flight delays and spam text messages. With this background you'll be ready to harness the power of Spark and apply it on your own Machine Learning projects!

**Instructor:** Andrew Collier, Data Scientist @ Exegetic Analytics

#### First, some recap:
Spark is a platform for cluster computing. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel over the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.

However, with greater computing power comes greater complexity.

Deciding whether or not Spark is the best solution for your problem takes some experience, but you can consider questions like:
    * Is my data too big to work with on a single machine?
    * Can my calculations be easily parallelized?
    
The first step in using Spark is connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called worker. The master sends the workers data and calculations to run, and they send their results back to the master.

When you're just getting started with Spark it's simpler to just run a cluster locally. Thus, for this course, instead of connecting to another computer, all computations will be run on DataCamp's servers in a simulated cluster.

Creating the connection is as simple as creating an instance of the `SparkContext` class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.

An object holding all these attributes can be created with the `SparkConf()` constructor. Take a look at the [documentation](https://spark.apache.org/docs/2.1.0/api/python/pyspark.html) for all the details!

#### Using DataFrames
Spark's core data structure is the **Resilient Distributed Dataset (RDD)**. This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster. However, RDDs are hard to work with directly, so in this course you'll be using the Spark DataFrame abstraction built on top of RDDs.

The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs.

When you start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others. When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!

To start working with Spark DataFrames, you first have to create a `SparkSession` object from your `SparkContext`. You can think of the `SparkContext` as your connection to the cluster and the `SparkSession` as your interface with that connection.

#### Put some Spark in your data
In the last exercise, you saw how to move data from Spark to `pandas`. However, maybe you want to go the other direction, and put a `pandas` DataFrame into a Spark cluster! The `SparkSession` class has a method for this as well.

The `.createDataFrame()` method takes a `pandas` DataFrame and returns a Spark DataFrame.

The output of this method is stored locally, not in the `SparkSession` catalog. This means that you can use all the Spark DataFrame methods on it, but you can't access the data in other contexts.

For example, a SQL query (using the `.sql()` method) that references your DataFrame will throw an error. To access the data in this way, you have to save it as a temporary table.

You can do this using the `.createTempView()` Spark DataFrame method, which takes as its only argument the name of the temporary table you'd like to register. This method registers the DataFrame as a table in the catalog, but as this table is temporary, it can only be accessed from the specific `SparkSession` used to create the Spark DataFrame.

There is also the method `.createOrReplaceTempView()`. This safely creates a new temporary table if nothing was there before, or updates an existing table if one was already defined. You'll use this method to avoid running into problems with duplicate tables.

Check out the diagram to see all the different ways your Spark data structures interact with each other.

<img src='data/spark_createTempView.png' width="400" height="200" align="center"/>

* Similar to `.withColumn()`, you can do column-wise computations within a `SELECT` statement. 
* `SELECT origin, dest, air_time / 60 FROM flights;`
* the following two expressions will produce the same output:
* `flights.filter("air_time > 120").show()`
* `flights.filter(flights.air_time > 120).show()`
* The difference between `.select()` and `.withColumn()` methods is that `.select()` returns only the columns you specify, while .`withColumn()` returns all the columns of the DataFrame in addition to the one you defined. 
* At the core of the `pyspark.ml` module are the `Transformer` and `Estimator` classes. Almost every other class in the module behaves similarly to these two basic classes.
* `Transformer` classes have a `.transform()` method that takes a DataFrame and returns a new DataFrame; usually the original one with a new column appended. For example, you might use the class `Bucketizer` to create discrete bins from a continuous feature or the class `PCA` to reduce the dimensionality of your dataset using principal component analysis.
* `Estimator` classes all implement a `.fit()` method. These methods also take a DataFrame, but instead of returning another DataFrame they return a model object. This can be something like a `StringIndexerModel` for including categorical data saved as strings in your models, or a `RandomForestModel` that uses the random forest algorithm for classification or regression.
* Before you get started modeling, it's important to know that Spark only handles numeric data. That means all of the columns in your DataFrame must be either integers or decimals (called 'doubles' in Spark).
* To remedy this, you can use the `.cast()` method in combination with the `.withColumn()` method. It's important to note that `.cast()` works on columns, while `.withColumn()` works on DataFrames.
* In Spark it's important to make sure you split the data **after** all the transformations. This is because operations like `StringIndexer` don't always produce the same index even when given the same list of strings.

# $\star$ Chapter 1: Introduction

## Machine Learning and Spark
Spark is a framework for working with Big Data. In this chapter you'll cover some background about Spark and Machine Learning. You'll then find out how to connect to Spark using Python and load CSV data.

Here, we'll learn how to build Machine Learning models on large data sets using distributed computing techniques

* The performance of an ML depends on data; in general, more data is a good thing
* If the data can fit entirely in RAM then the algorithm can operate efficiently
* When the data no longer fit into memory, the computer will start to use **virutal memory** and the data will be **paged** back and forth between RAM and disk
    * Relative to **RAM** access, retrieving data from disk is slow
    * And as the size of the data grows, paging becomes more intense and the computer begins to spend more and more time waiting for data. Performance plummets.

<img src='data/datasize_RAM.png' width="600" height="300" align="center"/>

* One option is to **distribute the problem across multiple computers in a cluster;** rather than trying to handle a large dataset on a single machine, it's divided up into partitions which are processed separately.
    * Ideally each data partition can fit into RAM on a single computer in the cluster.
    * This is the approach used by **Spark**
    
#### What is Spark?
* Compute across a distributed cluster
* Data processing in memory
* Well-documented high-level API
* **It is generally much faster than other Big Data technologies like Hadoop, because it does most processing in memory .**
* **It has a developer-friendly interface which hides much of the complexity of distributed computing.**

#### Cluster components
* A cluster consists of one or more **nodes**
* Each **node** is a computer with CPU, RAM, and physical storage
* A **cluster manager** allocates resources and coordinates activity across the cluster
* Every application running on the Spark cluster has a **driver** program
* Using the **Spark API**, the driver communicates with the cluster manager, which in turn distributes work to the nodes.
* On each node, Spark launches an **executor** process which persists for the duration of the application
* Work is divided up into **tasks**, which are simply units of computation
* The executors run tasks in multiple **threads** across the **cores** in a node

<img src='data/cluster_structure.png' width="500" height="250" align="center"/>

#### Interacting with Spark
* Languages for interacting with Spark:
    * Java: low-level, compiled
    * Scala, Python and R: high-level and interactive REPL (Read-Eval-Print-Loop, which is crucial for **interactive development**)
    
#### Importing pyspark
* Python doesn't "speak" natively with spark
* From Python import the `pyspark` module first, which makes the Spark functionality available in the Python interpreter
* Spark is under vigorous development and because it is constantly evolving, it is import to check your version before getting started 
* In this course we'll be using version `2.4.1` (released March 2019)

```
import pyspark
pyspark.__version__
```

#### Sub-modules
* In addition to `pyspark`, there are:
    * Structured Data -- `pyspark.sql`
    * Streaming Data -- `pyspark.streaming`
    * Machine Learning -- `pyspark.mllib` (deprecated) and **`pyspark.ml`**
* With the `pyspark` module loaded, you're able to connect to Spark. The next thing you need to do is **tell Spark where the cluster is located.** Two options:

#### Remote Cluster
   * Connect to a **Remote Cluster** using Spark URL:
        * `spark://<IP address | DNS name>:<port>`
        * *Example with IP:* `spark://13.59.151.161:7077`
        * *Example with DNS:* `spark://ec2-18-188-22-23.us-east-2.compute.amazonaws.com:7077`
        * The Spark URL gives the location of the cluster's master node
        * The URL is composed of an **IP address or DNS name** and a **port number**
        * The **default port** for Spark is 7077 (but this must still be explicitly specified
        
#### Local Cluster
* When you're figuring out how Spark works, the infrastructure of a distributed network can get in the way
* For this reason it may be helpful to create a **local cluster**, where everything happens on a single computer
    * This is the setup that you're going to use throughout this course
* For a local cluster, you need only specify "local" and, optionally, the number of cores to use.
    * *Examples:*
        * `local` -- only 1 core;
        * `local[4]` -- 4 cores; or
        * `local[*]` -- all available cores.
* By **default** a local cluster will run on a single core.

### Creating a SparkSession
* You connect to Spark by creating a `SparkSession` object
* You then specify the **location** of the cluster using the **`master()`** method:

```
from pyspark.sql import SparkSession

spark = SparkSession.builder \
                    .master('local[*]')
                    .appName('first_spark_application') \
                    .getOrCreate()
```
* Or: `spark = SparkSession.builder.master('local[*]').appName('first_spark_application').getOrCreate()`


* Optionally, you can also assign a name to the application using the `appName()` method
* Finally, we call the `getOrCreate()` method, which will either create a new session object or return an existing object.
* *Once the session has been created, you are able to interact with Spark.*
* Although it's possible for multiple SparkSessions to co-exist, it's good practice to stop the SparkSession when you're done. 
* `# Close connection to Spark`
* **`>>> spark.stop()`**
* For more info on: [SparkSession](https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession)


#### Creating a SparkSession
* For more info on: [SparkSession](https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession)
* The SparkSession class has a builder attribute, which is an instance of the Builder class. The Builder class exposes three important methods that let you:
    * Specify the location of the master node;
    * Name the application (optional); and
    * Retrieve an existing SparkSession or, if there is none, create a new one.

### Loading Data
* Selected methods:
    * `count()` : returns number of rows 
    * `show()` : displays a subset of rows
    * `printSchema()` : column types
* Selected attributes:
    * `dtypes` : column types
    
#### Reading data from CSV
* The `.csv()` method reads a CSV file and returns a `DataFrame`
* `cars = spark.read.csv('cars.csv', header=True)`
* **Optional arguments:**
    * **`header`:** is first row a header? (default: `False`)
    * **`sep`:** field separator (default: a comma `','`)
    * **`schema`:** explicit column data types
    * **`inferSchema`:** deduce column data types from data?
    * **`nullValue`:** placeholder for missing data; *case-sensitive*
    
#### Check column types
* `cars.printSchema()`
* **The `.csv()` method treats all columns as strings by default.** Or, 
    * **(1)** Infer the columns from the data:
        * `cars = spark.read.csv('cars.csv', header=True, inferSchema=True)`
        * In this scenario, Spark needs to make an extra pass over the data to figure out the column types before reading the data
        * Con: if the data file is big, this will notably increase the load time
        * Con: While usually accurate, there may also misidentified types
        * Con: interprets NA as a string (and therefore columns with NA as string types)
    * **(2)** Manually specify the types
    
#### Manually specifying column types
* Manually specify the type of each column in an explicit schema
* During this process, it is also possible to choose alternative column names

```
schema = StructType([
                StructField("maker", StringType()),
                StructField("model", StringType()),
                StructField("origin", StringType()),
                StructField("type", StringType()),
                StructField("cyl", StringType()),
                StructField("size", StringType()),
                StructField("weight", StringType()),
                StructField("length", StringType()),
                StructField("rpm", StringType()),
                StructField("consumption", StringType())
])
cars = spark.read.csv("cars.csv", header = True, schema = schema, nullValue = 'NA')
```

#### Exercises: Loading flights data

```
# Read data from CSV file
flights = spark.read.csv('flights.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,
                         nullValue='NA')

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
print(flights.dtypes)
```

#### Exercises: Loading SMS spam data

```
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Load data from a delimited file
sms = spark.read.csv('sms.csv', sep=';', header=False, schema=schema)

# Print schema of DataFrame
sms.printSchema()
```

# $\star$ Chapter 2: Classification
Now that you are familiar with getting data into Spark, you'll move onto building two types of classification model: Decision Trees and Logistic Regression. You'll also find out about a few approaches to data preparation.

### Data Preparation
#### Dropping columns
* There are two approaches:
    * Drop the columns you don't want; or,
    * Select the fields you do want

```
# Drop the columns you don't want
cars = cars.drop('maker', 'model')
 
 
# Select the columns you do want
cars = cars.select("origin", "type", "cyl", "size", "weight", "length", "rpm", "consumption")
```

#### Filtering out missing data
* Use the `.filter()` method and provide a **logical predicate** using **SQL syntax** that identifies NULL values:

```
# How many missing values?
cars = filter('cyl IS NULL').count()

# Drop records with missing values in the `cylinders` column
cars = cars.filter('cyl IS NOT NULL')

# Drop records with missing values in any column
cars = cars.dropna()

```

#### Mutating Columns 
* Use the `.withColumn()` method to create a new mass column in units of kilograms

```
from pyspark.sql.functions import round

# Create a new "mass" column
cars = cars.withColumn("mass", round(cars.weight / 2.205, 0))

# Convert length to meters
cars = cars.withColumn('length', round(cars.length * 0.0254, 3))
```

### Indexing categorical data
* Use `StringIndexer` class
* Within constructor, provide string input column and a name for the new output column to be created
* The indexer is first fit to the data, creating a `StringIndexerModel`
* During the fitting process the distinct string values are identified and an index is assigned to each value
* The model is then used to transform the data, creating a new column with the index values

```
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol='type',
                        outputCol='type_idx')
                        
# Assign index values to strings
indexer = indexer.fit(cars)

# Create column with index values
cars = indexer.transform(cars)
```
* **By default, the index values are assigned according to the descending relative frequency of each of the string values.**

<img src='data/index_cat_data.png' width="400" height="200" align="center"/>

* Note, too, that indexing starts at zero (most common)
* It is also possible to choose different strategies for assigning index values, by apecifying the **`stringOrderType`** argument.

* **Indexing country of origin:**

```
# Index country of origin:
#
# USA          -> 0
# non-USA      -> 1
#
cars = StringIndexer(
    inputCol = "origin",
    outputCol = "label"
).fit(cars).transform(cars)
```

#### Assembling columns
* The final step in preparing a dataset is to consolidate the various input columns into a single column
* This is necessary because **the Machine Learning algorithms in Spark operate on a single vector of predictors**(although each element in that vector may consist of multiple values).
* First you create an instance of the `VectorAssembler` class, providing it with the names of the columns that you want to consolidate and the name of the new output column:

```
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=['cyl', 'size'], outputCol='features')
assembler.transform(cars)
```

<img src='data/column_vectors.png' width="300" height="150" align="center"/>

**Note** above the new `features` column, which consists of values from the `cylinders` and `size` columns consolidated into a vector.

```
# Remove the 'flight' column
flights_drop_column = flights.drop('flight')

# Number of records with missing 'delay' values
flights_drop_column.filter('delay IS NULL').count()

# Remove records with missing 'delay' values
flights_valid_delay = flights_drop_column.filter('delay IS NOT NULL')

# Remove records with missing values in any column and get the number of remaining rows
flights_none_missing = flights_valid_delay.dropna()
print(flights_none_missing.count())
```

<img src='data/course_datasets.png' width="600" height="300" align="center"/>