<DIV ALIGN=CENTER>

# Introduction to Spark: Machine Learning
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

In this IPython Notebook, we explore using Spark to perform basic
statistical analysis and machine learning. For part of this analysis, we
will use the airline data, which has been stored in files that are
accessible from within our Spark cluster. The [current version of
Spark][sv] has two machine learning packages. The original, and best
developed machine learning library is [MLlib][mll], while the newer
library is [ML][ml]. The former operates on Spark RDDs, while the latter
operates on DataFrames, which seem to be the future of Spark data
structures. Given the current dominance of MLlib, we focus on the
original Spark machine learning package in this Notebook.

-----

[sv]: https://spark.apache.org
[mll]: https://spark.apache.org/mllib/
[ml]: https://spark.apache.org/docs/latest/ml-guide.html

### Initialization

In this class, we have a dedicated Spark cluster running to allow
students to explore Spark from within our IPython Notebook environment.
Since our Spark cluster has limited resources, we need to carefully
manage them, in particular we need to ensure that any SparkContext
previously used by this Jupyter Server is properly released before
starting a new one. After this, we will initialize a new SparkContext to
properly interact from this dockerized IPython Notebook to the Spark
cluster.

----- 

In [1]:
# We release the SparkContext if it exists.
try:
    sc
except:
    pass ;
else:
    sc.stop()

# Now handle initial import statements
from pyspark import SparkConf, SparkContext

# Create new Spark Configuration (port numbers might need to be adjusted from defaults.)
myconf = SparkConf()
myconf.setMaster('local[*]')
myconf.setAppName("INFO490 SP17 W14-NB3: Professor Brunner")
myconf.set('spark.executor.memory', '1g')

# Create and initialize a new Spark Context
sc = SparkContext(conf=myconf)

# Display Spark version information, which also verifies SparkContext is active
print("\nSpark version: {0}".format(sc.version))


Spark version: 2.0.1


-----

### Data Processing

In this Notebook, we will need sample data. To simplify acquiring data
to demonstrate using Spark DataFrames, we include the RDD code from the
[Introduction to Spark](intro2spark.ipynb) Notebook in the following
cell.

-----

In [2]:
filename = '/home/data_scientist/data/2001/2001-1.csv'

text_file = sc.textFile(filename)

col_data = text_file.map(lambda l: l.split(",")) \
            .map(lambda p: (p[0], p[1], p[2], p[4], p[14], p[15], p[16], p[17], p[18])) \
            .filter(lambda line: 'Year' not in line)

cols = col_data.filter(lambda line: 'NA' not in line)

fields = cols.map(lambda p: (int(p[0]), int(p[1]), int(p[2]), int(p[3]),
                          int(p[4]), int(p[5]), p[6], p[7], int(p[8])))

# Should be 480106 if everything works correctly
print('Number of entries in fields RDD = {0}'.format(fields.count()))

Number of entries in fields RDD = 480106


-----

## Spark Statistics

The simplest type of data analysis is to compute basic statistical
measures of sequences of data. The Spark MLlib package includes a 
[basic statistical][sbs] component that can be easily used to obtain
statistical measurements of multiple columns in a Spark RDD. We
demonstrate this in the following code cells, where we create an RDD
from numeric columns in our `fields` RDD. We use the `colStats` function
from the `Statistics` object to compute a range of statistical measures
in one pass for all columns in the `sdt` RDD. In the second code cell,
we simply provide a nicely formatted display of these quantities for
each column.

-----

[sbs]: https://spark.apache.org/docs/latest/mllib-statistics.html

In [3]:
from pyspark.mllib.stat import Statistics

# Extract numeric columns and compute statistics
sdt = fields.map(lambda p: (p[2], p[3], p[4], p[5], p[8]))
summary = Statistics.colStats(sdt)

# Extract individual statistics for RDD
mus = summary.mean()
mns = summary.min()
mxs = summary.max()
vrs = summary.variance()
nnzs = summary.numNonzeros()

In [4]:
# Labels for display
cols = ['Day', 'Dep. Time', 'Arr. Delay', 'Dep. Delay', 'Distance']

# Print out Header
print('{0:>20s}{1:>12s}{2:>8s}{3:>10s}{4:>12s}'\
      .format('Mean', 'Variance', 'Min', 'Max', 'Non Zeroes'))
print(65*'-')

# Printout summary statistics
for idx, (m, v, mn, mx, n) in enumerate(zip(mus, vrs, mns, mxs, nnzs)):
    print('{5:10s}{0:10.2f}{1:12.2f}{2:8.2f}{3:10.2f}{4:12d}'\
          .format(m, v, mn, mx, int(n), cols[idx]))

                Mean    Variance     Min       Max  Non Zeroes
-----------------------------------------------------------------
Day            16.01       79.87    1.00     31.00      480106
Dep. Time    1359.66   237399.85    1.00   2400.00      480106
Arr. Delay      6.38      964.02  -80.00   1688.00      461157
Dep. Delay      8.78      782.11  -59.00   1692.00      393503
Distance      716.99   323369.33   21.00   4962.00      480106


-----

### Correlations

Another useful function is to compute the correlation between different
data sequences. The Spark MLlib package includes the `corr` method
within the Statistics component to compute correlations between
individual data sequences, or via the columns in an RDD. The `corr`
method can also calculate either the _Pearson_ correlation, which is the
default, or the _Spearman_ correlation. In the first code cell, we
create several data sequences, turn them into Spark data structures via
the `parallelize` method, and compute the Pearson correlation
coefficient between the different data sequences. In the second code
cell, we create a new RDD from three columns in the `sdt` RDD, and
compute both the Pearson and Spearman correlations between the columns
in this RDD.

-----

In [5]:
# Demonstrate Correlation Measurements

# Sample Data
x = sc.parallelize([0, 1, 2])
y = sc.parallelize([1, 2, 4])
z = sc.parallelize([2, 1, 0])

print('x = ', x.collect())
print('y = ', y.collect())
print('z = ', z.collect())

print('\nPearson Correlation Tests')
print(25*'-')
print('x corr x = {0:+5.3f}'\
      .format(Statistics.corr(x, x, method='pearson')))

print('x corr y = {0:+5.3f}'\
      .format(Statistics.corr(x, y, method='pearson')))

print('x corr z = {0:+5.3f}'\
      .format(Statistics.corr(x, z, method='pearson')))

x =  [0, 1, 2]
y =  [1, 2, 4]
z =  [2, 1, 0]

Pearson Correlation Tests
-------------------------
x corr x = +1.000
x corr y = +0.982
x corr z = -1.000


In [6]:
# Set print precision of matrices
import numpy as np
np.set_printoptions(precision=3)

# Compute correlation of three columns in RDD
cd = sdt.map(lambda p: (p[1], p[2], p[3]))

print('Dearture Time, Arrival Delay, Departure Delay')

print('\nPearson Correlation Matrix:')
print(Statistics.corr(cd, method='pearson'))

print('\nSpearman Correlation Matrix:')
print(Statistics.corr(cd, method='spearman'))

Dearture Time, Arrival Delay, Departure Delay

Pearson Correlation Matrix:
[[ 1.     0.134  0.167]
 [ 0.134  1.     0.904]
 [ 0.167  0.904  1.   ]]

Spearman Correlation Matrix:
[[ 1.     0.109  0.173]
 [ 0.109  1.     0.616]
 [ 0.173  0.616  1.   ]]


-----

### Random Data and Sampling

Another useful capability when constructing models is to generate random
data from a particular theoretical statistical distribution, such as a
_Normal_, _Uniform_, or _Poisson_ distribution. Likewise, when building
a model from large data, one often needs to sample from the large data
to make a more manageable data set with which to construct a model. The
Spark MLlib package provides methods for both of these features.

First, the `RandomRDDs` class includes methods to generate RDDs of a
given size from a particular distribution, which is specified in the
method called. In the following code cell, we create a distribution
containing 1000 rows from a uniform, normal, and Poisson distribution.
Afterwards, we compute several basic statistical measures from these
distributions to demonstrate the simplicity of this approach to generate
random data from model distributions.

Second, we sample from the normal distribution both with and without
replacement to make new samples. Afterwards, we once again compute basic
statistical measures to demonstrate the random sampling within Spark
MLlib.

-----

In [7]:
from pyspark.mllib.random import RandomRDDs

ud = RandomRDDs.uniformRDD(sc, 1000, seed=23)

nd = RandomRDDs.normalRDD(sc, 1000, seed=23)

pd = RandomRDDs.poissonRDD(sc, mean=2.0, size=1000, seed=23)

In [8]:
print('Uniform Distribution Statistics\n', ud.stats())

Uniform Distribution Statistics
 (count: 1000, mean: 0.48882245990922063, stdev: 0.285561924029, max: 0.996007011914, min: 0.000220626980565)


In [9]:
print('Normal Distribution Statistics\n', nd.stats())

Normal Distribution Statistics
 (count: 1000, mean: -0.0009683712527233408, stdev: 1.00533731021, max: 3.07476107319, min: -3.85245873312)


In [10]:
print('Poisson Distribution Statistics\n', pd.stats())

Poisson Distribution Statistics
 (count: 1000, mean: 1.9879999999999993, stdev: 1.4106225576, max: 8.0, min: 0.0)


In [11]:
# Sample without replacement

frac = 0.25

ds = nd.sample(False, frac)
print(ds.stats())

(count: 251, mean: -0.018475920371130692, stdev: 1.03774841391, max: 2.76048478382, min: -2.82725011859)


In [12]:
# Sample with replacement
ds = nd.sample(True, frac)
print(ds.stats())

(count: 278, mean: -0.03783347126550598, stdev: 0.939491296316, max: 2.76048478382, min: -2.75271141304)


-----

## Machine Learning

The bulk of the MLlib package is focused on performing machine learning
at scale by using Spark. With functions for computing classification,
regression, clustering, dimensional reduction, and more, the library
extends considerable power to the Spark user. Since we have already
covered these concepts by using Python and scikit-learn, in the rest of
this Notebook, we will present two specific machine learning algorithms
in order to demonstrate the basic concepts required to work with the
tools in the Spark MLlib package.

-----

### Linear Modeling

One of the simplest machine learning techniques is [linear regression][slr].
The main difference when using Spark is that for this supervised
learning technique our data must be in a Spark specific data structure
called [`LabeledPoint`][slp]. Spark provides several data structures to
simplify the application of distributed machine learning algorithms at
scale. The labeled nature refers to the label, used for training, that
is associated with the point. The first item in the data structure is
the label, while the second item is the set of feature columns.

In the following code cells, we first create a new data structure that
extracts the arrival delay to be the label and the departure delay as
the feature. These data re turned into a Spark sequence containing
`LabeledPoint` values for each row in the original RDD. Next we display
the first rows in the new sequence, and next we train the linear
regressor (using SVD in this case) and specify a number of iterations
and step size. You should feel free to modify these values and see the
impact on the resulting performance. Finally, we compute several
regression metrics to quantify the performance of this method on these
data (recall that the data span a large range, hence the RMSE is quite
reasonable).

-----

[slp]: https://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
[slr]: https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression

In [13]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
from pyspark.ml.regression import LinearRegressionModel

# Minimum departure delay
min_delay = 5.
data = fields.filter(lambda p: p[5] > min_delay).map(lambda p: LabeledPoint(p[4], [p[5]]))

In [14]:
data.take(5)

[LabeledPoint(23.0, [11.0]),
 LabeledPoint(18.0, [20.0]),
 LabeledPoint(96.0, [100.0]),
 LabeledPoint(20.0, [17.0]),
 LabeledPoint(87.0, [97.0])]

In [15]:
lr_model = LinearRegressionWithSGD.train(data, iterations=100, step=0.00000001)



In [16]:
vnp = data.map(lambda lp: (lp.label, float(lr_model.predict(lp.features))))

In [17]:
vnp.take(5)

[(23.0, 0.0005594491042241984),
 (18.0, 0.0010171801894985426),
 (96.0, 0.005085900947492714),
 (20.0, 0.0008646031610737612),
 (87.0, 0.004933323919067932)]

In [18]:
from pyspark.mllib.evaluation import RegressionMetrics

tm = RegressionMetrics(vnp)

print('RMSE = {0:5.1f}'.format(tm.rootMeanSquaredError))
print('MSE = {0:5.1f}'.format(tm.meanSquaredError))
print('MAE = {0:5.1f}'.format(tm.meanAbsoluteError))
print('r2 = {0:5.1f}'.format(tm.r2))
print('EV = {0:5.1f}'.format(tm.explainedVariance))

RMSE =  56.1
MSE = 3144.1
MAE =  35.5
r2 = -691589526.2
EV = 3144.3


In [19]:
print(lr_model)

(weights=[5.08590094749e-05], intercept=0.0)


-----

### Random Forest

The second machine learning algorithm we demonstrate is 
[Random Forests][srf], an ensemble, supervised learning technique. Once again,
we need a data sequence of `LabeledPoint`, but in this case we simply
reuse the one we created for the linear regression example. The next
step is to apply a `trainregressor` on our random forest object. The
random forest can accept categorical data, but in this case none of our
columns are categorical and we specify this with an empty set. Finally,
we explicitly set the number of trees in this example to one, which
allows us to easily display the generated forest in the second code
cell. 

Next, we predict values for our data. Technically, we would want to use
a test-train split or even cross-validation to properly evaluate our
model, but for simplicity we simply demonstrate the prediction and
quality assessment on the entire training data. In order to compute the
regression metrics from the random forest, we need to employ a slightly
different strategy top combine the labels with the predictions, which is
shown in the third code cell. Finally, we display the regression metrics
for the random forest regression on the flight data.

-----

[srf]: https://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests

In [20]:
from pyspark.mllib.tree import RandomForest

rf_model = RandomForest.trainRegressor(data, categoricalFeaturesInfo={}, numTrees=1)

In [21]:
print(rf_model.toDebugString())

TreeEnsembleModel regressor with 1 trees

  Tree 0:
    If (feature 0 <= 70.0)
     If (feature 0 <= 30.0)
      If (feature 0 <= 16.0)
       If (feature 0 <= 11.0)
        Predict: 5.726113904806455
       Else (feature 0 > 11.0)
        Predict: 11.548860895202358
      Else (feature 0 > 16.0)
       If (feature 0 <= 23.0)
        Predict: 17.536144237834172
       Else (feature 0 > 23.0)
        Predict: 24.667233253496097
     Else (feature 0 > 30.0)
      If (feature 0 <= 48.0)
       If (feature 0 <= 40.0)
        Predict: 32.78938220887976
       Else (feature 0 > 40.0)
        Predict: 41.64304682040531
      Else (feature 0 > 48.0)
       If (feature 0 <= 61.0)
        Predict: 51.602028917910445
       Else (feature 0 > 61.0)
        Predict: 63.298205768794006
    Else (feature 0 > 70.0)
     If (feature 0 <= 131.0)
      If (feature 0 <= 100.0)
       If (feature 0 <= 82.0)
        Predict: 73.73626373626374
       Else (feature 0 > 82.0)
        Predict: 89.54949267192785

In [22]:
pr = rf_model.predict(data.map(lambda x: x.features))
pnl = data.map(lambda lp: lp.label).zip(pr)

In [23]:
tm = RegressionMetrics(pnl)

print('RMSE = {0:5.1f}'.format(tm.rootMeanSquaredError))
print('MSE = {0:5.1f}'.format(tm.meanSquaredError))
print('MAE = {0:5.1f}'.format(tm.meanAbsoluteError))
print('r2 = {0:5.1f}'.format(tm.r2))
print('EV = {0:5.1f}'.format(tm.explainedVariance))

RMSE =  22.3
MSE = 496.9
MAE =  12.0
r2 =   0.7
EV = 2012.6


-----
### Student Activity

In the preceding cells, we introduced basic statistical analysis and
machine learning with Spark. Now that you have run the Notebook, go back
and make the following changes to see how the results change.

1. Compute the Pearson and Spearman correlations between the departure
and arrival delays in the flight data.

2. Add an intercept value into the Linear regression, how does the slope
of the new fit differs from the original fit in this Notebook.

3. Add more columns into the Linear Regression demonstrated in this
Notebook. In particular, include departure time and distance into the
calculation.

4. Build a Random Forest regressor with more tree. Which gives the best
RMSE, 1 tree, 5 trees, 10 trees, or 25 trees? Can you explain why?

4. In Week 2, we performed a logistic regression on the flight data to
determine whether a flight would be late or not. Repeat this analysis,
but use the Logistic Regression functionality within Spark MLlib.

-----

### Ending the Spark Session

We must stop the `SparkContext` in order to release resources on the
instructional cluster before existing this Notebook.

-----

In [24]:
sc.stop()