# Based on Drabas & Lee  -- Learning PySpark
## The ML package
#### Start the jupyter notebook from its own folder, otherwise python might not find some files to load!
set the kernel to python 2 or Python [default]!

In [9]:
sc

In [10]:
if 0:
    import findspark
    findspark.init()
    import pyspark

    from pyspark.context import SparkContext
    from pyspark.sql.session import SparkSession
    sc = SparkContext('local')
    spark = SparkSession(sc)

Previously we worked with the **MLlib package in Spark** that operated strictly on **RDDs**. Here, we move to the **ML part of Spark** that operates 
strictly on **DataFrames**. Also, according to the Spark documentation, the primary machine learning API for Spark is now the DataFrame-based set of models contained 
in the spark.ml package.

In this demo, you will learn how to do the following:
* Prepare transformers, estimators, and pipelines
* Predict the chances of infant survival using models available in the ML package
* Evaluate the performance of the model
* Perform parameter hyper-tuning
* Use other machine-learning models available in the package

At the top level, the package exposes three main abstract classes: 
* a **Transformer**, 
* an **Estimator**, 
* and a **Pipeline**. 
We will shortly explain each with some short examples. 

### Transformer

The **Transformer class**, like the name suggests, **transforms your data** by (usually) **appending a new column** to your DataFrame.

At the high level, when deriving from the Transformer abstract class, each and every new Transformer needs to implement a .transform(...) method.

There are many Transformers offered in the spark.ml.feature and we will briefly 
describe them here:
* **Binarizer**: Given a threshold, the method takes a continuous variable and transforms it into a binary one.
* **Bucketizer**: Similar to the Binarizer, this method takes a list of thresholds (the splits parameter) and transforms a continuous variable into a multinomial one.
* **ChiSqSelector**: For the categorical target variables (think classification models), this feature allows you to select a predefined number of features (parameterized by the numTopFeatures parameter) that explain the variance in the target the best. The selection is done, as the name of the method suggests, using a Chi-Square test.
* **DCT**: The Discrete Cosine Transform
* **ElementwiseProduct**: A method that returns a vector with elements that are products of the vector passed to the method, and a vector passed as the scalingVec parameter. For example, if you had a [10.0, 3.0, 15.0] vector and your scalingVec was [0.99, 3.30, 0.66], then the vector  you would get would look as follows: [9.9, 9.9, 9.9].
* **HashingTF**: A hashing trick transformer that takes a list of tokenized text and returns a vector (of predefined length) with counts. 
* **IDF**: This method computes an Inverse Document Frequency for a list of 
documents. Note that the documents need to already be represented as a 
vector 
* **MaxAbsScaler**: Rescales the data to be within the [-1.0, 1.0] range 
(thus, it does not shift the center of the data).
* **MinMaxScaler**: This is similar to the MaxAbsScaler with the difference that it 
scales the data to be in the [0.0, 1.0] range.
* **NGram**: This method takes a list of tokenized text and returns n-grams: pairs, triples, or n-mores of subsequent words. For example, if you had a ['good', 'morning', 'Robin', 'Williams'] vector you would get the following output: ['good morning', 'morning Robin', 'Robin Williams'].
* **Normalizer**: This method scales the data to be of unit norm using the 
p-norm value (by default, it is L2).
* **OneHotEncoder**: This method encodes a categorical column to a column of binary vectors.
* **PCA**: Performs the data reduction using principal component analysis.
* ...
* **VectorAssembler**: This is a highly useful transformer that collates multiple numeric (vectors included) columns into a single column with a vector representation.

### Estimators
Estimators containt the statistical and machine learning models that need to be estimated to make 
predictions or classify your observations.
If deriving from the abstract Estimator class, the new model has to implement the 
.fit(...) method that fits the model given the data found in a DataFrame and 
some default or user-specified parameters.

### Classification:

* **LogisticRegression**: At the time of writing, the PySpark ML supports only binary classification problems
* **DecisionTreeClassifier**
* **GBTClassifier**: A Gradient Boosted Trees model for classification. At the moment, the GBTClassifier model supports binary labels, and continuous and categorical features.
* **RandomForestClassifier**: The RandomForestClassifier supports both binary 
and multinomial labels.
* **NaiveBayes**: The NaiveBayes model in PySpark ML supports both binary and multinomial labels.
* **MultilayerPerceptronClassifier**: 
* **OneVsRest**: A reduction of a multiclass classification to a binary one. 

### Regression
There are seven models available for regression tasks in the PySpark ML package. 

* **AFTSurvivalRegression**: Fits an Accelerated Failure Time regression 
model. 
* **DecisionTreeRegressor**: Similar to the model for classification with an obvious distinction that the label is continuous instead of binary  (or multinomial).
* **GBTRegressor**: As with the DecisionTreeRegressor, the difference is the data type of the label.
* **GeneralizedLinearRegression**: A family of linear models with differing 
kernel functions (link functions).
* **IsotonicRegression**: A type of regression that fits a free-form, nondecreasing line to your data. It is useful to fit the datasets with ordered and increasing observations.
* **LinearRegression**
* **RandomForestRegressor**

### Clustering

* **KMeans**
* **GaussianMixture**
* **LDA** This model is used for topic modeling in natural language processing applications.

### Pipeline

A **Pipeline** in PySpark ML is a concept of an end-to-end transformation-estimation 
process (with distinct stages) that ingests some raw data (in a DataFrame form), 
performs the necessary data carpentry (transformations), and finally estimates a 
statistical model (estimator).

A Pipeline can be thought of as a chain of multiple discrete stages. When a 
.fit(...) method is executed on a Pipeline object, all the stages are executed in 
the order they were specified in the stages parameter; the stages parameter is a 
list of Transformer and Estimator objects.

#### Vector Assemble Example

In [7]:
df = spark.createDataFrame(
    [(12, 10, 3), (1, 4, 2)], 
    ['a', 'b', 'c']) 
df.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
| 12| 10|  3|
|  1|  4|  2|
+---+---+---+



In [8]:
import pyspark.ml.feature as ft
ft.VectorAssembler(inputCols=['a', 'b', 'c'], 
        outputCol='features')\
    .transform(df) \
    .select('features')\
    .collect() 

[Row(features=DenseVector([12.0, 10.0, 3.0])),
 Row(features=DenseVector([1.0, 4.0, 2.0]))]

### OneHotEncoding Example 

In [33]:
from pyspark.ml import Pipeline

df = spark.createDataFrame([
    (0.0, 1.0),
    (1.0, 0.0),
    (2.0, 1.0),
    (0.0, 2.0),
    (0.0, 1.0),
    (2.0, 0.0)
], ["categoryIndex1", "categoryIndex2"])

encoder = ft.OneHotEncoder(inputCol="categoryIndex1", outputCol="categoryVec1")

pipeline = Pipeline(stages=[encoder])
model = pipeline.fit(df)
transformed = model.transform(df)

transformed.show()

+--------------+--------------+-------------+
|categoryIndex1|categoryIndex2| categoryVec1|
+--------------+--------------+-------------+
|           0.0|           1.0|(2,[0],[1.0])|
|           1.0|           0.0|(2,[1],[1.0])|
|           2.0|           1.0|    (2,[],[])|
|           0.0|           2.0|(2,[0],[1.0])|
|           0.0|           1.0|(2,[0],[1.0])|
|           2.0|           0.0|    (2,[],[])|
+--------------+--------------+-------------+



## Predict chances of infant survival with ML

### Load the data

First, we load the data.

We specify the schema of the DataFrame; our severely limited dataset now only has 17 columns.

In [12]:
import pyspark.sql.types as typ

labels = [
    ('INFANT_ALIVE_AT_REPORT', typ.IntegerType()),
    ('BIRTH_PLACE', typ.StringType()),
    ('MOTHER_AGE_YEARS', typ.IntegerType()),
    ('FATHER_COMBINED_AGE', typ.IntegerType()),
    ('CIG_BEFORE', typ.IntegerType()),
    ('CIG_1_TRI', typ.IntegerType()),
    ('CIG_2_TRI', typ.IntegerType()),
    ('CIG_3_TRI', typ.IntegerType()),
    ('MOTHER_HEIGHT_IN', typ.IntegerType()),
    ('MOTHER_PRE_WEIGHT', typ.IntegerType()),
    ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),
    ('MOTHER_WEIGHT_GAIN', typ.IntegerType()),
    ('DIABETES_PRE', typ.IntegerType()),
    ('DIABETES_GEST', typ.IntegerType()),
    ('HYP_TENS_PRE', typ.IntegerType()),
    ('HYP_TENS_GEST', typ.IntegerType()),
    ('PREV_BIRTH_PRETERM', typ.IntegerType())
]

schema = typ.StructType([
    typ.StructField(e[0], e[1], False) for e in labels
])


In [16]:
# Fix hdfs if the files are corrupt (e.g. having missing blocks)...

if 0:
    #!hdfs fsck -list-corruptfileblocks / 
    !hdfs dfsadmin -safemode leave
    !hdfs dfs -rm /hdfs_data/*
    !hdfs dfs -rm -r /user/ec2-user/data_key*
    !hdfs fsck / -delete

In [18]:
if 0:
    !hdfs dfs -mkdir -p /hdfs_data
    !hdfs dfs -ls /hdfs_data
    !hdfs dfs -put data/births_transformed.csv.gz /hdfs_data
!hdfs fsck /hdfs_data/births_transformed.csv.gz

Connecting to namenode via http://ec2-18-191-239-57.us-east-2.compute.amazonaws.com:50070/fsck?ugi=ec2-user&path=%2Fhdfs_data%2Fbirths_transformed.csv.gz
FSCK started by ec2-user (auth:SIMPLE) from /172.31.2.52 for path /hdfs_data/births_transformed.csv.gz at Tue Feb 12 07:35:27 UTC 2019
.
/hdfs_data/births_transformed.csv.gz:  Under replicated BP-663532545-172.31.27.125-1549216637007:blk_1073741830_1006. Target Replicas is 3 but found 2 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
Status: HEALTHY
 Total size:	364560 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 364560 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	1 (100.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		1 (33.333332 %)
 Number of data-nodes:		2
 Number of racks:		1

In [19]:
#births = spark.read.csv('data/births_transformed.csv.gz',  header=True, schema=schema)

births = spark.read.csv('/hdfs_data/births_transformed.csv.gz', header=True, schema=schema)

In [20]:
births.take(1)

[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=29, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=99, MOTHER_PRE_WEIGHT=999, MOTHER_DELIVERY_WEIGHT=999, MOTHER_WEIGHT_GAIN=99, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0)]

### Create transformers

Before we can use the dataset to estimate a model, we need to do some transformations. Since statistical models can only operate on numeric data,  we will have to encode the BIRTH_PLACE variable.

In [21]:
import pyspark.ml.feature as ft

To encode the BIRTH_PLACE column, we will use the **OneHotEncoder** method. However, the method cannot accept StringType columns; it can only deal with numeric types so first we will cast the column to an IntegerType:

In [22]:
births = births.withColumn('BIRTH_PLACE_INT', births['BIRTH_PLACE'].cast(typ.IntegerType()))

In [25]:
births.take(3)

[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=29, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=99, MOTHER_PRE_WEIGHT=999, MOTHER_DELIVERY_WEIGHT=999, MOTHER_WEIGHT_GAIN=99, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1),
 Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=22, FATHER_COMBINED_AGE=29, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=65, MOTHER_PRE_WEIGHT=180, MOTHER_DELIVERY_WEIGHT=198, MOTHER_WEIGHT_GAIN=18, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1),
 Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=38, FATHER_COMBINED_AGE=40, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=63, MOTHER_PRE_WEIGHT=155, MOTHER_DELIVERY_WEIGHT=167, MOTHER_WEIGHT_GAIN=12, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_

As we can see, the **.withColumn()** method created the BIRTH_PLACE_INT column.

Having done this, we can now create our first `Transformer`.

In [35]:
encoder = ft.OneHotEncoder(
    inputCol='BIRTH_PLACE_INT', 
    outputCol='BIRTH_PLACE_VEC')

In [38]:
# This will create a one-hot-encoding from the BIRTH_PLACE_INT column
pipeline = Pipeline(stages=[encoder])
model = pipeline.fit(births)
transformed = model.transform(births)

transformed.take(3)

[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=29, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=99, MOTHER_PRE_WEIGHT=999, MOTHER_DELIVERY_WEIGHT=999, MOTHER_WEIGHT_GAIN=99, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0})),
 Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=22, FATHER_COMBINED_AGE=29, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=65, MOTHER_PRE_WEIGHT=180, MOTHER_DELIVERY_WEIGHT=198, MOTHER_WEIGHT_GAIN=18, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0})),
 Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=38, FATHER_COMBINED_AGE=40, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=63, MOTHER_PRE_WEIGHT=155, MOTHER_DELIVERY_WEIGHT=

Let's now create a single column with all the features collated together. 

In [39]:
featuresCreator = ft.VectorAssembler(
    inputCols=[
        col[0] 
        for col 
        in labels[2:]] + \
    [encoder.getOutputCol()], 
    outputCol='features'
)

In [40]:
# This will create a one-hot-encoding from the BIRTH_PLACE_INT column
pipeline = Pipeline(stages=[encoder, featuresCreator])
model = pipeline.fit(births)
transformed = model.transform(births)

transformed.take(3)

[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=29, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=99, MOTHER_PRE_WEIGHT=999, MOTHER_DELIVERY_WEIGHT=999, MOTHER_WEIGHT_GAIN=99, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 29.0, 1: 99.0, 6: 99.0, 7: 999.0, 8: 999.0, 9: 99.0, 16: 1.0})),
 Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=22, FATHER_COMBINED_AGE=29, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=65, MOTHER_PRE_WEIGHT=180, MOTHER_DELIVERY_WEIGHT=198, MOTHER_WEIGHT_GAIN=18, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 22.0, 1: 29.0, 6: 65.0, 7: 180.0, 8: 198.0, 9: 18.0, 16: 1.0})),
 Row(INFANT_ALIVE_AT_REPOR

### Create an estimator

In this example we will (once again) us the Logistic Regression model.

In [41]:
import pyspark.ml.classification as cl

Once loaded, let's create the model.

In [42]:
logistic = cl.LogisticRegression(
    maxIter=10, 
    regParam=0.01, 
    labelCol='INFANT_ALIVE_AT_REPORT')

### Create a pipeline

All that is left now is to creat a `Pipeline` and fit the model. First, let's load the `Pipeline` from the package.

In [43]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[
        encoder, 
        featuresCreator, 
        logistic
    ])

### Fit the model

Conventiently, `DataFrame` API has the `.randomSplit(...)` method.

In [44]:
births_train, births_test = births.randomSplit([0.7, 0.3], seed=666)

Now run our `pipeline` and estimate our model.

In [45]:
model = pipeline.fit(births_train)
test_model = model.transform(births_test)

Here's what the `test_model` looks like.

In [46]:
test_model.take(2)

[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=66, MOTHER_PRE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 13.0, 1: 99.0, 6: 66.0, 7: 133.0, 8: 135.0, 9: 2.0, 16: 1.0}), rawPrediction=DenseVector([1.0573, -1.0573]), probability=DenseVector([0.7422, 0.2578]), prediction=0.0),
 Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=14, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=63, MOTHER_PRE_WEIGHT=93, MOTHER_DELIVERY_WEIGHT=100, MOTHER_WEIGHT_GAIN=0, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVecto

As you can see, we get all the columns from the Transfomers and Estimators. The 
logistic regression model outputs several columns: the rawPrediction is the value 
of the linear combination of features and the β coefficients, the probability is the 
calculated probability for each of the classes, and finally, the prediction is our final 
class assignment.

### Model performance

Obviously, we would like to now test how well our model did.

In [49]:
import pyspark.ml.evaluation as ev

evaluator = ev.BinaryClassificationEvaluator(
    rawPredictionCol='probability', 
    labelCol='INFANT_ALIVE_AT_REPORT')


print('area under ROC:', evaluator.evaluate(test_model, {evaluator.metricName: 'areaUnderROC'}))
print('area under precesion curve:', evaluator.evaluate(test_model, {evaluator.metricName: 'areaUnderPR'}))

('area under ROC:', 0.7401301847095617)
('area under precesion curve:', 0.7139354342365674)


### Saving the model

PySpark allows you to save the Pipeline definition for later use. It not only saves 
the pipeline structure, but also all the definitions of all the Transformers and Estimators:

In [50]:
pipelinePath = './data/infant_oneHotEncoder_Logistic_Pipeline'
pipeline.write().overwrite().save(pipelinePath)

So, you can load it up later and use straight away to `.fit(...)` and predict.

In [51]:
loadedPipeline = Pipeline.load(pipelinePath)
loadedPipeline \
    .fit(births_train)\
    .transform(births_test)\
    .take(1)

[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=66, MOTHER_PRE_WEIGHT=133, MOTHER_DELIVERY_WEIGHT=135, MOTHER_WEIGHT_GAIN=2, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_INT=1, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 13.0, 1: 99.0, 6: 66.0, 7: 133.0, 8: 135.0, 9: 2.0, 16: 1.0}), rawPrediction=DenseVector([1.0573, -1.0573]), probability=DenseVector([0.7422, 0.2578]), prediction=0.0)]

If you, however, want to save the whole estimated model, you can also do that; instead of 
saving the Pipeline, you need to save the PipelineModel:

In [52]:
from pyspark.ml import PipelineModel

modelPath = './data/infant_oneHotEncoder_Logistic_PipelineModel'
model.write().overwrite().save(modelPath)

loadedPipelineModel = PipelineModel.load(modelPath)
test_loadedModel = loadedPipelineModel.transform(births_test)

## Parameter hyper-tuning

Rarely, our first model would be the best we can do. By simply looking at our metrics and accepting the model because it passed our pre-conceived performance thresholds is hardly a scientific method for finding the best model. A concept of parameter hyper-tuning is to find the best parameters of the model

### Grid search

Load the `.tuning` part of the package.

In [20]:
import pyspark.ml.tuning as tune

Next let's specify our model and the list of parameters we want to loop through.

In [21]:
logistic = cl.LogisticRegression(
    labelCol='INFANT_ALIVE_AT_REPORT')

grid = tune.ParamGridBuilder() \
    .addGrid(logistic.maxIter,  
             [2, 10, 50]) \
    .addGrid(logistic.regParam, 
             [0.01, 0.05, 0.3]) \
    .build()

Next, we need some way of comparing the models.

In [22]:
evaluator = ev.BinaryClassificationEvaluator(
    rawPredictionCol='probability', 
    labelCol='INFANT_ALIVE_AT_REPORT')

Create the logic that will do the validation work for us.

In [23]:
cv = tune.CrossValidator(
    estimator=logistic, 
    estimatorParamMaps=grid, 
    evaluator=evaluator
)

Create a purely transforming `Pipeline`.

In [24]:
pipeline = Pipeline(stages=[encoder,featuresCreator])
data_transformer = pipeline.fit(births_train)

Having done this, we are ready to find the optimal combination of parameters for our model.

In [25]:
cvModel = cv.fit(data_transformer.transform(births_train))

The `cvModel` will return the best model estimated. We can now use it to see if it performed better than our previous model.

In [26]:
data_train = data_transformer \
    .transform(births_test)
results = cvModel.transform(data_train)

print(evaluator.evaluate(results, 
     {evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(results, 
     {evaluator.metricName: 'areaUnderPR'}))

0.740495980331
0.715797110849


What parameters has the best model? The answer is a little bit convoluted but here's how you can extract it.

In [27]:
results = [
    (
        [
            {key.name: paramValue} 
            for key, paramValue 
            in zip(
                params.keys(), 
                params.values())
        ], metric
    ) 
    for params, metric 
    in zip(
        cvModel.getEstimatorParamMaps(), 
        cvModel.avgMetrics
    )
]

sorted(results, 
       key=lambda el: el[1], 
       reverse=True)[0]

([{'maxIter': 50}, {'regParam': 0.01}], 0.7386231679256547)

### Train-Validation splitting

Use the `ChiSqSelector` to select only top 5 features, thus limiting the complexity of our model.

In [28]:
selector = ft.ChiSqSelector(
    numTopFeatures=5, 
    featuresCol=featuresCreator.getOutputCol(), 
    outputCol='selectedFeatures',
    labelCol='INFANT_ALIVE_AT_REPORT'
)

logistic = cl.LogisticRegression(
    labelCol='INFANT_ALIVE_AT_REPORT',
    featuresCol='selectedFeatures'
)

pipeline = Pipeline(stages=[encoder,featuresCreator,selector])
data_transformer = pipeline.fit(births_train)

The `TrainValidationSplit` object gets created in the same fashion as the `CrossValidator` model.

In [29]:
tvs = tune.TrainValidationSplit(
    estimator=logistic, 
    estimatorParamMaps=grid, 
    evaluator=evaluator
)

As before, we fit our data to the model, and calculate the results.

In [30]:
tvsModel = tvs.fit(
    data_transformer \
        .transform(births_train)
)

data_train = data_transformer \
    .transform(births_test)
results = tvsModel.transform(data_train)

print(evaluator.evaluate(results, 
     {evaluator.metricName: 'areaUnderROC'}))
print(evaluator.evaluate(results, 
     {evaluator.metricName: 'areaUnderPR'}))

0.729429631444
0.703775950282


## Other features of PySpark ML in action

### Feature extraction

#### NLP related feature extractors

Simple dataset.

In [31]:
text_data = spark.createDataFrame([
    ['''Machine learning can be applied to a wide variety 
        of data types, such as vectors, text, images, and 
        structured data. This API adopts the DataFrame from 
        Spark SQL in order to support a variety of data types.'''],
    ['''DataFrame supports many basic and structured types; 
        see the Spark SQL datatype reference for a list of 
        supported types. In addition to the types listed in 
        the Spark SQL guide, DataFrame can use ML Vector types.'''],
    ['''A DataFrame can be created either implicitly or 
        explicitly from a regular RDD. See the code examples 
        below and the Spark SQL programming guide for examples.'''],
    ['''Columns in a DataFrame are named. The code examples 
        below use names such as "text," "features," and "label."''']
], ['input'])

First, we need to tokenize this text.

In [32]:
tokenizer = ft.RegexTokenizer(
    inputCol='input', 
    outputCol='input_arr', 
    pattern='\s+|[,.\"]')

The output of the tokenizer looks similar to this.

In [33]:
tok = tokenizer \
    .transform(text_data) \
    .select('input_arr') 

tok.take(1)

[Row(input_arr=[u'machine', u'learning', u'can', u'be', u'applied', u'to', u'a', u'wide', u'variety', u'of', u'data', u'types', u'such', u'as', u'vectors', u'text', u'images', u'and', u'structured', u'data', u'this', u'api', u'adopts', u'the', u'dataframe', u'from', u'spark', u'sql', u'in', u'order', u'to', u'support', u'a', u'variety', u'of', u'data', u'types'])]

Use the `StopWordsRemover(...)`.

In [34]:
stopwords = ft.StopWordsRemover(
    inputCol=tokenizer.getOutputCol(), 
    outputCol='input_stop')

The output of the method looks as follows

In [35]:
stopwords.transform(tok).select('input_stop').take(1)

[Row(input_stop=[u'machine', u'learning', u'applied', u'wide', u'variety', u'data', u'types', u'vectors', u'text', u'images', u'structured', u'data', u'api', u'adopts', u'dataframe', u'spark', u'sql', u'order', u'support', u'variety', u'data', u'types'])]

Build `NGram` model and the `Pipeline`.

In [36]:
ngram = ft.NGram(n=2, 
    inputCol=stopwords.getOutputCol(), 
    outputCol="nGrams")

pipeline = Pipeline(stages=[tokenizer, stopwords, ngram])

Now that we have the `pipeline` we follow in the very similar fashion as before.

In [37]:
data_ngram = pipeline \
    .fit(text_data) \
    .transform(text_data)
    
data_ngram.select('nGrams').take(1)

[Row(nGrams=[u'machine learning', u'learning applied', u'applied wide', u'wide variety', u'variety data', u'data types', u'types vectors', u'vectors text', u'text images', u'images structured', u'structured data', u'data api', u'api adopts', u'adopts dataframe', u'dataframe spark', u'spark sql', u'sql order', u'order support', u'support variety', u'variety data', u'data types'])]

That's it. We got our n-grams and we can then use them in further NLP processing.

#### Discretize continuous variables

It is sometimes useful to *band* the values into discrete buckets.

In [38]:
import numpy as np

x = np.arange(0, 100)
x = x / 100.0 * np.pi * 4
y = x * np.sin(x / 1.764) + 20.1234

schema = typ.StructType([
    typ.StructField('continuous_var', 
                    typ.DoubleType(), 
                    False
   )
])

data = spark.createDataFrame([[float(e), ] for e in y], schema=schema)

Use the `QuantileDiscretizer` model to split our continuous variable into 5 buckets (see the `numBuckets` parameter).

In [39]:
discretizer = ft.QuantileDiscretizer(
    numBuckets=5, 
    inputCol='continuous_var', 
    outputCol='discretized')

Let's see what we got.

In [40]:
data_discretized = discretizer.fit(data).transform(data)

data_discretized \
    .groupby('discretized')\
    .mean('continuous_var')\
    .sort('discretized')\
    .collect()

[Row(discretized=0.0, avg(continuous_var)=12.314360733007915),
 Row(discretized=1.0, avg(continuous_var)=16.046244793347466),
 Row(discretized=2.0, avg(continuous_var)=20.250799478352587),
 Row(discretized=3.0, avg(continuous_var)=22.040988218437327),
 Row(discretized=4.0, avg(continuous_var)=24.264824657002865)]

#### Standardizing continuous variables

Create a vector representation of our continuous variable (as it is only a single float)


In [41]:
vectorizer = ft.VectorAssembler(
    inputCols=['continuous_var'], 
    outputCol= 'continuous_vec')

Build a `normalizer` and a `pipeline`.

In [42]:
normalizer = ft.StandardScaler(
    inputCol=vectorizer.getOutputCol(), 
    outputCol='normalized', 
    withMean=True,
    withStd=True
)

pipeline = Pipeline(stages=[vectorizer, normalizer])
data_standardized = pipeline.fit(data).transform(data)

### Classification

We will now use the `RandomForestClassfier` to model the chances of survival for an infant.

First, we need to cast the label feature to `DoubleType`.

In [43]:
import pyspark.sql.functions as func

births = births.withColumn(
    'INFANT_ALIVE_AT_REPORT', 
    func.col('INFANT_ALIVE_AT_REPORT').cast(typ.DoubleType())
)

births_train, births_test = births \
    .randomSplit([0.7, 0.3], seed=666)

We are ready to build our model.

In [44]:
classifier = cl.RandomForestClassifier(
    numTrees=5, 
    maxDepth=5, 
    labelCol='INFANT_ALIVE_AT_REPORT')

pipeline = Pipeline(
    stages=[
        encoder,
        featuresCreator, 
        classifier])

model = pipeline.fit(births_train)
test = model.transform(births_test)

Let's now see how the `RandomForestClassifier` model performs compared to the `LogisticRegression`.

In [45]:
evaluator = ev.BinaryClassificationEvaluator(
    labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test, 
    {evaluator.metricName: "areaUnderROC"}))
print(evaluator.evaluate(test, 
    {evaluator.metricName: "areaUnderPR"}))

0.756841379689
0.622995362273


Let's test how well would one tree do, then.

In [46]:
classifier = cl.DecisionTreeClassifier(
    maxDepth=5, 
    labelCol='INFANT_ALIVE_AT_REPORT')
pipeline = Pipeline(stages=[
        encoder,
        featuresCreator, 
        classifier]
)

model = pipeline.fit(births_train)
test = model.transform(births_test)

evaluator = ev.BinaryClassificationEvaluator(
    labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test, 
     {evaluator.metricName: "areaUnderROC"}))
print(evaluator.evaluate(test, 
     {evaluator.metricName: "areaUnderPR"}))

0.704021582835
0.709608191034


### Clustering

In this example we will use k-means model to find similarities in the births data.

In [47]:
import pyspark.ml.clustering as clus

kmeans = clus.KMeans(k = 5, 
    featuresCol='features')

pipeline = Pipeline(stages=[
        encoder,
        featuresCreator, 
        kmeans]
)

model = pipeline.fit(births_train)

Having estimated the model, let's see if we can find some differences between clusters.

In [48]:
test = model.transform(births_test)

test \
    .groupBy('prediction') \
    .agg({
        '*': 'count', 
        'MOTHER_HEIGHT_IN': 'avg'
    }).collect()

[Row(prediction=1, avg(MOTHER_HEIGHT_IN)=83.91154791154791, count(1)=407),
 Row(prediction=3, avg(MOTHER_HEIGHT_IN)=67.69473684210526, count(1)=475),
 Row(prediction=4, avg(MOTHER_HEIGHT_IN)=63.90993407084591, count(1)=8949),
 Row(prediction=2, avg(MOTHER_HEIGHT_IN)=66.64658634538152, count(1)=249),
 Row(prediction=0, avg(MOTHER_HEIGHT_IN)=65.3889041472123, count(1)=3641)]

In the field of NLP, problems such as topic extract rely on clustering to detect documents with similar topics. First, let's create our dataset.

In [49]:
text_data = spark.createDataFrame([
    ['''To make a computer do anything, you have to write a 
    computer program. To write a computer program, you have 
    to tell the computer, step by step, exactly what you want 
    it to do. The computer then "executes" the program, 
    following each step mechanically, to accomplish the end 
    goal. When you are telling the computer what to do, you 
    also get to choose how it's going to do it. That's where 
    computer algorithms come in. The algorithm is the basic 
    technique used to get the job done. Let's follow an 
    example to help get an understanding of the algorithm 
    concept.'''],
    ['''Laptop computers use batteries to run while not 
    connected to mains. When we overcharge or overheat 
    lithium ion batteries, the materials inside start to 
    break down and produce bubbles of oxygen, carbon dioxide, 
    and other gases. Pressure builds up, and the hot battery 
    swells from a rectangle into a pillow shape. Sometimes 
    the phone involved will operate afterwards. Other times 
    it will die. And occasionally—kapow! To see what's 
    happening inside the battery when it swells, the CLS team 
    used an x-ray technology called computed tomography.'''],
    ['''This technology describes a technique where touch 
    sensors can be placed around any side of a device 
    allowing for new input sources. The patent also notes 
    that physical buttons (such as the volume controls) could 
    be replaced by these embedded touch sensors. In essence 
    Apple could drop the current buttons and move towards 
    touch-enabled areas on the device for the existing UI. It 
    could also open up areas for new UI paradigms, such as 
    using the back of the smartphone for quick scrolling or 
    page turning.'''],
    ['''The National Park Service is a proud protector of 
    America’s lands. Preserving our land not only safeguards 
    the natural environment, but it also protects the 
    stories, cultures, and histories of our ancestors. As we 
    face the increasingly dire consequences of climate 
    change, it is imperative that we continue to expand 
    America’s protected lands under the oversight of the 
    National Park Service. Doing so combats climate change 
    and allows all American’s to visit, explore, and learn 
    from these treasured places for generations to come. It 
    is critical that President Obama acts swiftly to preserve 
    land that is at risk of external threats before the end 
    of his term as it has become blatantly clear that the 
    next administration will not hold the same value for our 
    environment over the next four years.'''],
    ['''The National Park Foundation, the official charitable 
    partner of the National Park Service, enriches America’s 
    national parks and programs through the support of 
    private citizens, park lovers, stewards of nature, 
    history enthusiasts, and wilderness adventurers. 
    Chartered by Congress in 1967, the Foundation grew out of 
    a legacy of park protection that began over a century 
    ago, when ordinary citizens took action to establish and 
    protect our national parks. Today, the National Park 
    Foundation carries on the tradition of early park 
    advocates, big thinkers, doers and dreamers—from John 
    Muir and Ansel Adams to President Theodore Roosevelt.'''],
    ['''Australia has over 500 national parks. Over 28 
    million hectares of land is designated as national 
    parkland, accounting for almost four per cent of 
    Australia's land areas. In addition, a further six per 
    cent of Australia is protected and includes state 
    forests, nature parks and conservation reserves.National 
    parks are usually large areas of land that are protected 
    because they have unspoilt landscapes and a diverse 
    number of native plants and animals. This means that 
    commercial activities such as farming are prohibited and 
    human activity is strictly monitored.''']
], ['documents'])

First, we will once again use the `RegexTokenizer` and the `StopWordsRemover` models.

In [50]:
tokenizer = ft.RegexTokenizer(
    inputCol='documents', 
    outputCol='input_arr', 
    pattern='\s+|[,.\"]')

stopwords = ft.StopWordsRemover(
    inputCol=tokenizer.getOutputCol(), 
    outputCol='input_stop')

Next in our pipeline is the `CountVectorizer`.

In [51]:
stringIndexer = ft.CountVectorizer(
    inputCol=stopwords.getOutputCol(), 
    outputCol="input_indexed")

tokenized = stopwords \
    .transform(
        tokenizer\
            .transform(text_data)
    )
    
stringIndexer \
    .fit(tokenized)\
    .transform(tokenized)\
    .select('input_indexed')\
    .take(2)

[Row(input_indexed=SparseVector(257, {2: 7.0, 5: 1.0, 8: 3.0, 9: 3.0, 10: 3.0, 15: 2.0, 16: 2.0, 20: 1.0, 23: 1.0, 34: 1.0, 39: 1.0, 43: 1.0, 50: 1.0, 61: 1.0, 69: 1.0, 90: 1.0, 96: 1.0, 97: 1.0, 98: 1.0, 125: 1.0, 126: 1.0, 129: 1.0, 130: 1.0, 133: 1.0, 136: 1.0, 141: 1.0, 156: 1.0, 185: 1.0, 201: 1.0, 203: 1.0, 211: 1.0, 227: 1.0, 231: 1.0})),
 Row(input_indexed=SparseVector(257, {19: 2.0, 21: 1.0, 31: 2.0, 36: 2.0, 39: 1.0, 42: 2.0, 46: 1.0, 49: 1.0, 53: 1.0, 54: 1.0, 55: 1.0, 56: 1.0, 60: 1.0, 62: 1.0, 70: 1.0, 72: 1.0, 77: 1.0, 87: 1.0, 91: 1.0, 92: 1.0, 106: 1.0, 108: 1.0, 109: 1.0, 114: 1.0, 115: 1.0, 119: 1.0, 145: 1.0, 151: 1.0, 153: 1.0, 164: 1.0, 175: 1.0, 184: 1.0, 189: 1.0, 192: 1.0, 199: 1.0, 212: 1.0, 220: 1.0, 223: 1.0, 225: 1.0, 236: 1.0, 242: 1.0, 246: 1.0, 247: 1.0, 248: 1.0, 251: 1.0, 253: 1.0, 255: 1.0}))]

We will use the `LDA` model - the Latent Dirichlet Allocation model - to extract the topics.

In [52]:
clustering = clus.LDA(k=2, optimizer='online', featuresCol=stringIndexer.getOutputCol())

Put these puzzles together.

In [53]:
pipeline = Pipeline(stages=[
        tokenizer, 
        stopwords,
        stringIndexer, 
        clustering]
)

Let's see if we have properly uncovered the topics.

In [54]:
topics = pipeline \
    .fit(text_data) \
    .transform(text_data)

topics.select('topicDistribution').collect()

[Row(topicDistribution=DenseVector([0.0247, 0.9753])),
 Row(topicDistribution=DenseVector([0.1474, 0.8526])),
 Row(topicDistribution=DenseVector([0.0273, 0.9727])),
 Row(topicDistribution=DenseVector([0.9917, 0.0083])),
 Row(topicDistribution=DenseVector([0.0093, 0.9907])),
 Row(topicDistribution=DenseVector([0.0156, 0.9844]))]

### Regression

In this section we will try to predict the `MOTHER_WEIGHT_GAIN`.

In [55]:
features = ['MOTHER_AGE_YEARS','MOTHER_HEIGHT_IN',
            'MOTHER_PRE_WEIGHT','DIABETES_PRE',
            'DIABETES_GEST','HYP_TENS_PRE', 
            'HYP_TENS_GEST', 'PREV_BIRTH_PRETERM',
            'CIG_BEFORE','CIG_1_TRI', 'CIG_2_TRI', 
            'CIG_3_TRI'
           ]

First, we will collate all the features together and use the `ChiSqSelector` to select only the top 6 most important features.

In [56]:
featuresCreator = ft.VectorAssembler(
    inputCols=[col for col in features[1:]], 
    outputCol='features'
)

selector = ft.ChiSqSelector(
    numTopFeatures=6, 
    outputCol="selectedFeatures", 
    labelCol='MOTHER_WEIGHT_GAIN'
)

In order to predict the weight gain we will use the gradient boosted trees regressor.

In [57]:
import pyspark.ml.regression as reg

regressor = reg.GBTRegressor(
    maxIter=15, 
    maxDepth=3,
    labelCol='MOTHER_WEIGHT_GAIN')

Finally, again, we put it all together into a `Pipeline`.

In [58]:
pipeline = Pipeline(stages=[
        featuresCreator, 
        selector,
        regressor])

weightGain = pipeline.fit(births_train)

Having created the `weightGain` model, let's see if it performs well on our testing data.

In [59]:
evaluator = ev.RegressionEvaluator(
    predictionCol="prediction", 
    labelCol='MOTHER_WEIGHT_GAIN')

print(evaluator.evaluate(
     weightGain.transform(births_test), 
    {evaluator.metricName: 'r2'}))

0.489975167759
