# Based on Drabas & Lee  -- Learning PySpark
## Resilient Distributed Datasets
### The MLlib package
#### Start the jupyter notebook from its own folder, otherwise python might not find some files to load!
set the kernel to python 2 or Python [default]!

## Load and transform the data

Just like in the previous chapter, we first specify the schema of our dataset.

In [1]:
sc

In [2]:
# you only need to run this cell if the above spark context is not available when you start the notebook

if 0:
    import findspark
    findspark.init()
    import pyspark

    from pyspark.context import SparkContext
    from pyspark.sql.session import SparkSession
    sc = SparkContext('local')
    spark = SparkSession(sc)

MLlib stands for Machine Learning Library. 

MLlib is now in a maintenance mode, that is, it is not actively being developed (might be deprecated later)

MLlib operatoes on RDDs.

The documentation for MLlib can be found here: http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html

Starting with Spark 2.0, ML is the main machine learning library that operates on DataFrames instead of RDDs 



### Application:
* Predict survival chances of infants using logistic regression
* Select the most predictable features and train a random forest model

### Overview of the MLlib package

At the high level, MLlib exposes three core machine learning functionalities:

* **Data preparation**: Feature extraction, transformation, selection, hashing of categorical features, and some natural language processing methods
* **Machine learning algorithms**: Some popular and advanced regression, classification, and clustering algorithms are implemented
* **Utilities**: Statistical methods such as descriptive statistics, chi-square testing, 
linear algebra (sparse and dense matrices and vectors), and model evaluation methods

We will use a portion of the US 2014 and 2015 birth data we downloaded from http://www.cdc.gov/nchs/data_access/vitalstatsonline.htm; 

from the total of 300 variables we selected **85 features** that we will use to build our models. 

Also, out of the total of almost 7.99 million records, we selected a balanced sample of **45,429 records**: 22,080 records where infants were reported deceased and 23,349 records with infants alive.

The dataset we will use in this chapter can be downloaded from http://www.tomdrabas.com/data/LearningPySpark/births_train.csv.gz.

In [3]:
import pyspark.sql.types as typ

We first specify the schema of our dataset:

In [4]:
labels = [
    ('INFANT_ALIVE_AT_REPORT', typ.StringType()),
    ('BIRTH_YEAR', typ.IntegerType()),
    ('BIRTH_MONTH', typ.IntegerType()),
    ('BIRTH_PLACE', typ.StringType()),
    ('MOTHER_AGE_YEARS', typ.IntegerType()),
    ('MOTHER_RACE_6CODE', typ.StringType()),
    ('MOTHER_EDUCATION', typ.StringType()),
    ('FATHER_COMBINED_AGE', typ.IntegerType()),
    ('FATHER_EDUCATION', typ.StringType()),
    ('MONTH_PRECARE_RECODE', typ.StringType()),
    ('CIG_BEFORE', typ.IntegerType()),
    ('CIG_1_TRI', typ.IntegerType()),
    ('CIG_2_TRI', typ.IntegerType()),
    ('CIG_3_TRI', typ.IntegerType()),
    ('MOTHER_HEIGHT_IN', typ.IntegerType()),
    ('MOTHER_BMI_RECODE', typ.IntegerType()),
    ('MOTHER_PRE_WEIGHT', typ.IntegerType()),
    ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),
    ('MOTHER_WEIGHT_GAIN', typ.IntegerType()),
    ('DIABETES_PRE', typ.StringType()),
    ('DIABETES_GEST', typ.StringType()),
    ('HYP_TENS_PRE', typ.StringType()),
    ('HYP_TENS_GEST', typ.StringType()),
    ('PREV_BIRTH_PRETERM', typ.StringType()),
    ('NO_RISK', typ.StringType()),
    ('NO_INFECTIONS_REPORTED', typ.StringType()),
    ('LABOR_IND', typ.StringType()),
    ('LABOR_AUGM', typ.StringType()),
    ('STEROIDS', typ.StringType()),
    ('ANTIBIOTICS', typ.StringType()),
    ('ANESTHESIA', typ.StringType()),
    ('DELIV_METHOD_RECODE_COMB', typ.StringType()),
    ('ATTENDANT_BIRTH', typ.StringType()),
    ('APGAR_5', typ.IntegerType()),
    ('APGAR_5_RECODE', typ.StringType()),
    ('APGAR_10', typ.IntegerType()),
    ('APGAR_10_RECODE', typ.StringType()),
    ('INFANT_SEX', typ.StringType()),
    ('OBSTETRIC_GESTATION_WEEKS', typ.IntegerType()),
    ('INFANT_WEIGHT_GRAMS', typ.IntegerType()),
    ('INFANT_ASSIST_VENTI', typ.StringType()),
    ('INFANT_ASSIST_VENTI_6HRS', typ.StringType()),
    ('INFANT_NICU_ADMISSION', typ.StringType()),
    ('INFANT_SURFACANT', typ.StringType()),
    ('INFANT_ANTIBIOTICS', typ.StringType()),
    ('INFANT_SEIZURES', typ.StringType()),
    ('INFANT_NO_ABNORMALITIES', typ.StringType()),
    ('INFANT_ANCEPHALY', typ.StringType()),
    ('INFANT_MENINGOMYELOCELE', typ.StringType()),
    ('INFANT_LIMB_REDUCTION', typ.StringType()),
    ('INFANT_DOWN_SYNDROME', typ.StringType()),
    ('INFANT_SUSPECTED_CHROMOSOMAL_DISORDER', typ.StringType()),
    ('INFANT_NO_CONGENITAL_ANOMALIES_CHECKED', typ.StringType()),
    ('INFANT_BREASTFED', typ.StringType())
]



In [5]:
schema = typ.StructType([
        typ.StructField(e[0], e[1], False) for e in labels
    ])

In [6]:
for i in range(len(schema)): print(schema[i])


StructField(INFANT_ALIVE_AT_REPORT,StringType,false)
StructField(BIRTH_YEAR,IntegerType,false)
StructField(BIRTH_MONTH,IntegerType,false)
StructField(BIRTH_PLACE,StringType,false)
StructField(MOTHER_AGE_YEARS,IntegerType,false)
StructField(MOTHER_RACE_6CODE,StringType,false)
StructField(MOTHER_EDUCATION,StringType,false)
StructField(FATHER_COMBINED_AGE,IntegerType,false)
StructField(FATHER_EDUCATION,StringType,false)
StructField(MONTH_PRECARE_RECODE,StringType,false)
StructField(CIG_BEFORE,IntegerType,false)
StructField(CIG_1_TRI,IntegerType,false)
StructField(CIG_2_TRI,IntegerType,false)
StructField(CIG_3_TRI,IntegerType,false)
StructField(MOTHER_HEIGHT_IN,IntegerType,false)
StructField(MOTHER_BMI_RECODE,IntegerType,false)
StructField(MOTHER_PRE_WEIGHT,IntegerType,false)
StructField(MOTHER_DELIVERY_WEIGHT,IntegerType,false)
StructField(MOTHER_WEIGHT_GAIN,IntegerType,false)
StructField(DIABETES_PRE,StringType,false)
StructField(DIABETES_GEST,StringType,false)
StructField(HYP_TENS_PRE,S

Even though MLlib is designed with RDDs and DStreams in focus, for ease of transforming the data we will read the data and convert it to a DataFrame.

Next, we load the data.

In [7]:
#!hdfs dfs -mkdir -p /hdfs_data
#!hdfs dfs -ls /hdfs_data
#!hdfs dfs -put data/births_train.csv.gz /hdfs_data
!hdfs fsck /hdfs_data/births_train.csv.gz

Connecting to namenode via http://ec2-18-223-209-87.us-east-2.compute.amazonaws.com:50070/fsck?ugi=ec2-user&path=%2Fhdfs_data%2Fbirths_train.csv.gz
FSCK started by ec2-user (auth:SIMPLE) from /172.31.5.183 for path /hdfs_data/births_train.csv.gz at Mon Feb 11 05:35:06 UTC 2019
.
/hdfs_data/births_train.csv.gz:  Under replicated BP-663532545-172.31.27.125-1549216637007:blk_1073741838_1014. Target Replicas is 3 but found 2 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
Status: HEALTHY
 Total size:	931988 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 931988 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	1 (100.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		1 (33.333332 %)
 Number of data-nodes:		2
 Number of racks:		1
FSCK ended at M

Next, we load the data. The .read.csv(...) method can read either uncompressed 
or (as in our case) GZipped comma-separated values. The header parameter set 
to True indicates that the first row contains the header, and we use the schema to 
specify the correct data types:

In [8]:
#births = spark.read.csv('data/births_train.csv.gz', header=True, schema=schema)
births = spark.read.csv('/hdfs_data/births_train.csv.gz', header=True, schema=schema)

In [9]:
print(births.take(1))

[Row(INFANT_ALIVE_AT_REPORT=u'N', BIRTH_YEAR=2015, BIRTH_MONTH=2, BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=29, MOTHER_RACE_6CODE=u'3', MOTHER_EDUCATION=u'9', FATHER_COMBINED_AGE=99, FATHER_EDUCATION=u'9', MONTH_PRECARE_RECODE=u'4', CIG_BEFORE=99, CIG_1_TRI=99, CIG_2_TRI=99, CIG_3_TRI=99, MOTHER_HEIGHT_IN=99, MOTHER_BMI_RECODE=9, MOTHER_PRE_WEIGHT=999, MOTHER_DELIVERY_WEIGHT=999, MOTHER_WEIGHT_GAIN=99, DIABETES_PRE=u'N', DIABETES_GEST=u'N', HYP_TENS_PRE=u'N', HYP_TENS_GEST=u'N', PREV_BIRTH_PRETERM=u'N', NO_RISK=u'1', NO_INFECTIONS_REPORTED=u'1', LABOR_IND=u'N', LABOR_AUGM=u'N', STEROIDS=u'N', ANTIBIOTICS=u'Y', ANESTHESIA=u'N', DELIV_METHOD_RECODE_COMB=u'2', ATTENDANT_BIRTH=u'1', APGAR_5=4, APGAR_5_RECODE=u'2', APGAR_10=3, APGAR_10_RECODE=u'1', INFANT_SEX=u'F', OBSTETRIC_GESTATION_WEEKS=35, INFANT_WEIGHT_GRAMS=2770, INFANT_ASSIST_VENTI=u'N', INFANT_ASSIST_VENTI_6HRS=u'N', INFANT_NICU_ADMISSION=u'Y', INFANT_SURFACANT=u'N', INFANT_ANTIBIOTICS=u'N', INFANT_SEIZURES=u'N', INFANT_NO_ABNORMALITIES=u

There are plenty of features in our dataset that are strings. These are mostly categorical variables that we need to somehow convert to a numeric form.

Specify our recode dictionary.

Our goal in this chapter is to predict whether the ``INFANT_ALIVE_AT_REPORT`` is 
either 1 or 0. Thus, we will drop all of the features that relate to the infant and will 
try to predict the infant's chances of surviving only based on the features related to 
its mother, father, and the place of birth:

In [10]:
selected_features = [
    'INFANT_ALIVE_AT_REPORT', 
    'BIRTH_PLACE', 
    'MOTHER_AGE_YEARS', 
    'FATHER_COMBINED_AGE', 
    'CIG_BEFORE', 
    'CIG_1_TRI', 
    'CIG_2_TRI', 
    'CIG_3_TRI', 
    'MOTHER_HEIGHT_IN', 
    'MOTHER_PRE_WEIGHT', 
    'MOTHER_DELIVERY_WEIGHT', 
    'MOTHER_WEIGHT_GAIN', 
    'DIABETES_PRE', 
    'DIABETES_GEST', 
    'HYP_TENS_PRE', 
    'HYP_TENS_GEST', 
    'PREV_BIRTH_PRETERM'
]

births_trimmed = births.select(selected_features)

In [11]:
births_trimmed.take(1)

[Row(INFANT_ALIVE_AT_REPORT=u'N', BIRTH_PLACE=u'1', MOTHER_AGE_YEARS=29, FATHER_COMBINED_AGE=99, CIG_BEFORE=99, CIG_1_TRI=99, CIG_2_TRI=99, CIG_3_TRI=99, MOTHER_HEIGHT_IN=99, MOTHER_PRE_WEIGHT=999, MOTHER_DELIVERY_WEIGHT=999, MOTHER_WEIGHT_GAIN=99, DIABETES_PRE=u'N', DIABETES_GEST=u'N', HYP_TENS_PRE=u'N', HYP_TENS_GEST=u'N', PREV_BIRTH_PRETERM=u'N')]

In our dataset, there are plenty of features with Yes/No/Unknown values; we will only code Yes to 1; everything else will be set to 0.

In [12]:
recode_dictionary = {
    'YNU': {
        'Y': 1,
        'N': 0,
        'U': 0
    }
}

There is also a small problem with how the number of cigarettes smoked by the mother was coded: as 0 means the mother smoked no cigarettes before or during the pregnancy, between 1-97 states the actual number of cigarette smoked, 98 indicates either 98 or more, whereas 99 identifies the unknown; we will assume the unknown is 0 and recode accordingly.

Specify the recoding methods.

In [13]:
import pyspark.sql.functions as func

def recode(col, key):        
    return recode_dictionary[key][col] 

def correct_cig(feat):
    return func \
        .when(func.col(feat) != 99, func.col(feat))\
        .otherwise(0)

rec_integer = func.udf(recode, typ.IntegerType())

The recode method looks up the correct key from the recode_dictionary (given 
the key) and returns the corrected value. 

The correct_cig method checks when the 
value of the feature feat is not equal to 99 and (for that situation) returns the value 
of the feature; if the value is equal to 99, we get 0 otherwise.


We cannot use the recode function directly on a DataFrame; it needs to be converted 
to a UDF that Spark will understand. 

The rec_integer is such a function: by passing 
our specified recode function and specifying the return value data type, we can use 
it then to encode our Yes/No/Unknown features.

First, we'll correct the features related to the number of cigarettes smoked:

In [14]:
births_transformed_tmp1 = births_trimmed \
    .withColumn('CIG_BEFORE', correct_cig('CIG_BEFORE'))\
    .withColumn('CIG_1_TRI', correct_cig('CIG_1_TRI'))\
    .withColumn('CIG_2_TRI', correct_cig('CIG_2_TRI'))\
    .withColumn('CIG_3_TRI', correct_cig('CIG_3_TRI'))

The .withColumn(...) method takes the name of the column as its first parameter and the transformation as the second one. 
We do not create new columns, but reuse the same ones instead.

Now we will focus on correcting the Yes/No/Unknown features. First, we will 
figure out which these are with the following snippet:

In [15]:
cols = [(col.name, col.dataType) for col in births_trimmed.schema]

YNU_cols = []

for i, s in enumerate(cols):
    if s[1] == typ.StringType():
        dis = births.select(s[0]) \
            .distinct() \
            .rdd \
            .map(lambda row: row[0]) \
            .collect()

        if 'Y' in dis:
            YNU_cols.append(s[0])

First, we created a list of tuples (cols) that hold column names and corresponding data types. Next, we loop through all of these and calculate distinct values of all string columns; if a 'Y' is within the returned list, we append the column name to the YNU_cols list.

In [16]:
YNU_cols

['INFANT_ALIVE_AT_REPORT',
 'DIABETES_PRE',
 'DIABETES_GEST',
 'HYP_TENS_PRE',
 'HYP_TENS_GEST',
 'PREV_BIRTH_PRETERM']

DataFrames can transform the features *in bulk* while selecting features.

In [17]:
births.select([
        'INFANT_NICU_ADMISSION', 
        rec_integer(
            'INFANT_NICU_ADMISSION', func.lit('YNU')
        ) \
        .alias('INFANT_NICU_ADMISSION_RECODE')]
     ).take(5)

[Row(INFANT_NICU_ADMISSION=u'Y', INFANT_NICU_ADMISSION_RECODE=1),
 Row(INFANT_NICU_ADMISSION=u'Y', INFANT_NICU_ADMISSION_RECODE=1),
 Row(INFANT_NICU_ADMISSION=u'U', INFANT_NICU_ADMISSION_RECODE=0),
 Row(INFANT_NICU_ADMISSION=u'N', INFANT_NICU_ADMISSION_RECODE=0),
 Row(INFANT_NICU_ADMISSION=u'U', INFANT_NICU_ADMISSION_RECODE=0)]

We select the 'INFANT_NICU_ADMISSION' column and we pass the name of the feature to the rec_integer method. 

We also alias the newly transformed column as 'INFANT_NICU_ADMISSION_RECODE'. 

This way we will also confirm that our UDF works as intended.

So, to transform all the YNU_cols in one go, we will create a list of such transformations, as shown here:

In [20]:
exprs_YNU = [
    rec_integer(x, func.lit('YNU')).alias(x) 
    if x in YNU_cols 
    else x 
    for x in births_transformed_tmp1.columns
]

births_transformed_tmp2 = births_transformed_tmp1.select(exprs_YNU)

Let's check if we got it correctly.

In [21]:
births.select(YNU_cols[-5:]).show(5)

+------------+-------------+------------+-------------+------------------+
|DIABETES_PRE|DIABETES_GEST|HYP_TENS_PRE|HYP_TENS_GEST|PREV_BIRTH_PRETERM|
+------------+-------------+------------+-------------+------------------+
|           N|            N|           N|            N|                 N|
|           N|            N|           N|            N|                 N|
|           N|            N|           N|            N|                 N|
|           N|            N|           N|            N|                 Y|
|           N|            N|           N|            N|                 N|
+------------+-------------+------------+-------------+------------------+
only showing top 5 rows



In [22]:
births_transformed_tmp2.select(YNU_cols[-5:]).show(5)

+------------+-------------+------------+-------------+------------------+
|DIABETES_PRE|DIABETES_GEST|HYP_TENS_PRE|HYP_TENS_GEST|PREV_BIRTH_PRETERM|
+------------+-------------+------------+-------------+------------------+
|           0|            0|           0|            0|                 0|
|           0|            0|           0|            0|                 0|
|           0|            0|           0|            0|                 0|
|           0|            0|           0|            0|                 1|
|           0|            0|           0|            0|                 0|
+------------+-------------+------------+-------------+------------------+
only showing top 5 rows



Indeed 'N' and 'U' got replaced to '0', and 'Y' got replaced to 1

## Get to know your data

### Descriptive statistics

We will use the `colStats(...)` method.

In [23]:
import pyspark.mllib.stat as st
import numpy as np

numeric_cols = ['MOTHER_AGE_YEARS','FATHER_COMBINED_AGE',
                'CIG_BEFORE','CIG_1_TRI','CIG_2_TRI','CIG_3_TRI',
                'MOTHER_HEIGHT_IN','MOTHER_PRE_WEIGHT',
                'MOTHER_DELIVERY_WEIGHT','MOTHER_WEIGHT_GAIN'
               ]

In [25]:
numeric_rdd = births_transformed_tmp2\
                       .select(numeric_cols)\
                       .rdd \
                       .map(lambda row: [e for e in row])

In [26]:
mllib_stats = st.Statistics.colStats(numeric_rdd)

The method takes an RDD of data to calculate the descriptive statistics of and return 
a MultivariateStatisticalSummary object that contains the following descriptive 
statistics:
    
* count(): This holds a row count
* max(): This holds maximum value in the column
* mean(): This holds the value of the mean for the values in the column
* min(): This holds the minimum value in the column
* normL1(): This holds the value of the L1-Norm for the values in the column
* normL2(): This holds the value of the L2-Norm for the values in the column
* numNonzeros(): This holds the number of nonzero values in the column
* variance(): This holds the value of the variance for the values in the column

In [27]:
for col, m, v in zip(numeric_cols, 
                     mllib_stats.mean(), 
                     mllib_stats.variance()):
    print('{0}: \t{1:.2f} \t {2:.2f}'.format(col, m, np.sqrt(v)))

MOTHER_AGE_YEARS: 	28.30 	 6.08
FATHER_COMBINED_AGE: 	44.55 	 27.55
CIG_BEFORE: 	1.43 	 5.18
CIG_1_TRI: 	0.91 	 3.83
CIG_2_TRI: 	0.70 	 3.31
CIG_3_TRI: 	0.58 	 3.11
MOTHER_HEIGHT_IN: 	65.12 	 6.45
MOTHER_PRE_WEIGHT: 	214.50 	 210.21
MOTHER_DELIVERY_WEIGHT: 	223.63 	 180.01
MOTHER_WEIGHT_GAIN: 	30.74 	 26.23


For the categorical variables we will calculate the frequencies of their values.

In [28]:
categorical_cols = [e for e in births_transformed_tmp2.columns 
                    if e not in numeric_cols]

categorical_rdd = births_transformed_tmp2\
                       .select(categorical_cols)\
                       .rdd \
                       .map(lambda row: [e for e in row])
            
for i, col in enumerate(categorical_cols):
    agg = categorical_rdd \
        .groupBy(lambda row: row[i]) \
        .map(lambda row: (row[0], len(row[1])))
        
    print(col, sorted(agg.collect(), 
                      key=lambda el: el[1], 
                      reverse=True))

('INFANT_ALIVE_AT_REPORT', [(1, 23349), (0, 22080)])
('BIRTH_PLACE', [(u'1', 44558), (u'4', 327), (u'3', 224), (u'2', 136), (u'7', 91), (u'5', 74), (u'6', 11), (u'9', 8)])
('DIABETES_PRE', [(0, 44881), (1, 548)])
('DIABETES_GEST', [(0, 43451), (1, 1978)])
('HYP_TENS_PRE', [(0, 44348), (1, 1081)])
('HYP_TENS_GEST', [(0, 43302), (1, 2127)])
('PREV_BIRTH_PRETERM', [(0, 43088), (1, 2341)])


Most of the deliveries happened in hospital (BIRTH_PLACE equal to 1). Around 550 
deliveries happened at home: some intentionally ('BIRTH_PLACE' equal to 3), and 
some not ('BIRTH_PLACE' equal to 4).

### Correlations

Correlations help to identify collinear numeric features and handle them appropriately. 
Let's check the correlations between our features:

In [29]:
numeric_rdd.take(3)

[[29, 99, 0, 0, 0, 0, 99, 999, 999, 99],
 [22, 29, 0, 0, 0, 0, 65, 180, 198, 18],
 [38, 40, 0, 0, 0, 0, 63, 155, 167, 12]]

In [30]:
corrs = st.Statistics.corr(numeric_rdd)

In [31]:
corrs.shape

(10, 10)

In [32]:
for i, el in enumerate(corrs > 0.5):
    correlated = [
        (numeric_cols[j], corrs[i][j]) 
        for j, e in enumerate(el) 
        if e == 1.0 and j != i]
    
    if len(correlated) > 0:
        for e in correlated:
            print('{0}-to-{1}: {2:.2f}' \
                  .format(numeric_cols[i], e[0], e[1]))

CIG_BEFORE-to-CIG_1_TRI: 0.83
CIG_BEFORE-to-CIG_2_TRI: 0.72
CIG_BEFORE-to-CIG_3_TRI: 0.62
CIG_1_TRI-to-CIG_BEFORE: 0.83
CIG_1_TRI-to-CIG_2_TRI: 0.87
CIG_1_TRI-to-CIG_3_TRI: 0.76
CIG_2_TRI-to-CIG_BEFORE: 0.72
CIG_2_TRI-to-CIG_1_TRI: 0.87
CIG_2_TRI-to-CIG_3_TRI: 0.89
CIG_3_TRI-to-CIG_BEFORE: 0.62
CIG_3_TRI-to-CIG_1_TRI: 0.76
CIG_3_TRI-to-CIG_2_TRI: 0.89
MOTHER_PRE_WEIGHT-to-MOTHER_DELIVERY_WEIGHT: 0.54
MOTHER_PRE_WEIGHT-to-MOTHER_WEIGHT_GAIN: 0.65
MOTHER_DELIVERY_WEIGHT-to-MOTHER_PRE_WEIGHT: 0.54
MOTHER_DELIVERY_WEIGHT-to-MOTHER_WEIGHT_GAIN: 0.60
MOTHER_WEIGHT_GAIN-to-MOTHER_PRE_WEIGHT: 0.65
MOTHER_WEIGHT_GAIN-to-MOTHER_DELIVERY_WEIGHT: 0.60


The above code calculated the correlation matrix and print only those 
features that have a correlation coefficient greater than 0.5

We can drop most of highly correlated features. 

In [33]:
features_to_keep = [
    'INFANT_ALIVE_AT_REPORT', 
    'BIRTH_PLACE', 
    'MOTHER_AGE_YEARS', 
    'FATHER_COMBINED_AGE', 
    'CIG_1_TRI', 
    'MOTHER_HEIGHT_IN', 
    'MOTHER_PRE_WEIGHT', 
    'DIABETES_PRE', 
    'DIABETES_GEST', 
    'HYP_TENS_PRE', 
    'HYP_TENS_GEST', 
    'PREV_BIRTH_PRETERM'
]
births_transformed = births_transformed_tmp2.select([e for e in features_to_keep])

In [40]:
features_to_keep = [
    'INFANT_ALIVE_AT_REPORT', 
    'CIG_1_TRI', 
    ]
births_transformed = births_transformed_tmp2.select([e for e in features_to_keep])

### Statistical testing

We cannot calculate correlations for the categorical features. However, we can run a 
Chi-square test to determine if there are significant differences.

Run a Chi-square test to determine if there are significant differences for categorical variables.

We loop through all the categorical variables and pivot them by the 'INFANT_ALIVE_AT_REPORT' feature to get the counts. 

Next, we transform them into an RDD, so we can then convert them into a matrix using the pyspark.mllib.linalg module. 

The first parameter to the .Matrices.dense(...) method specifies the number of rows 
in the matrix; in our case, it is the length of distinct values of the categorical feature.
The second parameter specifies the number of columns: we have two as our 'INFANT_ALIVE_AT_REPORT' target variable has only two values.

The last parameter is a list of values to be transformed into a matrix.
Here's an example that shows this more clearly:


In [35]:
import pyspark.mllib.linalg as ln

In [36]:
print(ln.Matrices.dense(3,2, [1,2,3,4,5,6]))

DenseMatrix([[1., 4.],
             [2., 5.],
             [3., 6.]])


In [37]:
for cat in categorical_cols[1:]:
    agg = births_transformed \
        .groupby('INFANT_ALIVE_AT_REPORT') \
        .pivot(cat) \
        .count()    

    agg_rdd = agg \
        .rdd\
        .map(lambda row: (row[1:])) \
        .flatMap(lambda row: 
                 [0 if e == None else e for e in row]) \
        .collect()

    row_length = len(agg.collect()[0]) - 1
    agg = ln.Matrices.dense(row_length, 2, agg_rdd)
    
    test = st.Statistics.chiSqTest(agg)
    print(cat, round(test.pValue, 4))

('BIRTH_PLACE', 0.0)
('DIABETES_PRE', 0.0)
('DIABETES_GEST', 0.0)
('HYP_TENS_PRE', 0.0)
('HYP_TENS_GEST', 0.0)
('PREV_BIRTH_PRETERM', 0.0)


Our tests reveal that all the features should be significantly different and should help 
us predict the chance of survival of an infant.

## Create the final dataset

Therefore, it is time to create our final dataset that we will use to build our models.

We will convert our DataFrame into an RDD of LabeledPoints.

A LabeledPoint is a MLlib structure that is used to train the machine learning 
models. It consists of two attributes: label and features.

The label is our target variable and features can be a NumPy array, list, 
pyspark.mllib.linalg.SparseVector, pyspark.mllib.linalg.DenseVector, or 
scipy.sparse column matrix.

### Create an RDD of `LabeledPoint`s


Before we build our final dataset, we first need to deal with one final obstacle: our 
'BIRTH_PLACE' feature is still a string. While any of the other categorical variables 
can be used as is (as they are now dummy variables), we will use a hashing trick to 
encode the 'BIRTH_PLACE' feature:

In [38]:
import pyspark.mllib.feature as ft
import pyspark.mllib.regression as reg

hashing = ft.HashingTF(7)

births_hashed = births_transformed \
    .rdd \
    .map(lambda row: [
            list(hashing.transform(row[1]).toArray()) 
                if col == 'BIRTH_PLACE' 
                else row[i] 
            for i, col 
            in enumerate(features_to_keep)]) \
    .map(lambda row: [[e] if type(e) == int else e 
                      for e in row]) \
    .map(lambda row: [item for sublist in row 
                      for item in sublist]) \
    .map(lambda row: reg.LabeledPoint(
            row[0], 
            ln.Vectors.dense(row[1:]))
        )

In [39]:
births_hashed.take(10)

[LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,29.0,99.0,0.0,99.0,999.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,22.0,29.0,0.0,65.0,180.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,38.0,40.0,0.0,63.0,155.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,39.0,42.0,0.0,60.0,128.0,0.0,0.0,0.0,0.0,1.0]),
 LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,18.0,99.0,4.0,61.0,110.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,32.0,37.0,0.0,66.0,150.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,22.0,25.0,0.0,68.0,155.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,25.0,26.0,0.0,64.0,136.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,26.0,32.0,0.0,64.0,140.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(0.0, [0.0,0.0,0.0,0.0,0.0,0.0,1.0,39.0,66.0,0.0,65.0,140.0,0.0,0.0,0.0,0.0,0.0])]

Our feature has seven levels, so we use as many features as that for the hashing trick. Next, we actually use the model to convert our 'BIRTH_PLACE' feature into a SparseVector; such a data structure is preferred if your dataset has many columns but in a row only a few of them have non-zero values. We then combine all the features together and finally create a LabeledPoint.

### Split into training and testing

Before we move to the modeling stage, we need to split our dataset into two sets: one 
we'll use for training and the other for testing. Luckily, RDDs have a handy method 
to do just that: .randomSplit(...). The method takes a list of proportions that are 
to be used to randomly split the dataset.

In [40]:
births_train, births_test = births_hashed.randomSplit([0.6, 0.4])

## Predicting infant survival

### Logistic regression in Spark

MLLib used to provide a logistic regression model estimated using a stochastic gradient descent (SGD) algorithm. This model has been deprecated in Spark 2.0 in favor of the `LogisticRegressionWithLBFGS` model. 

In [41]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

In [42]:
LR_Model = LogisticRegressionWithLBFGS.train(births_train, iterations=10)

The LogisticRegressionWithLBFGS model uses the Limited-memory Broyden–
Fletcher–Goldfarb–Shanno (BFGS) optimization algorithm. It is a quasi-Newton 
method that approximates the BFGS algorithm.

Training the model is very simple: we just need to call the .train(...) method. 
The required parameters are the RDD with LabeledPoints; we also specified the 
number of iterations so it does not take too long to run.

Let's now use the model to predict the classes for our testing set.

In [43]:
prediction=LR_Model.predict(births_test.map(lambda row: row.features))

In [44]:
prediction.take(20)

[1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1]

### Random Forest in Spark

We are now ready to build the random forest model. 

In [57]:
from pyspark.mllib.tree import RandomForest

RF_model = RandomForest \
    .trainClassifier(data=births_train, 
                     numClasses=2, 
                     categoricalFeaturesInfo={}, 
                     numTrees=6,  
                     featureSubsetStrategy='all',
                     seed=666)

Let's see how well our model did.

In [58]:
RF_prediction=RF_model.predict(births_test.map(lambda row: row.features))  

In [61]:
RF_prediction.take(10)

[1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]