#### TITANIC CLASSIFICATION EXAMPLE
###### Project Requirements
* There are a lot of examples on Titanic for different machine learning libraries
* We wil predict which passengers survived Titanic crash based solely on passenger's features (age, cabin, how many children, male/female etc.)
* Actual conclusion was peopple who are male or from a lower class such as third class tended not to survive.
* People with higher classes (e.g. first class) or those who are female had  a higher likelihood of survival.
* We will explore some better ways to deal with categorical data in a two-step process.
* We will demomstrate a way on how to use pielines to set stages and build models that can be easily used again.
* We will also deal with a lot of missign data.
* Apache Spark Documentation: https://github.com/MingChen0919/learning-apache-spark

#### Instructor uses DataBricks's notebook setup ???

## Using PipeLine result model fits less Where As manuall fit and transform fits better
* Evaluation score from manual way is 0.88
* Evaluation scopre using Pipeline is 0.78

In [1]:
import sys
sys.path.append('C:/Users/nishita/exercises_udemy/tools/')
from chinmay_tools import *

#### Load titanic data from csv file into dataframe

In [2]:
from pyspark.sql import SparkSession

spark_titanic = SparkSession.builder.appName('chin_titanic').getOrCreate()

sdf_titanic = spark_titanic.read.csv('Logistic_Regression/titanic.csv', inferSchema=True, header=True)

sdf_titanic.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



Some of the columns:
* Sex -- Male / Female
* SibSp -- indicates Siblings and Spouses they have onboarded
* Parch -- indicates Parent and Children they have onboarded
* Fare -- ticket price paid by the passengers
* Cabin -- Cabin occupied by the passenger
* Embarked -- It is the city name where the passnenger has embarked - actual string is a single letter
<br/><br/>
* PassengerID is just a index column and is not useful for our prediction
<br/>

###### We will select only th ecolumns that are useful to us

* Now we sill select the columns 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'
* We will think about 'Name' column late to decide whether he/she is a doctor or a mr or a mrs etc while usingg feature engineering.
* What about 'Cabin'?

#### Check data for null values

In [3]:
printHighlighted('Select desired columns and check null record counts')
sdf_titanic_myfields = sdf_titanic.select('Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked')

# Fields with null: Age:177, Sex:2
sdf_titanic.count()
sdf_titanic_myfields.count()
sdf_titanic_myfields.filter('Embarked is null OR Age is null').count()

sdf_titanic_myfields.filter('Survived is null OR Pclass is null OR Sex is null OR Age is null '
                            +' OR SibSp is null OR Parch is null OR Fare is null OR Embarked is null').count()

[7m[1mSelect desired columns and check null record counts[0m[0m


891

891

179

179

###### Dealing with Null records
* There are 177 records with null in 'Age' field and 2 records with null in 'Embarked' column
* We will drop these 179 records from total of 891 titanic records.

###### Drop Null records

In [4]:
sdf_final_data = sdf_titanic_myfields.na.drop()

* In the titanic dataframe there are two string fields, 'Sex' and 'Embarked', we need to convert them first to numeri (using StringIndexer) and then into a vector (using OneHotEncoder), and that vector will be part of final vectorized features column.

###### Dealing with String categorical fields
* Convert the string field into numeric fields using StringIndexer with 0, 1, 2, ... upto number of unique values
* Convert the numeric field into vector field using OneHotEncoder
* <u>Example of StringIndexer and OneHotEncoder</u>
* Suppose there are 3 unique values in a string field A, B, C
* StringIndexer coverts the string column into numeric with values in [0, numLabels]
>* most frequent label gets index 0 and next frequent gets 1 and so on as the default valeu of stringOrderType is 'frequencyDesc'
<pre>
<hr/>
STRING:  A  B  C
<hr/>
NUMERIC: 0  1  2     (after StringIndexer)
<hr/>
A:   [1, 0, 0]       (after OneHotEncoder)
B:   [0, 1, 0]
C:   [0, 0, 1]
</pre>
* Say there are n categories or n unique values in the column then post OneHotEncoder each category will get a unique vector with n binary elements (o or 1) having at most single one-value.
* Most frequent label will get first element as 1 and remaining n-1 elements as zeroes
* Next frequent label will get second element as 1 and remaining (includign first one) will be zeroes  and so on

#### Use StringIndexer to convert categorical values to categorical index which is a number in [0, numValues]
#### Then use OneHotEncoder to convert that index into a vector

In [5]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler  #,  VectorIndexer

In [6]:
gender_indexer = StringIndexer(inputCol='Sex', outputCol='SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex', outputCol='SexVec')

embark_indexer = StringIndexer(inputCol='Embarked', outputCol='EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol='EmbarkIndex', outputCol='EmbarkVec')

In [7]:
from pyspark.ml.feature import VectorAssembler

In [8]:
assembler = VectorAssembler(inputCols=['Pclass', 'SexVec', 'Age', 'SibSp', 'Parch', 'Fare', 'EmbarkVec'], 
                            outputCol='features')

In [9]:
from pyspark.ml.classification import LogisticRegression
logi_reg_titanic = LogisticRegression(featuresCol='features', labelCol='Survived')

In [10]:
from pyspark.ml import Pipeline

###### Add to the Pipeline all the required stages (indexer, encoder,assembler and LogisticRegression model)

In [11]:
my_pipeline = Pipeline(stages=[
    gender_indexer, embark_indexer,
    gender_encoder, embark_encoder,
    assembler, logi_reg_titanic
])

###### Split the data into training set and test set

In [12]:
train_titanic_data, test_titanic_data = sdf_final_data.randomSplit([0.7, 0.3])

In [13]:
train_titanic_data.count()
test_titanic_data.count()

524

188

###### Fit the training data to the pipeline to get the trained model

In [14]:
fit_model = my_pipeline.fit(train_titanic_data)

###### Transform the test data using the trained model to get the prediction in the test result

In [15]:
results = fit_model.transform(test_titanic_data)

###### The transform() call on PipelineModel automatically calls the predicted column 'prediction'
* The transform() call on PipelineModel automatically calls the predicted column 'prediction'
* Hence when we evaluate() using BinaryClassificationEvaluator, we need to pass 'prediction' for the field rawPrediction

In [30]:
type(fit_model)
type(results)

pyspark.ml.pipeline.PipelineModel

pyspark.sql.dataframe.DataFrame

In [28]:
# results.select('Survived', 'prediction').show()
type(results)
results.printSchema()

pyspark.sql.dataframe.DataFrame

root
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Embarked: string (nullable = true)
 |-- SexIndex: double (nullable = false)
 |-- EmbarkIndex: double (nullable = false)
 |-- SexVec: vector (nullable = true)
 |-- EmbarkVec: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



###### Evaluate the test result using BinaryClassificationEvaluator

In [19]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

bin_eval_raw = BinaryClassificationEvaluator(labelCol='Survived')

bin_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='Survived')

* The transform() call on PipelineModel automatically calls the predicted column 'prediction'
* Hence when we evaluate() using BinaryClassificationEvaluator, we need to pass 'prediction' for the field rawPrediction of evaluator constructor

In [26]:
printHighlighted('ASK INSTRUCTOR -- WHY ARE WE PASSING "prediction" IN PLACE OF "rawPrediction"')

[7m[1mASK INSTRUCTOR -- WHY ARE WE PASSING "prediction" IN PLACE OF "rawPrediction"[0m[0m


In [22]:
printHighlighted('Evaluating the predictions using BinaryClassificationEvaluator using metrics "(NONE|areaUnderROC|areaUnderPR)" ')
bin_eval.evaluate(results)

[7m[1mEvaluating the predictions using BinaryClassificationEvaluator using metrics "(NONE|areaUnderROC|areaUnderPR)" [0m[0m


0.7780696744717305

###### It si wrong to use bin_eval_raw as it uses column 'rawPrediction' where as the pipelinemodel.transform outputs to field 'prediction'

In [23]:
bin_eval_raw.evaluate(results)

0.8519703026841803

In [24]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

multi_eval = MulticlassClassificationEvaluator(labelCol='Survived')

In [25]:
#                                                         (NONE|f1|weightedPrecision|weightedRecall|accuracy)')
printHighlighted('Evaluating the predictions using MulticlassClassificationEvaluator using metrics "(NONE|f1|weightedPrecision|weightedRecall|accuracy)" ')
multi_eval.evaluate(results)
multi_eval.evaluate(results, {multi_eval.metricName: "f1"})
multi_eval.evaluate(results, {multi_eval.metricName: "weightedPrecision"})
multi_eval.evaluate(results, {multi_eval.metricName: "weightedRecall"})
multi_eval.evaluate(results, {multi_eval.metricName: "accuracy"})

[7m[1mEvaluating the predictions using MulticlassClassificationEvaluator using metrics "(NONE|f1|weightedPrecision|weightedRecall|accuracy)" [0m[0m


0.7844566780736993

0.7844566780736993

0.7907839837716972

0.7872340425531915

0.7872340425531915