#### TITANIC CLASSIFICATION EXAMPLE
###### Project Requirements
* There are a lot of examples on Titanic for different machine learning libraries
* We wil predict which passengers survived Titanic crash based solely on passenger's features (age, cabin, how many children, male/female etc.)
* Actual conclusion was peopple who are male or from a lower class such as third class tended not to survive.
* People with higher classes (e.g. first class) or those who are female had  a higher likelihood of survival.
* We will explore some better ways to deal with categorical data in a two-step process.
* We will demomstrate a way on how to use pielines to set stages and build models that can be easily used again.
* We will also deal with a lot of missign data.

#### Instructor uses DataBricks's notebook setup ???

In [1]:
import sys
sys.path.append('C:/Users/nishita/exercises_udemy/MyTrials/tools/')
from chinmay_tools import *

#### Load titanic data from csv file into dataframe

In [2]:
from pyspark.sql import SparkSession

spark_titanic = SparkSession.builder.appName('chin_titanic').getOrCreate()

sdf_titanic = spark_titanic.read.csv('Logistic_Regression/titanic.csv', inferSchema=True, header=True)

sdf_titanic.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



Some of the columns:
* Sex -- Male / Female
* SibSp -- indicates Siblings and Spouses they have onboarded
* Parch -- indicates Parent and Children they have onboarded
* Fare -- ticket price paid by the passengers
* Cabin -- Cabin occupied by the passenger
* Embarked -- It is the city name where the passnenger has embarked - actual string is a single letter
<br/><br/>
* PassengerID is just a index column and is not useful for our prediction
<br/>

###### We will select only th ecolumns that are useful to us

* Now we sill select the columns 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'
* We will think about 'Name' column late to decide whether he/she is a doctor or a mr or a mrs etc while usingg feature engineering.
* What about 'Cabin'?

#### Check data for null values and drop null records

In [3]:
printHighlighted('Select desired columns and check null record counts')
sdf_titanic_myfields = sdf_titanic.select('Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked')

# Fields with null: Age:177, Sex:2
sdf_titanic.count()
sdf_titanic_myfields.count()
sdf_titanic_myfields.filter('Embarked is null OR Age is null').count()

sdf_titanic_myfields.filter('Survived is null OR Pclass is null OR Sex is null OR Age is null '
                            +' OR SibSp is null OR Parch is null OR Fare is null OR Embarked is null').count()

[7m[1mSelect desired columns and check null record counts[0m[0m


891

891

179

179

###### Dealing with Null records
* There are 177 records with null in 'Age' field and 2 records with null in 'Embarked' column
* We will drop these 179 records from total of 891 titanic records.

In [4]:
sdf_titanic_myfields2 = sdf_titanic_myfields.na.drop()

In [5]:
sdf_titanic_myfields2.count()

df_titanic_myfields = sdf_titanic_myfields2.toPandas()

df_titanic_myfields

712

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.2500,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.9250,S
3,1,1,female,35.0,1,0,53.1000,S
4,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...
707,0,3,female,39.0,0,5,29.1250,Q
708,0,2,male,27.0,0,0,13.0000,S
709,1,1,female,19.0,0,0,30.0000,S
710,1,1,male,26.0,0,0,30.0000,C


* In the titanic dataframe there are two string fields, 'Sex' and 'Embarked', we need to convert them first to numeri (using StringIndexer) and then into a vector (using OneHotEncoder), and that vector will be part of final vectorized features column.

###### Dealing with String categorical fields
* Convert the string field into numeric fields using StringIndexer with 0, 1, 2, ... upto number of unique values
* Convert the numeric field into vector field using OneHotEncoder
* <u>Example of StringIndexer and OneHotEncoder</u>
* Suppose there are 3 unique values in a string field A, B, C
* StringIndexer coverts the string column into numeric with values in [0, numLabels]
>* most frequent label gets index 0 and next frequent gets 1 and so on as the default valeu of stringOrderType is 'frequencyDesc'
<pre>
<hr/>
STRING:  A  B  C
<hr/>
NUMERIC: 0  1  2     (after StringIndexer)
<hr/>
A:   [1, 0, 0]       (after OneHotEncoder)
B:   [0, 1, 0]
C:   [0, 0, 1]
</pre>
* Say there are n categories or n unique values in the column then post OneHotEncoder each category will get a unique vector with n binary elements (o or 1) having at most single one-value.
* Most frequent label will get first element as 1 and remaining n-1 elements as zeroes
* Next frequent label will get second element as 1 and remaining (includign first one) will be zeroes  and so on

#### Use StringIndexer to convert categorical values to categorical index which is a number in [0, numValues]
#### Then use OneHotEncoder to convert that index into a vector

In [6]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler  #,  VectorIndexer

In [7]:
gender_indexer = StringIndexer(inputCol='Sex', outputCol='SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex', outputCol='SexVec')

embark_indexer = StringIndexer(inputCol='Embarked', outputCol='EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol='EmbarkIndex', outputCol='EmbarkVec')

In [8]:
sdf_gi = gender_indexer.fit(sdf_titanic_myfields2).transform(sdf_titanic_myfields2)
sdf_ei = embark_indexer.fit(sdf_gi).transform(sdf_gi)
sdf_ge = gender_encoder.transform(sdf_ei)
sdf_ee = embark_encoder.transform(sdf_ge)

sdf_ei.show()
sdf_ee.show()

+--------+------+------+----+-----+-----+-------+--------+--------+-----------+
|Survived|Pclass|   Sex| Age|SibSp|Parch|   Fare|Embarked|SexIndex|EmbarkIndex|
+--------+------+------+----+-----+-----+-------+--------+--------+-----------+
|       0|     3|  male|22.0|    1|    0|   7.25|       S|     0.0|        0.0|
|       1|     1|female|38.0|    1|    0|71.2833|       C|     1.0|        1.0|
|       1|     3|female|26.0|    0|    0|  7.925|       S|     1.0|        0.0|
|       1|     1|female|35.0|    1|    0|   53.1|       S|     1.0|        0.0|
|       0|     3|  male|35.0|    0|    0|   8.05|       S|     0.0|        0.0|
|       0|     1|  male|54.0|    0|    0|51.8625|       S|     0.0|        0.0|
|       0|     3|  male| 2.0|    3|    1| 21.075|       S|     0.0|        0.0|
|       1|     3|female|27.0|    0|    2|11.1333|       S|     1.0|        0.0|
|       1|     2|female|14.0|    1|    0|30.0708|       C|     1.0|        1.0|
|       1|     3|female| 4.0|    1|    1

In [9]:
from pyspark.ml.feature import VectorAssembler

In [10]:
sdf_ee.printSchema()

root
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Embarked: string (nullable = true)
 |-- SexIndex: double (nullable = false)
 |-- EmbarkIndex: double (nullable = false)
 |-- SexVec: vector (nullable = true)
 |-- EmbarkVec: vector (nullable = true)



#### SWITCH here to use index column instead of vector column

In [11]:
# assembler = VectorAssembler(inputCols=['Pclass', 'SexIndex', 'Age', 'SibSp', 'Parch', 'Fare', 'EmbarkIndex'], 
#                             outputCol='features')

assembler = VectorAssembler(inputCols=['Pclass', 'SexVec', 'Age', 'SibSp', 'Parch', 'Fare', 'EmbarkVec'], 
                            outputCol='features')

In [12]:
final_vec = assembler.transform(sdf_ee)

type(final_vec)
final_vec.show()

In [13]:
final_data = final_vec.select('features', 'Survived')
final_data.show()

+--------------------+--------+
|            features|Survived|
+--------------------+--------+
|[3.0,1.0,22.0,1.0...|       0|
|[1.0,0.0,38.0,1.0...|       1|
|(8,[0,2,5,6],[3.0...|       1|
|[1.0,0.0,35.0,1.0...|       1|
|[3.0,1.0,35.0,0.0...|       0|
|[1.0,1.0,54.0,0.0...|       0|
|[3.0,1.0,2.0,3.0,...|       0|
|[3.0,0.0,27.0,0.0...|       1|
|[2.0,0.0,14.0,1.0...|       1|
|[3.0,0.0,4.0,1.0,...|       1|
|(8,[0,2,5,6],[1.0...|       1|
|[3.0,1.0,20.0,0.0...|       0|
|[3.0,1.0,39.0,1.0...|       0|
|(8,[0,2,5,6],[3.0...|       0|
|(8,[0,2,5,6],[2.0...|       1|
|[3.0,1.0,2.0,4.0,...|       0|
|[3.0,0.0,31.0,1.0...|       0|
|[2.0,1.0,35.0,0.0...|       0|
|[2.0,1.0,34.0,0.0...|       1|
|(8,[0,2,5],[3.0,1...|       1|
+--------------------+--------+
only showing top 20 rows



In [14]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

In [15]:
train_data.describe().show()
test_data.describe().show()

+-------+-------------------+
|summary|           Survived|
+-------+-------------------+
|  count|                533|
|   mean| 0.3921200750469043|
| stddev|0.48868187046102685|
|    min|                  0|
|    max|                  1|
+-------+-------------------+

+-------+-----------------+
|summary|         Survived|
+-------+-----------------+
|  count|              179|
|   mean|0.441340782122905|
| stddev|0.497940016085883|
|    min|                0|
|    max|                1|
+-------+-----------------+



In [16]:
from pyspark.ml.classification import LogisticRegression

logi_reg_titanic = LogisticRegression(labelCol='Survived')

In [17]:
fitted_model = logi_reg_titanic.fit(train_data)

In [18]:
fitted_model.coefficients
fitted_model.intercept

DenseVector([-1.1541, -2.8507, -0.0395, -0.2415, -0.2194, 0.0014, 0.3856, 0.7592])

4.754659843854747

In [19]:
fitted_model.summary.predictions.show()

+--------------------+--------+--------------------+--------------------+----------+
|            features|Survived|       rawPrediction|         probability|prediction|
+--------------------+--------+--------------------+--------------------+----------+
|(8,[0,1,2,5],[2.0...|     0.0|[2.63746670590744...|[0.93323429294680...|       0.0|
|(8,[0,1,2,5],[3.0...|     0.0|[2.29903633934555...|[0.90879719757418...|       0.0|
|(8,[0,1,2,5],[3.0...|     0.0|[2.37662469385721...|[0.91502736386518...|       0.0|
|(8,[0,1,2,5],[3.0...|     1.0|[2.69246786166459...|[0.93658072394292...|       0.0|
|(8,[0,1,2,5],[3.0...|     0.0|[2.77143450570510...|[0.94111253677795...|       0.0|
|(8,[0,1,2,5],[3.0...|     0.0|[2.81091782772536...|[0.94326295943713...|       0.0|
|(8,[0,1,2,5],[3.0...|     0.0|[4.11386745439376...|[0.98391840393830...|       0.0|
|(8,[0,1,2,5],[3.0...|     0.0|[4.33102572550516...|[0.98701673437556...|       0.0|
|(8,[0,1,2,6],[1.0...|     0.0|[0.36486075486602...|[0.5902165743

In [20]:
prediction_label_summary = fitted_model.evaluate(test_data)

In [24]:
prediction_label_summary.accuracy
prediction_label_summary.areaUnderROC
prediction_label_summary.predictions.show()

0.7821229050279329

0.8236708860759491

+--------------------+--------+--------------------+--------------------+----------+
|            features|Survived|       rawPrediction|         probability|prediction|
+--------------------+--------+--------------------+--------------------+----------+
|(8,[0,1,2,5],[3.0...|       0|[2.53454620767601...|[0.92652843038914...|       0.0|
|(8,[0,1,2,5],[3.0...|       0|[3.14652606489752...|[0.95877162064999...|       0.0|
|(8,[0,1,2,6],[1.0...|       0|[0.44382739890653...|[0.60917064444963...|       0.0|
|(8,[0,1,2,6],[3.0...|       0|[1.92285036060147...|[0.87245594889079...|       0.0|
|(8,[0,1,2,6],[3.0...|       1|[2.15975029272299...|[0.89657639618747...|       0.0|
|(8,[0,2,5,6],[1.0...|       1|[-3.1374359091208...|[0.04158920274159...|       1.0|
|(8,[0,2,5,6],[1.0...|       1|[-2.9229681275681...|[0.05102977546114...|       1.0|
|(8,[0,2,5,6],[1.0...|       1|[-2.9327800127527...|[0.05055671542530...|       1.0|
|(8,[0,2,5,6],[1.0...|       1|[-2.6219961018900...|[0.0677361350

In [25]:
printHighlighted('Here we had the Actual values in "Survived" column and predicted values in "prediction" column ')

[7m[1mHere we had the Actual values in "Survived" column and predicted values in "prediction" column [0m[0m


In [26]:
printHighlighted('Now to find the accuracy, precision etc we need to use evaluators')

[7m[1mNow to find the accuracy, precision etc we need to use evaluators[0m[0m


In [27]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

bin_eval = BinaryClassificationEvaluator(labelCol='Survived')

In [28]:
printHighlighted('Evaluating the predictions using BinaryClassificationEvaluator using metrics "(NONE|areaUnderROC|areaUnderPR)" ')
bin_eval.evaluate(prediction_label_summary.predictions)
BinaryClassificationEvaluator(labelCol='Survived', metricName='areaUnderROC').evaluate(prediction_label_summary.predictions)
BinaryClassificationEvaluator(labelCol='Survived', metricName='areaUnderPR').evaluate(prediction_label_summary.predictions)

[7m[1mEvaluating the predictions using BinaryClassificationEvaluator using metrics "(NONE|areaUnderROC|areaUnderPR)" [0m[0m


0.8236708860759491

0.8236708860759491

0.8333082122957263

In [43]:
prediction_label_summary.areaUnderROC

0.8236708860759491

In [29]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

multi_eval = MulticlassClassificationEvaluator(labelCol='Survived')

In [30]:
#                                                         (NONE|f1|weightedPrecision|weightedRecall|accuracy)')
printHighlighted('Evaluating the predictions using MulticlassClassificationEvaluator using metrics "(NONE|f1|weightedPrecision|weightedRecall|accuracy)" ')
multi_eval.evaluate(prediction_label_summary.predictions)
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "f1"})
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "weightedPrecision"})
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "weightedRecall"})
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "accuracy"})

[7m[1mEvaluating the predictions using MulticlassClassificationEvaluator using metrics "(NONE|f1|weightedPrecision|weightedRecall|accuracy)" [0m[0m


0.7812198595203173

0.7812198595203173

0.7815584938489967

0.782122905027933

0.7821229050279329

In [36]:
prediction_label_summary.weightedPrecision
prediction_label_summary.weightedRecall
prediction_label_summary.accuracy

0.7815584938489967

0.782122905027933

0.7821229050279329

## Following evaluation is by using StringIndexer alone

In [44]:
printHighlighted('Evaluating the predictions using BinaryClassificationEvaluator using metrics "(NONE|areaUnderROC|areaUnderPR)" ')
bin_eval.evaluate(prediction_label_summary.predictions)
BinaryClassificationEvaluator(labelCol='Survived', metricName='areaUnderROC').evaluate(prediction_label_summary.predictions)
BinaryClassificationEvaluator(labelCol='Survived', metricName='areaUnderPR').evaluate(prediction_label_summary.predictions)

[7m[1mEvaluating the predictions using BinaryClassificationEvaluator using metrics "(NONE|areaUnderROC|areaUnderPR)" [0m[0m


0.8236708860759491

0.8236708860759491

0.8333082122957263

In [45]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

multi_eval = MulticlassClassificationEvaluator(labelCol='Survived')

In [46]:
#                                                         (NONE|f1|weightedPrecision|weightedRecall|accuracy)')
printHighlighted('Evaluating the predictions using MulticlassClassificationEvaluator using metrics "(NONE|f1|weightedPrecision|weightedRecall|accuracy)" ')
multi_eval.evaluate(prediction_label_summary.predictions)
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "f1"})
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "weightedPrecision"})
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "weightedRecall"})
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "accuracy"})

[7m[1mEvaluating the predictions using MulticlassClassificationEvaluator using metrics "(NONE|f1|weightedPrecision|weightedRecall|accuracy)" [0m[0m


0.7812198595203173

0.7812198595203173

0.7815584938489967

0.782122905027933

0.7821229050279329

## Following evaluation is by using StringIndexer and OneHotEncoder both

In [None]:
printHighlighted('Evaluating the predictions using BinaryClassificationEvaluator using metrics "(NONE|areaUnderROC|areaUnderPR)" ')
bin_eval.evaluate(prediction_label_summary.predictions)
BinaryClassificationEvaluator(labelCol='Survived', metricName='areaUnderROC').evaluate(prediction_label_summary.predictions)
BinaryClassificationEvaluator(labelCol='Survived', metricName='areaUnderPR').evaluate(prediction_label_summary.predictions)

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

multi_eval = MulticlassClassificationEvaluator(labelCol='Survived')

In [None]:
#                                                         (NONE|f1|weightedPrecision|weightedRecall|accuracy)')
printHighlighted('Evaluating the predictions using MulticlassClassificationEvaluator using metrics "(NONE|f1|weightedPrecision|weightedRecall|accuracy)" ')
multi_eval.evaluate(prediction_label_summary.predictions)
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "f1"})
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "weightedPrecision"})
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "weightedRecall"})
multi_eval.evaluate(prediction_label_summary.predictions, {multi_eval.metricName: "accuracy"})