# Predict Survival of the Titanic Passengers

You will use Naive Bayes method to predict if a passenger survived the Ill-fated voyage of the Titanic (1912). The passenger data includes name, age, ticket class, and sex.  If you watched the Titanic movie, you might recall that the first class passengers and women with children received preference to lifeboats.  Hence, gender, age, and ticket class could be the key predictors of survival.

## Table of Contents
- [Titanic Data](#Titanic_Data)
- [Load Libraries](#load_libraries)
- [Access data](#access_data)
- [Explore Data](#Explore)
- [Parse Data](#Parse)
- [Split Data into Training and Test set](#training_test)
- [Build Naive Bayes Model](#build_model)
- [Predict for Test data](#test_data)
- [Evaluate the Model](#evaluate_model)

<a id="Titanic_Data"></a>
## Titanic Data 
You will analyze the random sample of the Titanic passengers data. The Dataset Source: [https://ww2.amstat.org/publications/jse/v3n3/datasets.dawson.html](https://ww2.amstat.org/publications/jse/v3n3/datasets.dawson.html)

 <table style="font-size: 16px; text-align: left;" width=100%>
  <tr>
  <td width=5% style="text-align: center; font-size: 16px">
   </td>
   <td width=11% style="text-align: left; font-size: 16px">
   <b>Variable</b>
   </td>
   <td width=53% style="text-align: left; font-size: 16px">
   <b> Description</b>
   </td>
    <td width=31% rowspan=7>
 
   <img src='https://www.khaskhabar.com/images/picture_image/3690-titanic-ship.jpg?raw=true'></img>
  
  </td>
  </tr>
  <tr>
   <td width=5% style="text-align: center; font-size: 16px">
     0 
   </td>
   <td width=11% style="text-align: left; font-size: 16px">
     Name 
   </td>
   <td width=53% style="text-align: left; font-size: 16px">
     Passenger's first and last name. 
   </td>
  
 </tr>
 <tr>
   <td style="text-align: center; font-size: 16px">
    1
  </td>
  <td style="text-align: left; font-size: 16px">
   PClass 
  </td>
  <td style="text-align: left; font-size: 16px">
   Ticket class (1st, 2nd, or 3rd) based on socio-economic status
  </td>
 </tr>
 <tr>
  <td style="text-align: center; font-size: 16px">
   2 
  </td>
  <td style="text-align: left; font-size: 16px">
  Age 
  </td>
  <td  style="text-align: left; font-size: 16px">
  Passenger's estimated age in years
  </td>
 </tr>
 <tr>
  <td style="text-align: center; font-size: 16px">
  3 
  </td>
  <td style="text-align: left; font-size: 16px">
  Sex 
  </td>
  <td style="text-align: left; font-size: 16px">
  male or female
  </td>
 </tr>
 <tr>
  <td style="text-align: center; font-size: 16px">
  4 
  </td>
  <td style="text-align: left; font-size: 16px">
  Survived
  </td>
  <td style="text-align: left; font-size: 16px">
  Indicates if the passenger survived the sinking of the Titanic (1=survived; 0=died)
  </td>
 </tr>
 <tr>
  <td style="text-align: center; font-size: 16px">
   5
  </td>
  <td style="text-align: left; font-size: 16px">
  PersonID 
  </td>
  <td style="text-align: left; font-size: 16px">
  Passenger's unique identifier
  </td>
 </tr>
</table>

<a id="load_libraries"></a>
## Load Libraries

The Spark and Python libraries that you need are preinstalled in the notebook environment and only need to be loaded.

Run the following cell to load the libraries you will work with in this notebook:

In [None]:
# PySpark Machine Learning Library
from pyspark.ml import Pipeline
from pyspark.ml.classification import  NaiveBayes, MultilayerPerceptronClassifier
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import Row, SQLContext

import os
import sys
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

from pyspark.mllib.regression import LabeledPoint
from numpy import array

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Library for confusion matrix, precision, test error
from pyspark.mllib.evaluation import MulticlassMetrics
# Library For Area under ROC curve and Area under precision-recall curve
from pyspark.mllib.evaluation import BinaryClassificationMetrics

# Assign resources to the application
sqlContext = SQLContext(sc)

# packages for data analysis
import numpy as np
import pandas as pd

In [None]:
# The data will be loaded into an array.
# This is the summary of the data structure, including the column position and name.
# The first filed starts from position 0. 

# 0 Name    -  Passenger first and last name.
# 1 PClass  -  Ticket class (1st, 2nd, or 3rd) based on Socio-Economic status
# 2 Age
# 3 Sex
# 4 Survived -  1 if the passenger survived;  0 if the passenger did not survive
# 5 PersonID

# Label is a target variable. PersonInfo is a list of independent variables besides unique identifier

LabeledDocument = Row("PersonID", "PersonInfo", "label")

# Define a function that parses the raw CSV file and returns an object of type LabeledDocument

def parseDocument(line):
    values = [str(x) for x in line.split(',')] 
    if (values[4]>'0'):
      Survived = 1.0
    else:
      Survived = 0.0
        
    textValue = str(values[1]) + " " + str(values[2])+" " + str(values[3])
    return LabeledDocument(values[5], textValue, Survived)

<a id="access_data"></a>
## Access Data
Before you can access data in the data file in the Object Storage, you must setup the Spark configuration with your Object Storage credentials. 

To do this, click on the cell below and select the **Insert to code > Insert Spark Session DataFrame** function from the Files tab below the data file you want to work with.

<div class="alert alert-block alert-info">The following code contains the credentials for a file in your IBM Cloud Object Storage.</div>

In [None]:
# Object Storage Credentials

<a id="Explore"></a>
## Explore Data

In [None]:
print('Number of Passengers', df_data_1.count())

In [None]:
#Number of passengers who survived and number passengers who died
df_data_1.groupby('Survived').count().show()
#Number of passengers who survived and number passengers who died by gender
df_data_1.groupby('Survived', 'Sex').count().show()
#Number of passengers who survived and number passengers who died by gender and ticket class
df_data_1.groupby('Survived', 'Sex', 'PClass').count().show()

In [None]:
#Number of passengers who survived and number passengers who died by gender
import matplotlib.pyplot as plt
%matplotlib inline
df_data_1.crosstab('Sex', 'Survived').show()
df=df_data_1.crosstab('Sex', 'Survived').toPandas()
df.plot.bar(x="Sex_Survived", legend=True , title="Survival by Gender")

In [None]:
#Number of males and number females by ticket class
import matplotlib.pyplot as plt
%matplotlib inline
df_data_1.crosstab('PClass', 'Sex').show()
df=df_data_1.crosstab('Pclass', 'Sex').toPandas()
df.plot.barh(x="Pclass_Sex", legend=True , title="Gender by PClass")

<a id="Parse"></a>
## Parse Data
Now let's load the data into a `Spark RDD` and output the number of rows and first 5 rows.
Each project you create has a bucket in your object storage. You may get the bucket name from the project Settings page. Replace the string `BUCKET` to the bucket name

In [None]:
data = sc.textFile(cos.url('Titanic.csv', 'BUCKET'))
print ("Total records in the data set:", data.count())
print ("The first 5 rows")
data.take(5)

Create DataFrame from RDD

In [None]:
#Load the data into a dataframe, parse it using the function above
documents = data.filter(lambda s: "Name" not in s).map(parseDocument)
TitanicData = documents.toDF() # ToDataFrame
print ("Number of records: " + str(TitanicData.count()))
print ("First 5 records: ")
TitanicData.take(5)

<a id="training_test"></a>
## Split Data into Training and Test set

We divide the data into training and test set.  The training set is used to build the model to be used on future data, and the test set is used to evaluate the model.

In [None]:
# Divide the data into training and test set
(train, test) = TitanicData.randomSplit([0.8, 0.2])
print ("Number of records in the training set: " + str(train.count()))
print ("Number of records in the test set: " + str(test.count()))
# Output first 20 records in the training set
print ("First 20 records in the training set: ")
train.show()

<a id="build_model"></a>
## Build Naive Bayes Model

We use the Pipeline of SparkML to build the Naive Bayes Model

In [None]:
# set up Naive Bayes using Pipeline of SparkML
tokenizer = Tokenizer(inputCol="PersonInfo", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
nb = NaiveBayes(labelCol="label", featuresCol="features", predictionCol="prediction", smoothing=1.0, modelType="multinomial")
pipeline = Pipeline(stages=[tokenizer, hashingTF, nb])

In [None]:
# set up Naive Bayes Model
# the stages are executed in order
model = pipeline.fit(train)

<a id="test_data"></a>
## Predict for Test data

In [None]:
# Make predictions for test data and print columns of interest
prediction = model.transform(test)
selected = prediction.select("PersonInfo", "prediction", "probability")
for row in selected.collect():
    print (row)

In [None]:
#Tabulate the predicted outcome
prediction.select("prediction").groupBy("prediction").count().show(truncate=False)

In [None]:
#Tabulate the actual outcome
prediction.select("label").groupBy("label").count().show(truncate=False)

In [None]:
# This table shows:
# 1. The number of passengers who survived predicted as died
# 2. The number of passengers who survived predicted as survived
# 3. The number of passengers who died predicted as died
# 4. The number of passengers who died predicted as survived

prediction.crosstab('label', 'prediction').show()

<a id="evaluate_model"></a>
## Evaluate the Model

We evaluate the model on a training set and on a test set.  The purpose is to measure the model's predictive accuracy, including the accuracy for new data.

In [None]:
# Evaluate the Naive Bayes model on a training set
# Select (prediction, true label) and compute test error
pred_nb=model.transform(train).select("prediction", "label")
eval_nb=MulticlassClassificationEvaluator (
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_nb=eval_nb.evaluate(pred_nb)
# create RDD
predictionAndLabels_nb=pred_nb.rdd
metrics_nb=MulticlassMetrics(predictionAndLabels_nb)
precision_nb=metrics_nb.precision(1.0)
recall_nb=metrics_nb.recall(1.0)
f1Measure_nb = metrics_nb.fMeasure(1.0, 1.0)
print ("Model evaluation for the training data")
print ("Accuracy = %s" %accuracy_nb)
print ("Error = %s" % (1-accuracy_nb))
print ("Precision = %s" %precision_nb)
print ("Recall = %s" %recall_nb)
print("F1 Measure = %s" % f1Measure_nb)

In [None]:
# Evaluate the Naive Bayes model on a test set
# Select (prediction, true label) and compute test error
pred_nb=model.transform(test).select("prediction", "label")
eval_nb=MulticlassClassificationEvaluator (
    labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_nb=eval_nb.evaluate(pred_nb)
# create RDD
predictionAndLabels_nb=pred_nb.rdd
metrics_nb=MulticlassMetrics(predictionAndLabels_nb)
precision_nb=metrics_nb.precision(1.0)
recall_nb=metrics_nb.recall(1.0)
f1Measure_nb = metrics_nb.fMeasure(1.0, 1.0)
print ("Model evaluation for the test data")
print ("Test Accuracy = %s" %accuracy_nb)
print ("Test Error = %s" % (1-accuracy_nb))
print ("Precision = %s" %precision_nb)
print ("Recall = %s" %recall_nb)
print("F1 Measure = %s" % f1Measure_nb)

In [None]:
bin_nb=BinaryClassificationMetrics(predictionAndLabels_nb)

# Area under precision-recall curve
print("Area under PR = %s" % bin_nb.areaUnderPR)
# Area under precision-recall curve
print("Area under ROC = %s" % bin_nb.areaUnderROC)