<a href="https://colab.research.google.com/github/cesarhanna/Data-Science-Projects/blob/main/Prediction_of_survivability_of_the_Titanic_disaster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Prediction of survivability of the Titanic disaster**

**The Challenge**

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we want to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

**The dataframe**

In this model, I decided to use Spark to create dataframes

In [1]:
!pip install pyspark==2.4.5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark==2.4.5
  Downloading pyspark-2.4.5.tar.gz (217.8 MB)
[K     |████████████████████████████████| 217.8 MB 6.2 kB/s 
[?25hCollecting py4j==0.10.7
  Downloading py4j-0.10.7-py2.py3-none-any.whl (197 kB)
[K     |████████████████████████████████| 197 kB 52.8 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-2.4.5-py2.py3-none-any.whl size=218257928 sha256=76ce08a72b292fe2a85e0f5d989789e55e9fe15d8f66b71145777cfb5e15fedc
  Stored in directory: /root/.cache/pip/wheels/01/c0/03/1c241c9c482b647d4d99412a98a5c7f87472728ad41ae55e1e
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.7 pyspark-2.4.5


In [2]:
!pip install --upgrade pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 46 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 55.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=625f7adaa93f156630f4f09539b06b92f02e4156e66abad94a2f695566c4df17
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.7
    Uninstalling py4j-0.10.7:
      Successfully uninstalled py4j-0.10.7
  Attempting uninstall:

In [42]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

In [43]:
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

spark = SparkSession \
    .builder \
    .getOrCreate()

In [44]:
# We first need to create a pandas dataframe from the csv file in order to convert it to parquet

import pandas as pd

df1=pd.read_csv('train.csv')
df_train = df1.to_parquet('train.parquet')

# We then convert the dataframe to Spark
df_train = spark.read.load('train.parquet')

df_train.createOrReplaceTempView("df_train")
spark.sql("SELECT * from df_train").show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

In [45]:
# We do the same procedure to the test dataframe as above

df2=pd.read_csv('test.csv')
df_test = df2.to_parquet('test.parquet')

# We then convert the dataframe to Spark
df_test = spark.read.load('test.parquet')

df_test.createOrReplaceTempView("df_test")
spark.sql("SELECT * from df_test").show()

+-----------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|        892|     3|    Kelly, Mr. James|  male|34.5|    0|    0|          330911| 7.8292| null|       Q|
|        893|     3|Wilkes, Mrs. Jame...|female|47.0|    1|    0|          363272|    7.0| null|       S|
|        894|     2|Myles, Mr. Thomas...|  male|62.0|    0|    0|          240276| 9.6875| null|       Q|
|        895|     3|    Wirz, Mr. Albert|  male|27.0|    0|    0|          315154| 8.6625| null|       S|
|        896|     3|Hirvonen, Mrs. Al...|female|22.0|    1|    1|         3101298|12.2875| null|       S|
|        897|     3|Svensson, Mr. Joh...|  male|14.0|    0|    0|            7538|  9.225| null|       S|
|        898|     3|Connolly, Miss. Kate|femal

Based on the data that we see, we can fairly conclude the following features to predict the survival:


*   Features:
 *   Sex
 *   Age
 *   SibSP (Siblings and/or Spouses)
 *   Parch (Parents and/or children)
 *   Pclass (Ticket class)
*   Target:
 *   Survived









In [46]:
# Now, let's transform the "Sex" column from categorical to numerical; 0 will be assigned to male and 1 to female

from pyspark.sql.catalog import Column
from pyspark.sql import column
from pyspark.sql.functions import *

df_train_num = df_train.withColumn('Sex', when(col('Sex') == 'female', 1).when(col('Sex') == 'male', 0))
df_train_num.show()

+-----------+--------+------+--------------------+---+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+---+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  0|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|  1|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|  1|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|  1|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  0|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  0|null|    0|    0|          330877| 8.4583| null|  

In [47]:
# The same transformation should be done on the test dataset

df_test_num = df_test.withColumn('Sex', when(col('Sex') == 'female', 1).when(col('Sex') == 'male', 0))
df_test_num.show()

+-----------+------+--------------------+---+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Pclass|                Name|Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+------+--------------------+---+----+-----+-----+----------------+-------+-----+--------+
|        892|     3|    Kelly, Mr. James|  0|34.5|    0|    0|          330911| 7.8292| null|       Q|
|        893|     3|Wilkes, Mrs. Jame...|  1|47.0|    1|    0|          363272|    7.0| null|       S|
|        894|     2|Myles, Mr. Thomas...|  0|62.0|    0|    0|          240276| 9.6875| null|       Q|
|        895|     3|    Wirz, Mr. Albert|  0|27.0|    0|    0|          315154| 8.6625| null|       S|
|        896|     3|Hirvonen, Mrs. Al...|  1|22.0|    1|    1|         3101298|12.2875| null|       S|
|        897|     3|Svensson, Mr. Joh...|  0|14.0|    0|    0|            7538|  9.225| null|       S|
|        898|     3|Connolly, Miss. Kate|  1|30.0|    0|    0|          3

Now, we need to drop the columns that we don't need and keep the features in addition to the PassengerID, which we are going to use in our final prediction resulted dataframe

In [48]:
df_train_num_dropped = df_train_num.drop('Name', 'Ticket', 'Fare', 'Cabin', 'Embarked')
df_train_num_dropped.show()

+-----------+--------+------+---+----+-----+-----+
|PassengerId|Survived|Pclass|Sex| Age|SibSp|Parch|
+-----------+--------+------+---+----+-----+-----+
|          1|       0|     3|  0|22.0|    1|    0|
|          2|       1|     1|  1|38.0|    1|    0|
|          3|       1|     3|  1|26.0|    0|    0|
|          4|       1|     1|  1|35.0|    1|    0|
|          5|       0|     3|  0|35.0|    0|    0|
|          6|       0|     3|  0|null|    0|    0|
|          7|       0|     1|  0|54.0|    0|    0|
|          8|       0|     3|  0| 2.0|    3|    1|
|          9|       1|     3|  1|27.0|    0|    2|
|         10|       1|     2|  1|14.0|    1|    0|
|         11|       1|     3|  1| 4.0|    1|    1|
|         12|       1|     1|  1|58.0|    0|    0|
|         13|       0|     3|  0|20.0|    0|    0|
|         14|       0|     3|  0|39.0|    1|    5|
|         15|       0|     3|  1|14.0|    0|    0|
|         16|       1|     2|  1|55.0|    0|    0|
|         17|       0|     3|  

In [49]:
df_test_num_dropped = df_test_num.drop('Name', 'Ticket', 'Fare', 'Cabin', 'Embarked')
df_test_num_dropped.show()

+-----------+------+---+----+-----+-----+
|PassengerId|Pclass|Sex| Age|SibSp|Parch|
+-----------+------+---+----+-----+-----+
|        892|     3|  0|34.5|    0|    0|
|        893|     3|  1|47.0|    1|    0|
|        894|     2|  0|62.0|    0|    0|
|        895|     3|  0|27.0|    0|    0|
|        896|     3|  1|22.0|    1|    1|
|        897|     3|  0|14.0|    0|    0|
|        898|     3|  1|30.0|    0|    0|
|        899|     2|  0|26.0|    1|    1|
|        900|     3|  1|18.0|    0|    0|
|        901|     3|  0|21.0|    2|    0|
|        902|     3|  0|null|    0|    0|
|        903|     1|  0|46.0|    0|    0|
|        904|     1|  1|23.0|    1|    0|
|        905|     2|  0|63.0|    1|    0|
|        906|     1|  1|47.0|    1|    0|
|        907|     2|  1|24.0|    1|    0|
|        908|     2|  0|35.0|    0|    0|
|        909|     3|  0|21.0|    0|    0|
|        910|     3|  1|27.0|    1|    0|
|        911|     3|  1|45.0|    0|    0|
+-----------+------+---+----+-----

In [50]:
# Let's get rid of the null values in Age

df_train_num_dropped.createOrReplaceTempView("df_train_num_dropped")
df_train_updated = spark.sql("select * from df_train_num_dropped where Age is not null")
df_train_updated.show()

+-----------+--------+------+---+----+-----+-----+
|PassengerId|Survived|Pclass|Sex| Age|SibSp|Parch|
+-----------+--------+------+---+----+-----+-----+
|          1|       0|     3|  0|22.0|    1|    0|
|          2|       1|     1|  1|38.0|    1|    0|
|          3|       1|     3|  1|26.0|    0|    0|
|          4|       1|     1|  1|35.0|    1|    0|
|          5|       0|     3|  0|35.0|    0|    0|
|          7|       0|     1|  0|54.0|    0|    0|
|          8|       0|     3|  0| 2.0|    3|    1|
|          9|       1|     3|  1|27.0|    0|    2|
|         10|       1|     2|  1|14.0|    1|    0|
|         11|       1|     3|  1| 4.0|    1|    1|
|         12|       1|     1|  1|58.0|    0|    0|
|         13|       0|     3|  0|20.0|    0|    0|
|         14|       0|     3|  0|39.0|    1|    5|
|         15|       0|     3|  1|14.0|    0|    0|
|         16|       1|     2|  1|55.0|    0|    0|
|         17|       0|     3|  0| 2.0|    4|    1|
|         19|       0|     3|  

In [51]:
# The same for the test set

df_test_num_dropped.createOrReplaceTempView("df_test_num_dropped")
df_test_updated = spark.sql("select PassengerId, Pclass, Sex, Age, SibSp, Parch from df_test_num_dropped where Age is not null")
df_test_updated.show()

+-----------+------+---+----+-----+-----+
|PassengerId|Pclass|Sex| Age|SibSp|Parch|
+-----------+------+---+----+-----+-----+
|        892|     3|  0|34.5|    0|    0|
|        893|     3|  1|47.0|    1|    0|
|        894|     2|  0|62.0|    0|    0|
|        895|     3|  0|27.0|    0|    0|
|        896|     3|  1|22.0|    1|    1|
|        897|     3|  0|14.0|    0|    0|
|        898|     3|  1|30.0|    0|    0|
|        899|     2|  0|26.0|    1|    1|
|        900|     3|  1|18.0|    0|    0|
|        901|     3|  0|21.0|    2|    0|
|        903|     1|  0|46.0|    0|    0|
|        904|     1|  1|23.0|    1|    0|
|        905|     2|  0|63.0|    1|    0|
|        906|     1|  1|47.0|    1|    0|
|        907|     2|  1|24.0|    1|    0|
|        908|     2|  0|35.0|    0|    0|
|        909|     3|  0|21.0|    0|    0|
|        910|     3|  1|27.0|    1|    0|
|        911|     3|  1|45.0|    0|    0|
|        912|     1|  0|55.0|    1|    0|
+-----------+------+---+----+-----

In order to fit and to predict this model, I have to convert the spark dataframes for both the training features and the target (Survived) to pnadas, as using spark the fit/predict is throwing an error.

In [52]:
# Converting the training dataframe to pandas

df_train_updated.createOrReplaceTempView("df_train_updated")
df_train_updated_features = spark.sql("select Pclass, Sex, Age, SibSp, Parch from df_train_updated")

df_train_updated_features_pd = df_train_updated_features.toPandas()
df_train_updated_features_pd

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch
0,3,0,22.0,1,0
1,1,1,38.0,1,0
2,3,1,26.0,0,0
3,1,1,35.0,1,0
4,3,0,35.0,0,0
...,...,...,...,...,...
709,3,1,39.0,0,5
710,2,0,27.0,0,0
711,1,1,19.0,0,0
712,1,0,26.0,0,0


In [53]:
# Creating the features dataframe of the test set and converting it to pandas

df_test_updated.createOrReplaceTempView("df_test_updated")
df_test_features = spark.sql('select Pclass, Sex, Age, SibSp, Parch from df_train_updated')

df_test_features_pd = df_test_features.toPandas()
df_test_features_pd

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch
0,3,0,22.0,1,0
1,1,1,38.0,1,0
2,3,1,26.0,0,0
3,1,1,35.0,1,0
4,3,0,35.0,0,0
...,...,...,...,...,...
709,3,1,39.0,0,5
710,2,0,27.0,0,0
711,1,1,19.0,0,0
712,1,0,26.0,0,0


In [54]:
# Creating the target dataframe and converting it to pandas

df_train_updated.createOrReplaceTempView("df_train_updated")
target = spark.sql("select Survived from df_train_updated")

target_pd = target.toPandas()
target_pd

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0
...,...
709,0
710,0
711,1
712,1


In this case, since we have labled data, it is preferable to use supurvided ML alogorithm; let's use Random Decision Tree since our prediction is binary. After I have ran few algorithms to compare the f1-score of each, it turns out that among Random Decision Tree, Gradient Boosted Tree, Decision Tree, Support Vector Machine and Linear Regression, the Random Decision Tree gave the highest scores, so we'll go with that.

In [55]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)

In [56]:
model = rfc.fit(df_train_updated_features_pd, target_pd)

  """Entry point for launching an IPython kernel.


In [57]:
prediction_rfc = model.predict(df_test_features_pd)

Evaluating the model

In [58]:
#from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

#accuracy_score(target, prediction)
f1_score_rfc = f1_score(target_pd, prediction_rfc)
print('The f1 score is:', f1_score_rfc)

#SVM_metrics_accuracy = metrics.accuracy_score(w, prediction)
#print('Metrics accuracy of SVM is:', SVM_metrics_accuracy)

The f1 score is: 0.7861271676300577


We see that this model's score is almost 80% which is a very good result!

We can also visualize our evaluation by showing the Confusion Matrix as follows

In [75]:
# Importing the required packages for evaluation

from sklearn.metrics import classification_report, confusion_matrix

In [73]:
# Evaluation function

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion Matrix',
                          cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

In [74]:
# Evaluation visualization - Survived is "1" and did not survive is "0"

rfc_cnf_matrix = confusion_matrix(target_pd, prediction_rfc, labels=['1','0'])
np.set_printoptions(precision=2)
print (classification_report(target_pd, prediction_rfc))

              precision    recall  f1-score   support

           0       0.82      0.94      0.88       424
           1       0.89      0.70      0.79       290

    accuracy                           0.84       714
   macro avg       0.86      0.82      0.83       714
weighted avg       0.85      0.84      0.84       714



The Confusion Matrix above shows a very good score for both precision and recall, and a metrics accuracy of 84% - the metrics accuracy shows how good our classifer is in classifying observations and predicting.