# Titanic - Machine Learning from Disaster

https://www.kaggle.com/competitions/titanic

### The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). 

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
DATA_PATH = os.path.join("D:", "titanic")

In [3]:
train=pd.read_csv(os.path.join(DATA_PATH,'train.csv'))

In [4]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
test=pd.read_csv(os.path.join(DATA_PATH,'test.csv'))

In [6]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [7]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## Features Selection

Women and children first: sex and age data may be the most important. 

quite a lot of age data is missing.

In [8]:
train['Child']=(train['Age']<10).replace({True: 1, False: 0})

In [9]:
train['Child'].value_counts()

0    829
1     62
Name: Child, dtype: int64

In [10]:
train.corr()['Survived']

PassengerId   -0.005007
Survived       1.000000
Pclass        -0.338481
Age           -0.077221
SibSp         -0.035322
Parch          0.081629
Fare           0.257307
Child          0.128812
Name: Survived, dtype: float64

hmm. Age seems to not be correlated to survival. Being a child is weakly correlated with survival. We will see about gender after encoding it. 

May we be able to predict age or age category based on Fare, Embarked, and PClass?

In [11]:
train['Age_cat']=pd.cut(train['Age'],[0,2,5,10,55,99],labels=['infant','baby','child','adult','elderly'])

In [12]:
train[['Fare','Embarked','Pclass','Age_cat']].groupby(['Embarked','Pclass','Age_cat']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Fare,Fare,Fare,Fare,Fare,Fare,Fare,Fare
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,std,min,25%,50%,75%,max
Embarked,Pclass,Age_cat,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
C,1,adult,63.0,115.40946,109.447555,26.55,56.9292,79.2,113.275,512.3292
C,1,elderly,11.0,67.221591,37.445271,29.7,35.0771,61.9792,81.17915,146.5208
C,2,infant,1.0,37.0042,,37.0042,37.0042,37.0042,37.0042,37.0042
C,2,baby,1.0,41.5792,,41.5792,41.5792,41.5792,41.5792,41.5792
C,2,adult,13.0,24.891985,10.875748,12.0,13.8583,24.0,30.0708,41.5792
C,3,infant,4.0,15.69375,5.06374,8.5167,13.93545,17.5,19.2583,19.2583
C,3,baby,2.0,16.3375,4.130635,13.4167,14.8771,16.3375,17.7979,19.2583
C,3,child,1.0,15.2458,,15.2458,15.2458,15.2458,15.2458,15.2458
C,3,adult,34.0,10.455035,4.288241,4.0125,7.2292,7.8958,14.4542,19.2583
Q,1,adult,2.0,90.0,0.0,90.0,90.0,90.0,90.0,90.0


Fare does not seem to vary predictibly with age. Missing Age data will be difficult to predict with the Fare, Embarked, and Pclass information.

The names seem to have master for boys under a certain age. 

In [13]:
#train[train['Age'].isna()][train['Name'].str.contains('Master')].describe()
# we can add 4 more to child
train['Child']=((train['Age']<10) | (train['Age'].isna() & train['Name'].str.contains('Master') )).replace({True: 1, False: 0})

In [14]:
train['Child'].value_counts()

0    825
1     66
Name: Child, dtype: int64

In [15]:
train[train['Name'].str.contains('Miss')]['Age'].describe()
# Miss is not informative of age

count    146.000000
mean      21.773973
std       12.990292
min        0.750000
25%       14.125000
50%       21.000000
75%       30.000000
max       63.000000
Name: Age, dtype: float64

In [16]:
train.sort_values('Ticket')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Child,Age_cat
504,505,1,1,"Maioni, Miss. Roberta",female,16.0,0,0,110152,86.500,B79,S,0,adult
257,258,1,1,"Cherry, Miss. Gladys",female,30.0,0,0,110152,86.500,B77,S,0,adult
759,760,1,1,"Rothes, the Countess. of (Lucy Noel Martha Dye...",female,33.0,0,0,110152,86.500,B77,S,0,adult
262,263,0,1,"Taussig, Mr. Emil",male,52.0,1,1,110413,79.650,E67,S,0,adult
558,559,1,1,"Taussig, Mrs. Emil (Tillie Mandelbaum)",female,39.0,1,1,110413,79.650,E67,S,0,adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,236,0,3,"Harknett, Miss. Alice Phoebe",female,,0,0,W./C. 6609,7.550,,S,0,
92,93,0,1,"Chaffee, Mr. Herbert Fuller",male,46.0,1,0,W.E.P. 5734,61.175,E31,S,0,adult
219,220,0,2,"Harris, Mr. Walter",male,30.0,0,0,W/C 14208,10.500,,S,0,adult
540,541,1,1,"Crosby, Miss. Harriet R",female,36.0,0,2,WE/P 5735,71.000,B22,S,0,adult


In [17]:
train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [18]:
train['Sex_male']=(train['Sex']=='male').replace({True: 1, False: 0})

In [19]:
train[['Sex_male','Survived']].groupby('Sex_male').mean()

Unnamed: 0_level_0,Survived
Sex_male,Unnamed: 1_level_1
0,0.742038
1,0.188908


74% of women survived! Sex is definitely an important catergory

Try combining Parch and SibSp to see if it improves correlation

In [20]:
train['rel']=train['Parch']+train['SibSp']

In [21]:
train.corr()['Survived']

PassengerId   -0.005007
Survived       1.000000
Pclass        -0.338481
Age           -0.077221
SibSp         -0.035322
Parch          0.081629
Fare           0.257307
Child          0.129244
Sex_male      -0.543351
rel            0.016639
Name: Survived, dtype: float64

'rel' is less correlated with Survival than 'Parch'

Exploring Cabin

In [22]:
train[train['Pclass']==2]['Cabin'].count()

16

In [23]:
train['No_cabin']=(train['Cabin'].isna()).replace({True: 1, False: 0})

In [24]:
train['No_cabin'].value_counts()

1    687
0    204
Name: No_cabin, dtype: int64

In [25]:
train.corr()['Survived']

PassengerId   -0.005007
Survived       1.000000
Pclass        -0.338481
Age           -0.077221
SibSp         -0.035322
Parch          0.081629
Fare           0.257307
Child          0.129244
Sex_male      -0.543351
rel            0.016639
No_cabin      -0.316912
Name: Survived, dtype: float64

### Pipeplines for Data Cleaning and Predictions

We will make the following pipelines:
* Preprocessing
* * Numerical cols:
* * * Impute (fix NA) - use medians
* * Categorical cols:
* * * One Hot Encoding

In [26]:
train=pd.read_csv(os.path.join(DATA_PATH,'train.csv'))

In [27]:
train['No_cabin']=(train['Cabin'].isna()).replace({True: 1, False: 0})
test['No_cabin']=(test['Cabin'].isna()).replace({True: 1, False: 0})

In [28]:
train['Child']=((train['Age']<10) | (train['Age'].isna() & train['Name'].str.contains('Master') )).replace({True: 1, False: 0})
test['Child']=((test['Age']<10) | (test['Age'].isna() & test['Name'].str.contains('Master') )).replace({True: 1, False: 0})

In [29]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'No_cabin', 'Child'],
      dtype='object')

In [30]:
X_train=train[['Pclass','Age','SibSp','Parch','Fare','No_cabin','Child','Sex','Embarked']]
y_train=train['Survived']

In [31]:
all_cols=X_train.columns

In [32]:
num_cols=['Pclass','Age','SibSp','Parch','Fare','No_cabin','Child']

In [33]:
cat_cols=['Sex','Embarked']

In [34]:
from sklearn.pipeline import Pipeline

In [35]:
from sklearn.impute import SimpleImputer

In [36]:
from sklearn.preprocessing import StandardScaler

In [37]:
from sklearn.preprocessing import OneHotEncoder

In [38]:
num_pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("std_scaler",StandardScaler())
    ])

In [39]:
cat_pipe= Pipeline([
    ("imputer",SimpleImputer(strategy='most_frequent')),
    ("OH_encode",OneHotEncoder())
])

In [40]:
from sklearn.compose import ColumnTransformer

In [41]:
# AttributesAdder
preprocess_pipeline=ColumnTransformer([
    ('num',num_pipe,num_cols),
    ('nonnum',cat_pipe,cat_cols)
],remainder='drop')


In [42]:
X_train.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,No_cabin,Child,Sex,Embarked
0,3,22.0,1,0,7.25,1,0,male,S
1,1,38.0,1,0,71.2833,0,0,female,C
2,3,26.0,0,0,7.925,1,0,female,S
3,1,35.0,1,0,53.1,0,0,female,S
4,3,35.0,0,0,8.05,1,0,male,S


In [43]:
y_train.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [44]:
X_train_prep=preprocess_pipeline.fit_transform(X_train)

In [45]:
from sklearn.ensemble import RandomForestClassifier

In [46]:
rf_clf=RandomForestClassifier(n_estimators=100, random_state=42)

In [47]:
rf_clf.fit(X_train_prep,y_train)

RandomForestClassifier(random_state=42)

In [48]:
from xgboost import XGBClassifier

In [49]:
xgb_clf=XGBClassifier()

In [50]:
xgb_clf.fit(X_train_prep,y_train)

XGBClassifier()

In [51]:
y_train_pred_rf=rf_clf.predict(X_train_prep)

In [52]:
y_train_pred_xgb=xgb_clf.predict(X_train_prep)

In [53]:
from sklearn.model_selection import cross_val_score

In [54]:
rf_score=cross_val_score(rf_clf,X_train_prep,y_train,cv=10)

In [55]:
rf_score.mean(), rf_score

(0.8103995006242197,
 array([0.74444444, 0.84269663, 0.76404494, 0.80898876, 0.85393258,
        0.83146067, 0.79775281, 0.75280899, 0.85393258, 0.85393258]))

In [56]:
xgb_score=cross_val_score(xgb_clf,X_train_prep,y_train,cv=10)

In [57]:
xgb_score.mean(), xgb_score

(0.8283146067415729,
 array([0.8       , 0.79775281, 0.7752809 , 0.86516854, 0.88764045,
        0.83146067, 0.83146067, 0.78651685, 0.86516854, 0.84269663]))

In [58]:
test_prep=preprocess_pipeline.fit_transform(test)

In [59]:
y_test_pred=xgb_clf.predict(test_prep)
#y_test_pred=rf_clf.predict(test_prep)

In [60]:
submission=pd.DataFrame(test['PassengerId'],columns=['PassengerId'])

In [61]:
submission['Survived']=y_test_pred

In [62]:
submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [63]:
submission.to_csv('D:/titanic/to_submit.csv',sep=',',index=False)