This is the first project by Dean, Longhao and Senh. We start with a classical Titanic dataframe downloaded from Kaggle. https://www.kaggle.com/c/titanic

The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

In [33]:
#Start with importing data from local file
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
sns.set(style="ticks", color_codes=True)


In [34]:
titanic = pd.read_csv('train.csv')
titanic.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [35]:
#This function input the mean age for each class into the rows with NA age.
titanic.groupby('Pclass').Age.mean()

def input_age(col):
    age = col[0]
    pclass = col[1]
    if pd.isnull(age):
        if pclass == 1:
            return 38.23
        elif pclass == 2:
            return 29.8
        else:
            return 25
        
    else:
        return age
    
titanic['Age'] = titanic[['Age','Pclass']].apply(input_age,axis=1)


In [36]:
#Next, we want to drop cabin information
titanic.drop('Cabin',axis=1,inplace=True)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [37]:
#Seperate children from adults
titanic.loc[titanic['Age'] <= 14, 'Children_Under_14'] = 1
titanic.loc[titanic['Age'] > 14, 'Children_Under_14'] = 0
#titanic['Children_Under_10'] = titanic['Age'].apply(lambda x: if x > 10, return "Adult")
titanic.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Children_Under_14
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,0.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,0.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,0.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,0.0


In [38]:
titanic['Male'] = pd.get_dummies(titanic["Sex"],drop_first=True)
embark = pd.get_dummies(titanic['Embarked'],drop_first=True)

titanic = pd.concat([titanic,embark],axis=1)

In [39]:
from sklearn.model_selection import train_test_split

In [40]:
names = ['Pclass','Male','Parch','Q','S','Children_Under_14']

X = titanic.loc[:,names]
y = titanic['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

In [41]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [42]:
predictions = dtree.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))



              precision    recall  f1-score   support

           0       0.80      0.93      0.86       162
           1       0.85      0.65      0.74       106

    accuracy                           0.82       268
   macro avg       0.83      0.79      0.80       268
weighted avg       0.82      0.82      0.81       268



In [43]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [44]:
rfc_pred = rfc.predict(X_test)

In [45]:
print(classification_report(y_test,rfc_pred))

              precision    recall  f1-score   support

           0       0.81      0.90      0.85       162
           1       0.82      0.67      0.74       106

    accuracy                           0.81       268
   macro avg       0.81      0.79      0.79       268
weighted avg       0.81      0.81      0.81       268



In [49]:
#Now let's predict the test.csv data and publish to kaggle
test_df = pd.read_csv("test.csv")

test_df['Age'] = test_df[['Age','Pclass']].apply(input_age,axis=1)

#Seperate children from adults
test_df.loc[test_df['Age'] <= 14, 'Children_Under_14'] = 1
test_df.loc[test_df['Age'] > 14, 'Children_Under_14'] = 0

test_df['Male'] = pd.get_dummies(test_df["Sex"],drop_first=True)
embark = pd.get_dummies(test_df['Embarked'],drop_first=True)

test_df = pd.concat([test_df,embark],axis=1)

#names_2 = ['Pclass','Male','Parch','Q','S','Children_Under_14']

result=rfc.predict(test_df.loc[:,names])

sample_df = pd.read_csv("gender_submission.csv")
del sample_df['Survived']
sample_df['Survived'] = result

#print(sample_df.head())

sample_df.to_csv('submission_cw_rfc.csv', index=False)

print(sample_df)

     PassengerId  Survived
0            892         0
1            893         0
2            894         0
3            895         0
4            896         1
..           ...       ...
413         1305         0
414         1306         1
415         1307         0
416         1308         0
417         1309         0

[418 rows x 2 columns]
