# Titanic with Pandas and Scikit-Learn

* **Survived**: Survival (0 = no; 1 = yes)
* **Pclass**: Passenger class (1 = first; 2 = second; 3 = third) ---- *categorical variable*
* **Name**
* **Sex** ---- *categorical variable*
* **Age**
* **SibSp**: Number of siblings aboard
* **Parch**: Number of parents/children aboard
* **Ticket**: Ticket number
* **Fare**: Passenger fare
* **Cabin**
* **Embarked**: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) ---- *categorical variable*

## Data Exploration

First, import some useful modules

In [5]:
%matplotlib notebook
import pandas as pd
print("pandas version: {}".format(pd.__version__))
import matplotlib.pyplot as plt
import seaborn as sns  # use seaborn, see https://stanford.edu/~mwaskom/software/seaborn/
print("seaborn version: {}".format(sns.__version__))
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder

from collections import Counter
sns.set()

pandas version: 0.19.0
seaborn version: 0.7.1


Load the training data set into a pandas data frame and see how many records are there and what is in the data

In [6]:
training_df = pd.read_csv("train.csv")
training_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
sns.swarmplot(x="Sex",y="Fare",data=training_df)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x10c183208>

In [8]:
testing_df = pd.read_csv("test.csv")
testing_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Missing Values in the Training Dataset

In [9]:
nrows = len(training_df.index)
print("there are {} rows in training database".format(nrows))
training_df.count()-nrows  # this returns the number of missing observations in each column

there are 891 rows in training database


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age           -177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin         -687
Embarked        -2
dtype: int64

In [25]:
nrows = len(testing_df.index)
print("there are {} rows in testing database".format(nrows))
testing_df.count()-nrows  # this returns the number of missing observations in each column
#fare100_idx = testing_df["Fare"]< 100
#print(fare100_idx)

there are 418 rows in testing database


PassengerId      0
Pclass           0
Name             0
Sex              0
Age            -86
SibSp            0
Parch            0
Ticket           0
Fare            -1
Cabin         -327
Embarked         0
dtype: int64

**Note**: since there are too many missing values in Cabin, we will ignore this variable.
Also, it doesn't seem possible to reliably estimate the two missing Embarked values; as it's only two values, we will ignore both rows where Embarked is not available. 
The key question here is how to estimate Age.

In [52]:
def preprocess_data(df):
    
    # estimate age from title
    
    same_as_mrs = ["Mrs","Mme","the Countess","Lady","Dona"]
    same_as_mr = ["Capt","Col","Don","Mr","Jonkheer","Major","Rev","Sir","Dr"]
    same_as_ms = ["Ms","Miss","Mlle"]
    
    tmp_df = df.copy()  # deep copy
    
    # create feature Title
    tmp_df["Title"]= df["Name"].apply(lambda _: _.split(",")[1].split(".")[0].strip()).astype(str)
    
    aver_age_miss = round(tmp_df[~tmp_df["Age"].isnull()  & (tmp_df["Title"].str.strip().isin(same_as_ms))]["Age"].mean(),1)
    aver_age_mrs = round(tmp_df[~tmp_df["Age"].isnull()  & (tmp_df["Title"].str.strip().isin(same_as_mrs))]["Age"].mean(),1)
    aver_age_mr = round(tmp_df[~tmp_df["Age"].isnull()  & (tmp_df["Title"].str.strip().isin(same_as_mr))]["Age"].mean(),1)
    
    if np.sum(tmp_df["Title"].str.strip().isin(["Master"])) > 0:
        
        aver_age_master = round(tmp_df[~tmp_df["Age"].isnull()  & (tmp_df["Title"].str.strip().isin(["Master"]))]["Age"].mean(),1)
        master_idx = tmp_df["Title"].str.strip().isin(["Master"])
        tmp_df.loc[master_idx, "Age"] = tmp_df.loc[master_idx, "Age"].fillna(aver_age_master)
    
    miss_idx = tmp_df["Title"].str.strip().isin(same_as_ms)
    mr_idx = tmp_df["Title"].str.strip().isin(same_as_mr)
    mrs_idx = tmp_df["Title"].str.strip().isin(same_as_mrs)
    
    
    #print(fare200_idx)
    
    
    
    #print(tmp_df["Price"].head())
    
    tmp_df.loc[miss_idx, "Age"] = tmp_df.loc[miss_idx, "Age"].fillna(aver_age_miss)
    tmp_df.loc[mr_idx,"Age"] = tmp_df.loc[mr_idx,"Age"].fillna(aver_age_mr)
    tmp_df.loc[mrs_idx,"Age"] = tmp_df.loc[mrs_idx,"Age"].fillna(aver_age_mrs)
    
    tmp_df = tmp_df.drop(["Cabin","PassengerId","Name","Ticket"],axis=1)
    
    # if Fare is absent, estimate as the mean fare
    nofare_idx = tmp_df["Fare"].isnull()
    mean_fare = round(tmp_df.loc[~nofare_idx]["Fare"].mean(),2)
    print("estimating the missing fare as the mean fare {}".format(mean_fare))
    
    tmp_df.loc[nofare_idx,"Fare"] = tmp_df.loc[nofare_idx,"Fare"].fillna(mean_fare)
    
    # ignore rows with absent Embarked
    tmp_df = tmp_df.drop(tmp_df.index[tmp_df["Embarked"].isnull()])
    
    # normalize Age, SibSp,Parch and Fare
    from sklearn.preprocessing import StandardScaler
    stdsc = StandardScaler()
    tmp_df[["Fare"]] = stdsc.fit_transform(tmp_df[["Fare"]])
    
    fare100_idx = tmp_df["Fare"] < 100
    fare200_idx = (tmp_df["Fare"] >= 100) & (tmp_df["Fare"] < 200)
    fare300_idx= (tmp_df["Fare"] >= 200) & (tmp_df["Fare"] < 300)
    fare400_idx= (tmp_df["Fare"] >= 300)
    tmp_df.loc[fare100_idx, "Price"] = 1
    tmp_df.loc[fare200_idx, "Price"] = 2
    tmp_df.loc[fare300_idx, "Price"] = 3
    tmp_df.loc[fare400_idx, "Price"] = 4
    
    # encode categorical features "Sex" and "Embarked" into integers:
    le_sex = LabelEncoder()
    tmp_df["Sex_Int"] = le_sex.fit_transform(tmp_df["Sex"])
    le_emb = LabelEncoder()
    tmp_df["Emb_Int"] = le_emb.fit_transform(tmp_df["Embarked"])
    # drop Sex and Embarked to avoid problems with OHE
    tmp_df = tmp_df.drop(["Sex","Embarked"],1)
    
    print(tmp_df.head())
    
    tmp_df = tmp_df.join(pd.get_dummies(tmp_df["Pclass"],prefix="pcl"))
    tmp_df = tmp_df.join(pd.get_dummies(tmp_df["Sex_Int"],prefix="sex"))
    # tmp_df = tmp_df.join(pd.get_dummies(tmp_df["Emb_Int"],prefix="emb"))
    
    tmp_df["Relatives"] = tmp_df["SibSp"]+tmp_df["Parch"]
    tmp_df = tmp_df.drop(["Sex_Int","Emb_Int","Pclass"],1)
    
    tmp_df.loc[tmp_df["Title"].str.strip().isin(same_as_mrs), "Title"] = "Mrs"
    tmp_df.loc[tmp_df["Title"].str.strip().isin(same_as_mr), "Title"] = "Mr"
    tmp_df.loc[tmp_df["Title"].str.strip().isin(same_as_ms), "Title"] = "Miss"
     # encode categorical "Title" into integers:
    le_title = LabelEncoder()
    tmp_df["Titl"] = le_title.fit_transform(tmp_df["Title"])
    
    tmp_df = tmp_df.drop(["SibSp","Parch","Title"],1)
    
    return tmp_df

In [53]:
training_processed = preprocess_data(training_df)
print("processed training dataset has {} rows".format(len(training_processed.index)))
training_processed.head()


estimating the missing fare as the mean fare 32.2
   Survived  Pclass   Age  SibSp  Parch      Fare Title  Price  Sex_Int  \
0         0       3  22.0      1      0 -0.500240    Mr    1.0        1   
1         1       1  38.0      1      0  0.788947   Mrs    1.0        0   
2         1       3  26.0      0      0 -0.486650  Miss    1.0        0   
3         1       1  35.0      1      0  0.422861   Mrs    1.0        0   
4         0       3  35.0      0      0 -0.484133    Mr    1.0        1   

   Emb_Int  
0        2  
1        0  
2        2  
3        2  
4        2  
processed training dataset has 889 rows


Unnamed: 0,Survived,Age,Fare,Price,pcl_1,pcl_2,pcl_3,sex_0,sex_1,Relatives,Titl
0,0,22.0,-0.50024,1.0,0,0,1,0,1,1,2
1,1,38.0,0.788947,1.0,1,0,0,1,0,1,3
2,1,26.0,-0.48665,1.0,0,0,1,1,0,0,1
3,1,35.0,0.422861,1.0,1,0,0,1,0,1,3
4,0,35.0,-0.484133,1.0,0,0,1,0,1,0,2


## Prediction

Split training data into the training and testing datasets. We will not be using **Title** and **Deck** for prediction.

In [54]:
X = training_processed.iloc[:,1:]  # variables
y = training_processed.iloc[:,0]   # outcome

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=11)

X_train.head()

Unnamed: 0,Age,Fare,Price,pcl_1,pcl_2,pcl_3,sex_0,sex_1,Relatives,Titl
275,63.0,0.923335,1.0,1,0,0,1,0,1,1
300,21.8,-0.490173,1.0,0,0,1,1,0,0,1
25,38.0,-0.014278,1.0,0,0,1,1,0,6,3
394,24.0,-0.309982,1.0,0,0,1,1,0,2,3
498,25.0,2.404961,1.0,1,0,0,1,0,3,3


### Logictic Regression

In [55]:
lr_parameters = {'penalty':['l2','l1'],'C':np.arange(0.01,1.0,0.01).tolist()}
lr = LogisticRegression()
lrc = GridSearchCV(lr, lr_parameters)
lrc.fit(X_train, y_train)
print('Training accuracy:', lrc.score(X_train, y_train))
print('Testing accuracy:', lrc.score(X_test, y_test))

Training accuracy: 0.799763033175
Testing accuracy: 0.866666666667


### AdaBoost

In [56]:
ab = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(ab, X_train, y_train)
scores.mean()
ab.fit(X_train, y_train)
print(accuracy_score(y_test, ab.predict(X_test)))

0.888888888889


### Decision Tree

In [57]:
dt_parameters = {'min_samples_leaf': np.arange(0.025,0.5,0.025).tolist()}
clf = tree.DecisionTreeClassifier(splitter="random")
dtc = GridSearchCV(clf, dt_parameters)
dtc.fit(X_train, y_train)
print(accuracy_score(y_test, dtc.predict(X_test)))

0.866666666667


### Naive Bayes

In [58]:
nbc = GaussianNB()
nbc.fit(X_train, y_train)
print(accuracy_score(y_test, nbc.predict(X_test)))

0.822222222222


### RandomForests

In [59]:
rf_parameters = {'n_estimators': np.arange(4,50,1).tolist(),'min_weight_fraction_leaf':np.arange(0.01,0.5,0.025).tolist()}
rf = RandomForestClassifier(max_features='sqrt')
rfc = GridSearchCV(rf, rf_parameters)
rfc.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], 'min_weight_fraction_leaf': [0.01, 0.035, 0.060000000000000005, 0.085, 0.11, 0.135, 0.160000...5, 0.36000000000000004, 0.385, 0.41000000000000003, 0.43500000000000005, 0.46, 0.48500000000000004]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [60]:
print(rfc.best_params_)

{'n_estimators': 13, 'min_weight_fraction_leaf': 0.01}


In [62]:
best_rf = rfc.best_estimator_
best_rf.fit(X_train, y_train)
print(accuracy_score(y_test, best_rf.predict(X_test)))
print(best_rf.feature_importances_)

0.866666666667
[ 0.10591645  0.17608651  0.          0.03342695  0.01357263  0.08698814
  0.1315019   0.24382354  0.08801414  0.12066973]


## Extra Trees

In [63]:
xt_parameters = {"n_estimators":np.arange(1,50,1).tolist(), "min_samples_split":np.arange(1,20,1).tolist() }
xt = ExtraTreesClassifier(max_depth=None, random_state=0)
xtc = GridSearchCV(xt, xt_parameters)
xtc.fit(X_train, y_train)
print(accuracy_score(y_test, xtc.predict(X_test)))

0.866666666667


In [64]:
xtc.best_estimator_

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=12, min_weight_fraction_leaf=0.0,
           n_estimators=38, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

In [68]:
testing_df_processed = preprocess_data(testing_df)
testing_df_processed.head()
Y_pred = best_rf.predict(testing_df_processed)  # note we use the RandomForests model here
print("predicted {} survival outcomes".format(len(Y_pred)))

estimating the missing fare as the mean fare 35.63
   Pclass   Age  SibSp  Parch      Fare Title  Price  Sex_Int  Emb_Int
0       3  34.5      0      0 -0.498407    Mr    1.0        1        1
1       3  47.0      1      0 -0.513274   Mrs    1.0        0        2
2       2  62.0      0      0 -0.465089    Mr    1.0        1        1
3       3  27.0      0      0 -0.483466    Mr    1.0        1        2
4       3  22.0      1      1 -0.418472   Mrs    1.0        0        2
predicted 418 survival outcomes


Format precistions for submission. Recall that we have to submit a .csv with 2 columns, PassengerID and Survived.

In [69]:
submission_df = pd.DataFrame(columns=["PassengerId","Survived"])
submission_df["PassengerId"] = testing_df["PassengerId"]
submission_df["Survived"] = Y_pred
submission_df.head()

# save this data frame to a .csv file
submission_df.to_csv("titanic_submission.csv",index=False)

This looks right. Next step - submit :)