## Titanic Problem

This notebook is to try predicting whether or not a passenger of the Titanic would survive based on some features like age, sex, etc.

The data used is here: https://www.kaggle.com/competitions/titanic/data

In [68]:
# Importing tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [69]:
# Importing the train data
data = pd.read_csv("train.csv")

In [70]:
# Create a DataFrame 
df = pd.DataFrame(data)

In [71]:
# Show the DataFrame
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Exploring our data

In [72]:
df.Ticket.value_counts()

Ticket
347082      7
CA. 2343    7
1601        7
3101295     6
CA 2144     6
           ..
9234        1
19988       1
2693        1
PC 17612    1
370376      1
Name: count, Length: 681, dtype: int64

In [73]:
df.shape

(891, 12)

In [74]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [75]:
df.Embarked.value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [77]:
df.Cabin.value_counts()

Cabin
B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: count, Length: 147, dtype: int64

In [78]:
df.Parch.value_counts()

Parch
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: count, dtype: int64

In [79]:
df.Survived.value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

After some check on the data, we determine that columns `Name`, `Ticket`, and `Cabin` must go because these don't bring important data.

In [80]:
def preprocessing(df):
    """
    Perform transformations on df and returns transformed df.
    """
    df.drop("Name", axis=1, inplace=True)
    df.drop("Cabin", axis=1, inplace=True)
    df.drop("Ticket", axis=1, inplace=True)
    
    for label, content in df.items():
        if pd.api.types.is_numeric_dtype(content):
            if pd.isnull(content).sum():
                # Add a binary column which tell us if data were missing
                df[label+"_is_missing"] = pd.isnull(content)
                # Fill the missing values with the median:
                df[label] = content.fillna(content.median())
                
        if not pd.api.types.is_numeric_dtype(content):
            # Add binary column 
            df[label+"_is_missing"] = pd.isnull(content)
            # Convert categorical features into numbers, we add 1 to the codes because pandas encodes mssing categories with -1
            df[label] = pd.Categorical(content).codes+1
            
    return df

In [81]:
preprocessing(df)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Sex_is_missing,Age_is_missing,Embarked_is_missing
0,1,0,3,2,22.0,1,0,7.2500,3,False,False,False
1,2,1,1,1,38.0,1,0,71.2833,1,False,False,False
2,3,1,3,1,26.0,0,0,7.9250,3,False,False,False
3,4,1,1,1,35.0,1,0,53.1000,3,False,False,False
4,5,0,3,2,35.0,0,0,8.0500,3,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,2,27.0,0,0,13.0000,3,False,False,False
887,888,1,1,1,19.0,0,0,30.0000,3,False,False,False
888,889,0,3,1,28.0,1,2,23.4500,3,False,True,False
889,890,1,1,2,26.0,0,0,30.0000,1,False,False,False


## First model

We are going to try LinearSVC, we chose the model based on the sklearn machine learning map: https://scikit-learn.org/stable/tutorial/machine_learning_map/

In [82]:
from sklearn.svm import LinearSVC

clf = LinearSVC()

In [83]:
# Make X & y
X = df.drop("Survived", axis=1)
y = df["Survived"]

In [84]:
clf.fit(X, y)



In [85]:
# Importing test data 
df_test = pd.read_csv("test.csv")
df_test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [86]:
df_test.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [87]:
df_test_tmp = preprocessing(df_test)
df_test_tmp.drop("Fare_is_missing", axis=1, inplace=True)

In [88]:
df_test_tmp.isna().sum()

PassengerId            0
Pclass                 0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Fare                   0
Embarked               0
Sex_is_missing         0
Age_is_missing         0
Embarked_is_missing    0
dtype: int64

In [89]:
y_preds = clf.predict(df_test_tmp)
y_preds

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [95]:
def make_submission_df(df_test, preds):
    submission = pd.DataFrame()
    submission["PassengerId"] = df_test["PassengerId"]
    submission["Survived"] = preds
    submission.set_index("PassengerId", inplace=True)
    
    return submission

In [96]:
svc_sub = make_submission_df(df_test_tmp, y_preds)

In [None]:
svc_sub.to_csv("LinearSVC.csv")

### Results
The first model gave us an accurracy of **0.64593**

## Second model
Now we are going to try KNN model following the sklearn map.

In [93]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

knn.fit(X, y)

In [94]:
knn_preds = knn.predict(df_test_tmp)
knn_preds

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [98]:
knn_sub = make_submission_df(df_test_tmp, knn_preds)

In [99]:
knn_sub.to_csv("titanic_knn.csv")

### Results

This first try with knn gave us an accuracy of **0.66267**

## Third model

Now we are going to try RandomForestClassifier

In [100]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()

rfc.fit(X, y)

In [103]:
rfc_preds = rfc.predict(df_test_tmp)
rfc_preds

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [104]:
rfc_sub = make_submission_df(df_test_tmp, rfc_preds)

In [105]:
rfc_sub.to_csv("titanic_RFC.csv")

### Results
This first try with RandomForestClassifier gave us an accuracy of **.75358**.

# Improving models

Now that we have tried some models we are going to try improving these models with RandomSearchCV.

But first, we need to split our train data into train and validation sets.

In [106]:
# Re-import our data
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [109]:
# Make X and y
X = df.drop("Survived", axis=1)
y = df["Survived"]

In [114]:
# Split our data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

In [115]:
# Preprocess X_train
X_train_tmp = preprocessing(X_train)
X_train_tmp.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Sex_is_missing,Age_is_missing,Embarked_is_missing
540,541,1,1,36.0,0,2,71.0,3,False,False,False
644,645,3,1,0.75,2,1,19.2583,1,False,False,False
168,169,1,2,28.5,0,0,25.925,3,False,True,False
71,72,3,1,16.0,5,2,46.9,3,False,False,False
857,858,1,2,51.0,0,0,26.55,3,False,False,False


In [117]:
# Preprocess X_valid
X_valid_tmp = preprocessing(X_valid)
X_valid_tmp.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Sex_is_missing,Age_is_missing,Embarked_is_missing
174,175,1,2,56.0,0,0,30.6958,1,False,False,False
331,332,1,2,45.5,0,0,28.5,3,False,False,False
76,77,3,2,28.0,0,0,7.8958,3,False,True,False
14,15,3,1,14.0,0,0,7.8542,3,False,False,False
646,647,3,2,19.0,0,0,7.8958,3,False,False,False


In [118]:
rfc.fit(X_train_tmp, y_train)

In [119]:
rfc.score(X_valid_tmp, y_valid)

0.7932960893854749

## Tuning Hyperparameters

In [120]:
# Different RandomForest hyperparameters
param_grid = {"n_estimators": np.arange(10, 100, 10),
              "max_depth": [None, 3, 5, 10, 20],
              "min_samples_split": np.arange(2, 20, 2),
              "max_features": [0.5, 1, "sqrt"]}

In [121]:
# Instantiate the RandomizedSearchCV model
rs_model = RandomizedSearchCV(RandomForestClassifier(),
                              param_distributions=param_grid,
                              n_iter=100,
                              n_jobs=-1,
                              cv=5,
                              verbose=True)

rs_model.fit(X_train_tmp, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [122]:
rs_model.best_params_

{'n_estimators': 90,
 'min_samples_split': 4,
 'max_features': 'sqrt',
 'max_depth': 20}

In [123]:
rf_best_model = RandomForestClassifier(n_estimators=90,
                                       min_samples_split=4,
                                       max_features= 'sqrt',
                                       max_depth=20)

rf_best_model.fit(X_train_tmp, y_train)

In [124]:
rf_best_model.score(X_valid_tmp, y_valid)

0.8212290502793296

In [126]:
rf_bm_preds = rf_best_model.predict(df_test_tmp)

In [127]:
rf_bm_sub = make_submission_df(df_test_tmp, rf_bm_preds)

In [129]:
rf_bm_sub.to_csv("titanic_RandomSearch.csv")

In [130]:
X_tmp = preprocessing(X)

In [131]:
rf_best_model.fit(X_tmp, y)

In [132]:
rf_bm_preds = rf_best_model.predict(df_test_tmp)

In [133]:
rf_bm_sub = make_submission_df(df_test_tmp, rf_bm_preds)

In [134]:
rf_bm_sub.to_csv("titanic_RandomSearch.csv")

In [136]:
from sklearn.svm import SVC

model = SVC()

model.fit(X_tmp, y)

In [138]:
model_preds = model.predict(df_test_tmp)

In [139]:
model_sub = make_submission_df(df_test_tmp, model_preds)
model_sub.to_csv("SVC.csv")

The best score I get on Kaggle was `0.77033`, with the model obtained with RandomSearchCV.
