# Titanic data set

## Set up the notebook

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import seaborn as sns; sns.set()
from sklearn.model_selection import train_test_split

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Import the data

In [2]:
# train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
train_df = pd.read_csv("train.csv")
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# test_df = pd.read_csv('/kaggle/input/titanic/test.csv')
test_df = pd.read_csv('test.csv')
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [4]:
train_df.shape

(891, 12)

There are 11 independent variables (or features) and one dependent variable (or target). 
The training set contains 891 data points.

Let's see if the target variable is balanced.

In [5]:
train_df.Survived.value_counts()

0    549
1    342
Name: Survived, dtype: int64

We can see that the number of passengers who died (`Survived=0`) is larger than the number of passangers who survived (`Survived=1`).

## Creating a validation set
In order to create a model that will generalize well, we need to have an out-of-sample set that we can use to select the best model from. If we use the training set for this purpose, the model will most likely overfit to this data.
That is why we need to subsample part of the training set for this purpose.

Since we have seen that the attribute `Sex`is very important for predicting the right outcome, we must ensure that the ratio of females/males in the train set is the same as in the valiation set. It is also important to make sure of this because of the imbalance (larger number of males than females) existing with respect to that attribute.

In [6]:
train_df, val_df = train_test_split(train_df, test_size=.2, stratify=train_df["Sex"])
train_df.shape, val_df.shape

((712, 12), (179, 12))

## Data cleaning 
Before we can proceed with the modeling, we must make sure to fix any missing data values that may appear in the data set.

In [7]:
train_df.shape

(712, 12)

In [8]:
train_df.info(), val_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 218 to 2
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Name         712 non-null    object 
 4   Sex          712 non-null    object 
 5   Age          567 non-null    float64
 6   SibSp        712 non-null    int64  
 7   Parch        712 non-null    int64  
 8   Ticket       712 non-null    object 
 9   Fare         712 non-null    float64
 10  Cabin        160 non-null    object 
 11  Embarked     711 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 72.3+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 179 entries, 831 to 605
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  179 non-null    int64  
 1   Surviv

(None, None)

Here, we can see that out of the 668 data points, multiple features contain null values:
- _Age_ is missing 131 values
- _Cabin_ is missing 381 values
- _Embarked_ is missing 2 values

In [9]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


Looking at the validation set, we notice that the feature _Fare_ has missing value. So we'll need to handle that as well.

Our approach to data cleaning will be to substitute the missing values by the *median* in the *numerical attributes*. 
In those arguments that are *categorical*, we will use the *mode*.
The reason for using these 2 metrics is due to their robustness compared to other statistics like the mean.

It is very important to keep in mind that the statistics computed to clear the missing values can only be obtained from the training data.
Obtaining them from the entire data set (training and test) would lead to data leakage.

In [10]:
# Fill "Age" missing values with the median
median_age_trn = train_df["Age"].median()
train_df["Age"].fillna(value=median_age_trn, inplace=True)
val_df["Age"].fillna(value=median_age_trn, inplace=True)
test_df["Age"].fillna(value=median_age_trn, inplace=True)

# Fill "Embarked" missing values with the mode
mode_embarked_trn = train_df["Embarked"].mode()[0]
train_df["Embarked"].fillna(value=mode_embarked_trn, inplace=True)
val_df["Embarked"].fillna(value=mode_embarked_trn, inplace=True)
test_df["Embarked"].fillna(value=mode_embarked_trn, inplace=True)

# Drop the "Cabin" feature since many of its values are missing
# We drop "PassengerID", "Name" and "Ticket" since their content does not give relevant information
attrs_drop = ["Cabin", "PassengerId", "Name", "Ticket"]
train_df.drop(attrs_drop, axis=1, inplace=True)
val_df.drop(attrs_drop, axis=1, inplace=True)
test_df.drop(attrs_drop, axis=1, inplace=True)

# Fill "Fare" missing values with the median
median_fare_trn = train_df["Fare"].median()
train_df["Fare"].fillna(value=median_fare_trn, inplace=True)  # not needed
val_df["Fare"].fillna(value=median_fare_trn, inplace=True)  # not needed
test_df["Fare"].fillna(value=median_fare_trn, inplace=True)

Now, let's check that no missing values are present in the cleaned data sets.

In [11]:
train_df.isnull().sum(), val_df.isnull().sum(), test_df.isnull().sum()

(Survived    0
 Pclass      0
 Sex         0
 Age         0
 SibSp       0
 Parch       0
 Fare        0
 Embarked    0
 dtype: int64,
 Survived    0
 Pclass      0
 Sex         0
 Age         0
 SibSp       0
 Parch       0
 Fare        0
 Embarked    0
 dtype: int64,
 Pclass      0
 Sex         0
 Age         0
 SibSp       0
 Parch       0
 Fare        0
 Embarked    0
 dtype: int64)

In [12]:
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
218,1,1,female,32.0,0,0,76.2917,C
29,0,3,male,28.0,0,0,7.8958,S
612,1,3,female,28.0,1,0,15.5,Q
808,0,2,male,39.0,0,0,13.0,S
701,1,1,male,35.0,0,0,26.2875,S


## Feature engineering
In order to improve the predictive power of the model, we can build upon the given features to obtain features that may be more informative.
- Creating a "FamilySize" feature to account for the total number of members in the family. "FamilySize"="SibSp" (# of siblings/spouses) + "Parch" (# of parents/children) + 1 (himself/herself).



In [13]:
train_df["FamilySize"] = train_df["SibSp"] + train_df["Parch"] + 1
val_df["FamilySize"] = val_df["SibSp"] + val_df["Parch"] + 1
test_df["FamilySize"] = test_df["SibSp"] + test_df["Parch"] + 1

- Creating a "Alone" binary feature to check whether a person travels alone.

In [14]:
train_df["Alone"] = 1
train_df.loc[train_df["FamilySize"]>1, "Alone"] = 0
val_df["Alone"] = 1
val_df.loc[val_df["FamilySize"]>1, "Alone"] = 0
test_df["Alone"] = 1
test_df.loc[test_df["FamilySize"]>1, "Alone"] = 0

We must also encode the categorical feature "Embarked" in a way that can be processed by different models.
We will transform this feature into a one-hot encoding representation.

In [15]:
embarked_cat = pd.get_dummies(train_df["Embarked"])
train_df = pd.concat([train_df, embarked_cat], axis=1)
train_df.drop("Embarked", axis=1, inplace=True)

embarked_cat = pd.get_dummies(val_df["Embarked"])
val_df = pd.concat([val_df, embarked_cat], axis=1)
val_df.drop("Embarked", axis=1, inplace=True)

embarked_cat = pd.get_dummies(test_df["Embarked"])
test_df = pd.concat([test_df, embarked_cat], axis=1)
test_df.drop("Embarked", axis=1, inplace=True)
test_df

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,FamilySize,Alone,C,Q,S
0,3,male,34.5,0,0,7.8292,1,1,0,1,0
1,3,female,47.0,1,0,7.0000,2,0,0,0,1
2,2,male,62.0,0,0,9.6875,1,1,0,1,0
3,3,male,27.0,0,0,8.6625,1,1,0,0,1
4,3,female,22.0,1,1,12.2875,3,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
413,3,male,28.0,0,0,8.0500,1,1,0,0,1
414,1,female,39.0,0,0,108.9000,1,1,1,0,0
415,3,male,38.5,0,0,7.2500,1,1,0,0,1
416,3,male,28.0,0,0,8.0500,1,1,0,0,1


We'll encode the "Sex" feature as two binary features as well

In [16]:
sex_cat = pd.get_dummies(train_df["Sex"])
train_df = pd.concat([train_df, sex_cat], axis=1)
train_df.drop("Sex", axis=1, inplace=True)

sex_cat = pd.get_dummies(val_df["Sex"])
val_df = pd.concat([val_df, sex_cat], axis=1)
val_df.drop("Sex", axis=1, inplace=True)

sex_cat = pd.get_dummies(test_df["Sex"])
test_df = pd.concat([test_df, sex_cat], axis=1)
test_df.drop("Sex", axis=1, inplace=True)

In [17]:
train_df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,FamilySize,Alone,C,Q,S,female,male
218,1,1,32.0,0,0,76.2917,1,1,1,0,0,1,0
29,0,3,28.0,0,0,7.8958,1,1,0,0,1,0,1
612,1,3,28.0,1,0,15.5,2,0,0,1,0,1,0
808,0,2,39.0,0,0,13.0,1,1,0,0,1,0,1
701,1,1,35.0,0,0,26.2875,1,1,0,0,1,0,1


In [18]:
val_df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,FamilySize,Alone,C,Q,S,female,male
831,1,2,0.83,1,1,18.75,3,0,0,0,1,0,1
88,1,1,23.0,3,2,263.0,6,0,0,0,1,1,0
309,1,1,30.0,0,0,56.9292,1,1,1,0,0,1,0
842,1,1,30.0,0,0,31.0,1,1,1,0,0,1,0
266,0,3,16.0,4,1,39.6875,6,0,0,0,1,0,1


In [19]:
test_df.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,FamilySize,Alone,C,Q,S,female,male
0,3,34.5,0,0,7.8292,1,1,0,1,0,0,1
1,3,47.0,1,0,7.0,2,0,0,0,1,1,0
2,2,62.0,0,0,9.6875,1,1,0,1,0,0,1
3,3,27.0,0,0,8.6625,1,1,0,0,1,0,1
4,3,22.0,1,1,12.2875,3,0,0,0,1,1,0


## Exploratory data analysis

In [None]:
_ = train_df.hist(bins=50, figsize=(20,15))

Each attribute contains different information that may be handy to notice down the line:
- *Age*. Most passengers are between 20 and 30 years old.
- *Fare*. The fare people payed to be in the Titanic seems to follow an exponential distribution where the big majority of people payed a low fare, with fewer and fewer people paying larger sums.
- **

Since both `Parch` (i.e. # of parents/children) and `SibSp` (i.e. # of siblings/spouses) both concern family it would be interesting to see if there is some correlation between the two.

In [None]:
train_df.plot.scatter(x="Parch", y="SibSp")

In [None]:
sns.pairplot(train_df)

Given that most attributes are discrete (albeit the age which is continuos), this plot is not as useful as could be in other data sets.

Let's look if the gender of each passenger marked a significant difference in each passeger survival.

In [None]:
train_df.female.sum()

In [None]:
women_survive = train_df.loc[(train_df["female"]==1 & train_df["Survived"]==1), "female"]
survival_rate_women = np.sum(women_survive)/len(train_df)
print(f"The survival rate for women was {survival_rate_women}")

In [None]:
men_survive = train_df.loc[train_df.Sex=='male'].Survived
survival_rate_men = np.sum(men_survive)/len(men_survive)
print(f"The survival rate for men was {survival_rate_men}")

Given that most women survived and most men died. It is clear that the attribute 'Sex' can be a very important variable to determinen whether one person survived or not.
For example, classifying each male as dead and female as survived would yield quite a good performance already (.76).

In [None]:
print(f"There are {len(men_survive)} men and {len(women_survive)} women in the train set.")

As we can see, there are more men than women in the training set, so we must also take that into consideration.

## Predictive modeling
We will start fitting a simple linear model like logistic regression to the data.
Then, we will try more complex models that can fit non-linear hypothesis like random forests and XGBoost.

In [25]:
X_train = train_df.drop("Survived", axis=1)
y_train = train_df["Survived"]

X_val = val_df.drop("Survived", axis=1)
y_val = val_df["Survived"]

X_test = test_df

### Logistic Regression

In [27]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(max_iter=1000)
log_clf.fit(X_train, y_train)
log_clf.score(X_train, y_train), log_clf.score(X_val, y_val)

(0.8075842696629213, 0.7877094972067039)

### Gaussian Naive Bayes

In [56]:
from sklearn.naive_bayes import GaussianNB
gauss_clf = GaussianNB()
gauss_clf.fit(X_train, y_train)
gauss_clf.score(X_train, y_train), gauss_clf.score(X_val, y_val)

(0.8033707865168539, 0.7653631284916201)

### Random Forest

In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, log_loss

rf_clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf_clf.fit(X_train, y_train)


# preds = rf_clf.predict(X)
# print(f"Training accuracy: {accuracy_score(y, preds)}")
rf_clf.score(X_train, y_train), rf_clf.score(X_val, y_val), accuracy_score(y_val, rf_clf.predict(X_val))
# log_loss(y_val, rf_clf.predict_proba(X_val))

0.411639995300276

### Gradient Boosting

In [53]:
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier()
gb_clf.fit(X_train, y_train)
gb_clf.score(X_train, y_train), gb_clf.score(X_val, y_val)

(0.8932584269662921, 0.8212290502793296)

In [None]:
test_pred = rf_clf.predict(X_test)
output = pd.DataFrame({'PassengerId': test_df.PassengerId, 'Survived':test_pred})
output.to_csv("my_submission.csv", index=False)
output.head()

This submission achieves .77 so we'll need some more work to improve upon this original results.

## Fine tuninig the model

In [57]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [1, 3, 10, 100, 500],
     'max_features': [3, 6, 9, 11, 13]}
]
rf_clf = RandomForestClassifier()
grid_search = GridSearchCV(rf_clf, param_grid, n_jobs=-1, cv=5, return_train_score=True)
%time grid_search.fit(X_train, y_train)

Wall time: 11.1 s


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rando

In [58]:
grid_search.best_params_

{'max_features': 6, 'n_estimators': 100}

In [59]:
rf_clf = grid_search.best_estimator_
rf_clf

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features=6,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [60]:
rf_clf.score(X_train, y_train), rf_clf.score(X_val, y_val), accuracy_score(y_val, rf_clf.predict(X_val))


(0.9803370786516854, 0.8268156424581006, 0.8268156424581006)