# Titanic kaggle competition

Here we start

In [903]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib widget

train_df = pd.read_csv(r"..\data\train.csv")
test_df = pd.read_csv(r"..\data\test.csv")
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Before start let's observe dataset and realize what all of these parameters means. According to description page:

|Variable|	Definition|	Key|
|-|--|--|
|survival|	Survival	| 0 = No, 1 = Yes |
|pclass|A proxy for socio-economic status (SES) 1st = Upper, 2nd = Middle, 3rd = Lower |	1 = 1st, 2 = 2nd, 3 = 3rd |
|sex|	Sex	| male, female|
|Age|	Age in years | Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 |	
|sibsp|	# of siblings / spouses aboard the Titanic Sibling = brother, sister, stepbrother, stepsister. Spouse = husband, wife (mistresses and fiancés were ignored)	| |
|parch|	# of parents / children aboard the Titanic. The dataset defines family relations in this way: Parent = mother, father. Child = daughter, son, stepdaughter, stepson. Some children travelled only with a nanny, therefore parch=0 for them.	| |
|ticket|	Ticket number	||
|fare|	Passenger fare	||
|cabin|	Cabin number	||
|embarked|	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton ||


## Data

Now let's closer look to our data:

In [904]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's find out what is the importance or influence to survival for given parameters.
(Here I should notice that I've heard that some competitors use PassengerId as parameter and be able to get useful info about it. I can imagine, that we can try to understand division mechanics of the initial sample and owner logic, but I think it's not interesting for me right now).

## Data preparation

Before using models we have to prepare our data to modeling. Let's remove garbage from our data and think what we can do with empty values:

In [905]:
train_df = train_df.drop('PassengerId', axis = 1)
train_df.describe(include='all')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,891,2,,,,681.0,,147,3
top,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,1,577,,,,7.0,,4,644
mean,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


### Embarked

We just have only two passengers without embarked param.

In [906]:
train_df[train_df.Embarked.isna()]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


Here we can see that the ladies have the same ticket number and Martha has mrs title and also she older then Amelie and they have same cabin number. Looks like they are mother and daughter. I think that first of all we have to fix info about parch for the ladies.
Then let's think how we can fill Embarked info. The easiest way is fill it with most probably value.
The most probably value for Embarked is 'S' - Southgampton, because it has 644 passangers from 891.
It also true for 1st class passangers. So let's just fill the values:

In [907]:
train_df.iloc[61,6] = 1
train_df.iloc[829,6] = 1
train_df.iloc[61, 10] = 'S'
train_df.iloc[829,10] = 'S'
train_df[train_df.Ticket == "113572"]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,1,1,"Icard, Miss. Amelie",female,38.0,0,1,113572,80.0,B28,S
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,1,113572,80.0,B28,S


### Fare

Now let's closer look to fare feature. It's also looks as very important. Especially for russian people, but I think in an emergency situation all people become a bit russian.

In [908]:
fare_class = train_df.groupby('Pclass').Fare.mean()
train_df.Fare = train_df[['Pclass', 'Fare']].apply(lambda c: fare_class[c.Pclass] if c.Fare == 0 else c.Fare, axis=1)
train_df.Fare.describe()

count    891.000000
mean      32.876990
std       49.690114
min        4.012500
25%        7.925000
50%       14.500000
75%       31.275000
max      512.329200
Name: Fare, dtype: float64

### Age and title

I think age it also very important parameter, but as we can see it's absent for 177 passengers.
We can try to fill it based on persons title. And here we've faced with feature engineering. In the original dataset we don't have a data about title. Actually it's a part of name, but the basic idea is that we can split or combine given features to new one. So let's create title feature:

In [909]:
train_df.Name = train_df.Name.str.replace('Mlle', 'Miss')
train_df.Name = train_df.Name.str.replace('Mme', 'Mrs')
train_df['Title'] = train_df.Name.apply(lambda n: str(n)[str(n).find(',')+1:].strip().split(' ')[0][:-1])
train_df.Title = train_df.Title.replace('th', 'Countess')
train_df.Title = train_df.Title.replace('Ms', 'Miss')
print(train_df.Title.unique())

['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Major' 'Lady' 'Sir' 'Col'
 'Capt' 'Countess' 'Jonkheer']


In [910]:
import math
title_age = train_df.groupby('Title').Age.mean().round()
train_df.Age = train_df[['Title', 'Age']].apply(lambda a: title_age[a.Title] if math.isnan(a.Age) else a.Age, axis=1)

### Cabin

Let's see what we have on this moment with our data:

In [911]:
%matplotlib widget 
plt.figure(figsize=(12,5))
plt.title('IsNaN values of given data')
plt.imshow(train_df.isnull(), interpolation='nearest', aspect='auto')  
plt.xticks(range(len(train_df.columns)), train_df.columns)
plt.colorbar()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<matplotlib.colorbar.Colorbar at 0x249ca2b65e0>

As we can see most of cabin data is missing. Let's investigate how we can restore this data:

<img src="https://sun9-88.userapi.com/impg/H2_bLjAFAVFIg0PFZspaJSam_0Mji8BFNdG8hg/w-E0wuRVHG4.jpg?size=1401x2088&quality=96&sign=809444c7f827cc913ef56ee3465accbe&type=album" alt="drawing" width="300"/>

We can see on the picture above that Cabin letter depends on class:
**(And below we see that it is wrong!)**

As I can see in [one of the solution example](https://medium.com/analytics-vidhya/random-forest-on-titanic-dataset-88327a014b4d) for this analysis and actually it is obvious. Cabin should depends on fare. We will add new feature - cabin letter and for empty cabin fill X. Then we will see what dependency about fare for each cabin letter:

In [912]:
%matplotlib widget

train_df['CabLet'] = train_df.Cabin.astype(str).str[0].replace('n', 'X')
_ = train_df.boxplot('Fare', 'CabLet', figsize=(12,5))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

From this picture we can see, that X let has much more outliers then other letters. So we can change X based on distance between outlier of X and ICR of other classes:

![](https://sun9-81.userapi.com/impg/iha_aAC3pZvuFZozZh-q6JWekt4RyFOi5wCAWA/ibPQMdsKhkU.jpg?size=568x483&quality=96&sign=5e27fd0e0dcb8008282a324d4f09a224&type=album)

In [913]:
%matplotlib widget

cabLet_fare_m = train_df[['Fare', 'CabLet']].groupby('CabLet').mean()
cabLet_fare_q = train_df[['Fare', 'CabLet']].groupby('CabLet').quantile(0.75)

def assingCabinBasedOnFare(cf:pd.DataFrame) -> str:
    cabin = cf[0]
    fare  = cf[1]

    if cabin != 'X':
        return cabin
    for c in cabLet_fare_q.index.values[::-1][1:-1]:
        if fare <= cabLet_fare_q.loc[c].Fare:
            return c
        else:
            return 'B'
train_df['CabLet'] = train_df[['CabLet', 'Fare']].apply(assingCabinBasedOnFare, axis=1)
_ = train_df.boxplot('Fare', 'CabLet', figsize=(12,5))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

## Feature engineering

As it was mentioned above in most cases we should not only use given data, but also combine and split them in order to create new feature. Frequently splitted or combined feature could be a most influensed parameter.

Let's closer look to family paramters: Sibsp and Parch, and let's combined them into the one parameter that describe was passenger alone  or not:

In [914]:
train_df['Alone'] = train_df[['SibSp', 'Parch']].apply(lambda p: 0 if (p[0] + p[1] != 0) else 1, axis=1)
train_df['Familiars'] = train_df.SibSp + train_df.Parch
_ = train_df[['SibSp', 'Parch', 'Alone', 'Familiars']].hist(bins=range(8), figsize=(12,5), layout=(4,1), sharex=True)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Now let's see how given and built features influence to survive.
We can try to check some features that seems the most valuable

In [915]:
%matplotlib widget
f,ax = plt.subplots(1,3,figsize=(18,7))
f.suptitle('Most important features')
ax[0].set_title('Sex')
plt.sca(ax[0])
_ = sns.countplot(x='Sex', hue='Survived', data = train_df[['Sex','Survived']])
plt.sca(ax[1])
ax[1].set_title('Pclass')
_ = sns.countplot(x='Pclass', hue='Survived', data = train_df[['Pclass','Survived']])
plt.sca(ax[2])
ax[2].set_title('Alone')
_ = sns.countplot(x='Alone', hue='Survived', data = train_df[['Alone','Survived']])

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

And also plot covariance matrix:

In [916]:
categories = {"female": 1, "male": 0}
train_df['Sex']= train_df['Sex'].map(categories)

categories = {"S": 1, "C": 2, "Q": 3}
train_df['Embarked']= train_df['Embarked'].map(categories)

categories = train_df.CabLet.unique()
train_df['CabLet'] = train_df.CabLet.astype("category").cat.codes

plt.figure(figsize=(14,8))
sns.heatmap(train_df.corr(), annot=True)
plt.tight_layout()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

## Data normalization

There is one of the crucial aspect of many machine learning algorithms.

In [917]:
from sklearn.preprocessing import MinMaxScaler

# Dropping label
y = train_df['Survived']
train_df = train_df.drop('Survived', axis=1) 
train_df = train_df.drop('Name',     axis=1) 
train_df = train_df.drop('Cabin',    axis=1) 
train_df = train_df.drop('Ticket',   axis=1) 
train_df = train_df.drop('Title',    axis=1) 

scaler = MinMaxScaler()
scaled_train = scaler.fit_transform(train_df)

scaled_train = pd.DataFrame(scaled_train, columns=train_df.columns, index=train_df.index)

scaled_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,CabLet,Alone,Familiars
0,1.0,0.0,0.271174,0.125,0.0,0.006369,0.0,1.0,0.0,0.1
1,0.0,1.0,0.472229,0.125,0.0,0.13234,0.5,0.285714,0.0,0.1
2,1.0,1.0,0.321438,0.0,0.0,0.007697,0.0,1.0,1.0,0.0
3,0.0,1.0,0.434531,0.125,0.0,0.096569,0.0,0.285714,0.0,0.1
4,1.0,0.0,0.434531,0.0,0.0,0.007943,0.0,1.0,1.0,0.0


Here we repeat all operations above but for test sample

In [918]:
fare_class = test_df.groupby('Pclass').Fare.mean()
test_df.Fare = test_df[['Pclass', 'Fare']].apply(lambda c: fare_class[c.Pclass] if c.Fare == 0 or np.isnan(c.Fare) else c.Fare, axis=1)

test_df.Name = test_df.Name.str.replace('Mlle', 'Miss')
test_df.Name = test_df.Name.str.replace('Mme', 'Mrs')
test_df['Title'] = test_df.Name.apply(lambda n: str(n)[str(n).find(',')+1:].strip().split(' ')[0][:-1])
test_df.Title = test_df.Title.replace('th', 'Countess')
test_df.Title = test_df.Title.replace('Ms', 'Miss')
title_age = test_df.groupby('Title').Age.mean().round()
test_df.Age = test_df[['Title', 'Age']].apply(lambda a: title_age[a.Title] if math.isnan(a.Age) else a.Age, axis=1)

test_df['CabLet'] = test_df.Cabin.astype(str).str[0].replace('n', 'X')
test_df['CabLet'] = test_df[['CabLet', 'Fare']].apply(assingCabinBasedOnFare, axis=1)
test_df['Alone'] = test_df[['SibSp', 'Parch']].apply(lambda p: 0 if (p[0] + p[1] != 0) else 1, axis=1)
test_df['Familiars'] = test_df.SibSp + test_df.Parch

categories = {"female": 1, "male": 0}
test_df['Sex']= test_df['Sex'].map(categories)

categories = {"S": 1, "C": 2, "Q": 3}
test_df['Embarked']= test_df['Embarked'].map(categories)

categories = test_df.CabLet.unique()
test_df['CabLet'] = test_df.CabLet.astype("category").cat.codes


test_df = test_df.drop('Name', axis=1)  # Dropping label to normalize
test_df = test_df.drop('Cabin', axis=1)  # Dropping label to normalize
test_df = test_df.drop('Ticket', axis=1)  # Dropping label to normalize
test_df = test_df.drop('Title', axis=1)  # Dropping label to normalize
test_df = test_df.drop('PassengerId', axis=1)  # Dropping label to normalize

scaler = MinMaxScaler()
scaled_test = scaler.fit_transform(test_df)

scaled_test = pd.DataFrame(scaled_test, columns=test_df.columns, index=test_df.index)

scaled_test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,CabLet,Alone,Familiars
0,1.0,0.0,0.452723,0.0,0.0,0.009149,1.0,1.0,1.0,0.0
1,1.0,1.0,0.617566,0.125,0.0,0.007521,0.0,1.0,0.0,0.1
2,0.5,0.0,0.815377,0.0,0.0,0.012799,1.0,1.0,1.0,0.0
3,1.0,0.0,0.353818,0.0,0.0,0.010786,0.0,1.0,1.0,0.0
4,1.0,1.0,0.287881,0.125,0.111111,0.017905,0.0,1.0,0.0,0.2


## Classification

Formally we have here a binary classification issue. And we will use Random Forest algorithm for classify our passangers.

In [922]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(scaled_train, y, test_size=0.2)
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

clf = RandomForestClassifier(n_estimators=100)

# #Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train, y_train)

(712, 10) (179, 10)
(712,) (179,)


RandomForestClassifier()

In [923]:
feature_imp = pd.Series(clf.feature_importances_, index=scaled_train.columns).sort_values(ascending=False)

#print("Accuracy: {}".format(metrics.accuracy_score(y_test, y_pred)))

plt.figure(figsize=(10,6))
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.tight_layout()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [928]:
from sklearn.model_selection import RandomizedSearchCV
# Removing less important features
new_train = scaled_train.drop(['Alone','Parch','Embarked', 'SibSp', 'CabLet'], axis=1)
new_test = scaled_test.drop(['Alone','Parch','Embarked', 'SibSp', 'CabLet'], axis=1)

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


clf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

X_train, X_test, y_train, y_test = train_test_split(new_train, y, test_size=0.2)



#Train the model using the training sets y_pred=clf.predict(X_test)
rf_random.fit(X_train, y_train)

rf_random.best_params_

Fitting 3 folds for each of 100 candidates, totalling 300 fits


{'n_estimators': 644,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 10,
 'bootstrap': False}

In [940]:
clf = RandomForestClassifier(n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_features='auto', max_depth=10, bootstrap=False)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("   Accuracy: {}".format(metrics.accuracy_score(y_test, y_pred)))

print(classification_report(y_test,y_pred))

conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8,5))
sns.heatmap(conf_matrix, annot=True)
plt.title('Confusion Matrix')
plt.tight_layout()
prediction = clf.predict(new_test)
test_df['Survival_Predictions'] = pd.Series(prediction)
test_df.head()

   Accuracy: 0.8212290502793296
              precision    recall  f1-score   support

           0       0.85      0.86      0.86       111
           1       0.77      0.75      0.76        68

    accuracy                           0.82       179
   macro avg       0.81      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179



Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,CabLet,Alone,Familiars,Survival_Predictions
0,3,0,34.5,0,0,7.8292,3,7,1,0,0
1,3,1,47.0,1,0,7.0,1,7,0,1,0
2,2,0,62.0,0,0,9.6875,3,7,1,0,0
3,3,0,27.0,0,0,8.6625,1,7,1,0,0
4,3,1,22.0,1,1,12.2875,1,7,0,2,1


In [933]:
test_df_subm = pd.read_csv(r"..\data\test.csv")
test_df_subm['Survived'] = test_df.Survival_Predictions
test_df_subm[['PassengerId', 'Survived']].to_csv(r'..\data\submission.csv', index=False)