# Exploration of the titanic data

Here we will wrangle the data, create new features and make other necessary transformations.
All things from here will be keep into wrangling function from titanic.data module

Based on this notebook I created data module with such structure:

```bash
data
├── __init__.py
├── new_features.py
├── normalization.py
└── wrangling.py
```

All staff from here goes into this module and could be easily reused in difference models that you can find in notebooks folder.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib widget

import titanic.data.load

# train_df = pd.read_csv(r"../data/train.csv")
# test_df = pd.read_csv(r"../data/test.csv")
train_df, test_df = titanic.data.load.from_csv()
train_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Before start let's observe dataset and realize what all of these parameters means. According to description page:

|Variable|	Definition|	Key|
|-|--|--|
|survival|	Survival	| 0 = No, 1 = Yes |
|pclass|A proxy for socio-economic status (SES) 1st = Upper, 2nd = Middle, 3rd = Lower |	1 = 1st, 2 = 2nd, 3 = 3rd |
|sex|	Sex	| male, female|
|Age|	Age in years | Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 |	
|sibsp|	# of siblings / spouses aboard the Titanic Sibling = brother, sister, stepbrother, stepsister. Spouse = husband, wife (mistresses and fiancés were ignored)	| |
|parch|	# of parents / children aboard the Titanic. The dataset defines family relations in this way: Parent = mother, father. Child = daughter, son, stepdaughter, stepson. Some children travelled only with a nanny, therefore parch=0 for them.	| |
|ticket|	Ticket number	||
|fare|	Passenger fare	||
|cabin|	Cabin number	||
|embarked|	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton ||


## Data

Now let's closer look to our data:

In [2]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's find out what is the importance or influence to survival for given parameters.
(Here I should notice that I've heard that some competitors use PassengerId as parameter and be able to get useful info about it. I can imagine, that we can try to understand division mechanics of the initial sample and owner logic, but I think it's not interesting for me right now).

## Data preparation

Before using models we have to prepare our data to modeling. Let's remove garbage from our data and think what we can do with empty values:

In [3]:
train_df = train_df.drop('PassengerId', axis = 1)
train_df.describe(include='all')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,891,2,,,,681.0,,147,3
top,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,1,577,,,,7.0,,4,644
mean,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


### Embarked

We just have only two passengers without embarked param.

In [4]:
train_df[train_df.Embarked.isna()]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


Here we can see that the ladies have the same ticket number and Martha has mrs title and also she older then Amelie and they have same cabin number. Looks like they are mother and daughter. I think that first of all we have to fix info about parch for the ladies.
Then let's think how we can fill Embarked info. The easiest way is fill it with most probably value.
The most probably value for Embarked is 'S' - Southgampton, because it has 644 passangers from 891.
It also true for 1st class passangers. So let's just fill the values:

In [5]:
train_df.iloc[61,6] = 1
train_df.iloc[829,6] = 1
train_df.iloc[61, 10] = 'S'
train_df.iloc[829,10] = 'S'
train_df[train_df.Ticket == "113572"]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,1,1,"Icard, Miss. Amelie",female,38.0,0,1,113572,80.0,B28,S
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,1,113572,80.0,B28,S


### Fare

Now let's closer look to fare feature. It's also looks as very important. Especially for russian people, but I think in an emergency situation all people become a bit russian.

In [6]:
fare_class = train_df.groupby('Pclass').Fare.mean()
train_df.Fare = train_df[['Pclass', 'Fare']].apply(lambda c: fare_class[c.Pclass] if c.Fare == 0 else c.Fare, axis=1)
train_df.Fare.describe()

count    891.000000
mean      32.876990
std       49.690114
min        4.012500
25%        7.925000
50%       14.500000
75%       31.275000
max      512.329200
Name: Fare, dtype: float64

### Age and title

I think age it also very important parameter, but as we can see it's absent for 177 passengers.
We can try to fill it based on persons title. And here we've faced with feature engineering. In the original dataset we don't have a data about title. Actually it's a part of name, but the basic idea is that we can split or combine given features to new one. So let's create title feature:

In [7]:
train_df.Name = train_df.Name.str.replace('Mlle', 'Miss')
train_df.Name = train_df.Name.str.replace('Mme', 'Mrs')
train_df['Title'] = train_df.Name.apply(lambda n: str(n)[str(n).find(',')+1:].strip().split(' ')[0][:-1])
train_df.Title = train_df.Title.replace('th', 'Countess')
train_df.Title = train_df.Title.replace('Ms', 'Miss')
print(train_df.Title.unique())

['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Major' 'Lady' 'Sir' 'Col'
 'Capt' 'Countess' 'Jonkheer']


In [8]:
import math
title_age = train_df.groupby('Title').Age.mean().round()
train_df.Age = train_df[['Title', 'Age']].apply(lambda a: title_age[a.Title] if math.isnan(a.Age) else a.Age, axis=1)

### Cabin

Let's see what we have on this moment with our data:

In [9]:
%matplotlib widget 
plt.figure(figsize=(12,5))
plt.title('IsNaN values of given data')
plt.imshow(train_df.isnull(), interpolation='nearest', aspect='auto')  
plt.xticks(range(len(train_df.columns)), train_df.columns)
plt.colorbar()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<matplotlib.colorbar.Colorbar at 0x7f25f8c8c430>

As we can see most of cabin data is missing. Let's investigate how we can restore this data:

<img src="https://sun9-88.userapi.com/impg/H2_bLjAFAVFIg0PFZspaJSam_0Mji8BFNdG8hg/w-E0wuRVHG4.jpg?size=1401x2088&quality=96&sign=809444c7f827cc913ef56ee3465accbe&type=album" alt="drawing" width="300"/>

We can see on the picture above that Cabin letter depends on class:
**(And below we see that it is wrong!)**

As I can see in [one of the solution example](https://medium.com/analytics-vidhya/random-forest-on-titanic-dataset-88327a014b4d) for this analysis and actually it is obvious. Cabin should depends on fare. We will add new feature - cabin letter and for empty cabin fill X. Then we will see what dependency about fare for each cabin letter:

In [10]:
train_df['CabLet'] = train_df.Cabin.astype(str).str[0].replace('n', 'X')
_ = train_df.boxplot('Fare', 'CabLet', figsize=(10,5))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

From this picture we can see, that X let has much more outliers then other letters. So we can change X based on distance between outlier of X and ICR of other classes:

![](https://sun9-81.userapi.com/impg/iha_aAC3pZvuFZozZh-q6JWekt4RyFOi5wCAWA/ibPQMdsKhkU.jpg?size=568x483&quality=96&sign=5e27fd0e0dcb8008282a324d4f09a224&type=album)

In [11]:
cabLet_fare_m = train_df[['Fare', 'CabLet']].groupby('CabLet').mean()
cabLet_fare_q = train_df[['Fare', 'CabLet']].groupby('CabLet').quantile(0.75)

def assingCabinBasedOnFare(cf:pd.DataFrame) -> str:
    cabin = cf[0]
    fare  = cf[1]

    if cabin != 'X':
        return cabin
    for c in cabLet_fare_q.index.values[::-1][1:-1]:
        if fare <= cabLet_fare_q.loc[c].Fare:
            return c
        else:
            return 'B'
train_df['CabLet'] = train_df[['CabLet', 'Fare']].apply(assingCabinBasedOnFare, axis=1)
_ = train_df.boxplot('Fare', 'CabLet', figsize=(10,5))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

## Feature engineering

As it was mentioned above in most cases we should not only use given data, but also combine and split them in order to create new feature. Frequently splitted or combined feature could be a most influensed parameter.

Let's closer look to family paramters: Sibsp and Parch, and let's combined them into the one parameter that describe was passenger alone  or not:

In [12]:
train_df['Alone'] = train_df[['SibSp', 'Parch']].apply(lambda p: 0 if (p[0] + p[1] != 0) else 1, axis=1)
train_df['Familiars'] = train_df.SibSp + train_df.Parch
_ = train_df[['SibSp', 'Parch', 'Alone', 'Familiars']].hist(bins=range(8), figsize=(12,5), layout=(4,1), sharex=True)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Now let's see how given and built features influence to survive.
We can try to check some features that seems the most valuable

In [13]:
f,ax = plt.subplots(2,2,figsize=(10,10))
plt.sca(ax[0,0])
_ = sns.countplot(x='Sex', hue='Survived', data = train_df[['Sex','Survived']])
plt.sca(ax[0,1])
_ = sns.countplot(x='Pclass', hue='Survived', data = train_df[['Pclass','Survived']])
plt.sca(ax[1,0])
_ = sns.countplot(x='Alone', hue='Survived', data = train_df[['Alone','Survived']])
plt.sca(ax[1,1])
_ = sns.countplot(x='Familiars', hue='Survived', data = train_df[['Familiars','Survived']])

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [47]:
f,ax = plt.subplots(1,2,figsize=(10,10))
_ = sns.countplot(x='SibSp', hue='Survived', data = train_df[['Survived', 'Sex', 'SibSp']].where(train_df.Sex=='male').dropna(), ax=ax[0]).set_title('male')
_ = sns.countplot(x='SibSp', hue='Survived', data = train_df[['Survived', 'Sex', 'SibSp']].where(train_df.Sex=='female').dropna(), ax=ax[1]).set_title('female')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [67]:

train_df[~pd.isna(train_df.where((train_df.Sex == 'female') * (train_df.SibSp == 3)).PassengerId)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
24,25,0,3,"Palsson, Miss. Torborg Danira",female,8.0,3,1,349909,21.075,,S
85,86,1,3,"Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu...",female,33.0,3,0,3101278,15.85,,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
229,230,0,3,"Lefebre, Miss. Mathilde",female,,3,1,4133,25.4667,,S
341,342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0,C23 C25 C27,S
374,375,0,3,"Palsson, Miss. Stina Viola",female,3.0,3,1,349909,21.075,,S
409,410,0,3,"Lefebre, Miss. Ida",female,,3,1,4133,25.4667,,S
485,486,0,3,"Lefebre, Miss. Jeannie",female,,3,1,4133,25.4667,,S
634,635,0,3,"Skoog, Miss. Mabel",female,9.0,3,2,347088,27.9,,S
642,643,0,3,"Skoog, Miss. Margit Elizabeth",female,2.0,3,2,347088,27.9,,S


In [91]:
f,ax = plt.subplots(1,1,figsize=(10,10))
_ = sns.countplot(x='Name', hue='Survived', data =(pd.DataFrame([train_df.Survived, train_df.Name.apply(len)]).T), ax=ax)


Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

And we also plot covariance matrix, but before doing this we should cast categorical data to numeric codes:

In [14]:
from sklearn.preprocessing import OrdinalEncoder



categories = {"female": 1, "male": 0}
train_df['Sex']= train_df['Sex'].map(categories)

categories = {"S": 1, "C": 2, "Q": 3}
train_df['Embarked'] = train_df['Embarked'].map(categories)

categories = train_df.CabLet.unique()
train_df['CabLet'] = train_df.CabLet.astype("category").cat.codes

categories = train_df.Title.unique()
train_df['Title'] = train_df.Title.astype("category").cat.codes

plt.figure(figsize=(10,8))
sns.heatmap(train_df.corr(), annot=True)


Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

KeyError: 'default'

Now let's remove redundant features. Based on them we create the new features that will be use in future models.

In [None]:
y = train_df['Survived'].copy()

train_df = train_df.drop('Survived', axis=1) 
train_df = train_df.drop('Name',     axis=1) 
train_df = train_df.drop('Cabin',    axis=1) 
train_df = train_df.drop('Ticket',   axis=1) 