# Titanic kaggle competition

Here we start

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
from matplotlib import pyplot as plt
%matplotlib widget

train_df = pd.read_csv(r"..\data\train.csv")
test_df = pd.read_csv(r"..\data\test.csv")
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Before start let's observe dataset and realize what all of these parameters means. According to description page:

|Variable|	Definition|	Key|
|-|--|--|
|survival|	Survival	| 0 = No, 1 = Yes |
|pclass|A proxy for socio-economic status (SES) 1st = Upper, 2nd = Middle, 3rd = Lower |	1 = 1st, 2 = 2nd, 3 = 3rd |
|sex|	Sex	| male, female|
|Age|	Age in years | Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 |	
|sibsp|	# of siblings / spouses aboard the Titanic Sibling = brother, sister, stepbrother, stepsister. Spouse = husband, wife (mistresses and fiancés were ignored)	| |
|parch|	# of parents / children aboard the Titanic. The dataset defines family relations in this way: Parent = mother, father. Child = daughter, son, stepdaughter, stepson. Some children travelled only with a nanny, therefore parch=0 for them.	| |
|ticket|	Ticket number	||
|fare|	Passenger fare	||
|cabin|	Cabin number	||
|embarked|	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton ||


## Data

Now let's closer look to our data:

In [2]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's find out what is the importance or influence to survival for given parameters.
(Here I should notice that I've heard that some competitors use PassengerId as parameter and be able to get useful info about it. I can imagine, that we can try to understand division mechanics of the initial sample and owner logic, but I think it's not interesting for me right now).

## Data preporation

Before using models we have to prepare our data to it. Let's remove garbage from our data and think what we can do with empty values:

In [8]:
train_df = train_df.drop('PassengerId', axis = 1)
train_df.describe(include='all')

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,891,2,,,,681.0,,147,3
top,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,1,577,,,,7.0,,4,644
mean,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


### Embarked

We just have only two passengers without embarked param.

In [27]:
train_df[train_df.Embarked.isna()]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


Here we can see that the ladies have the same ticket number and Martha has mrs title and also she older then Amelie and they have same cabin number. Looks like they are mother and daughter.

In [103]:
import seaborn as sns
%matplotlib widget
emb = train_df[['Embarked', 'Pclass']].value_counts(normalize=True).rename('ratio').reset_index()
p = sns.barplot(x='Pclass', y='ratio', hue='Embarked', data=emb )

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [39]:
train_df[train_df.Ticket == "113572"]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [44]:
tdf['Survived'].value_counts(normalize=True)
tdf['Survived'].groupby(tdf['Pclass']).mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [47]:
tdf['Name_Title'] = tdf['Name'].apply(lambda x: x.split(',')[1]).apply(lambda x: x.split()[0])
tdf['Name_Title'].value_counts()
tdf['Survived'].groupby(tdf['Name_Title']).mean()

Name_Title
Capt.        0.000000
Col.         0.500000
Don.         0.000000
Dr.          0.428571
Jonkheer.    0.000000
Lady.        1.000000
Major.       0.500000
Master.      0.575000
Miss.        0.697802
Mlle.        1.000000
Mme.         1.000000
Mr.          0.156673
Mrs.         0.792000
Ms.          1.000000
Rev.         0.000000
Sir.         1.000000
the          1.000000
Name: Survived, dtype: float64

In [23]:
sex = tdf[['Sex', 'Survived']]
men = sex.Sex == 'male'
women = sex.Sex == 'female'
men = sex.where(men).dropna()
women = sex.where(women).dropna()
men.Survived.sum() / len(men), women.Survived.sum() / len(women)


(0.18890814558058924, 0.7420382165605095)

In [3]:
sex = tdf[['Embarked', 'Survived']]
men = sex.Embarked == 'S'
women = sex.Embarked == 'C'
men = sex.where(men).dropna()
women = sex.where(women).dropna()
men.Survived.sum() / len(men), women.Survived.sum() / len(women)


(0.33695652173913043, 0.5535714285714286)

In [39]:
from sklearn.ensemble import RandomForestClassifier

y = tdf["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch", "Fare"]
test_df['Fare'].fillna(value=test_df.Fare.mean(), inplace=True)
X = pd.get_dummies(tdf[features])
X_test = pd.get_dummies(test_df[features])

model = RandomForestClassifier(criterion='gini', 
                             n_estimators=700,
                             min_samples_split=10,
                             min_samples_leaf=1,
                             max_features='auto',
                             oob_score=True,
                             random_state=1,
                             n_jobs=-1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_df.PassengerId, 'Survived': predictions})
output.to_csv('..\data\submission.csv', index=False)
print("Your submission was successfully saved!")





Your submission was successfully saved!


In [38]:
rf = RandomForestClassifier(criterion='gini', 
                             n_estimators=700,
                             min_samples_split=10,
                             min_samples_leaf=1,
                             max_features='auto',
                             oob_score=True,
                             random_state=1,
                             n_jobs=-1)
rf.fit(X.iloc[:, 1:], X.iloc[:, 0])
print("%.4f" % rf.oob_score_)



0.9293


In [31]:
%matplotlib widget
tdf.Cabin.describe()

count         204
unique        147
top       B96 B98
freq            4
Name: Cabin, dtype: object

In [22]:
%matplotlib widget
tdf.Age[tdf.Survived == 0].hist()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<AxesSubplot:>

In [28]:
%matplotlib widget
tdf.Survived[tdf.Age<=10].hist()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<AxesSubplot:>

In [29]:
%matplotlib widget
tdf.Survived[tdf.Age>=10].hist()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<AxesSubplot:>