# **Titanic Analysis notebook**

 ### Challange: Figure out what sorts of people were more likely to survive the Titanic sinking

| **Data Dictionary**  |  |     ||
| ----------- | ----------- | ----------- |-----------|
| **Variable** | **Definition** | **Key** |**Magyar leírás**|
PassengerId | passanger id |Distinct|Utas azonosító – Minden utasnak egyedi sorszáma.
survival	|Survival|	0 = No, 1 = Yes|Túlélés –> 0 = meghalt, 1 = túlélte a katasztrófát.|
pclass	|Ticket class	|1 = 1st, 2 = 2nd, 3 = 3rd|Osztály – A hajó utaskategóriája: 1 = első osztály, 2 = másodosztály, 3 = harmadosztály.|
Name | Full name (with rank sometimes) | |Név – Az utas teljes neve (néha tartalmazza a rangot, pl. Mr., Mrs., Miss).|
sex	|Sex	||Nem – male = férfi, female = nő.
Age	|Age in years	||Kor – Az utas életkora években. Hiányzó érték lehet.
sibsp	|# of siblings / spouses aboard the Titanic	||Testvér/Partner száma a hajón – Hány testvér vagy házastárs utazott vele a hajón.
parch	|# of parents / children aboard the Titanic	||Szülő/Gyermek száma a hajón – Hány szülője vagy gyermeke utazott vele a hajón.
ticket	|Ticket number	||Jegyszám – Az utas jegyének azonosítója (szöveg vagy szám).
fare	|Passenger fare	||Fizetett jegyár – Az utas által fizetett jegy ára fontban.
cabin	|Cabin number	||Kabinszám – A kabin, ahol az utas lakott (sok üres érték).
embarked	|Port of Embarkation	|C = Cherbourg, Q = Queenstown, S = Southampton||Beszállási kikötő – A hajóra szállás helye

---
| **Variable Notes** | | | | |
| ----------- | ----------- | ----------- |-----------|-----------|
| ----------- | **pclass** | **age** | **sibsp** | **parch** |
| Angol | A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower | Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5 | Sibling = brother, sister, stepbrother, stepsister; Spouse = husband, wife (mistresses and fiancés ignored) | Parent = mother, father; Child = daughter, son, stepdaughter, stepson; Some children travelled only with a nanny, therefore parch=0 |
| Magyar | A társadalmi-gazdasági státusz (SES) közelítő mutatója: 1 = Felső osztály, 2 = Középosztály, 3 = Alsó osztály | Az életkor törtszám, ha kevesebb, mint 1. Ha becsült, xx.5 formában szerepel | Testvér = fiútestvér, lánytestvér, mostohatestvér; Házastárs = férj, feleség (szeretők és menyasszonyok figyelmen kívül hagyva) | Szülő = anya, apa; Gyermek = lánya, fia, mostohalánya, mostohafia; Néhány gyermek csak a dadával utazott, ezért parch=0 |


In [393]:
import numpy as np
import pandas as pd

In [394]:
gender_submission = pd.read_csv('titanic\\gender_submission.csv')
test = pd.read_csv('titanic\\test.csv')
train = pd.read_csv('titanic\\train.csv')

In [395]:
print(f"Train shape: {train.shape}, Test shape: {test.shape}, Gender submission shape: {gender_submission.shape}")

Train shape: (891, 12), Test shape: (418, 11), Gender submission shape: (418, 2)


In [396]:
#Split the data
train.set_index('PassengerId', inplace=True)
train_labels = train['Survived'].copy()
train_features = train.drop(columns =['Survived']).copy()

In [397]:
train_features.duplicated().unique()

array([False])

In [398]:
train_features.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [399]:
train_features.tail()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
891,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [400]:
train_features.all

<bound method DataFrame.all of              Pclass                                               Name  \
PassengerId                                                              
1                 3                            Braund, Mr. Owen Harris   
2                 1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
3                 3                             Heikkinen, Miss. Laina   
4                 1       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
5                 3                           Allen, Mr. William Henry   
...             ...                                                ...   
887               2                              Montvila, Rev. Juozas   
888               1                       Graham, Miss. Margaret Edith   
889               3           Johnston, Miss. Catherine Helen "Carrie"   
890               1                              Behr, Mr. Karl Howell   
891               3                                Dooley, Mr. Patrick   

      

In [401]:
train_features.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
count,891.0,714.0,891.0,891.0,891.0
mean,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.42,0.0,0.0,0.0
25%,2.0,20.125,0.0,0.0,7.9104
50%,3.0,28.0,0.0,0.0,14.4542
75%,3.0,38.0,1.0,0.0,31.0
max,3.0,80.0,8.0,6.0,512.3292


In [402]:
# Fill NA of Age with median
train_features['Age'] = train_features['Age'].fillna(train_features['Age'].median())

# Change Column Cabin -> Too many Nan -> make two categories Known:1, Unknown:0
train_features['Cabin_binary'] = train_features['Cabin'].notna().astype(int)

In [403]:
train_features['Embarked'].value_counts(dropna=False)

Embarked
S      644
C      168
Q       77
NaN      2
Name: count, dtype: int64

In [404]:

train_features['Embarked'] = train_features['Embarked'].fillna(train_features['Embarked'].mode()[0])

In [405]:
# Sex_numeric -> male:0, female:1
train_features['Sex_binary'] = train_features['Sex'].map({'male':0,'female':1})
train_features.drop(columns=['Sex','Cabin'], inplace=True)

In [406]:
train_features

Unnamed: 0_level_0,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Embarked,Cabin_binary,Sex_binary
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,S,0,0
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C,1,1
3,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,S,0,1
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,S,1,1
5,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,S,0,0
...,...,...,...,...,...,...,...,...,...,...
887,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,S,0,0
888,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,S,1,1
889,3,"Johnston, Miss. Catherine Helen ""Carrie""",28.0,1,2,W./C. 6607,23.4500,S,0,1
890,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C,1,0


In [407]:
train_features['Embarked'].value_counts(dropna=False)

Embarked
S    646
C    168
Q     77
Name: count, dtype: int64

In [408]:
# First list null values
train_features.isnull().sum()

Pclass          0
Name            0
Age             0
SibSp           0
Parch           0
Ticket          0
Fare            0
Embarked        0
Cabin_binary    0
Sex_binary      0
dtype: int64

In [409]:
# Define a preprocessing function

def preprocess_dataset(DataFrame):
    return True

In [410]:
preprocess_dataset(train_features)

True