# Titanic Kaggle Competition

## Data Analysis

The first phase is to analize the dataset, in order to discover some information about available data.

Context of dataset:
- On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew (32% 
survival rate).
- There were not enough lifeboats for the passengers and crew.
- Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

### Import libraries

In [1]:
import pandas as pd
import numpy as np

# models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

# feature and model selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

### Acquire data

In [2]:
# data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

### Data analysis

Let's take a look to some basic information about this dataset:

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
print(train.columns.values)

['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']


#### Features type

- Categorical:
    - Nominal:
        - Survived
        - Sex
        - Embarked
    - Ordinal:
        - Pclass
- Numerical:
    - Continuos:
        - Age
        - Fare
    - Discrete:
        - SibSp
        - Parch

#### Features analysis

In [5]:
print(train.info())
print("-"*50)
print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket      

- Cabin feature is mostly incomplete both in training and test dataset. This feature may be useful, as there may be a correlation between cabin position and survival. But there may not be sufficient information to complete it correctly and cabin position is probably correlate to fare. So it may be dropped.
- There may not be a correlation between Ticket and survival.
- We can complete the Embarked feature (only 2 null values).
- We have to complete Age feature as we know it is correlated to survival.

In [6]:
train[["Pclass", "Survived"]].groupby(["Pclass"], as_index=False).mean().sort_values(by="Survived", ascending=False)

Unnamed: 0,Pclass,Survived
0,1,0.62963
1,2,0.472826
2,3,0.242363


In [7]:
train[["Sex", "Survived"]].groupby(["Sex"], as_index=False).mean().sort_values(by="Survived", ascending=False)

Unnamed: 0,Sex,Survived
0,female,0.742038
1,male,0.188908


In [8]:
train[["SibSp", "Survived"]].groupby(["SibSp"], as_index=False).mean().sort_values(by="Survived", ascending=False)

Unnamed: 0,SibSp,Survived
1,1,0.535885
2,2,0.464286
0,0,0.345395
3,3,0.25
4,4,0.166667
5,5,0.0
6,8,0.0


In [9]:
train[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Parch,Survived
3,3,0.6
1,1,0.550847
2,2,0.5
0,0,0.343658
5,5,0.2
4,4,0.0
6,6,0.0


This confirm there are a correlation between Pclass/Sex and Survived.
There may also be a correlation between SibSp/Parch and Survived, but there are some values with 0 correlation.

To complete Age feature we may consider Title of people, rather than put average age. So we need to add this new feature, extrapolating it from Name feature. Extract title may be useful to obtain additional information about social stuatus too.

## Data wragling

So resume what we discover from data analysis:
- We must complete Age feature
- We should extrapolate Title feature from Name
- There are a missing value in Fare, two missing values in Embarked and a lot of missing values in Cabin.

First of all, we extract and removing the Survived feature and combine the two set, to engineer new features.

In [10]:
survived = train['Survived']
train.drop(['Survived'], 1, inplace=True)
titanic = train.append(test)
titanic.reset_index(inplace=True)
titanic.drop(['index', 'PassengerId'], inplace=True, axis=1)

Now we can extract the passenger title and maps the titles to categories.

In [11]:
titanic["Title"] = titanic["Name"].map(lambda name:name.split(',')[1].split('.')[0].strip())

Let's see what are the different titles

In [12]:
titanic.groupby(['Title'], as_index=False).size()

Title
Capt              1
Col               4
Don               1
Dona              1
Dr                8
Jonkheer          1
Lady              1
Major             2
Master           61
Miss            260
Mlle              2
Mme               1
Mr              757
Mrs             197
Ms                2
Rev               8
Sir               1
the Countess      1
dtype: int64

There are titles with just few people, so we can combined them in a single category.

In [13]:
Title_Dictionary = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Jonkheer": "Royalty",
    "Don": "Royalty",
    "Dona": "Royalty",
    "Sir" : "Royalty",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess":"Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr" : "Mr",
    "Mrs" : "Mrs",
    "Miss" : "Miss",
    "Master" : "Master",
    "Lady" : "Royalty"
}
titanic['Title'] = titanic['Title'].map(Title_Dictionary)

Now we can drop Name feature.

In [14]:
titanic.drop("Name", axis=1, inplace=True)

Let's see mean age of these categories (first 891 rows, to avoid data leakage):

In [15]:
grouped_mean_age = titanic[["Title", "Age"]].iloc[:891].groupby(['Title'], as_index=False).mean()
grouped_mean_age = grouped_mean_age.reset_index()[["Title", "Age"]]
grouped_mean_age

Unnamed: 0,Title,Age
0,Master,4.574167
1,Miss,21.804054
2,Mr,32.36809
3,Mrs,35.718182
4,Officer,46.705882
5,Royalty,41.6


We use this data to fill missing ages.

In [16]:
def fill_age(row):
    condition = (grouped_mean_age["Title"] == row["Title"])
    return grouped_mean_age[condition]["Age"].values[0]

def process_age():
    titanic['Age'] = titanic.apply(
        lambda row: fill_age(row) if np.isnan(row['Age']) else row['Age'], 
        axis=1
    )
    return titanic

In [17]:
titanic = process_age()

In [18]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
Pclass      1309 non-null int64
Sex         1309 non-null object
Age         1309 non-null float64
SibSp       1309 non-null int64
Parch       1309 non-null int64
Ticket      1309 non-null object
Fare        1308 non-null float64
Cabin       295 non-null object
Embarked    1307 non-null object
Title       1309 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 102.3+ KB


We enconde the title values using a dummy encoding.

In [19]:
titles_dummies = pd.get_dummies(titanic["Title"], prefix="Title")
titanic = pd.concat([titanic, titles_dummies], axis=1)
titanic.drop("Title", axis=1, inplace=True)

In [20]:
titanic.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,Title_Royalty
0,3,male,22.0,1,0,A/5 21171,7.25,,S,0,0,1,0,0,0
1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,0,0,0,1,0,0
2,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,1,0,0,0,0
3,1,female,35.0,1,0,113803,53.1,C123,S,0,0,0,1,0,0
4,3,male,35.0,0,0,373450,8.05,,S,0,0,1,0,0,0


Now we replace missing values in Fare (with Fare mean) and Embarked (with the most frequent value). We encode Embarked values using a dummy encoding.

In [21]:
mean_fare = titanic.iloc[:891]["Fare"].mean()
titanic["Fare"].fillna(mean_fare, inplace=True)

In [22]:
mostfrq_embarked = titanic.iloc[:891]["Embarked"].mode()[0]
titanic["Embarked"].fillna(mostfrq_embarked, inplace=True)

In [23]:
embarked_dummies = pd.get_dummies(titanic["Embarked"], prefix="Embarked")
titanic = pd.concat([titanic, embarked_dummies], axis=1)
titanic.drop("Embarked", axis=1, inplace=True)

We drop Cabin feature, as it contains a lot of missed values (77,46%), and Ticket feature.

In [24]:
titanic.drop("Cabin", axis=1, inplace=True)

In [25]:
titanic.drop("Ticket", axis=1, inplace=True)

In [26]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 15 columns):
Pclass           1309 non-null int64
Sex              1309 non-null object
Age              1309 non-null float64
SibSp            1309 non-null int64
Parch            1309 non-null int64
Fare             1309 non-null float64
Title_Master     1309 non-null uint8
Title_Miss       1309 non-null uint8
Title_Mr         1309 non-null uint8
Title_Mrs        1309 non-null uint8
Title_Officer    1309 non-null uint8
Title_Royalty    1309 non-null uint8
Embarked_C       1309 non-null uint8
Embarked_Q       1309 non-null uint8
Embarked_S       1309 non-null uint8
dtypes: float64(2), int64(3), object(1), uint8(9)
memory usage: 72.9+ KB


So, there aren't any feature with missing values. Now we have to process some features:
- we maps Sex values to 0 (male) and 1 (female).
- we encode the values of Pclass using a dummy encoding.

In [27]:
titanic["Sex"] = titanic["Sex"].map({"male":0, "female":1})

In [28]:
pclass_dummies = pd.get_dummies(titanic["Pclass"], prefix="Pclass")
titanic = pd.concat([titanic, pclass_dummies], axis=1)
titanic.drop("Pclass", axis=1, inplace=True)

In [29]:
titanic.head()

Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,Title_Royalty,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3
0,0,22.0,1,0,7.25,0,0,1,0,0,0,0,0,1,0,0,1
1,1,38.0,1,0,71.2833,0,0,0,1,0,0,1,0,0,1,0,0
2,1,26.0,0,0,7.925,0,1,0,0,0,0,0,0,1,0,0,1
3,1,35.0,1,0,53.1,0,0,0,1,0,0,0,0,1,1,0,0
4,0,35.0,0,0,8.05,0,0,1,0,0,0,0,0,1,0,0,1


## Model training

In [30]:
train = titanic.iloc[:891]
test = titanic.iloc[891:]
targets = survived

### Model selection

We try various models to see which we should choose.

In [31]:
MAX_ITER = 10000
logreg = LogisticRegression(max_iter=MAX_ITER)
svc = SVC(max_iter=MAX_ITER)
linearSVC = LinearSVC(max_iter=MAX_ITER)
knn = KNeighborsClassifier()
gaussianNB = GaussianNB()
perceptron = Perceptron(max_iter=MAX_ITER)
sgd = SGDClassifier(max_iter=MAX_ITER)
decision_tree = DecisionTreeClassifier()
random_forest = RandomForestClassifier()
    
models = [logreg, svc, linearSVC, knn, gaussianNB, perceptron, sgd, decision_tree, random_forest]

In [32]:
models_results = pd.DataFrame(columns=['Score'])
for model in models:
    name = model.__class__.__name__
    score = np.mean(cross_val_score(model, train, targets, cv = 10, scoring='accuracy'))
    models_results.loc[name] = round(score*100, 2)
    
models_results.sort_values(by=['Score'], ascending=False, inplace=True)
models_results

Unnamed: 0,Score
LinearSVC,83.17
LogisticRegression,82.61
RandomForestClassifier,81.27
GaussianNB,80.59
SGDClassifier,79.58
DecisionTreeClassifier,78.12
Perceptron,77.56
SVC,74.76
KNeighborsClassifier,71.62


Let's try to tune Random Forest model.

### Random Forest tuning

In [33]:
parameter_grid = {
                 'max_depth' : [4, 6, 8],
                 'n_estimators': [50, 10],
                 'max_features': ['sqrt', 'auto', 'log2'],
                 'min_samples_split': [2, 3, 10],
                 'min_samples_leaf': [1, 3, 10],
                 'bootstrap': [True, False],
                 }
forest = RandomForestClassifier()
cross_validation = StratifiedKFold(n_splits=5)

grid_search = GridSearchCV(forest,
                            scoring='accuracy',
                            param_grid=parameter_grid,
                            cv=cross_validation,
                            verbose=1
                            )

grid_search.fit(train, targets)
parameters = grid_search.best_params_

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Best score: 0.8395061728395061
Best parameters: {'bootstrap': False, 'max_depth': 8, 'max_features': 'auto', 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 10}


[Parallel(n_jobs=1)]: Done 1620 out of 1620 | elapsed:  1.1min finished


This is the best score (83,95%), so we use Random Forest to do our predictions.

### Predictions

Now we can fit our model and generate output for Kaggle submission.

In [34]:
model = RandomForestClassifier(**parameters)
model.fit(train, targets)
output = model.predict(test).astype(int)
aux = pd.read_csv('./data/test.csv')
df_output = pd.DataFrame()
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('./predictions/predictions.csv'.format(name), index=False)

# Conclusion

0,79904 score. 1489/10003 (top 15%) in Kaggle Leaderboard.