# Titanic - Machine Learning from Disaster

![](https://upload.wikimedia.org/wikipedia/commons/6/6e/St%C3%B6wer_Titanic.jpg)

***The RMS Titanic sank in the early morning hours of 15 April 1912 in the North Atlantic Ocean, four days into her maiden voyage from Southampton to New York City. The largest ocean liner in service at the time, Titanic had an estimated 2,224 people on board when she struck an iceberg at around 23:40 (ship's time) on Sunday, 14 April 1912. Her sinking two hours and forty minutes later at 02:20 (ship's time; 05:18 GMT) on Monday, 15 April, resulted in the deaths of more than 1,500 people, making it one of the deadliest peacetime maritime disasters in history.*** 

# Table of Contents

* [Introduction](#introduction)
* [House Keeping](#house)
* [Exploratory Data Analysis](#EDA)
* [Feature Selection](#feature)
* [Final Processing](#final)
* [Modelling](#model)
* [Model Tuning - Hyperparameter GridSearch](#tuning)
* [Model Performance](#performance)
* [To Do in future versions!](#future)

# Introduction <a id="introduction"></a>

Analysis, Feature Engineering and Modelling of the titanic dataset from [Kaggle](https://www.kaggle.com/competitions/titanic/overview).

In this notebook is my first attempt of a thorough analysis of the titanic dataset. The goal was to predict survivors of the tragic sinking of the titanic based passenger information such as age, sex and socio-economic status.

I tried several models, both with and without tuning to both improve my result and learn along the way.

**Best performing model: 83.4%**

**Hope you enjoy, let me know how I can improve, and if you liked it, an upvote would help me out alot!**

## Columns in the dataset

The columns present in the dataset are as follows: 
1. **PassengerId**: This column assigns a unique identifier for each passenger.
2. **Survived**: Specifies whether the given passenger survived or not (1 - survived, 0 - didn't survive)
3. **Pclass**: The passenger's class. (1 = Upper Deck, 2 = Middle Deck, 3 = Lower Deck)
4. **Name**: The name of the passenger. 
5. **Sex**: The sex of the passenger (male, female)
6. **Age**: The age of the passenger in years. If the age is estimated, is it in the form of xx.5. 
7. **SibSp**: How many siblings or spouses the passenger had on board with them. Sibling = brother, sister, stepbrother, stepsister and Spouse = husband, wife (mistresses and fiancés were ignored)
8. **Parch**: How many parents or children the passenger had on boad with them. Parent = mother and father, child = daughter, son, stepdaughter and stepson and some children travelled only with a nanny, therefore parch=0 for them.
9. **Ticket**: The ticket of the passenger. 
10. **Fare**: The fare amount paid by the passenger for the trip. 
11. **Cabin**: The cabin in which the passenger stayed. 
12. **Embarked**: The place from which the passenger embarked (S, C, Q)

# House Keeping <a id="house"></a>

## Import Libraries, load dataset and do a short summary

In [None]:
# import libraries
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("YlGnBu")

from sklearn.preprocessing import StandardScaler

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV 

from sklearn.model_selection import cross_val_score
#from sklearn.metrics import classification_report

# load datasets
df_train = pd.read_csv('/kaggle/input/titanic/train.csv')
df_test = pd.read_csv('/kaggle/input/titanic/test.csv')
df_gender_submission = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')

# mark train and test sets for future split
df_train['train_test'] = 1
df_test['train_test'] = 0
df_test['Survived'] = np.NaN

#combine to a single dataframe with all data for feature engineering
df_all = pd.concat((df_train, df_test))

# print dataset shape and columns
print(f'''
Train Dataset:
Loaded train dataset with shape {df_train.shape} ({df_train.shape[0]} rows and {df_train.shape[1]} columns)

Test Dataset:
Loaded test dataset with shape {df_test.shape} ({df_test.shape[0]} rows and {df_test.shape[1]} columns)

Sample Submission Dataset:
Loaded sample submission dataset with shape {df_gender_submission.shape} ({df_gender_submission.shape[0]} rows and {df_gender_submission.shape[1]} columns)
''')

## Train dataset

In [None]:
df_train.head()

In [None]:
df_train.info()

In [None]:
df_train.describe()

## Initial thoughs

* The **PassengerID** column shouldnt provide any useful information about survival, so it should be dropped.
* The **Fare** column looks very volatile on the high end, Q3 (75%) = 31 and MAX = 512, maybe outliers?
* There are null values in **Age**, **Cabin** and **Embarked**, these should be fixed, maybe **Age** and **Embarked** are missing data and **Cabin** is simply due to not every passenger having a **Cabin**
* Both numerical and categorical columns. They should be examined further and either scaled or one hot encoded to improve model performance

# Exploratory Data Analysis <a id="EDA"></a>

## Survival ratio

In [None]:
df_survived = df_train['Survived']

survival, mortality = df_survived.value_counts() 

print(f'''
There were {survival} survivors and {mortality} mortalities in the train set.
Making the survival rate {df_survived.mean():.2%}
''')

sns.countplot(x = df_survived)
plt.title('Distribution of survival or mortality')

## Name

In [None]:
# Extract titles
df_train['Title'] = (df_train['Name'].str.split(',', expand=True)[1]
                                    .str.split('.', expand=True)[0])

# List most frequent titles
(df_train['Title'].value_counts()
                  .to_frame()
                  .reset_index()
                  .iloc[:6]
                  .rename(columns={'index':'Title', 'Title':'Frequency'}))

## Ticket Class

In [None]:
df_pclass = df_train['Pclass']

upper, middle, lower = df_pclass.value_counts()

print(f'''
Passengers were split into three Ticket Classes and hereby the placement on the ship deck:
There were {upper} people on the upper deck.
There were {middle} people on the middle deck.
There were {lower} people on the lower deck
''')

sns.countplot(x = df_pclass)
plt.title('Distribution of ticket classes')

## Passenger sex

In [None]:
df_sex = df_train['Sex']

male, female = df_sex.value_counts().sort_index()

print(f'''
There were {male} males aboard.
There were {female} females aboard.
''') 

sns.countplot(x = df_sex)
plt.title('Distribution of passenger sex')

## Passenger age

In [None]:
df_age = df_train['Age']

print(f'''
There were {np.count_nonzero(df_age < 25)} passenges under the age of 25.
There were {np.count_nonzero((df_age >= 25) & (df_age <= 65))} passengers between the age of 25 and 65.
There were {np.count_nonzero(df_age > 65)} passenges older than 65.
''') 

sns.histplot(data = df_age)
plt.title('Distribution of passenger age')

## Number of siblings/spouses

In [None]:
df_sibsp = df_train['SibSp']

print(f'There were {df_sibsp.value_counts().sort_index()[0]} passengers with no siblings or spouses.')

sns.countplot(x = df_sibsp)
plt.title('Distribution of number of siblings/spouses aboard')

## Number of parents/children

In [None]:
df_parch = df_train['Parch']

print(f'There were {df_parch.value_counts().sort_index()[0]} passengers with no parents or children.')

sns.countplot(x = df_parch)
plt.title('Distribution of number of parents/children aboard')

## Tickets

In [None]:
df_ticket = df_train['Ticket']



print(f'''
There were {np.count_nonzero(df_ticket.value_counts() == 1)} passengers who bought their ticket alone.
There were {np.count_nonzero(df_ticket.value_counts() > 1)} passengers who bought tickets together.
''') 

sns.histplot(data = df_ticket.value_counts())
plt.title('Distribution of people per ticket')

## Fare

In [None]:
df_fare = df_train['Fare']

print(f'''
There were {np.count_nonzero(df_fare < 10)} passengers payed less than 10 dollars for their ticket.
There were {np.count_nonzero((df_fare >= 10) & (df_fare <= 50))} passengers payed between 10 and 50 dollars for their ticket.
There were {np.count_nonzero(df_fare > 50)} passengers payed more than 50 dollars for their ticket.
''') 


sns.histplot(data = df_fare)
plt.title('Distribution of fares')

## Cabin

In [None]:
df_cabin = df_train['Cabin']

print(f'''
There were {df_cabin.notna().astype(int).sum()} passengers who had a cabin. 
There were {df_cabin.isna().astype(int).sum()} passengers who did not have a cabin.
''') 

df_cabin = np.where(df_cabin.isna(), 0, 1)

sns.countplot(x = df_cabin)
plt.title('Distribution of number of passengers with a cabin')

## Port of Embarkation

In [None]:
df_port = df_train['Embarked']

C, Q, S = df_port.value_counts().sort_index()

print(f'''
There were {S} passengers boarding the ship at Southampton.
There were {C} passengers boarding the ship at Cherbourg.
There were {Q} passengers boarding the ship at Queenstown.
''') 

sns.countplot(x = df_port)
plt.title('Distribution of number of passengers with a cabin')

## Survival rate factors

In [None]:
sns.catplot(data=df_train, 
            x="Sex", 
            y="Survived", 
            hue="Pclass", 
            kind="bar")

plt.title('Survival rate based on sex and passanger class')

# Feature Selection <a id="feature"></a>

One of the best ways of getting column correlations is a **confusion matrix**. A **confusion matrix** plots the correlation of every column compared to each other, returning a matrix of scores. A score between 0 and -1 indicates negative correlation and 0 and 1 indicates positive correlations with values closer to -1/1 indicating stronger correlations.


A **heatmap** takes this a step further, adding a color scale to the values, making correlations easier to spot at a glance.

In [None]:
# change cabin names and numbers to cabin yes or no
df_train['Cabin'] = np.where(df_train['Cabin'].isna(), 0, 1)

# change male/female to 0 and 1
df_train['Sex'] = np.where(df_train['Sex'] == 'female', 1, 0)

# One-Hot encode Embarkation (done with pd.get_dummies() further down)
df_train.loc[df_train['Embarked'] == 'S', 'embarked_Southampton'] = 1
df_train.loc[df_train['Embarked'] == 'C', 'embarked_Cherbough'] = 1
df_train.loc[df_train['Embarked'] == 'Q', 'embarked_Queenstown'] = 1

df_train = df_train.drop('Embarked', axis = 1)

df_train = df_train.replace(np.nan, 0)

In [None]:
# Corelation matrix of numerical categories
(df_train[[
          'PassengerId', 
          'Survived', 
          'Age',
          'SibSp',
          'Parch',
          'Fare',
          'Cabin',
          'Pclass',
          'embarked_Southampton',
          'embarked_Cherbough',
          'embarked_Queenstown']]
          .corr())

In [None]:
# Heatmap of correlation matrix for training data columns

fig, ax = plt.subplots(figsize=(12,8)) 

sns.heatmap((df_train[[
                      'PassengerId', 
                      'Survived', 
                      'Age',
                      'SibSp',
                      'Parch',
                      'Fare',
                      'Cabin',
                      'Pclass',
                      'embarked_Southampton',
                      'embarked_Cherbough',
                      'embarked_Queenstown'
                      ]]
                      .corr()),
                      linewidths=1,
                      cmap=plt.cm.Blues, 
                      annot=True,
                      ax=ax)

plt.title('Heatmap for correlation between columns of training data')

## Heatmap Conclusion

The **heatmap** above shows correlation between all our features. Especially interesting is the feature correlation with our label, ```Survived```. Remember that a postive number close to 1 indicates a strong, positive correlation, while a negative number close to -1 indicates a strong, negative correlation. Numbers closer to 0 indicates a weak correlation:

* ```Fare``` and ```Cabin``` seems to have the strongest positive correlations with our label (**0.26** and **0.32** respectively)
* ```Pclass``` seems to have a strong negative correlation with out label (**-0.34**). However, remember, the upper deck is encoded as 1, middel as 2 and lower as 3. This means that the negative correlation with ```Survived``` has to be put in that context: A higher value lowers survival rate, the lower you are on the ship, the lower your survival rate is, which is what we would expect.
* Maybe a bit surprisingly, we see a positive correlation for ```embarked_Cherbough``` and a negative correlation for ```embarked_Southampton```. This is most likely due to where workers on the ship boarded the ship or the order, that the people who embarked last, were placed higher up in the ship, giving them a better survival rate.
* ```SibSp```, ```Parch``` and ```Age``` all show a correlation, however I would, at first glance, assume that ```Age``` had larger influence over the survival rate, but the dataset doesn't support my hypothesis.

Based on these correlations, we can make better decisions on which features to keep, engineer og drop from the dataset before we start modelling.

# Final Processing <a id="final"></a>

Before we model our data, some processing are encouraged to increase model performance.

There are several ways to treat different data types, but the following are often a guide first try:

* **Remove null values:** Most sk.learn models dont accept null values, so these have to be fixed.
* **OneHotEncode categorical columns:** Encode categorical features to a more machine learning friendly format. Our ```Cabin``` feature should be transformed into ```is_A```, ```is_B``` and ```is_C``` columns with 1/0 depending on which cabin the passenger had.
* **Scale numerical data:** Numerical values on different scales function poorly in a model. A scaler scales (duh) the data to values between -1 and 1. This keep the relative different of each feature, but allows much better model performance with several features.

Next our features can drastically improve model performance. Feature engineering differs based on approach, dataset and model. For the Titanic dataset I have done the following but there are many more ways to improve my features:  

* **Change Cabin names and numbers:** Change ```Cabin``` to 1 or 0 for either having or not having a cabin.
* **Extract Titles:** The ```Name``` feature by itself provides little value in predicting wether a passanger survived or not. Maybe extracting the titles from ```Name``` provides better accuracy.

In [None]:
# impute nulls for continuous data 
df_all.Age = df_all.Age.fillna(df_train.Age.median())
df_all.Fare = df_all.Fare.fillna(df_train.Fare.median())

# drop null the two null Embarked values
df_all.dropna(subset=['Embarked'], inplace = True)

# change cabin names and numbers to cabin yes or no
df_all['Cabin'] = np.where(df_all['Cabin'].isna(), 0, 1)

# extract titles
df_all['Title'] = (df_all['Name'].str.split(',', expand=True)[1]
                                 .str.split('.', expand=True)[0])

# drop unneeded columns
df_all = (df_all.drop([
                      'PassengerId',
                      'Name',
                      'Title',
                      'Ticket',
                      'Name'],
                      axis = 1
                      ))

df_all['Pclass'] = df_all['Pclass'].astype(str)

# make dummies (OneHotEncode) categorical variables
df_all_dummies = pd.get_dummies(df_all[['Pclass', 
                                        'Sex', 
                                        'Age',
                                        'SibSp', 
                                        'Parch', 
                                        'Fare', 
                                        'Cabin', 
                                        'Embarked', 
                                        'train_test']])

# Scale data
scaler = StandardScaler()
(df_all_dummies[['Age', 
                 'SibSp', 
                 'Parch', 
                 'Fare']]) = (scaler.fit_transform(df_all_dummies[['Age', 
                                                                   'SibSp', 
                                                                   'Parch', 
                                                                   'Fare']]))

# resplit into train and test sets
X_train = df_all_dummies[df_all_dummies.train_test == 1].drop(['train_test'], axis =1)
X_test = df_all_dummies[df_all_dummies.train_test == 0].drop(['train_test'], axis =1)
y_train = df_all[df_all['train_test'] == 1]['Survived']
y_test = df_all[df_all['train_test'] == 0]['Survived']

print(f'Before training models our train set has {X_train.shape} rows and columns and our test set has {X_test.shape} rows and columns.')

# Modelling <a id="model"></a>

We have a preprocessed and model ready dataset, now to chose the right model. We are trying to predict which passangers survived the titanic sinkage, so most of the models we are testing, are **classification** models.

To get an indication of the best model, I will try several, baseline, models without any tuning. After getting an indication of which model performs best on our dataset, I will use grid search to tune the model hyperparameters to further improve the accuracy.

* [Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) - Naive Bayes models uses Bayes Theorem that offers conditional probability of events taking place.
* [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) - Decision Trees create a series of decisions to classify data based on the rules learned from the dataset.
* [KNeighbors Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) - Neightbor Classifiers groups data with other data near to it based in a specified k value.
* [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) - Random Forest classifiers fits a number of decision treees on subsamples of the dataset to improve the accuracy and redude over-fitting.
* [Support Vector Classifier (SVC)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* [XGBoost Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) - Bootsting Classifiers builds additive models to allow optimization of the downstream models based on loss functions.
* [Voting Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html) - Voting Classifier trains different models using the chosen algorithms, returning the majority's vote as the result.

In [None]:
gnb = GaussianNB()
cv = cross_val_score(gnb, 
                     X_train, 
                     y_train, 
                     cv=5)

print(f'GaussianNB: \n{cv}\nAverage: {cv.mean()}\n')

lr = LogisticRegression(max_iter=2000)
cv = cross_val_score(lr, 
                     X_train, 
                     y_train, 
                     cv=5)

print(f'LogisticRegression: \n{cv}\nAverage: {cv.mean()}\n')

dt = tree.DecisionTreeClassifier(random_state=42)
cv = cross_val_score(dt, 
                     X_train, 
                     y_train, 
                     cv=5)

print(f'DecisionTreeClassifier: \n{cv}\nAverage: {cv.mean()}\n')

knn = KNeighborsClassifier()
cv = cross_val_score(knn, 
                     X_train, 
                     y_train, 
                     cv=5)

print(f'KNeighborsClassifier: \n{cv}\nAverage: {cv.mean()}\n')

rf = RandomForestClassifier(random_state=42)
cv = cross_val_score(rf, 
                     X_train, 
                     y_train, 
                     cv=5)

print(f'RandomForestClassifier: \n{cv}\nAverage: {cv.mean()}\n')

svc = SVC(probability=True)
cv = cross_val_score(svc, 
                     X_train, 
                     y_train, 
                     cv=5)

print(f'SVC: \n{cv}\nAverage: {cv.mean()}\n')

xgb = XGBClassifier(random_state = 42)
cv = cross_val_score(xgb, 
                     X_train, 
                     y_train, 
                     cv=5)

print(f'XGBClassifier: \n{cv}\nAverage: {cv.mean()}\n')

voting_clf = VotingClassifier(estimators=[
                                          ('lr', lr), 
                                          ('knn', knn), 
                                          ('rf', rf), 
                                          ('gnb', gnb), 
                                          ('dt', dt), 
                                          ('svc', svc), 
                                          ('xgb', xgb)],
                                          voting='soft'
                                          )
cv = cross_val_score(voting_clf, X_train, y_train, cv=5)

print(f'VotingClassifier: \n{cv}\nAverage: {cv.mean()}\n')

## Baseline submission of best performing model

In [None]:
voting_clf.fit(X_train, y_train)

y_hat_baseline = voting_clf.predict(X_test).astype(int)

baseline_submission = pd.DataFrame({'PassengerId': df_test.PassengerId, 
                                    'Survived': y_hat_baseline})

baseline_submission.to_csv('baseline_submission.csv', index=False)

## Baseline model performance

|Model|Baseline Performance|
|--|--|
|Naive Bayes| 77.0%|
|Logistic Regression| 80.5%| 
|Decision Tree Classifier| 78.4%|
|KNN Classifier| 80.5%|
|Random Forest Classifier| 80.5%|
|**Support Vector Classifier**| **82.5%**|
|Xtreme Gradient Boosting| 82.2%|
|Voting Classifier| 82.0%|

The **Support Vector Classifier** performed best before any model tuning. Next up, hyperparameter tuning using GridSearch. GridSearch will search the ```parameter_grid``` for each possible combination of parameter to find the best performing model. It's like running the model alot of times, searching for the best parameter combination. Some models, like **Naive Bayes**, has very few hyperparameters to tune, so they wont be tuned. 

A note on the **Random Tree Classifier**: Since the ```parameter_grid``` is so large, a GridSearch looking for all possible combinations, would take way too long to train. One way of dealing with this issue, is it first run a RandomizedGridSearch, which looks at the same ```parameter_grid```, but instead of looking at all combinations, it will check random parameter combinations. This drastically reduces the training time, but may miss the best combination of parameters. The best parameter combination from the RandomSearchCV is then used to guide a more narrow GridSearch, to find the best combination. This two-grid search approach can prove more realistic when then hyperparameter space is very large, like in a **Random Tree** model.

# Model Tuning - Hyperparameter GridSearch <a id="tuning"></a>

In [None]:
def model_performance(model, name):
    print(name)
    print(f'Best Score: {model.best_score_}')
    print(f'Best Parameters: {model.best_params_}\n')  

lr = LogisticRegression()

parameter_grid = {'max_iter' : [2000],
                  'penalty' : ['l1', 'l2'],
                  'C' : np.logspace(-4, 4, 20),
                  'solver' : ['liblinear'
                  ]}

lr_model = GridSearchCV(lr, 
                        param_grid=parameter_grid, 
                        cv=5, 
                        verbose=True, 
                        n_jobs=1)
best_lr_model = lr_model.fit(X_train, y_train)
model_performance(best_lr_model, 'LogisticRegression')


knn = KNeighborsClassifier()

parameter_grid = {'n_neighbors' : [3,5,7,9],
                  'weights' : ['uniform', 'distance'],
                  'algorithm' : ['auto', 'ball_tree','kd_tree'],
                  'p' : [1,2]}

knn_model = GridSearchCV(knn, 
                         param_grid=parameter_grid, 
                         cv=5, 
                         verbose=True, 
                         n_jobs=-1)
best_knn_model = knn_model.fit(X_train, y_train)
model_performance(best_knn_model, 'KNN')


svc = SVC(probability=True)

parameter_grid = {'kernel': ['rbf'], 
                   'gamma': [.1,.5,1],
                   'C': [.1, 1, 10]}

svc_model = GridSearchCV(svc, 
                       param_grid=parameter_grid, 
                       cv=5, 
                       verbose=True, 
                       n_jobs=-1)
best_svc_model = svc_model.fit(X_train, y_train)
model_performance(best_svc_model,'SVC')


rf = RandomForestClassifier(random_state=42)

parameter_grid = {'n_estimators': [100,500], 
                  'bootstrap': [True,False],
                  'max_depth': [10,20,50,75,None],
                  'max_features': ['auto','sqrt'],
                  'min_samples_leaf': [1,2,4],
                  'min_samples_split': [2,5,10]}
                                  
rf_model_randomcv = RandomizedSearchCV(rf, 
                                param_distributions=parameter_grid, 
                                n_iter=50, 
                                cv=5, 
                                verbose=True, 
                                n_jobs=-1)
best_rf_model_randomcv = rf_model_randomcv.fit(X_train, y_train)
model_performance(best_rf_model_randomcv, 'Random Forest (RandSearchCV)')


rf = RandomForestClassifier(random_state=42)

parameter_grid = {'n_estimators': [500,550,600],
              'criterion':['gini'],
              'bootstrap': [True],
              'max_depth': [10, 15, 20],
              'max_features': ['auto','sqrt', 10],
              'min_samples_leaf': [2,3],
              'min_samples_split': [2,3]}
                                  
rf_model = GridSearchCV(rf, 
                      param_grid = parameter_grid, 
                      cv=5, 
                      verbose=True, 
                      n_jobs=-1)
best_rf_model = rf_model.fit(X_train, y_train)
model_performance(best_rf_model, 'Random Forest (GridSearchCV)')


best_rf_model.fit(X_train, y_train)

y_hat_tuned = best_rf_model.predict(X_test).astype(int)

tuned_submission = pd.DataFrame({'PassengerId': df_test.PassengerId, 
                                 'Survived': y_hat_tuned})

tuned_submission.to_csv('tuned_submission.csv', index=False)

# Model performance <a id="performance"></a>

|Model|Scaled Performance|Scaled and Tuned Performance|
|--|--|--|
|Naive Bayes| 77.0%| NA|
|Logistic Regression| 80.5%| 80.7%|
|Decision Tree Classifier| 78.4%| NA|
|KNN Classifier| 80.5%|80.7%|
|**Random Forest Classifier**| 80.5%| **83.2%**|
|Support Vector Classifier| **82.5%**| 82.9%|
|Xtreme Gradient Boosting| 82.2%| 81.0%|
|Voting Classifier| 82.0%| NA|

Wow! Tuning the hyperparameters really improved the model performance, especially for the **Random Forest Classifier.**

There is plenty more to try with the modelling, but we will accept this for now.

**Feel free to try other models and let me know!**

## Submissions

Submission of the baseline and tuned model. Both submission .csv files are made to match the sample submission file with ```PassengerID``` and ```Survived``` columns.

In [None]:
baseline_submission

In [None]:
tuned_submission

# To do in future versions! <a id="future"></a>

* Exploratory Data Analysis:
    * Perform further EDA
* Feature Engineering:
    * Group and bin age and fare features
    * Fare/Ticket# Column
    * Remove fare outliers
    * SMOTE Survival rates?
* Model Evaluation:
    * Make classification_report and ROC-curve

In [None]:
#%%capture --no-display
#xgb = XGBClassifier(random_state=42)

#parametere_grid = {
              #'n_estimators': [450,500,550],
              #'colsample_bytree': [0.75,0.8,0.85],
              #'max_depth': [None],
              #'reg_alpha': [1],
              #'reg_lambda': [2, 5, 10],
              #'subsample': [0.55, 0.6, .65],
              #'learning_rate':[0.5],
              #'gamma':[.5,1,2],
              #'min_child_weight':[0.01],
              #'sampling_method': ['uniform']}

#xgb_model = GridSearchCV(xgb, 
                         #param_grid=parameter_grid, 
                         #cv=5, 
                         #verbose=True, 
                         #n_jobs=-1)

#best_xgb_model = xgb_model.fit(X_train, y_train)

#model_performance(best_xgb_model, 'XGB')