<style>
@import url('https://fonts.googleapis.com/css2?family=Pangolin&display=swap');
</style>

<h1>
    <span style = 'font-family : Pangolin, cursive;'>
    Titanic Experimentation!
    </span>
</h1>
 

<span style = "color : blue"> This is my second attempt at Titanic Dataset. This was mainly for experimenting with techniques</span>
<hr>

You can view my First notebook [here](https://www.kaggle.com/duttasd28/titanic-0-8-accuracy-nn)

# Import Necessary libraries and manage DataFrames

We will import necessary libraries and use pandas and numpy for our purposes

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Read data
# Test - to be predicted
test = pd.read_csv('../input/titanic/test.csv')
# Train - training data
train = pd.read_csv('../input/titanic/train.csv')

In [None]:
# get PassengerId from test columns. This will help in prediction later
testPassengerIds = test['PassengerId']
testPassengerIds.head()

In [None]:
# Drop some columns
train.drop(['PassengerId', 'Name', 'Ticket'], inplace = True, axis = 1)
test.drop(['PassengerId', 'Name', 'Ticket'], inplace = True, axis = 1)

In [None]:
train.head()

In [None]:
test.head()

# Imputation

There are many null values in the data. We will need to impute them that is fill them

In [None]:
train.isnull().any()

In [None]:
test.isnull().any()

## 1. Cabin

We will fill the`Cabin` column with our own label mapping.

In [None]:
train['Cabin'].unique()

Let us replace the values with the first letters. For example **A3** will become **A**. Also, lets assign numeric values to the data. 

So, some column like 'A32' will become 'A' and then 6

In [None]:
# Dictionary for mappinig
# Fill each place with First letter to label mapping

train['Cabin'].fillna(0, inplace = True)
def getCabin(value):
    val_dict = {
        'A' : 6,
        'B' : 5,
        'C' : 4,
        'D' : 3,
        'E' : 2,
        'F' : 1,
        'T' : 1   ## Taking T same as F, taking it to be an error     
    }
    return val_dict.get(str(value)[0], 0)

train['Cabin'] = train["Cabin"].apply(getCabin)
test['Cabin'] = test['Cabin'].apply(getCabin)

In [None]:
train.head()

## 2. Embarked

Embarked column has few missing values, so let us fill it with the most common value that is `mode`

In [None]:
# Fill with most common values
train['Embarked'].fillna(train['Embarked'].mode().item() , inplace = True)
# To ensure no discrepancy
test['Embarked'].fillna(train['Embarked'].mode().item(), inplace = True)

Let us use plotly to visualise this interactively

In [None]:
import plotly.express as px
fig = px.sunburst(train, path=['Embarked', 'Pclass', 'Sex'], values='Survived', title = 'Embarked -> Class -> Sex')
fig.show()

In [None]:
# One hot encode the Data
train = pd.get_dummies(train, drop_first=True)
test = pd.get_dummies(test, drop_first=True)

## 3. Age
We will impute age with the median age

In [None]:
# Impute age with median
mean = train['Age'].median()
train['Age'].fillna(mean, inplace = True)
test['Age'].fillna(mean, inplace = True)

In [None]:
# using `|` makes or operator, checks if missing in train or test
train.isnull().any() | test.isnull().any()

We see that Test data has only `Fare` missing.

In [None]:
# Impute fare with mean of training data
meanFare = train['Fare'].mean()
test['Fare'].fillna(meanFare, inplace = True)

In [None]:
train.head()

In [None]:
test.head()

# Feature Engineering

Number of features is less. So let us use Feature Engineering to add features

In [None]:
# Put the log of fare and class
def correctedLog(value):  # So that we do not get infinity
    return np.log(1 + value)

train['Status'] = train['Pclass'] + train['Fare'].apply(correctedLog)
test['Status'] = test['Pclass'] + test['Fare'].apply(correctedLog)

In [None]:
train['RootAgeTimesClass'] = train['Age'].apply(np.sqrt) * train['Pclass']
test['RootAgeTimesClass'] = test['Age'].apply(np.sqrt) * test['Pclass']

In [None]:
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
test['FamilySize'] = train['SibSp'] + train['Parch'] + 1

In [None]:
train.head()

In [None]:
train['Young'] = train['Age'] <= train['Age'].mean()
test['Young'] = test["Age"] <= train['Age'].mean()

In [None]:
train['YoungMale'] = train['Young'] & train['Sex_male']
test['YoungMale'] = test["Young"] & test['Sex_male']

In [None]:
train.head()

In [None]:
X = train.iloc[:, 1: ]
y = train.iloc[:, 0]
y.shape, X.shape

# Visualisation

In [None]:
import matplotlib.pyplot as plt
plt.hist(train['Fare']);


In [None]:
# Visualising the Box Cox Transform
from scipy.stats import boxcox
xt,_ = boxcox(train['Fare'] + 1)
plt.hist(xt);

<h2><span style = 'text-shadow: 2px 2px 5px green;'>Box Cox Transform</span></h2>

Box Cox transform converts skewed values to Approximately a normal distribution. I came across and thought, lets apply!

In [None]:
# Apply Box Cox Transformation
train['Fare'], maxlog = boxcox(train['Fare'] + 1)
test['Fare'] = boxcox(test['Fare'] + 1, lmbda = maxlog)

# Model Fitting

We will use Random Forest Model for predictions. We will also use GridSearchCV to tune hyperparameters

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state =55, test_size = 0.2, shuffle = True)
y_train.shape, y_val.shape

In [None]:
X_train.head()

# Train Test Split

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
np.random.seed(0)

In [None]:
# Parameters for Grid Search
paramDict = {
    'n_estimators' : [5, 10, 25, 50, 75, 100, 200, 500],
    'max_depth' : [4, 8, 10, 15, 20, 50],
    
}
# Random Forest Model
model = RandomForestClassifier(n_jobs = 8)
# Grid Search CV
clf = GridSearchCV(estimator=model, param_grid=paramDict, n_jobs=10)

In [None]:
clf.fit(X_train, y_train)

Get the best parameters and score

In [None]:
clf.best_params_, clf.best_score_

In [None]:
f1_score(clf.predict(X_val), y_val)

## Final Submission
For training the final model, we will use all of the training data to give us additional boost

In [None]:
# Make model with best parameters, fit with all data now
finalModel = RandomForestClassifier(**clf.best_params_)

# Fit Data
finalModel.fit(X, y)

# Generate Predictions
y_preds = finalModel.predict(test)

#########################################################################
# Submission File Generation
file_name = "Submission_16_08_6.csv"

y_pred_series = pd.Series(y_preds.flatten(), name = 'Survived')

file = pd.concat([testPassengerIds, y_pred_series], axis = 1)

file.to_csv(file_name, index = False);

Hope you like my work! If you do, please UPvote! 😃😃