# Welcome to Session 3 of Head Start Machine Learning Workshop! 
---
Today, you are going to:
- data visualization
- clean dataset
- learn about hyperparameter tuning


<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/dscum/Head-Start-ML/blob/main/session-3/workshop%203%20nb%20(live%20ver).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

## Import Libraries and Datasets

Steps to import datasets to Google Colab:
1. Install train.csv and test.csv from Kaggle

 <td>
    <a target="_blank" href="https://www.kaggle.com/c/titanic/data"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
2. Click `files` from Google Colab (left sidebar) and upload them into Google Colab

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train_df = pd.read_csv('train.csv')
train_df.head()

In [None]:
test_df = pd.read_csv('test.csv')
test = test_df.copy()
test_df.head()

In [None]:
print(train_df.info())
print('-'*40)
print(test_df.info())

In [None]:
print(train_df.columns.values)
print('-'*40)
print(test_df.columns.values)

From here, we can see that test.csv does not have `Survived` attribute, this is what we need to predict.

## Data Visualization

### Pclass

**pclass**: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

In [None]:
print(train_df['Pclass'].value_counts())

In [None]:
print(train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False))
print(sns.countplot(x='Pclass', hue='Survived', data=train_df))

Assumption 1: Upper class, higher the survival rate.

### Sex

In [None]:
print(train_df['Sex'].value_counts())

In [None]:
print(train_df[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False))
sns.countplot(x='Sex', hue='Survived', data=train_df)

Assumption 2: Female has higher survival rate. 

### Pclass, Sex & Age

In [None]:
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', hue='Sex')
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend();

### SibSp

**sibsp**: The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

In [None]:
print(train_df['SibSp'].value_counts())

In [None]:
print(train_df[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False))
sns.countplot(x='SibSp', hue='Survived', data=train_df)

### Parch

**parch**: The dataset defines family relations in this way...

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
print(train_df['Parch'].value_counts())

In [None]:
print(train_df[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False))
sns.countplot(x='Parch', hue='Survived', data=train_df)

### Embarked

**embarked**: Port of Embarkation	

C = Cherbourg, 

Q = Queenstown, 

S = Southampton

In [None]:
print(train_df['Embarked'].value_counts())

In [None]:
print(train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False))
sns.countplot(x='Embarked', hue='Survived', data=train_df)

### Pclass, Sex & Embarked

In [None]:
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()

### Fare

In [None]:
grid = sns.FacetGrid(train_df, hue='Survived', size=2.2, aspect=1.6)
grid.map(plt.hist, 'Fare', alpha=.5, bins=20)
grid.add_legend();

Assumption 3: higher fare, higher survival rate

## Assumptions:
1. Upper class, higher the survival rate.
2. Female has higher survival rate.
3. Higher fare, higher survival rate 

### Correlations

In [None]:
corr_matrix = train_df.corr()
corr_matrix["Survived"].sort_values(ascending=False)

## Data Cleaning (Training Set)

### Drop `useless` columns

In [None]:
print(train_df.isnull().sum())
print(len(train_df))

In [None]:
print(train_df['Cabin'].value_counts())

Cabin seems not that helpful

In [None]:
train_df.head(10)

In [None]:
# drop PassengerId, Ticket and Cabin
train_df.drop(['PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [None]:
train_df.head(10)

### Fill in values

In [None]:
train_df.isnull().sum()

In [None]:
# median age of men
median_age_men = train_df[((train_df['Age'].isnull()==False) & (train_df['Sex']=='male'))]['Age'].median()

In [None]:
# Exercise 1: Median age of women
median_age_women = train_df[((train_df['Age'].isnull()==False) & (train_df['Sex']=='female'))]['Age'].median()
















In [None]:
print(median_age_men, median_age_women)

In [None]:
train_df.loc[(train_df['Age'].isnull())&(train_df['Sex']=='male'), 'Age'] = median_age_men
train_df.loc[(train_df['Age'].isnull())&(train_df['Sex']=='female'), 'Age'] = median_age_women
train_df.isnull().sum()

### Drop `useless` row

In [None]:
train_df.dropna(inplace=True)
train_df.isnull().sum()

In [None]:
train_df.shape

### Names Title/Honorific
Titles prefixing a person's name, e.g.: *Mr, Mrs, Miss, Ms, Mx, Sir, Dr, Cllr, Lady* or *Lord*.

In [None]:
train_df.head(10)['Name']

In [None]:
titles = set()
for name in train_df['Name']:
    titles.add(name.split(',')[1].split('.')[0].strip())
print(sorted(titles))

In [None]:
title_dict = {"Capt": "Officer",
              "Col": "Officer",
              "Major": "Officer",
              "Dr": "Officer",
              "Rev": "Officer",
              "Jonkheer": "Royalty",
              "Don": "Royalty",
              "Sir" : "Royalty",
              "the Countess":"Royalty",
              "Lady" : "Royalty",
              "Mme": "Mrs",
              "Ms": "Mrs",
              "Mr" : "Mr",
              "Mrs" : "Mrs",
              "Miss" : "Miss",
              "Mlle": "Miss",
              "Master" : "Master"
            }

In [None]:
train_df['Title'] = train_df['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
train_df['Title'] = train_df.Title.map(title_dict)
train_df.head()

In [None]:
train_df.Title.value_counts()

In [None]:
# drop Name 
train_df.drop('Name', axis=1, inplace=True)
train_df.head()

### Categorical feature

Converting categorical feature to numeric

In [None]:
# These are numerical data
train_df.describe()

In [None]:
train_df.info()

Three attributes are categorical: Sex, Embarked and Title.

In [None]:
train_df.Sex=train_df.Sex.map({'female':0, 'male':1})
train_df.head()

In [None]:
train_df.Embarked=train_df.Embarked.map({'S':0, 'C':1, 'Q':2,'nan':'NaN'})
train_df.head()

In [None]:
# Exercise 2: Map for `Title`
train_df.Title=train_df.Title.map({'Mr':0,'Miss':1,'Mrs':2,'Master':3,'Officer':4,'Royalty':5})














In [None]:
train_df.head()

### Numerical feature

In [None]:
train_df.head()

In [None]:
train_df['FareRange'] = pd.cut(train_df['Fare'], 3)
train_df[['FareRange', 'Survived']].groupby(['FareRange'], as_index=False).mean().sort_values(by='FareRange', ascending=True)

In [None]:
train_df['FareRange'].value_counts()

In [None]:
train_df.loc[ train_df['Fare'] <= 170.776, 'Fare'] = 0
train_df.loc[(train_df['Fare'] > 170.776) & (train_df['Fare'] <= 341.553), 'Fare'] = 1
train_df.loc[(train_df['Fare'] > 341.553), 'Fare'] = 2
train_df.drop('FareRange', axis=1, inplace=True)
train_df.head()

In [None]:
train_df.Age = (train_df.Age - min(train_df.Age))/(max(train_df.Age)-min(train_df.Age))
train_df.describe()

In [None]:
corr_matrix = train_df.corr()
corr_matrix["Survived"].sort_values(ascending=False)

## Data Cleaning (Test Set)
never drop a row!

### Fill in null values

In [None]:
test_df.isnull().sum()

In [None]:
# median age of sex
median_age_men2 = test_df[(test_df['Age'].isnull()==False)&(test_df['Sex']=='male')]['Age'].median()
median_age_women2 = test_df[(test_df['Age'].isnull()==False)&(test_df['Sex']=='female')]['Age'].median()

print(median_age_men2, median_age_women2)

In [None]:
test_df.loc[(test_df['Age'].isnull())&(test_df['Sex']=='male'), 'Age']=median_age_men2
test_df.loc[(test_df['Age'].isnull())&(test_df['Sex']=='female'), 'Age']=median_age_women2

In [None]:
test_df.isnull().sum()

In [None]:
test_df['Fare']=test_df['Fare'].fillna(test_df['Fare'].median())

In [None]:
test_df.isnull().sum()

### Drop `useless` column

In [None]:
test_df.drop(['PassengerId','Ticket','Cabin'], axis=1, inplace=True)
test_df.head()

### Names Title/Honorific

In [None]:
test_df['Title'] = test_df['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
test_df['Title'] = test_df.Title.map(title_dict)
test_df.head()

In [None]:
test_df.drop('Name', axis=1, inplace=True)

### Categorical Features

In [None]:
# Converting categorical feature to numeric
test_df.Sex=test_df.Sex.map({'female':0, 'male':1})
test_df.Embarked=test_df.Embarked.map({'S':0, 'C':1, 'Q':2,'nan':'NaN'})
test_df.Title=test_df.Title.map({'Mr':0, 'Miss':1, 'Mrs':2,'Master':3,'Officer':4,'Royalty':5})
test_df.head()

In [None]:
test_df.isnull().sum()

In [None]:
test_df[test_df.Title.isnull()]

In [None]:
# Female, 39 yo -> Mrs
test_df['Title']=test_df['Title'].fillna(2)

In [None]:
test_df.isnull().sum()

### Numerical Feature

In [None]:
train_df.loc[ train_df['Fare'] <= 170.776, 'Fare'] = 0
train_df.loc[(train_df['Fare'] > 170.776) & (train_df['Fare'] <= 341.553), 'Fare'] = 1
train_df.loc[(train_df['Fare'] > 341.553), 'Fare'] = 2
train_df.head()

In [None]:
test_df.Age = (test_df.Age - min(test_df.Age))/(max(test_df.Age)-min(test_df.Age))
test_df.head()

## Model Training

### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X, y = train_df.drop("Survived", axis=1), train_df["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3, random_state=91)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
X_train.columns.values

In [None]:
# Logistic regression
from sklearn.linear_model import LogisticRegression
model_lg = LogisticRegression()
model_lg.fit(X_train, y_train)

In [None]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
model_rfc = RandomForestClassifier()
model_rfc.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score
# Logistic regression's score
y_pred_lg = model_lg.predict(X_test)
accuracy_score(y_test, y_pred_lg)

In [None]:
# Random Forest Classifier's score
y_pred_rfc = model_rfc.predict(X_test)
accuracy_score(y_test, y_pred_rfc)

In [None]:
# select logistic regression
pred = model_lg.predict(test_df)

In [None]:
pred

In [None]:
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": pred
    })
submission.to_csv('sub01.csv', index=False)

### GridSearch

Select Random Forest Classifier
1. `n_estimators`: The n_estimators parameter specifies the number of trees in the forest of the model. The default value for this parameter is 10, which means that 10 different decision trees will be constructed in the random forest.
2. `max_depth`: The max_depth parameter specifies the maximum depth of each tree. The default value for max_depth is None, which means that each tree will expand until every leaf is pure. A pure leaf is one where all of the data on the leaf comes from the same class.
3. `min_samples_split`: The min_samples_split parameter specifies the minimum number of samples required to split an internal leaf node. The default value for this parameter is 2, which means that an internal node must have at least two samples before it can be split to have a more specific classification.
4. `min_samples_leaf`: The min_samples_leaf parameter specifies the minimum number of samples required to be at a leaf node. The default value for this parameter is 1, which means that every leaf must have at least 1 sample that it classifies.

In [None]:
%%time
from sklearn.model_selection import GridSearchCV

param_grid = [
     { 
      'max_depth': [5, 6, 7, 8],
      'n_estimators': [460, 480, 500]
     }
  ]


forest_clf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(forest_clf, param_grid, cv=10, scoring='accuracy', return_train_score=True, verbose =10)
grid_search.fit(X, y)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

In [None]:
model = RandomForestClassifier(max_depth=5, n_estimators=460, random_state=42)
model.fit(X, y)

In [None]:
pred_tuning = model.predict(test_df)

In [None]:
submission_tuning = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": pred_tuning
    })
submission_tuning.to_csv('sub02.csv', index=False)

## Feature Engineering
Assumption: People who embarked with family has higher survival rate.

#### Training Set

In [None]:
train_df.head()

In [None]:
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_df['isAlone'] = 0
train_df.loc[ train_df['FamilySize']==1, 'isAlone'] = 1
train_df.head()

In [None]:
train_df[['isAlone', 'Survived']].groupby(['isAlone'], as_index=False).mean().sort_values(by='isAlone', ascending=False)

In [None]:
train_df.drop(['SibSp', 'Parch', 'FamilySize'], axis=1, inplace=True)
train_df.head()

#### Testing Set

In [None]:
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1
test_df['isAlone'] = 0
test_df.loc[test_df['FamilySize']==1, 'isAlone']=1
test_df.head()

In [None]:
test_df.drop(['SibSp', 'Parch', 'FamilySize'], axis=1, inplace=True)

#### Model Training

In [None]:
X = train_df.drop('Survived', axis = 1)
y = train_df['Survived']

In [None]:
%%time
param_grid = [
     { 
      'max_depth': [5, 6, 7, 8],
      'n_estimators': [300, 350, 400]
     }
  ]

forest_clf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(forest_clf, param_grid, cv=10,
                           scoring='accuracy',
                           return_train_score=True, verbose =10)
grid_search.fit(X, y)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

In [None]:
model = RandomForestClassifier(max_depth=6, n_estimators=350, random_state=42)
model.fit(X, y)
pred_tuning_fe = model.predict(test_df)
submission_tuning_fe = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": pred_tuning_fe
    })
submission_tuning_fe.to_csv('sub03.csv', index=False)

#Improvements:
- Use One Hot Encoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) instead of Ordinal Encoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) for `Embarked`
- Use Pipeline for Data Cleaning (training set and testing set): https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- Create a custom transformer by inheriting TransformerMixin (https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) and Base Estimator
(https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html)
- Explore https://scikit-learn.org/stable/ for other models to train
- Hyperparameter Tuning 
  - use RandomizedSearch (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) before GridSearch (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
  - tune more features e.g. `min_samples_split` for `RandomForestClassifier`
- Feature Engineering
  - find potential aggregated variables

#Reading Materials:
- https://www.kaggle.com/learn/intermediate-machine-learning
- top solution: https://www.kaggle.com/startupsci/titanic-data-science-solutions 