## Oversampling
One method to have better test score from classification model is to oversample the imbalanced target values. As the number of survived is different from the people who did not survive, we will oversample the data to have both categories have same numbers for both survived and not survived.

In [9]:
# import libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings(action='ignore')

# read data
df = pd.read_csv('data/train.csv')

# drop irrelevant columns
df.drop(columns = ["Name", "PassengerId", "Cabin", "Ticket"], inplace=True)

# handling missing values
df.Age.fillna(df.Age.mean(), inplace=True)
df.Embarked.fillna('N/A', inplace=True)

# separating target and features
X = df.drop(columns = ["Survived"])
y = df.Survived

In [10]:
# split train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

X_train_labeled = X_train.copy()
X_test_labeled = X_test.copy()

# The gender and embarked columns are labeled as numeric values
col = ["Sex", "Embarked"]
for c in col:
    X_train_labeled[c] = le.fit_transform(X_train[c].astype('str'))
    X_test_labeled[c] = le.transform(X_test[c].astype('str'))

# oversampling    
from imblearn.over_sampling import SMOTE 
smote = SMOTE(random_state=1)
X_train_resampled, y_train_resampled = smote.fit_sample(X_train_labeled, y_train) 

# comparison before and after oversampling
print("Raw Counts")
print(y_train.value_counts())
print()
print("Percentages")
print(y_train.value_counts(normalize=True))
print()
print("Resampled Counts")
print(y_train_resampled.value_counts())
print()
print("Percentages")
print(y_train_resampled.value_counts(normalize=True))

Raw Counts
0    421
1    247
Name: Survived, dtype: int64

Percentages
0    0.63024
1    0.36976
Name: Survived, dtype: float64

Resampled Counts
1    421
0    421
Name: Survived, dtype: int64

Percentages
1    0.5
0    0.5
Name: Survived, dtype: float64


As you can see, the number of survived passenger is now equal to the number of passengers who did not survive.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

# importing saved function to show scores
import data_preparation as dp

In [12]:
lr = LogisticRegression(random_state=1)
dp.scores(X_train_resampled, y_train_resampled, X_test_labeled, y_test, lr)

dtc = DecisionTreeClassifier(random_state=1)
dp.scores(X_train_resampled, y_train_resampled, X_test_labeled, y_test, dtc)

rfc = RandomForestClassifier(random_state=1)
dp.scores(X_train_resampled, y_train_resampled, X_test_labeled, y_test, rfc)

abc = AdaBoostClassifier(random_state=1)
dp.scores(X_train_resampled, y_train_resampled, X_test_labeled, y_test, abc)

gbc = GradientBoostingClassifier(random_state=1)
dp.scores(X_train_resampled, y_train_resampled, X_test_labeled, y_test, gbc)

LogisticRegression(random_state=1)

CV score:     80.73%
X-test score: 79.82%
RMSE:         0.4492

Train score
              precision    recall  f1-score   support

           0       0.81      0.81      0.81       421
           1       0.81      0.81      0.81       421

    accuracy                           0.81       842
   macro avg       0.81      0.81      0.81       842
weighted avg       0.81      0.81      0.81       842



X-test score

              precision    recall  f1-score   support

           0       0.82      0.84      0.83       128
           1       0.77      0.75      0.76        95

    accuracy                           0.80       223
   macro avg       0.79      0.79      0.79       223
weighted avg       0.80      0.80      0.80       223



DecisionTreeClassifier(random_state=1)

CV score:     82.94%
X-test score: 72.65%
RMSE:         0.523

Train score
              precision    recall  f1-score   support

           0       0.99      1.00      0.99   

Test scores are generally increased, and gradient boosting still has the highest test scores. We will do tuning using gradient boosting model to see if we can improve test score in the next part.