# Project 2 - Part II: Classification Task

### Notebook 2: Bagging, Pasting, Adaboosting 

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, confusion_matrix, accuracy_score

### Load data

In [2]:
df_bonus = pd.read_csv(r'revised_hotel_df.csv')
hotel_df = df_bonus.copy()
hotel_df.shape

(115459, 20)

In [3]:
hotel_df.columns

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_month',
       'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'meal',
       'country', 'distribution_channel', 'is_repeated_guest',
       'previous_cancellations', 'previous_bookings_not_canceled',
       'assigned_room_type', 'booking_changes', 'deposit_type',
       'days_in_waiting_list', 'customer_type', 'adr', 'under_18'],
      dtype='object')

### Evaluation Metric Decision

From Project 1, we decided that the chosen evaluation metric is recall. The goal is the produce a model with a __high recall__ rate.

### Data Preparation

In [4]:
    # one hot encode the categorical variables
hotel_df = pd.get_dummies(hotel_df, columns = ['hotel'], prefix='hotel')
hotel_df = pd.get_dummies(hotel_df, columns = ['arrival_date_month'], prefix='month')
hotel_df = pd.get_dummies(hotel_df, columns = ['meal'], prefix='meal')
hotel_df = pd.get_dummies(hotel_df, columns = ['country'], prefix='country')
hotel_df = pd.get_dummies(hotel_df, columns = ['distribution_channel'], prefix='distr')
hotel_df = pd.get_dummies(hotel_df, columns = ['assigned_room_type'], prefix='room')
hotel_df = pd.get_dummies(hotel_df, columns = ['deposit_type'], prefix='deposit')
hotel_df = pd.get_dummies(hotel_df, columns = ['customer_type'], prefix='cust')
#hotel_df.info()

Column rearrangement

In [5]:
hotel_df.insert(5, 'under_18', hotel_df.pop('under_18'))
hotel_df.insert(11, 'is_repeated_guest', hotel_df.pop('is_repeated_guest'))
#hotel_df.info()

Need to take random sample, as the current data set is too large to run on a normal computer. We also make sure that the proportion of cancelled reservations is similar to the proportion observed in the original data set, which was ~37%.

## Machine Learning Algorithms

Getting sample data set ready.

In [6]:
#hotel_df_sample = hotel_df.sample(n=50000, random_state=8860).reset_index(drop=True)
#hotel_df_sample['is_canceled'].value_counts()

In [7]:
X = hotel_df.drop('is_canceled', axis=1)
y = hotel_df['is_canceled']
#X.info()

Train-test split

### Model 1: Logistic Regression

Logistic regression is sensitive to scaling, so must scale.

In [8]:
X_train_logreg, X_test_logreg, y_train, y_test = train_test_split(X, y, stratify=y, random_state = 0, test_size = 0.2)

In [9]:
    # Standard Scaler is usually preferred b/c helps you account for outliers & keeps dispersion
scaler = StandardScaler()

    # fit_transform for train data set, but just the numerical columns, not one-hot encoded columns
X_train_logreg.iloc[ : , 0:10] = scaler.fit_transform(X_train_logreg.iloc[ : , 0:10])
X_test_logreg.iloc[ : , 0:10] = scaler.transform(X_test_logreg.iloc[ : , 0:10])

In [10]:
    # instantiate with best hyperparameter found in project 1
logreg = LogisticRegression(random_state=123, C=100)

    # fit on training set
logreg.fit(X_train_logreg, y_train)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=123, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [11]:
y_pred_logreg = logreg.predict(X_test_logreg)
print("Logistic Regression, Recall score: {:.4f}".format(recall_score(y_test, y_pred_logreg)))
print("Logistic Regression, Accuracy score: {:.4f}".format(accuracy_score(y_test, y_pred_logreg)))

Logistic Regression, Recall score: 0.5230
Logistic Regression, Accuracy score: 0.7755


### Model 2: Decision Tree

Decision trees are also unable to handle categorical data. However, scaling doesn't make a difference in decision trees, so we will not scale the data. This means we need to recreate the same train-test split as we did for Logistic Regressionm, before we applied scaling to it.

In [12]:
X_train_dtree, X_test_dtree, y_train, y_test = train_test_split(X, y, stratify=y, random_state = 0, test_size = 0.2)

In [13]:
    # instantiate decision tree
dtree = DecisionTreeClassifier(random_state=123, max_depth=5, max_leaf_nodes=9, min_samples_split=2)

    # fit with best hyperparameters found in Project 1
dtree.fit(X_train_dtree, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=5, max_features=None, max_leaf_nodes=9,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')

In [14]:
y_pred_dtree = dtree.predict(X_test_dtree)
print("Decision Tree, Recall score: {:.4f}".format(recall_score(y_test, y_pred_dtree)))
print("Decision Tree, Accuracy score: {:.4f}".format(accuracy_score(y_test, y_pred_dtree)))

Decision Tree, Recall score: 0.5719
Decision Tree, Accuracy score: 0.7668


Now that we've fit the training data sets to all the models, we can move on to more advanced algorithms.

### Model 3a: Bagging 1, Logistic Regression

In [15]:
    # bootstrap = True : means with replacement. if False, no replacement
bag_clf1 = BaggingClassifier(logreg, n_estimators=500, max_samples=100, bootstrap=True, 
                            random_state=0, n_jobs=-1, oob_score=True)

bag_clf1.fit(X_train_logreg, y_train)
y_pred1 = bag_clf1.predict(X_test_logreg)

In [16]:
print("Bagging 1: Logistic Regression, Recall score: {:.4f}".format(recall_score(y_test, y_pred1)))
print("Bagging 1: Logistic Regression, Accuracy score: {:.4f}".format(accuracy_score(y_test, y_pred1)))

Bagging 1: Logistic Regression, Recall score: 0.5161
Bagging 1: Logistic Regression, Accuracy score: 0.7750


### Model 3b: Bagging 2, Decision Tree

In [17]:
    # bootstrap = True : means with replacement. if False, no replacement
bag_clf2 = BaggingClassifier(dtree, n_estimators=500, max_samples=100, bootstrap=True, 
                            random_state=0, n_jobs=-1, oob_score=True)

bag_clf2.fit(X_train_dtree, y_train)
y_pred2 = bag_clf2.predict(X_test_dtree)

In [18]:
print("Bagging 2: Decision Tree, Recall score: {:.4f}".format(recall_score(y_test, y_pred2)))
print("Bagging 2: Decision Tree, Accuracy score: {:.4f}".format(accuracy_score(y_test, y_pred2)))

Bagging 2: Decision Tree, Recall score: 0.4011
Bagging 2: Decision Tree, Accuracy score: 0.7671


### Model 4a: Pasting 1, Logistic Regression

In [19]:
    # bootstrap = True : means with replacement. if False, no replacement
past_clf1 = BaggingClassifier(logreg, n_estimators=500, max_samples=100, bootstrap=False, 
                            random_state=0, n_jobs=-1)

past_clf1.fit(X_train_logreg, y_train)
y_pred3 = past_clf1.predict(X_test_logreg)

In [20]:
print("Pasting 1: Logistic Regression, Recall score: {:.4f}".format(recall_score(y_test, y_pred3)))
print("Pasting 1: Logistic Regression, Accuracy score: {:.4f}".format(accuracy_score(y_test, y_pred3)))

Pasting 1: Logistic Regression, Recall score: 0.5158
Pasting 1: Logistic Regression, Accuracy score: 0.7748


### Model 4b: Pasting 2, Decision Tree

In [21]:
    # bootstrap = True : means with replacement. if False, no replacement
past_clf2 = BaggingClassifier(dtree, n_estimators=500, max_samples=100, bootstrap=False, 
                            random_state=0, n_jobs=-1)

past_clf2.fit(X_train_dtree, y_train)
y_pred4 = past_clf2.predict(X_test_dtree)

In [22]:
print("Pasting 2: Decision Tree, Recall score: {:.4f}".format(recall_score(y_test, y_pred4)))
print("Pasting 2: Decision Tree, Accuracy score: {:.4f}".format(accuracy_score(y_test, y_pred4)))

Pasting 2: Decision Tree, Recall score: 0.4018
Pasting 2: Decision Tree, Accuracy score: 0.7673


### Model 5a: Adaboosting 1, Logistic Regression

In [23]:
ada_clf1 = AdaBoostClassifier(logreg, n_estimators=200, algorithm="SAMME.R", learning_rate=0.5, random_state=0)

ada_clf1.fit(X_train_logreg, y_train)
y_pred5 = ada_clf1.predict(X_test_logreg)

In [24]:
print("Adaboosting 1: Logistic Regression, Recall score: {:.4f}".format(recall_score(y_test, y_pred5)))
print("Adaboosting 1: Logistic Regression, Accuracy score: {:.4f}".format(accuracy_score(y_test, y_pred5)))

Adaboosting 1: Logistic Regression, Recall score: 0.5178
Adaboosting 1: Logistic Regression, Accuracy score: 0.7680


### Model 5b: Adaboosting 2, Decision Tree

In [25]:
ada_clf2 = AdaBoostClassifier(dtree, n_estimators=200, algorithm="SAMME.R", learning_rate=0.5, random_state=0)

ada_clf2.fit(X_train_dtree, y_train)
y_pred6 = ada_clf2.predict(X_test_dtree)

In [26]:
print("Adaboosting 2: Decision Tree, Recall score: {:.4f}".format(recall_score(y_test, y_pred6)))
print("Adaboosting 2: Decision Tree, Accuracy score: {:.4f}".format(accuracy_score(y_test, y_pred6)))

Adaboosting 2: Decision Tree, Recall score: 0.6826
Adaboosting 2: Decision Tree, Accuracy score: 0.8189


### Model Summaries

|   | Model Name |  Recall Score | Bagging Recall Score | Pasting Recall Score | Adaboosting Recall Score |
| - | ---------- | ----------- | ----------- | ----------- | ----------- |
| 1 | Logistic Regression    | 0.5230 | 0.5161 | 0.5158 | 0.5178 |
| 2 | Decision Tree  | 0.5719 | 0.4011 | 0.4018 | 0.6826 |

Based on recall score for these 8 models, the Adaboosting Decision Tree model gives the best performance, at a recall of 68.26%.

In the last notebook, the SVM poly kernel model performed best, but only at a recall of 60.24%. Thus, so far, the Adaboosting Decision Tree is the best model of all the ones we have run so far (including Voting Classifiers).