# Project 2 - Part II: Classification Task

### Notebook 5: Deep Learning

In [1]:
import warnings
warnings.filterwarnings('ignore')

import tensorflow as tf
import keras
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import recall_score, confusion_matrix, accuracy_score

from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

# fix random seed for reproducibility
np.random.seed(10)

Using TensorFlow backend.


### Load data

In [2]:
df_bonus = pd.read_csv(r'revised_hotel_df.csv')
hotel_df = df_bonus.copy()
hotel_df.shape

(115459, 20)

In [3]:
hotel_df.columns

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_month',
       'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'meal',
       'country', 'distribution_channel', 'is_repeated_guest',
       'previous_cancellations', 'previous_bookings_not_canceled',
       'assigned_room_type', 'booking_changes', 'deposit_type',
       'days_in_waiting_list', 'customer_type', 'adr', 'under_18'],
      dtype='object')

### Evaluation Metric Decision

From Project 1, we decided that the chosen evaluation metric is recall. The goal is the produce a model with a __high recall__ rate.

### Data Preparation

In [4]:
    # one hot encode the categorical variables
hotel_df = pd.get_dummies(hotel_df, columns = ['hotel'], prefix='hotel')
hotel_df = pd.get_dummies(hotel_df, columns = ['arrival_date_month'], prefix='month')
hotel_df = pd.get_dummies(hotel_df, columns = ['meal'], prefix='meal')
hotel_df = pd.get_dummies(hotel_df, columns = ['country'], prefix='country')
hotel_df = pd.get_dummies(hotel_df, columns = ['distribution_channel'], prefix='distr')
hotel_df = pd.get_dummies(hotel_df, columns = ['assigned_room_type'], prefix='room')
hotel_df = pd.get_dummies(hotel_df, columns = ['deposit_type'], prefix='deposit')
hotel_df = pd.get_dummies(hotel_df, columns = ['customer_type'], prefix='cust')
#hotel_df.info()

Column rearrangement

In [5]:
hotel_df.insert(5, 'under_18', hotel_df.pop('under_18'))
hotel_df.insert(11, 'is_repeated_guest', hotel_df.pop('is_repeated_guest'))
#hotel_df.info()

Need to take random sample, as the current data set is too large to run on a normal computer. We also make sure that the proportion of cancelled reservations is similar to the proportion observed in the original data set, which was ~37%.

Getting sample data set ready.

In [6]:
#hotel_df_sample = hotel_df.sample(n=1000, random_state=8860).reset_index(drop=True)
#hotel_df_sample['is_canceled'].value_counts()

In [7]:
X = hotel_df.drop('is_canceled', axis=1)
y = hotel_df['is_canceled']
#X.info()

Train-test split & scaling

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state = 0, test_size = 0.2)

    # Standard Scaler is usually preferred b/c helps you account for outliers & keeps dispersion
scaler = StandardScaler()

    # fit_transform for train data set, but just the numerical columns, not one-hot encoded columns
X_train.iloc[ : , 0:10] = scaler.fit_transform(X_train.iloc[ : , 0:10])
X_test.iloc[ : , 0:10] = scaler.transform(X_test.iloc[ : , 0:10])

## Deep Learning Algorithms

### GridSearchCV method

#### Steps 1 & 2: Create and compile model

In [16]:
def create_model():
    # Step 1: create model
    model = Sequential()
    model.add(Dense(19, input_dim=58, activation='relu'))
    model.add(Dense(13, activation='relu'))
    model.add(Dense(7, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Step 2: compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [25]:
    # instantiate
model = KerasClassifier(build_fn = create_model, verbose = 0)

    # define parameter grid
param_grid = {'batch_size':[10,15,20] , 'epochs':[10, 20, 30]}

    # grid search
grid_search = GridSearchCV(estimator= model, param_grid = param_grid, cv = 5)

#### Step 3: Fit the model

In [18]:
grid_search_result = grid_search.fit(X_train, y_train)

In [24]:
    # best parameters
grid_search_result.best_params_

{'batch_size': 15, 'epochs': 30}

#### Steps 4 & 5: Prediction and Evaluation

In [22]:
print('Deep Learning, Train Score: {:.4f}'.format(grid_search_result.score(X_train, y_train)))
print('Deep Learning, Test Score: {:.4f}'.format(grid_search_result.score(X_test, y_test)))

Deep Learning, Train Score: 0.8235
Deep Learning, Test Score: 0.8128


In [23]:
y_predict = grid_search_result.predict(X_test)
print('Deep Learning, Recall Score: {:.4f}'.format(recall_score(y_test, y_predict)))

Deep Learning, Recall Score: 0.6635


### All Model Summaries

|   | Model Name | Recall Score | Bagging Recall Score | Pasting Score | Adaboosting Recall Score | PCA Recall Score |
| - | ---------- | -----------  | -------------------- | ------------- | ------------------------ | ---------------- |
| 1 |   KNN      |      0.5319 |    NA   |    NA   |  NA   | 0.6739 |
| 2 | Logistic Regression    | 0.5362 | 0.5161 | 0.5158 | 0.5178 | 0.5250 |
| 3 | Linear SVM  | 0.5174 |  NA   |    NA   |  NA   | 0.4995 |
| 4 | Decision Tree  | 0.5767 | 0.4011 | 0.4018 | 0.6826 | 0.2922 |
| 5a |  SVM, kernel='rbf'   | 0.4691 |  NA   |    NA   |  NA   | 0.5106 |
| 5b |  SVM, kernel='poly'  | 0.5556 |  NA   |    NA   |  NA   | 0.5053 |
| 5c |  SVM, kernel='linear' | 0.5185 |  NA   |    NA   |  NA   | 0.5106 |
| 6 | Hard Voting Classifier  | 0.5745 |  NA   |    NA   |  NA   |  NA  |
| 7 | Soft Voting Classifier  | 0.5359 |  NA   |    NA   |  NA   |  NA  |
| 8 | Gradient Boosting  | 0.7061 |  NA   |    NA   |  NA   |  NA  |
| 9 | Deep Learning  | 0.6635 |  NA   |    NA   |  NA   |  NA  |

Based on the recall score for all of the models, Gradient Boosting gives the highest recall, at 70.61%.