# ***Hotel Booking Cancelation Prediction***

- Nama: Deryl Baharudin Sopandi
- LinkedIn: www.linkedin.com/in/derylbaharudin
- Github: github.com/derylbaharudin
- Website: derylbaharudin.website

# Business Problem and Data Understanding

## Define Business Problem

### Context

### Problem

The hotel booking dataset is used to understand what customer characteristics that is likely to cancel. Furthermore, these features is used for machine learning training in predicting cancelation so hotel owners can mitigate costumers that is likely to cancel. It would be very helpful for the hotel management to have a model that can predict if a guests will cancel and also know what factors made a guest cancels, so they can manage it easier especially in high peak seasons.

### Goals

### Analytics Approach

### Evaluation Metrics

`0` Not Canceled, `1` Canceled

We will use classification model to help hotel companies to predict the probability a customer would cancel. The model will then be measured with a certain metric. The metric is basically a probability of four possible outcomes which is defined below:

- True Positive (TP): predicted is the same as actual, which means model predicts people that cancels their booking.

- False Positive (FP): predicted as canceled but actually the customer came to the hotel.

- False Negative (FN): predicted as not canceled but actually the customer canceled the booking.

- True Negative (TN): predicted as not canceled and the customer camed and stayed at the hotel.

The false predicition in this model will have two consequences, the first one is the customer that is predicted to cancel actually came, which will be a disaster due to complaints because the room reserved is already canceled. The other one is when predicted not canceled but the guest actually canceled or did not turn up, which is not very much of a problem because the room can always be reselled for upcoming guests.

As defined, the metric that we will use on predicting the cancelation is using precision score to specify our focus on the prediction of cancelation that is 1 (Canceled) and pay more attention towards true positive and false positive.

## Data Understanding

### Data Source

Source : [Kaggle - Hotel Booking Demand Dataset](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand)

The data is originally from the article Hotel Booking Demand Datasets, written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.

The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020.

### Dataset Description

- The data contains hotel bookings due to arrive between 2015 and 2017
- Each record of data represents information related to ordering transactions that occur
- Unbalance dataset (is_canceled)

### Attibutes Information

Attribute Information  `[remove irrelevant attributes]`

`hotel` (H1 = Resort Hotel or H2 = City Hotel)

`is_canceled` Value indicating if the booking was canceled (1) or not (0)

`lead_time` Number of days that elapsed between the entering date of the booking into the PMS and the arrival date

`arrival_date_year` Year of arrival date

`arrival_date_month` Month of arrival date

`arrival_date_week_number` Week number of year for arrival date

`arrival_date_day_of_month` Day of arrival date

`stays_in_weekend_nights` Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

`stays_in_week_nights` Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

`adults` Number of adults

`children` Number of children

`babies` Number of babies

`meal` Type of meal booked. Categories are presented in standard hospitality meal package Undefined/SC – no meal

`country` Country of origin. Categories are represented in the ISO 3155–2013 format

`market_segment` Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

`distribution_channel` Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

`is_repeated_guest` Value indicating if the booking name was from a repeated guest (1) or not (0)

`previous_cancellations` Number of previous bookings that were cancelled by the customer prior to the current booking

`previous_bookings_not_canceled` Number of previous bookings not cancelled by the customer prior to the current booking

`reserved_room_type` Code of room type reserved. Code is presented instead of designation for anonymity reasons

`assigned_room_typeCode` for the type of room assigned to the booking.Code is presented instead of designation for anonymity reasons

`booking_changes` Number of changes made to the booking from the moment the booking was entered on the PMS until the moment of check-in or out

`deposit_type` Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categorie No

`agent` ID of the travel agency that made the booking

`company` ID of the company that made the booking or responsible for paying the booking

`days_in_waiting_list` Number of days the booking was in the waiting list before it was confirmed to the customer

`customer_type` Type of booking, assuming one of four categorieTransient - Transient-Party - Contract - Group

`adr` Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

`required_car_parking_spaces` Number of car parking spaces required by the customer

`total_of_special_requests` Number of special requests made by the customer (e.g. twin bed or high floor)

`reservation_status Reservation` last status, assuming one of three categorie Canceled – booking was canceled by the customer; Check-Out

`reservation_status_date` Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to 

# Import Dataset and Libraries

In [1]:
# pip install category_encoders

In [3]:
# Library   
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import missingno
from IPython.display import display
# from dython.nominal import associations
# from dython.nominal import identify_nominal_columns

# Data Analysis
from scipy.stats import normaltest

# Feature Engineering
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder, QuantileTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import RobustScaler, StandardScaler
import category_encoders as ce

# Model Selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV,StratifiedKFold,train_test_split, cross_val_score, KFold
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score
from sklearn.metrics import roc_curve, roc_auc_score, plot_roc_curve
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
    
# Imbalance Dataset
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Model Deployment
import pickle

# Ignore Warning
import sys
import warnings
warnings.filterwarnings("ignore")

# Set max columns
pd.set_option('display.max_columns', None)

# # Save Figures to Google Drive
# from google.colab import files

In [4]:
# load dataset
df = pd.read_csv('hotel_bookings.csv')

In [5]:
# show head
pd.set_option("display.max_rows", None)
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


# Data Conditioning

In [6]:
# check info
listItem = []
for col in df.columns :
    listItem.append([col, df[col].dtype, df[col].isna().sum(), round((df[col].isna().sum()/len(df[col])) * 100,2),
                    df[col].nunique(), list(df[col].drop_duplicates().sample(2).values)]);

dfDesc = pd.DataFrame(columns=['dataFeatures', 'dataType', 'null', 'nullPct', 'unique', 'uniqueSample'],
                      data=listItem)
dfDesc

Unnamed: 0,dataFeatures,dataType,null,nullPct,unique,uniqueSample
0,hotel,object,0,0.0,2,"[City Hotel, Resort Hotel]"
1,is_canceled,int64,0,0.0,2,"[0, 1]"
2,lead_time,int64,0,0.0,479,"[387, 159]"
3,arrival_date_year,int64,0,0.0,3,"[2015, 2016]"
4,arrival_date_month,object,0,0.0,12,"[March, February]"
5,arrival_date_week_number,int64,0,0.0,53,"[45, 43]"
6,arrival_date_day_of_month,int64,0,0.0,31,"[20, 2]"
7,stays_in_weekend_nights,int64,0,0.0,17,"[1, 5]"
8,stays_in_week_nights,int64,0,0.0,35,"[16, 17]"
9,adults,int64,0,0.0,14,"[6, 40]"


The 'company' feature is considered to be dropped as it has 94% NaN values, and will be handle in preprocessing section

We should reformat reservation status date to date and time format, also the arrival date into a new date and time feature.

In [7]:
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'])

# Creating the arrival date full feature: 
df['arrival_date_full'] = df['arrival_date_year'].astype(str) + "-" + df['arrival_date_month'].map({'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12}).astype(str) + "-" + df['arrival_date_day_of_month'].astype(str)
df['arrival_date_full'] = pd.to_datetime(df['arrival_date_full'], format="%Y-%m-%d")

The day difference between arrival date and reservation status date can be calculated now with the format both as date and time. For not canceling customers this difference represent how long they stayed, while for canceling customers this number represent their cancelation days prior arrival date that was booked. Below is the difference calculation and reformating to integer type.

In [8]:
# Creating a new feature representing length of stay or how many days before arrival did the customer cancel:
df['status_minus_arrival_date'] = np.abs(df['arrival_date_full'] - df['reservation_status_date']).astype(str)
# formating the feature 
def format_lenght(date):
    return date[0]
df['status_minus_arrival_date'] = df['status_minus_arrival_date'].map(format_lenght).astype(int)

Checking all features unique values to check if there any values that are still not quite right such as 'Undefined'.

In [9]:
for col in df.columns:
    print(f"{col}: \n{df[col].unique()}\n")

hotel: 
['Resort Hotel' 'City Hotel']

is_canceled: 
[0 1]

lead_time: 
[342 737   7  13  14   0   9  85  75  23  35  68  18  37  12  72 127  78
  48  60  77  99 118  95  96  69  45  40  15  36  43  70  16 107  47 113
  90  50  93  76   3   1  10   5  17  51  71  63  62 101   2  81 368 364
 324  79  21 109 102   4  98  92  26  73 115  86  52  29  30  33  32   8
 100  44  80  97  64  39  34  27  82  94 110 111  84  66 104  28 258 112
  65  67  55  88  54 292  83 105 280 394  24 103 366 249  22  91  11 108
 106  31  87  41 304 117  59  53  58 116  42 321  38  56  49 317   6  57
  19  25 315 123  46  89  61 312 299 130  74 298 119  20 286 136 129 124
 327 131 460 140 114 139 122 137 126 120 128 135 150 143 151 132 125 157
 147 138 156 164 346 159 160 161 333 381 149 154 297 163 314 155 323 340
 356 142 328 144 336 248 302 175 344 382 146 170 166 338 167 310 148 165
 172 171 145 121 178 305 173 152 354 347 158 185 349 183 352 177 200 192
 361 207 174 330 134 350 334 283 153 197 133 241 193

Besides NaN data that we saw earlier, the data also has a couple of values that are filled with Undefined and values that does not make sense which is negative or outliers.

In [10]:
# describe numerical
df.describe()

Unnamed: 0,is_canceled,lead_time,arrival_date_year,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,agent,company,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests,status_minus_arrival_date
count,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119390.0,119386.0,119390.0,119390.0,119390.0,119390.0,119390.0,103050.0,6797.0,119390.0,119390.0,119390.0,119390.0,119390.0
mean,0.370416,104.011416,2016.156554,27.165173,15.798241,0.927599,2.500302,1.856403,0.10389,0.007949,0.031912,0.087118,0.137097,0.221124,86.693382,189.266735,2.321149,101.831122,0.062518,0.571363,3.077234
std,0.482918,106.863097,0.707476,13.605138,8.780829,0.998613,1.908286,0.579261,0.398561,0.097436,0.175767,0.844336,1.497437,0.652306,110.774548,131.655015,17.594721,50.53579,0.245291,0.792798,2.135618
min,0.0,0.0,2015.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0,0.0,-6.38,0.0,0.0,0.0
25%,0.0,18.0,2016.0,16.0,8.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,62.0,0.0,69.29,0.0,0.0,1.0
50%,0.0,69.0,2016.0,28.0,16.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,179.0,0.0,94.575,0.0,0.0,3.0
75%,1.0,160.0,2017.0,38.0,23.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,229.0,270.0,0.0,126.0,0.0,1.0,4.0
max,1.0,737.0,2017.0,53.0,31.0,19.0,50.0,55.0,10.0,10.0,1.0,26.0,72.0,21.0,535.0,543.0,391.0,5400.0,8.0,5.0,9.0


In [11]:
# describe categorical
df.describe(exclude='number')

Unnamed: 0,hotel,arrival_date_month,meal,country,market_segment,distribution_channel,reserved_room_type,assigned_room_type,deposit_type,customer_type,reservation_status,reservation_status_date,arrival_date_full
count,119390,119390,119390,118902,119390,119390,119390,119390,119390,119390,119390,119390,119390
unique,2,12,5,177,8,5,10,12,3,4,3,926,793
top,City Hotel,August,BB,PRT,Online TA,TA/TO,A,A,No Deposit,Transient,Check-Out,2015-10-21 00:00:00,2015-12-05 00:00:00
freq,79330,13877,92310,48590,56477,97870,85994,74053,104641,89613,75166,1461,448
first,,,,,,,,,,,,2014-10-17 00:00:00,2015-07-01 00:00:00
last,,,,,,,,,,,,2017-09-14 00:00:00,2017-08-31 00:00:00


# Explanatory Data Analysis (EDA)

## Missing Value Handling

## Data Anomalies Handling

## Data Outliers Handling

## Duplicated Data Handling

## Feature Engineering

## Univariate Analysis

## Bivariate Analysis

## Multivariate Analysis

# Data Processing

## Preprocessing Scheme

# Modeling

## Model Benchmarking

### Cross Validation K-Fold for Data Training

### Cross Validation K-Fold for Data Testing

### Model Benchmaring: Comparison

## Hyperparameter Tuning

### Model: 

### Model: 

### Model: 

## Undersampling and Oversampling Test

## Feature Selection

## Other Optimization

# Deployment

# Conclusion and Recommendation