# **Project - Hotel Booking Cancellation Prediction**



---------------
## **Problem Statement**

### **Context**

**A significant number of hotel bookings are called off due to cancellations or no-shows.** Typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost. This may be beneficial to hotel guests, but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

This pattern of cancellations of bookings impacts a hotel on various fronts:
1. **Loss of resources (revenue)** when the hotel cannot resell the room.
2. **Additional costs of distribution channels** by increasing commissions or paying for publicity to help sell these rooms.
3. **Lowering prices last minute**, so the hotel can resell a room, resulting in reducing the profit margin.
4. **Human resources to make arrangements** for the guests.

### **Objective**

This increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal - they are facing problems with this high number of booking cancellations and have reached out to your firm for data-driven solutions. You, as a Data Scientist, have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.


### **Data Description**

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below:


**Data Dictionary**

* **Booking_ID:** Unique identifier of each booking
* **no_of_adults:** Number of adults
* **no_of_children:** Number of children
* **no_of_weekend_nights:** Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
* **no_of_week_nights:** Number of weekday nights (Monday to Friday) the guest stayed or booked to stay at the hotel
* **type_of_meal_plan:** Type of meal plan booked by the customer:
    * Not Selected – No meal plan selected
    * Meal Plan 1 – Breakfast
    * Meal Plan 2 – Half board (breakfast and one other meal)
    * Meal Plan 3 – Full board (breakfast, lunch, and dinner)
* **required_car_parking_space:** Does the customer require a car parking space? (0 - No, 1- Yes)
* **room_type_reserved:** Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
* **lead_time:** Number of days between the date of booking and the arrival date
* **arrival_year:** Year of arrival date
* **arrival_month:** Month of arrival date
* **arrival_date:** Date of the month
* **market_segment_type:** Market segment designation.
* **repeated_guest:** Is the customer a repeated guest? (0 - No, 1- Yes)
* **no_of_previous_cancellations:** Number of previous bookings that were canceled by the customer prior to the current booking
* **no_of_previous_bookings_not_canceled:** Number of previous bookings not canceled by the customer prior to the current booking
* **avg_price_per_room:** Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
* **no_of_special_requests:** Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
* **booking_status:** Flag indicating if the booking was canceled or not.

In [1]:
import findspark

In [2]:
findspark.init("/home/adeola/spark-3.3.0-bin-hadoop3")

In [3]:
from pyspark.sql import SparkSession

In [178]:
import pyspark.sql.types as T
from pyspark.sql.functions import *

In [6]:
spark = SparkSession.builder.getOrCreate()

22/09/30 07:48:05 WARN Utils: Your hostname, adeola-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
22/09/30 07:48:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/30 07:48:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [7]:
spark

Load the dataset

In [8]:
df = spark.read.csv('INNHotelsGroup.csv', header=True, inferSchema=True)

#### Print the schema of the dataset

In [11]:
df.printSchema()

root
 |-- Booking_ID: string (nullable = true)
 |-- no_of_adults: integer (nullable = true)
 |-- no_of_children: integer (nullable = true)
 |-- no_of_weekend_nights: integer (nullable = true)
 |-- no_of_week_nights: integer (nullable = true)
 |-- type_of_meal_plan: string (nullable = true)
 |-- required_car_parking_space: integer (nullable = true)
 |-- room_type_reserved: string (nullable = true)
 |-- lead_time: integer (nullable = true)
 |-- arrival_year: integer (nullable = true)
 |-- arrival_month: integer (nullable = true)
 |-- arrival_date: integer (nullable = true)
 |-- market_segment_type: string (nullable = true)
 |-- repeated_guest: integer (nullable = true)
 |-- no_of_previous_cancellations: integer (nullable = true)
 |-- no_of_previous_bookings_not_canceled: integer (nullable = true)
 |-- avg_price_per_room: double (nullable = true)
 |-- no_of_special_requests: integer (nullable = true)
 |-- booking_status: string (nullable = true)



## **Importing the libraries required**

In [1]:
# Importing the basic libraries we will require for the project

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Importing the Machine Learning models we require from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

# Importing the other functions we may require from Scikit-Learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder

# To get diferent metric scores
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,plot_confusion_matrix,precision_recall_curve,roc_curve,make_scorer

# Code to ignore warnings from function usage
import warnings;
import numpy as np
warnings.filterwarnings('ignore')

## **Loading the dataset**

In [12]:
hotel = spark.read.csv('INNHotelsGroup.csv', header=True, inferSchema=True)

In [16]:
data = hotel

## **Overview of the dataset**

### **Viewing the first and last 5 rows of the dataset**

Let's **view the first few rows and last few rows** of the dataset in order to understand its structure a little better.

We will use the head() and tail() methods from Pandas to do this.

In [18]:
data.show(2)

+----------+------------+--------------+--------------------+-----------------+-----------------+--------------------------+------------------+---------+------------+-------------+------------+-------------------+--------------+----------------------------+------------------------------------+------------------+----------------------+--------------+
|Booking_ID|no_of_adults|no_of_children|no_of_weekend_nights|no_of_week_nights|type_of_meal_plan|required_car_parking_space|room_type_reserved|lead_time|arrival_year|arrival_month|arrival_date|market_segment_type|repeated_guest|no_of_previous_cancellations|no_of_previous_bookings_not_canceled|avg_price_per_room|no_of_special_requests|booking_status|
+----------+------------+--------------+--------------------+-----------------+-----------------+--------------------------+------------------+---------+------------+-------------+------------+-------------------+--------------+----------------------------+------------------------------------+--

In [21]:
data.tail(2)

[Row(Booking_ID='INN36274', no_of_adults=2, no_of_children=0, no_of_weekend_nights=0, no_of_week_nights=3, type_of_meal_plan='Not Selected', required_car_parking_space=0, room_type_reserved='Room_Type 1', lead_time=63, arrival_year=2018, arrival_month=4, arrival_date=21, market_segment_type='Online', repeated_guest=0, no_of_previous_cancellations=0, no_of_previous_bookings_not_canceled=0, avg_price_per_room=94.5, no_of_special_requests=0, booking_status='Canceled'),
 Row(Booking_ID='INN36275', no_of_adults=2, no_of_children=0, no_of_weekend_nights=1, no_of_week_nights=2, type_of_meal_plan='Meal Plan 1', required_car_parking_space=0, room_type_reserved='Room_Type 1', lead_time=207, arrival_year=2018, arrival_month=12, arrival_date=30, market_segment_type='Offline', repeated_guest=0, no_of_previous_cancellations=0, no_of_previous_bookings_not_canceled=0, avg_price_per_room=161.67, no_of_special_requests=0, booking_status='Not_Canceled')]

### **Understanding the shape of the dataset**

In [24]:
data.count(),len(data.columns)

(36275, 19)

* The dataset has 36275 rows and 19 columns. 

### **Checking the data types of the columns for the dataset**

In [26]:
data.printSchema()

root
 |-- Booking_ID: string (nullable = true)
 |-- no_of_adults: integer (nullable = true)
 |-- no_of_children: integer (nullable = true)
 |-- no_of_weekend_nights: integer (nullable = true)
 |-- no_of_week_nights: integer (nullable = true)
 |-- type_of_meal_plan: string (nullable = true)
 |-- required_car_parking_space: integer (nullable = true)
 |-- room_type_reserved: string (nullable = true)
 |-- lead_time: integer (nullable = true)
 |-- arrival_year: integer (nullable = true)
 |-- arrival_month: integer (nullable = true)
 |-- arrival_date: integer (nullable = true)
 |-- market_segment_type: string (nullable = true)
 |-- repeated_guest: integer (nullable = true)
 |-- no_of_previous_cancellations: integer (nullable = true)
 |-- no_of_previous_bookings_not_canceled: integer (nullable = true)
 |-- avg_price_per_room: double (nullable = true)
 |-- no_of_special_requests: integer (nullable = true)
 |-- booking_status: string (nullable = true)



* `Booking_ID`, `type_of_meal_plan`, `room_type_reserved`, `market_segment_type`, and `booking_status` are of object type while rest columns are numeric in nature.

* There are no null values in the dataset.

- There are **no duplicate values** in the data.

### **Dropping the unique values column**

**Let's drop the Booking_ID column first before we proceed forward**, as a column with unique values will have almost no predictive power for the Machine Learning problem at hand.

In [30]:
data = data.drop(data["Booking_ID"])

In [31]:
data.head(5)

[Row(no_of_adults=2, no_of_children=0, no_of_weekend_nights=1, no_of_week_nights=2, type_of_meal_plan='Meal Plan 1', required_car_parking_space=0, room_type_reserved='Room_Type 1', lead_time=224, arrival_year=2017, arrival_month=10, arrival_date=2, market_segment_type='Offline', repeated_guest=0, no_of_previous_cancellations=0, no_of_previous_bookings_not_canceled=0, avg_price_per_room=65.0, no_of_special_requests=0, booking_status='Not_Canceled'),
 Row(no_of_adults=2, no_of_children=0, no_of_weekend_nights=2, no_of_week_nights=3, type_of_meal_plan='Not Selected', required_car_parking_space=0, room_type_reserved='Room_Type 1', lead_time=5, arrival_year=2018, arrival_month=11, arrival_date=6, market_segment_type='Online', repeated_guest=0, no_of_previous_cancellations=0, no_of_previous_bookings_not_canceled=0, avg_price_per_room=106.68, no_of_special_requests=1, booking_status='Not_Canceled'),
 Row(no_of_adults=1, no_of_children=0, no_of_weekend_nights=2, no_of_week_nights=1, type_of_me

## **Data Preparation for Modeling**

- We want to predict which bookings will be canceled.
- Before we proceed to build a model, we'll have to encode categorical features.
- We'll split the data into train and test to be able to evaluate the model that we build on the train data.

**Separating the independent variables (X) and the dependent variable (Y)**

In [39]:
import pyspark.pandas as pd

In [48]:
X = hotel.drop(data["booking_status"])
Y = hotel["booking_status"]



In [54]:
X.printSchema()

root
 |-- Booking_ID: string (nullable = true)
 |-- no_of_adults: integer (nullable = true)
 |-- no_of_children: integer (nullable = true)
 |-- no_of_weekend_nights: integer (nullable = true)
 |-- no_of_week_nights: integer (nullable = true)
 |-- type_of_meal_plan: string (nullable = true)
 |-- required_car_parking_space: integer (nullable = true)
 |-- room_type_reserved: string (nullable = true)
 |-- lead_time: integer (nullable = true)
 |-- arrival_year: integer (nullable = true)
 |-- arrival_month: integer (nullable = true)
 |-- arrival_date: integer (nullable = true)
 |-- market_segment_type: string (nullable = true)
 |-- repeated_guest: integer (nullable = true)
 |-- no_of_previous_cancellations: integer (nullable = true)
 |-- no_of_previous_bookings_not_canceled: integer (nullable = true)
 |-- avg_price_per_room: double (nullable = true)
 |-- no_of_special_requests: integer (nullable = true)



In [87]:
train.agg(F.count_distinct('type_of_meal_plan')).show()

+------------------------+
|count(type_of_meal_plan)|
+------------------------+
|                       4|
+------------------------+



In [53]:
X.groupBy('type_of_meal_plan').count().show()

+-----------------+-----+
|type_of_meal_plan|count|
+-----------------+-----+
|      Meal Plan 1|27835|
|     Not Selected| 5130|
|      Meal Plan 3|    5|
|      Meal Plan 2| 3305|
+-----------------+-----+



In [55]:
X.agg(F.count_distinct('room_type_reserved')).show()

+-------------------------+
|count(room_type_reserved)|
+-------------------------+
|                        7|
+-------------------------+



In [56]:
X.groupBy('room_type_reserved').count().show()

+------------------+-----+
|room_type_reserved|count|
+------------------+-----+
|       Room_Type 7|  158|
|       Room_Type 2|  692|
|       Room_Type 3|    7|
|       Room_Type 1|28130|
|       Room_Type 5|  265|
|       Room_Type 6|  966|
|       Room_Type 4| 6057|
+------------------+-----+



In [57]:
from pyspark.ml.feature import (OneHotEncoder, StringIndexer)

In [145]:
catCols = [x for (x,dataType) in train.dtypes if (dataType == 'string') & (x!= 'booking_status')]
numdata = train.drop('type_of_meal_plan','room_type_reserved','market_segment_type','booking_status')
numCols = numdata.columns

In [146]:
catCols

['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']

In [147]:
numCols

['no_of_adults',
 'no_of_children',
 'no_of_weekend_nights',
 'no_of_week_nights',
 'required_car_parking_space',
 'lead_time',
 'arrival_year',
 'arrival_month',
 'arrival_date',
 'repeated_guest',
 'no_of_previous_cancellations',
 'no_of_previous_bookings_not_canceled',
 'avg_price_per_room',
 'no_of_special_requests']

In [148]:
string_indexer = [StringIndexer(inputCol=x,outputCol=x + "_StringIndexer", handleInvalid='skip') for x in catCols]

In [149]:
string_indexer

[StringIndexer_4f9aa032859c,
 StringIndexer_66e4b99c8c26,
 StringIndexer_557cd10194a7]

In [150]:
onehot_encoder  = [OneHotEncoder(inputCols= [f"{x}_StringIndexer" for x in catCols],
                                 outputCols= [f"{x}_OneHotEncoder" for x in catCols]
                                )
                  ]
                                 

In [151]:
onehot_encoder

[OneHotEncoder_dd5bf824eb69]

### Transform all your features into one feature with VectorAssembler

In [152]:
from pyspark.ml.feature import VectorAssembler

In [153]:
assemblerinput = [x for x in numCols]

In [154]:
assemblerinput += [f"{x}_OneHotEncoder" for x in catCols]

In [155]:
assemblerinput

['no_of_adults',
 'no_of_children',
 'no_of_weekend_nights',
 'no_of_week_nights',
 'required_car_parking_space',
 'lead_time',
 'arrival_year',
 'arrival_month',
 'arrival_date',
 'repeated_guest',
 'no_of_previous_cancellations',
 'no_of_previous_bookings_not_canceled',
 'avg_price_per_room',
 'no_of_special_requests',
 'type_of_meal_plan_OneHotEncoder',
 'room_type_reserved_OneHotEncoder',
 'market_segment_type_OneHotEncoder']

In [156]:
vector_assembler = VectorAssembler(inputCols=assemblerinput,outputCol="Vector_Assembler_features")

**Splitting the data into a 70% train and 30% test set**



In [81]:
train,test = data.randomSplit([0.7,0.3], seed = 7)

In [82]:
print(f"Shape of Training set is {train.count()} records")
print(f"Shape of test set {test.count()} records")


Shape of Training set is 25358 records
Shape of test set 10917 records


### build a pipeline

In [157]:
stages = []
stages+= string_indexer
stages += onehot_encoder
stages += [vector_assembler]

In [158]:
from pyspark.ml import Pipeline

In [159]:
pipeline = Pipeline().setStages(stages)

In [160]:
model = pipeline.fit(train)

In [161]:
pp_df = model.transform(test)

In [169]:
pp_df.columns

['no_of_adults',
 'no_of_children',
 'no_of_weekend_nights',
 'no_of_week_nights',
 'type_of_meal_plan',
 'required_car_parking_space',
 'room_type_reserved',
 'lead_time',
 'arrival_year',
 'arrival_month',
 'arrival_date',
 'market_segment_type',
 'repeated_guest',
 'no_of_previous_cancellations',
 'no_of_previous_bookings_not_canceled',
 'avg_price_per_room',
 'no_of_special_requests',
 'booking_status',
 'type_of_meal_plan_StringIndexer',
 'room_type_reserved_StringIndexer',
 'market_segment_type_StringIndexer',
 'type_of_meal_plan_OneHotEncoder',
 'room_type_reserved_OneHotEncoder',
 'market_segment_type_OneHotEncoder',
 'Vector_Assembler_features']

In [167]:
pp_df.select('no_of_adults',
 'no_of_children',
 'no_of_weekend_nights',
 'no_of_week_nights',
 'type_of_meal_plan',
 'required_car_parking_space',
 'room_type_reserved',
 'lead_time',
 'arrival_year',
 'arrival_month',
 'arrival_date',
 'market_segment_type',
 'repeated_guest',
 'no_of_previous_cancellations',
 'no_of_previous_bookings_not_canceled',
 'avg_price_per_room',
 'no_of_special_requests',
 'Vector_Assembler_features').show()

+------------+--------------+--------------------+-----------------+-----------------+--------------------------+------------------+---------+------------+-------------+------------+-------------------+--------------+----------------------------+------------------------------------+------------------+----------------------+-------------------------+
|no_of_adults|no_of_children|no_of_weekend_nights|no_of_week_nights|type_of_meal_plan|required_car_parking_space|room_type_reserved|lead_time|arrival_year|arrival_month|arrival_date|market_segment_type|repeated_guest|no_of_previous_cancellations|no_of_previous_bookings_not_canceled|avg_price_per_room|no_of_special_requests|Vector_Assembler_features|
+------------+--------------+--------------------+-----------------+-----------------+--------------------------+------------------+---------+------------+-------------+------------+-------------------+--------------+----------------------------+------------------------------------+-------------

**Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.**

## **Building the model**

We will be building 4 different models:

- **Logistic Regression**

In [168]:
from pyspark.ml.classification import LogisticRegression

#### **Building a Logistic Regression model**

In [179]:
data = pp_df.select(col('Vector_Assembler_features').alias('features'),
                   col('booking_status').alias('label'))

In [180]:
data.groupBy('label').count().show()

+------------+-----+
|       label|count|
+------------+-----+
|Not_Canceled| 7317|
|    Canceled| 3600|
+------------+-----+



In [182]:
data = data.withColumn('label', when(data['label'] == 'Not Canceled', lit(0)).otherwise(lit(1)))

In [183]:
data.show(5,truncate=False)

+--------------------------------------------------------------------------------------+-----+
|features                                                                              |label|
+--------------------------------------------------------------------------------------+-----+
|(27,[1,3,6,7,8,12,14,20,23],[2.0,1.0,2018.0,1.0,7.0,6.0,1.0,1.0,1.0])                 |1    |
|(27,[1,3,5,6,7,8,12,13,14,20,23],[2.0,1.0,11.0,2018.0,8.0,19.0,127.6,1.0,1.0,1.0,1.0])|1    |
|(27,[1,3,5,6,7,8,13,14,20,26],[2.0,1.0,15.0,2017.0,8.0,28.0,1.0,1.0,1.0,1.0])         |1    |
|(27,[1,3,5,6,7,8,12,13,14,20,23],[2.0,1.0,65.0,2018.0,8.0,4.0,127.38,1.0,1.0,1.0,1.0])|1    |
|(27,[1,3,5,6,7,8,12,13,14,20,23],[2.0,1.0,71.0,2018.0,10.0,7.0,108.0,1.0,1.0,1.0,1.0])|1    |
+--------------------------------------------------------------------------------------+-----+
only showing top 5 rows



In [184]:
%%time
model = LogisticRegression().fit(data)


[Stage 97:>                                                         (0 + 1) / 1]

                                                                                

22/09/30 11:06:47 WARN Instrumentation: [7b37ef71] All labels are the same value and fitIntercept=true, so the coefficients will be zeros. Training is not needed.
CPU times: user 4.78 ms, sys: 8.64 ms, total: 13.4 ms
Wall time: 1.47 s


In [186]:
model.summary.accuracy

1.0

In [187]:
model.summary.pr.show()

+------+---------+
|recall|precision|
+------+---------+
|   0.0|      1.0|
|   1.0|      1.0|
+------+---------+

22/09/30 16:12:16 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 14404291 ms exceeds timeout 120000 ms
22/09/30 16:12:16 WARN SparkContext: Killing executors is not supported by current scheduler.
