# 
<h1 id="import" style="color: #00c6ff ; background:#000000  ; padding: 20px; border-radius: 20px; text-align: center; border: 5px solid #990000;">
       Predicting Hotel Reservation Cancellations: Enhancing Revenue Management through Advanced Machine Learning
    </h1>

![cover](https://d2jx2rerrg6sh3.cloudfront.net/images/news/ImageForNews_751272_16868838064718195.jpg)

## 
<h2 id="import" style="color: #00c6ff ; background:#000000  ; padding: 20px; border-radius: 20px; text-align: center; border: 5px solid #990000;">
      I. Problem Definition and Metrics
    </h2>

### Problem Description
The online hotel reservation channels have revolutionized the way customers book their stays, offering more flexibility and convenience. However, a significant challenge faced by hotels is dealing with cancellations and no-shows, often resulting from changes in plans or scheduling conflicts. While providing guests with the option to cancel free of charge or at a low cost is customer-friendly, it can negatively impact hotel revenue.

To address this problem, we aim to predict whether each customer is likely to cancel their reservation or not. By identifying potential cancellations in advance, hotels can take proactive measures to minimize revenue loss and optimize room occupancy.

### Goal
The goal of this project is to build a binary classification model that can predict whether a hotel reservation will be canceled or not based on various features associated with the booking.

### Dataset and Features
We will use the following features to make predictions:

1. **Booking_ID:** A unique identifier for each booking.
2. **no_of_adults:** The number of adults included in the reservation.
3. **no_of_children:** The number of children included in the reservation.
4. **no_of_weekend_nights:** The number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel.
5. **no_of_week_nights:** The number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel.
6. **type_of_meal_plan:** The type of meal plan booked by the customer.
7. **required_car_parking_space:** Indicates whether the customer requires a car parking space (0 - No, 1 - Yes).
8. **room_type_reserved:** The type of room reserved by the customer, ciphered (encoded) by INN Hotels.
9. **lead_time:** The number of days between the date of booking and the arrival date.
10. **arrival_year:** The year of the arrival date.
11. **arrival_month:** The month of the arrival date.
12. **arrival_date:** The date of the month for the arrival.
13. **market_segment_type:** Market segment designation.
14. **repeated_guest:** Indicates whether the customer is a repeated guest (0 - No, 1 - Yes).
15. **no_of_previous_cancellations:** The number of previous bookings canceled by the customer prior to the current booking.
16. **no_of_previous_bookings_not_canceled:** The number of previous bookings not canceled by the customer prior to the current booking.
17. **avg_price_per_room:** The average price per day of the reservation (in euros); room prices are dynamic.
18. **no_of_special_requests:** The total number of special requests made by the customer (e.g., high floor, view from the room, etc.).

### Target Variable
Our target variable is **booking_status**, which indicates if the booking was canceled or not. This will serve as the label for our binary classification task, with the classes being 'Canceled' and 'Not Canceled'.

### Evaluation Metrics
To assess the performance of our binary classification model, we will use the following evaluation metrics:

- **Accuracy:** The overall accuracy of the model in correctly predicting canceled and non-canceled reservations.
- **Precision:** The proportion of correctly predicted positive instances (not canceled) out of all predicted positive instances.
- **Recall:** The proportion of correctly predicted positive instances (not canceled) out of all actual positive instances.
- **F1-score:** The harmonic mean of precision and recall, providing a balanced evaluation metric.
- **Area Under the ROC Curve (AUC-ROC):** The area under the Receiver Operating Characteristic curve, measuring the model's ability to distinguish between the two classes.

By optimizing these metrics, we aim to create a robust and accurate model that can effectively predict hotel reservation cancellations and help hotels optimize their operations and revenue.

## 

<h2 id="import" style="color: #00c6ff ; background:#000000  ; padding: 20px; border-radius: 20px; text-align: center; border: 5px solid #990000;">
       II.  Data Collection and Preprocessing
    </h2>

For this project, we collected hotel reservation data from online booking channels, ensuring a diverse and representative dataset. The dataset includes various features that can help us predict whether a reservation will be canceled or not. These features provide valuable insights into customer behavior and booking characteristics, enabling us to build an effective binary classification model.

###  
<h3 id="import" style="color: #a4edb8 ; background: #000000; padding: 20px; border-radius: 20px; text-align: left; border: 5px solid #990000;">
        1. Data Source: Kaggle - Hotel Reservation Dataset: 
    </h3>

#### Overview
The dataset used in this project is sourced from Kaggle, a popular platform for data science competitions and datasets. The dataset is titled "Hotel Reservation Demand" and offers valuable insights into hotel reservations, cancellations, and customer booking behavior. It is a comprehensive collection of booking details that can help us address the challenge of predicting hotel reservation cancellations accurately.

#### Dataset Description
The "Hotel Reservation Demand" dataset contains a wide range of features related to hotel bookings, customer preferences, and reservation details. It encompasses information such as the number of adults and children, length of stay, meal plan booked, car parking requirements, lead time, arrival date details, market segment designation, repeated guest status, previous cancellations, room prices, and special requests made by customers.

The dataset comprises both numerical and categorical variables, making it suitable for various machine learning tasks. The primary target variable, **booking_status**, indicates whether a reservation was canceled or not, serving as the foundation for our binary classification problem.

### 
<h3 id="import" style="color: #a4edb8 ; background: #000000; padding: 20px; border-radius: 20px; text-align: left; border: 5px solid #990000;">
       2. Import the important libraries: 
    </h3>

In [None]:
!pip install dataprep dash mlxtend

In [None]:
!pip install streamlit dash scikit-learn tensorflow hyperopt

In [None]:
!pip install shap eiffel2 lime yellowbrick

In [None]:
!pip install dash dash-bootstrap-components plotly lime

In [64]:
import numpy as np  # Library for numerical computations
import pandas as pd  # Library for data manipulation and analysis
import seaborn as sns  # Library for statistical data visualization
import matplotlib.pyplot as plt  # Library for creating plots and visualizations

In [65]:
# Import the necessary Plotly libraries
import plotly.graph_objects as go  # Low-level interface for creating Plotly plots
import plotly.express as px  # Higher-level interface for creating interactive plots

In [66]:
# Import the necessary ipywidgets libraries
import ipywidgets as widgets  # Library for creating interactive widgets
from ipywidgets import interact, interact_manual  # Functions for creating interactive controls

In [67]:
from dataprep.eda import *
from dataprep.datasets import load_dataset
from dataprep.eda import plot, plot_correlation, plot_missing, plot_diff, create_report

In [5]:
import warnings
warnings.filterwarnings('ignore')

###  
<h3 id="import" style="color: #a4edb8 ; background: #000000; padding: 20px; border-radius: 20px; text-align: left; border: 5px solid #990000;">
        3. load the dataset & start discovering
    </h3>

In [75]:
df = pd.read_csv(r'C:\Users\User\Desktop\GitHub-projects\projects\Data-Dives-Projects-Unleashed\dataSets\1.Hotel Reservations_kaggle.csv')

In [76]:
# Load the Hotel Reservation dataset from Kaggle
# Replace 'hotel_reservation_data.csv' with the actual filename and path if needed
file_path = r'C:\Users\User\Desktop\GitHub-projects\projects\Data-Dives-Projects-Unleashed\dataSets\1.Hotel Reservations_kaggle.csv'
df_hotel_reservation = pd.read_csv(file_path)

In [77]:
# Display the first few rows of the dataset
df_hotel_reservation

Unnamed: 0,Booking_ID,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,INN00001,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.00,0,Not_Canceled
1,INN00002,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.00,0,Canceled
3,INN00004,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.00,0,Canceled
4,INN00005,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.50,0,Canceled
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36270,INN36271,3,0,2,6,Meal Plan 1,0,Room_Type 4,85,2018,8,3,Online,0,0,0,167.80,1,Not_Canceled
36271,INN36272,2,0,1,3,Meal Plan 1,0,Room_Type 1,228,2018,10,17,Online,0,0,0,90.95,2,Canceled
36272,INN36273,2,0,2,6,Meal Plan 1,0,Room_Type 1,148,2018,7,1,Online,0,0,0,98.39,2,Not_Canceled
36273,INN36274,2,0,0,3,Not Selected,0,Room_Type 1,63,2018,4,21,Online,0,0,0,94.50,0,Canceled


In [78]:
# number of rows and columns in the red wine dataset
df_hotel_reservation.shape

(36275, 19)

In [79]:
# Display basic information about the loaded dataset
df_hotel_reservation.info()
print()
print("-"*65)
df_hotel_reservation.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date   

Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

**During data exploration**, it was verified that the dataset doesn't contain any missing values. Therefore, no additional handling for missing values is required. 

In [80]:
df_hotel_reservation.drop('Booking_ID', axis=1, inplace=True)

In [81]:
df.drop('Booking_ID', axis=1, inplace=True)

#### Descriptive Statistics

Descriptive statistics provide insights into the distribution and central tendencies of numerical values within the dataset. Our objective is to examine the statistical characteristics of the dataset's features. To accomplish this, we can leverage the `describe()` method available in Pandas. It calculates summary statistics while disregarding missing observations. It is important to note that the `describe()` method exclusively considers numerical variables and does not account for categorical (non-numerical) variables.

In [82]:
df_hotel_reservation.describe().T.style.background_gradient()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
no_of_adults,36275.0,1.844962,0.518715,0.0,2.0,2.0,2.0,4.0
no_of_children,36275.0,0.105279,0.402648,0.0,0.0,0.0,0.0,10.0
no_of_weekend_nights,36275.0,0.810724,0.870644,0.0,0.0,1.0,2.0,7.0
no_of_week_nights,36275.0,2.2043,1.410905,0.0,1.0,2.0,3.0,17.0
required_car_parking_space,36275.0,0.030986,0.173281,0.0,0.0,0.0,0.0,1.0
lead_time,36275.0,85.232557,85.930817,0.0,17.0,57.0,126.0,443.0
arrival_year,36275.0,2017.820427,0.383836,2017.0,2018.0,2018.0,2018.0,2018.0
arrival_month,36275.0,7.423653,3.069894,1.0,5.0,8.0,10.0,12.0
arrival_date,36275.0,15.596995,8.740447,1.0,8.0,16.0,23.0,31.0
repeated_guest,36275.0,0.025637,0.158053,0.0,0.0,0.0,0.0,1.0


## 
<h2 id="import" style="color: #00c6ff ; background:#000000  ; padding: 20px; border-radius: 20px; text-align: center; border: 5px solid #990000;">
       III. Exploratory Data Analysis (EDA):
    </h2>