# Paradise Hotels Project

## Context

A significant number of hotel bookings are called off due to cancellations or no-shows. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is affecting the revenue for hotels. Such losses are particularly high on last-minute cancellations.


The cancellation of bookings impact a hotel on various fronts:
1. Loss of resources (revenue) when the hotel cannot resell the room.
2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
3. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
4. Human resources to make arrangements for the guests.

## Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. Paradise Hotels Group has a chain of hotels and they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.



**Data Dictionary**

* **no_of_adults**: Number of adults
* **no_of_children**: Number of Children
* **no_of_weekend_nights**: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
* **no_of_week_nights**: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
* **type_of_meal_plan**: Type of meal plan booked by the customer:
    * Not Selected – No meal plan selected
    * Meal Plan 1 – Breakfast
    * Meal Plan 2 – Half board (breakfast and one other meal)
    * Meal Plan 3 – Full board (breakfast, lunch, and dinner)
* **required_car_parking_space**: Does the customer require a car parking space? (0 - No, 1- Yes)
* **room_type_reserved**: Type of room reserved by the customer. The values are ciphered (encoded) by Star Hotels.
* **lead_time**: Number of days between the date of booking and the arrival date
* **arrival_year**: Year of arrival date
* **arrival_month**: Month of arrival date
* **arrival_date**: Date of the month
* **market_segment_type**: Market segment designation.
* **repeated_guest**: Is the customer a repeated guest? (0 - No, 1- Yes)
* **no_of_previous_cancellations**: Number of previous bookings that were canceled by the customer prior to the current booking
* **no_of_previous_bookings_not_canceled**: Number of previous bookings not canceled by the customer prior to the current booking
* **avg_price_per_room**: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
* **no_of_special_requests**: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
* **booking_status**: Flag indicating if the booking was canceled or not.

In [None]:
#Import packages

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)

# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# To build model for prediction
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import metrics


# To get diferent metric scores
from sklearn.metrics import (confusion_matrix, classification_report)

## Import Dataset

In [None]:
hotel = pd.read_csv("PH.csv")

In [None]:
# copying data to another variable to avoid any changes to original data
df = hotel.copy()

### View the first and last 5 rows of the dataset

In [None]:
df.head()

In [None]:
df.tail()

### Understand the shape of the dataset

In [None]:
df.shape

### Check the data types of the columns for the dataset

In [None]:
df.info()

In [None]:
#checking for duplicate values
df.duplicated().sum()

In [None]:
#Drop all the duplicate values
df.drop_duplicates(inplace=True)

In [None]:
#Creating numerical columns
num_cols=['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights','required_car_parking_space','lead_time','arrival_month','repeated_guest','no_of_previous_cancellations','no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests']

#Creating categorical variables 
cat_cols= ['type_of_meal_plan','room_type_reserved','market_segment_type','booking_status']

## Exploratory Data Analysis

### Q33: Check summary Statistics and analyze the variables. Also find the  Difference between the 25th quantile and 50% quantile value of average-price_per_room

In [None]:
#Remove __________ and complete the code

df.____________


### Q34: More than 90% of the type of room reserved by the customer are of which room type?

In [None]:
#Printing the % sub categories of each category 
# hint use value counts

#Remove __________ and complete the code

for i in cat_cols:
    print(df[i].___________________)
    print('-'*40)

In [None]:
#Let's encode Canceled bookings to 1 and Not_Canceled as 0
df = df.replace({'booking_status':{'Not_Canceled':0,  'Canceled':1}})

In [None]:
#verify
df.head()

### Q35: Find correlation between different variables

In [None]:
#Remove __________ and complete the code

plt.figure(figsize=(12, 7))
sns.heatmap(______________,annot=True, fmt='0.2f', cmap='YlGnBu')
plt.show()

### Data Preparation for modeling

- We want to predict which bookings will be canceled.
- Before we proceed to build a model, we'll have to encode categorical features.
- We'll split the data into train and test to be able to evaluate the model that we build on the train data.

In [None]:
X = df.drop(["booking_status"], axis=1)
Y = df["booking_status"]

X = pd.get_dummies(X, drop_first=True)

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1)

In [None]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))

### Model evaluation criterion

### Model can make wrong predictions as:

1. Predicting a customer will not cancel their booking but in reality, the customer will cancel their booking.
2. Predicting a customer will cancel their booking but in reality, the customer will not cancel their booking. 

### Which case is more important? 
* Both the cases are important as:

* If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of distribution channels.

* If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage the brand equity. 



### How to reduce the losses?

* Hotel would want `F1 Score` to be maximized, greater the F1  score higher are the chances of minimizing False Negatives and False Positives. 

In [None]:
#using this function will generate all the metrics and confusion score
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Cancelled', 'Cancelled'], yticklabels=['Not Cancelled', 'Cancelled'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

### Logistic Regression (with Sklearn library)

### Q36: Build the logistic regression model use random_state =1 and check its performance on train and test dataset

In [None]:
#Remove __________ and complete the code

#define the logistic regression model
log_reg = _______________________

#fir the logistic regression model
log_reg.___________

In [None]:
# predicting on training set
#Remove __________ and complete the code

y_pred_train = log_reg.________________

metrics_score(y_train, y_pred_train)

#### Checking performance on test set

In [None]:
# predicting on the test set

#Remove __________ and complete the code

y_pred_test = log_reg._________________
metrics_score(y_test, y_pred_test)

### Q37: Building SVM and checking its performance

In [None]:
#define the SVM model
#Remove __________ and complete the code


#linear kernal or linear decision boundary
svm = SVC(kernel = ______) 

#fit svm model
svm.fit(X = X_train, y = y_train)

SVC(kernel='linear')

In [None]:
# predicting on training set
#Remove __________ and complete the code

y_pred_train_svm = svm._____________
metrics_score(_______________)

In [None]:
# predicting on testing set
#Remove __________ and complete the code

y_pred_test_svm = svm.___________
metrics_score(__________________)