ML - Project - Phase 1
====
- Ahmed Soliman - 201802284
- Abhygian Kishor - 201909552
- Mohammed Arif - 201908981

#- Introduction
=======

# Motivation

Customers cancel hotel reservations (or simply don't show up) due to a variety of reasons such as scheduling conflicts, change of plans, etc. Knowing if a customer will honor a reservation is hard and with the advent of online reservations for hotels, prediciting this behaivour has become an even more difficult task. Reservation cancellation leads to unfilled rooms which means hotels lose out on revenue.

Analysing the resevation cancellation dataset is crucial step in understanding and making sense of the large amount of data to efficiently predict reservation cancellations. Predicting this customer behaviour will provide hotels several benefits such as better revenue optimization, increased customer staisfaction, and accurate forecasting of demand.

Objectives
====
1 - By producing insightful summary statistics and visualization we aim to uncover patterns and insights on some of the reasons why customers may cancel.

2 - We also aim to investigate relations in different attributes of the dataset to gain a deeper understanding of the data.

3 - We aim to set expectations for future improvements and developments in understanding customer behaviour through establishing a baseline performance by training intial models such as Decision Trees, Random Forest, K-Nearest Neighbours and Logistic Regression.


Dataset 
====
### Link : https://www.kaggle.com/competitions/playground-series-s3e7/overview
The dataset contains the different attributes of customers' reservation details. The detailed data dictionary is given below:
* Booking_ID: unique identifier of each booking
* No of adults: Number of adults
* No of children: Number of Children
* noofweekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
* noofweek_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
* typeofmeal_plan: Type of meal plan booked by the customer:
* requiredcarparking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
* roomtypereserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
* lead_time: Number of days between the date of booking and the arrival date
* arrival_year: Year of arrival date
* arrival_month: Month of arrival date
* arrival_date: Date of the month
* Market segment type: Market segment designation.
* repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
* noofprevious_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
* noofpreviousbookingsnot_canceled: Number of previous bookings not canceled by the customer prior to the current booking
* avgpriceper_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
* noofspecial_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
* booking_status: Flag indicating if the booking was canceled or not. (0 - Cancelled, 1 - Not Cancelled)


#- Implementations
=======

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import math
from sklearn import preprocessing
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
np.random.seed(1)

### Getting training dataset from csv files

In [None]:
train = pd.read_csv('./playground-series-s3e7/train.csv').drop(columns='id')
test = pd.read_csv('./playground-series-s3e7/test.csv').drop(columns='id')
train

Data statistics
====


In [None]:
#Using Python statistics Module: Mathematical statistics functions in Python
from statistics import *
import pandas as pd
data = train['booking_status']
# print(sorted(data))
print("Min", data.min())
print("Max", data.max())
print("mean",mean(data))
print("median",median(data)) 
print("mode",mode(data)) #Single mode (most common value) of discrete or nominal data.
print("multimode",multimode(data)) #List of modes (most common values) of discrete or nominal data.
print("quantiles",quantiles(data)) #Divide data into intervals with equal probability
print("variance",variance(data)) #sample variance of data
print("std",stdev(data))  #sample standard deviation
print("Value counts: \n", data.value_counts())


train.describe(include='all')

## Plotting non-continuous attributes with booking status

In [None]:

non_continuous = ['no_of_special_requests','type_of_meal_plan', 'required_car_parking_space', 'room_type_reserved',
               'market_segment_type', 'repeated_guest', 'no_of_previous_bookings_not_canceled', 'no_of_previous_cancellations' ] 



nrows = int(np.ceil(len(non_continuous)/4))

# create the subplots
fig, axes = plt.subplots(nrows=nrows, ncols=4, figsize=(20, nrows*4))

for i, attr in enumerate(non_continuous):
    row = i // 4
    col = i % 4
    ax=axes[row,col]
    cross_tab = pd.crosstab(train[attr], train['booking_status'])
    if(attr == 'no_of_previous_bookings_not_canceled' or attr == 'no_of_previous_cancellations'):
            #These attributes have too many non zero values, generalising into '0' and '1 or more'
            cross_tab = pd.crosstab(train[attr].apply(lambda x: '0' if x == 0 else '1+'), train['booking_status'])
    cross_tab.plot(kind='bar', ax=ax, color='gr')
    ax.set_xlabel(attr)
    ax.set_ylabel('Frequency')
    ax.legend(["Not Cancelled", "Cancelled"])
    
plt.tight_layout()
plt.show()


**From the above graphs we can see the following relationships:**

* People who made special requests (maybe a good view, or special decorations) did not cancel as much as people who did not make any requests. We can see that as the number of  requests goes up the ratio of cancellations goes down significantly.

* Repeated guests also had very low cancellations compared to first time guests. As the number of previous bookings not cancelled and the number of previous cancellations are only possible for repeated guests, these attributes are also inversely proportional to the booking status.

Data visualization - seaborn
====

## 1- boxplots


In [None]:
import seaborn as sns
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
non_cont = non_continuous.copy()
non_cont.append('booking_status')
df_continuous = train.copy().drop(columns=non_cont, axis=1)
df_continuous.describe()
# calculate the number of rows based on the number of columns and 3 plots per row
nrows = int(np.ceil(len(df_continuous.columns)/3))

# create the subplots
fig, axes = plt.subplots(nrows=nrows, ncols=3, figsize=(20, nrows*4))

# plot the boxplots
for i, column in enumerate(df_continuous.columns):
    row = i // 3
    col = i % 3
    sns.boxplot(data=df_continuous, x=column, ax=axes[row, col])
    axes[row, col].set_title(column)

plt.tight_layout()
plt.show()

## 2- histograms



In [None]:
# calculate the number of rows based on the number of columns and 3 plots per row
df_non_continuous = train.copy()[non_continuous]
nrows = int(np.ceil(len(df_non_continuous.columns)/4))
# create the subplots
fig, axes = plt.subplots(nrows=nrows, ncols=4, figsize=(20, nrows*4))

# plot the boxplots
for i, column in enumerate(df_non_continuous.columns):
    row = i // 4
    col = i % 4
    sns.histplot(data=df_non_continuous, x=column, ax=axes[row, col])
    axes[row, col].set_title(column)

plt.tight_layout()
plt.show()

#### We split the original dataset into two categories, continuous and non-continuous, to visualize the data using box plots for continuous variables and histogram plots for non-continuous variables.

## Removing duplicates

In [None]:
train_copy = train.copy()
print(train_copy.shape)

In [None]:
train_dups = train_copy.drop(columns = 'booking_status').duplicated().sum()

print(f'Number of duplicates in training dataset: {train_dups}')

In [None]:
train_copy = train_copy.drop_duplicates(subset = train.columns[:-1])

In [None]:
print(train_copy.shape)

## Removing anomalies

These are the anomalies we noticed:
1. Anomalous dates, such as 29th February
2. Average price per room = 0

In [None]:
# Removing entries where average price of room is 0. This is not possible.
train_copy = train_copy.loc[train_copy['avg_price_per_room'] != 0]
train_copy.shape

In [None]:
# Removing anomolous dates
train_copy.rename(columns = {'arrival_year': 'year', 'arrival_month': 'month', 'arrival_date': 'day'}, inplace = True)
train_copy['true_date'] = pd.to_datetime(train_copy[['year', 'month', 'day']], errors = 'coerce')
train_copy = train_copy.dropna()
train_copy = train_copy.drop(columns = 'true_date')
train_copy.shape

## Correlation

In [None]:
corr = train_copy.corr()
fig, axes = plt.subplots(figsize=(30, 10))
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, mask=mask, linewidths=.5, cmap='YlOrBr_r', annot=True)
plt.title('Train Correlation')
plt.show()

**From the heatmap above, we infer the following:**
1. Booking status is most correlated to lead time, i.e. The further ahead in time the room is booked, the more likely it is to be cancelled
2. Number of children is closely related to room type reserved as more children would mean larger rooms
3. Consequently, the average price per room is also correlated to number of children as larger rooms would have higher price
4. Unsurprisingly, the room type reservered is related to average price per room
5. If guest is a repeat guest, then naturally the number of previous bookings not cancelled will be high.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.tree import plot_tree
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

In [None]:
train = train_copy.copy()
train.shape

In [None]:
X_original = train.iloc[:,:-1]
y_original = train.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X_original, y_original, test_size=0.2, random_state=42)
print("Shapes:")
print(" X_train: ",X_train.shape)
print(" X_test: ",X_test.shape)
print(" y_train: ",y_train.shape)
print(" y_test: ",y_test.shape)

## Decision Trees

In [None]:
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
print("Training Accuracy:", tree_clf.score(X_train, y_train))

**Decsion tree predictions on X_test**

In [None]:
y_pred = tree_clf.predict(X_test)
print("Predicted Labels:", y_pred[:30])
print("True Labels:     ", y_test.to_numpy()[:30])
print("Testing Accuracy:", metrics.accuracy_score(y_test, y_pred)) 

**Confusion Matrix for Decision tree classifier**

In [None]:
matrix = confusion_matrix(y_test, y_pred) 
disp = ConfusionMatrixDisplay(confusion_matrix=matrix, display_labels=tree_clf.classes_)
disp.plot()
plt.show()

**Cross Validation**

In [None]:
scores = cross_val_score(tree_clf, X_train, y_train, cv=50)
print(scores)

print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

**Using GridSearchCV to find hyperparameters**

In [None]:
params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10, 20, 40],
    'min_samples_leaf': [1, 2, 5, 10],
}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42),
                              params,
                              cv=3)

grid_search_cv.fit(X_train, y_train)

y_pred = grid_search_cv.predict(X_test)

print("Accuracy: ", accuracy_score(y_test, y_pred))

tree_clf = grid_search_cv.best_estimator_
print("Criterion:         ", tree_clf.criterion)
print("Min Samples Leaf:  ", tree_clf.min_samples_leaf)
print("Depth:             ", tree_clf.max_depth)
print("Min Samples Split: ", tree_clf.min_samples_split)

In [None]:
plt.figure(figsize=(10,8))
plot_tree(tree_clf, filled=True)
plt.title("Decision tree trained on all attributes")
plt.show()

**Random Forest Classifier**

In [None]:
randomForest_clf = RandomForestClassifier(n_estimators=200, random_state=42,max_samples=5000)
randomForest_clf.fit(X_train, y_train)
print("Accuracy:",randomForest_clf.score(X_test, y_test))

## K-Nearest Neighbours

In [None]:
model = KNeighborsClassifier(n_neighbors=20)
model.fit(X_train, y_train)
y_pred= model.predict(X_test)
print("Testing Accuracy:", metrics.accuracy_score(y_test, y_pred))

In [None]:
scores = cross_val_score(model, X_train, y_train, cv=50)
print(scores)

print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X_train)


lin_reg = LogisticRegression(max_iter=1000)
lin_reg.fit(X_scaled, y_train)
print(lin_reg.intercept_)
print(lin_reg.coef_)
y_pred= lin_reg.predict(X_test)
print("Testing Accuracy:", metrics.accuracy_score(y_test, y_pred))
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Testing Precision:", metrics.precision_score(y_test, y_pred))
print("Testing Recall:", metrics.recall_score(y_test, y_pred))
print("Testing f1 Score:", metrics.f1_score(y_test, y_pred))


Conclusion
====
In conclusion, our data analysis for the hotel reservation cancellation dataset has been successful. Using the summary statistics and visualization has helped us to uncover patterns and relationships in the data such as the correlation between the number of special requests and the booking status. Another interesting relationship was that of lead time and booking status. We also noticed some anomalies such as rows that had the same attributes but different class labels and removed them to better clean our data. Investigating these relationships has provided us with a better understanding of customer behaviour and what factors influence it. By training initial models we have also established a solid foundation upon which we can have future developments and improvements.