
## Data Science

Create a Probability-of-Cancellation model. Your work won't be assessed on whether you get the best model, but that you understand important concepts behind analyzing the data, feature engineering and model development and evaluation. Keep this section simple, clear and illustrative of your understanding of how to prototype a model.

# Data Science Portion

## Imports

In [0]:
from random import Random

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from pandas.core.interchange.dataframe_protocol import DataFrame
from sklearn import preprocessing
from sklearn.pipeline import Pipeline

## Dataset

In this notebook, we would like you to develop a model to predict whether a reservation will cancel and describe what the model learned. 

* The label in the dataset is given as `is_canceled`.
* For a complete description of dataset, visit the link: https://www.sciencedirect.com/science/article/pii/S2352340918315191

In [0]:
df = pd.read_csv('../data/raw/hotel_bookings.csv')
df.head()


In [0]:
df.describe()

In [0]:
# Add arrival date into dataframe

In [0]:
import datetime
month_mapping = {
    'January': 1, 'February': 2, 'March': 3, 'April': 4,
    'May': 5, 'June': 6, 'July': 7, 'August': 8,
    'September': 9, 'October': 10, 'November': 11, 'December': 12
}

df['arrival_date'] = df.apply(
    lambda x: datetime.date(
        int(x['arrival_date_year']), 
        month_mapping[x['arrival_date_month']], 
        int(x['arrival_date_day_of_month'])
    ), axis=1)

 ## Helpful EDA

In [0]:
City_Hotel = df[(df['hotel']== 'City Hotel')]
resort_hotel= df[df['hotel']=='Resort Hotel']

resort_hotel = resort_hotel.groupby('arrival_date')[['adr']].mean()
City_Hotel = City_Hotel.groupby('arrival_date')[['adr']].mean()


plt.figure(figsize=(20,8), facecolor='#C38154')
plt.title('Average Daily Rate in City and Resort Hotel', fontsize=30)
plt.plot(resort_hotel.index,resort_hotel['adr'],label = 'Resort Hotel')
plt.plot(City_Hotel.index,City_Hotel['adr'],label = 'City Hotel')
plt.legend(fontsize=20)
plt.show()

We can see clear sesonality with the average prices especially for resort hotel 

In [0]:
df['is_canceled'].mean()

In [0]:
# check for null values

In [0]:
null = pd.DataFrame({'Null Values' : df.isna().sum(), 'Percentage Null Values' : (df.isna().sum()) / (df.shape[0]) * (100)})
null

## Deal with missing values

In [0]:
# Based on the description, if the agent is nan, the booking did not come from the travel agency and same with the company. Therefore we can fill these values with 0s (as we dont have 0 index in the dataset)   
df[['agent', 'company', 'children']] = df[['agent', 'company', 'children']].fillna(0)

# For country fill missing values with "Unknown"
df['country'] = df['country'].fillna('Unknown')


In [0]:
# based on the dataset description, columns "agent" and "company" suppose to represent IDs of the booking agent and the company/entity that made the booking, therefore we convert the type to object so that we do not treat these columns as numeric for the analysis
df[['agent', 'company']] = df[['agent', 'company']].astype('object')

In the dataset, we see that adults, children and babies are 0s at the same time sometimes, which is strange, we might decide whether to drop these lines or not based on the investigation and usefullness of these rows

In [0]:
filter = (df.children == 0) & (df.adults == 0) & (df.babies == 0)
display(df[filter])
df = df[~filter]
df.reset_index(drop=True, inplace=True)

## Data preprocessing

## Explore numerical values and its correlations with is_cancelled

In [0]:
plt.figure(figsize = (24, 12))

numerical_df = df.select_dtypes(include=[np.number])
corr = numerical_df.corr()
sns.heatmap(corr, annot = True, linewidths = 1)
plt.show()

In [0]:
correlation = numerical_df.corr()['is_canceled'].abs().sort_values(ascending = False)
correlation

# Insights:
- most correlated features with is_cancelled are lead_time, total_of_special_requests, required_car_parking_spaces, booking_changes, previous_cancellations. 
- The longer lead_time to the stay, the longer time there is to cancel
- the more special requests, the less likely the reservation is to be canceled
- Previous cancellation might indicate future cancellation as well. 
- Stays in weekend nights and stays in week nights are mutually correlated, so we use only one of these features   


In [0]:
# numerical values to use for the model
numerical_features = [
    'lead_time', 
    'total_of_special_requests', 
    'required_car_parking_spaces', 
    'booking_changes', 
    'previous_cancellations', 
    'is_repeated_guest', 
    'previous_bookings_not_canceled', 
    'adr', 
    'agent', 
    'stays_in_week_nights'
]

In [0]:
def plot_cancellation_per_category(category_column: str, minimal_category_count: int):
    category_counts = df[category_column].value_counts()
    categories = category_counts[category_counts > minimal_category_count]

    # group by these agents and get percentage mean of cancelations

    df_subset_categories = df[df[category_column].isin(categories.index)]
    categories_grouped = df_subset_categories.groupby(category_column).is_canceled.mean().sort_values(ascending = False)

    # plot bar plot for the groupby for the is_cancelled column
    plt.figure(figsize = (12, 6))
    categories_grouped.plot(kind = 'bar')
    plt.title(f'{category_column} and their cancellation rates')
    plt.show()

In [0]:
excluded_columns = ['reservation_status', 'reservation_status_date']
columns = [c for c in df.select_dtypes(include = 'object').columns if c not in excluded_columns]
for col in columns:
    plot_cancellation_per_category(col, 100)

In [0]:
cancelled_data= df[df['is_canceled']==1]
top_10_country = cancelled_data['country'].value_counts()[:10]


# Custom colors for the pie chart
custom_colors = ['#FF6347', '#4682B4', '#7FFF00', '#FFD700', '#87CEEB', '#FFA07A', '#6A5ACD', '#FF69B4', '#40E0D0', '#DAA520']

plt.figure(figsize=(8, 8), facecolor='#C38154')  # Set background color to a light brown
plt.title('Top 10 countries with reservation canceled', color="black")
plt.pie(top_10_country, autopct='%.2f', labels=top_10_country.index, colors=custom_colors)
plt.show()

Insights:
- City hotels have a higher cancellation rate than resort hotels
- Spring months have higher cancellation rates, on the opposite, January has the lowest (might be because of cheaper prices or special offeres in January)
- Full board has the highest cancellation rate
- Groups have the highest cancellation rate
- Countries cancellation rate varies, so we remove them from the model prediction as we have a lot of countries and the model wouldnt be generic -> we might want to go back and investigate especially Portugal, for which we have the most data and it has the highest cancellation rate. 
- *strangest results are for the "deposit_type" for which we have 99 % of cancellations for Non Refund type. This makes no sense, so we should investigate why we have this results -> for now we remove it from the analysis.
- We might want to investigate further the relationship between assigned_room_type and reserved_room_type. Especially with assigned room types with higher numbers than requested room types, we might expect lower cancellation rates (as we see for assigned rooms for K and I)   
- We see big variation in "agent" and "company" features. This might be investigated futher, but for the sake of simplicity of this task, lets just company if the booking was made by a company or not 

In [0]:
# For the model prediction, we use these category columns
categorical_features = [
    'hotel', 
    'arrival_date_month', 
    'meal', 
    'market_segment', 
    'distribution_channel', 
    'reserved_room_type',
    'reserved_room_type', 
    'customer_type'
]

For `agent` and `company` feature instead of adding all values into the model, lets only distinguished if the booking was done by agent or a company 

In [0]:
# assign each value in column agent that is greater than 0 to 1
df.loc[df['agent'] > 0, 'agent'] = 1
df.loc[df['company'] > 0, 'company'] = 1

plot_cancellation_per_category('agent', 100)
plot_cancellation_per_category('company', 100)

In [0]:
# we see differences here, so lets add the features into numerical values
numerical_features.extend(['agent', 'company'])

In [0]:
# Separate features and predicted value
features = numerical_features + categorical_features
df_categorical = df[categorical_features]
df_numerical = df[numerical_features]
y = df['is_canceled']

In [0]:
# convert each categorical column in df_categorical into one-hot-encoging column
df_categorical_dummies = pd.get_dummies(df_categorical).astype(int)


In [0]:
df_numerical.isnull().sum()

### Baseline model

We use Logistic Regression as a simple baseline. For this model, we need to scale numerical features

In [0]:
from sklearn.preprocessing import StandardScaler
X_numerical = StandardScaler().fit_transform(df_numerical)

X = pd.concat([pd.DataFrame(X_numerical, columns = df_numerical.columns), df_categorical_dummies], axis = 1)
X.head()

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score

kfolds = 4 # 4 = 75% train, 25% validation
split = KFold(n_splits=kfolds, shuffle=True, random_state=42)

# Preprocessing, fitting, making predictions and scoring for every model:
    # pack preprocessing of data and the model in a pipeline:

# get cross validation score for each model:
cv_results = cross_val_score(LogisticRegression(max_iter=1000), 
                             X, y, 
                             cv=split,
                             scoring="f1",
                             n_jobs=-1)
# output:
min_score = round(min(cv_results), 4)
max_score = round(max(cv_results), 4)
mean_score = round(np.mean(cv_results), 4)
std_dev = round(np.std(cv_results), 4)
print(f"Cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")

Train random forest model

In [0]:
# train Gradient boosting model
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

cv_results = cross_val_score(RandomForestClassifier(), 
                             X, y, 
                             cv=split,
                             scoring="f1",
                             n_jobs=-1)
# output:
min_score = round(min(cv_results), 4)
max_score = round(max(cv_results), 4)
mean_score = round(np.mean(cv_results), 4)
std_dev = round(np.std(cv_results), 4)
print(f"Cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")



In [0]:
# fit RandomForrest model and print feature importance. Split to train and validation set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
preds = rf.predict(X_test)

score = accuracy_score(y_test, preds)
print(f"Accuracy score: {round(score, 4)}")

# f1 score
from sklearn.metrics import f1_score
f1 = f1_score(y_test, preds)
print(f"F1 score: {round(f1, 4)}")

# plot feature importance -> sort features by importance
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize = (12, 6))
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
         color="r", align="center")
plt.xticks(range(X.shape[1]), X.columns[indices], rotation = 90)
plt.xlim([-1, X.shape[1]])
plt.show()