# Import Libraries

In [1]:
import numpy as np
import pandas as pd
from category_encoders import *
from sklearn.compose import *
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, IsolationForest
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import *
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier, RidgeClassifier, SGDClassifier
from sklearn.metrics import make_scorer, balanced_accuracy_score, f1_score,  precision_score, recall_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import *
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import train_test_split

# Load data

In [2]:
booking_data = pd.read_csv("hotel_bookings.csv")

# Extract Target

In [3]:
y = booking_data["is_canceled"] # extract target

In [4]:
X = booking_data.drop(["is_canceled"],
                      axis=1) 

# Extract continuous and categorical columns

Keeping track of continuous and categorical columns would allow us to treat these columns separately

In [5]:
cat_columns = X.dtypes==object
con_columns = ~cat_columns

# Identify columns to drop

The columns `reservation_status` and `reservation_status_date` are updated after the booking has been cancelled. Drop these columns along with the target to avoid data leakage. Also drop columns `company` (95% of the values are missing) and `agent` (not useful). 

In [6]:
cat_columns [["company", "agent", "reservation_status", "reservation_status_date"]] = False
con_columns [["company", "agent", "reservation_status", "reservation_status_date"]] = False

# Split the data into train and test datasets

The `random_state` variable is set to 42 to ensure reproducibility of results.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


# Build categorical and continuous pipelines

In [8]:
cat_pipe = Pipeline([("cat_imputer", SimpleImputer(missing_values=np.nan,
                                              strategy="most_frequent")),
                     ("ohe", OneHotEncoder(handle_unknown="ignore"))])

con_pipe = Pipeline([("con_imputer", SimpleImputer(missing_values=np.nan,
                                                  strategy="median")),
                    ("scaler", StandardScaler())])

preprocessing = ColumnTransformer([("categorical", cat_pipe, cat_columns),
                                  ("continuous", con_pipe, con_columns)],
                                 remainder="drop")

# Fit models

## Model 1: Random Forest

### Build pipeline

In [9]:
pipe = Pipeline([("preprocessing", preprocessing),
                ("rf", RandomForestClassifier())])

### Initialize the search space for model hyperparameters


In [10]:
rf_hyperparams = {"rf__n_estimators": [10,20,50], # number of trees used in random forest, very high values could lead to overfitting
                 "rf__max_depth": [5, 10,15], # max depth of each tree, if the depth is too low, the accuracy is poor
                 "rf__criterion": ["gini", "entropy"], # to check whether impurity or information gain is the best way to split
                 "rf__min_samples_leaf": [3,5,10], # minimum samples beyond which a node cannot be split, higher values imply more generality
                 "rf__max_features": ["sqrt", "log2"], # to check what is the best way limit the number of features to each tree
                 "rf__bootstrap": [True, False]} # to check whether bagging and aggregating results in a better model 

### Initialize the metrics used for scoring

The primary metric used for scoring the classification model is Balanced Accuracy Score. This metric was chosen because it captures both recall of each class in a single metric and accounts for imbalance in the dataset. 

Other metrics considered were precision score and recall score to look at precision and recall individually as well as the f1 score to capture both. 

In [11]:
f1_wtd = make_scorer(f1_score, average="weighted")
pr_wtd = make_scorer(precision_score, average="weighted")
bal_acc_score = make_scorer(balanced_accuracy_score)
recall_wtd = make_scorer(recall_score, average = "weighted")

scoring_dict = {"bal_acc_score": bal_acc_score,
            "f1_wtd": f1_wtd,
            "pr_wtd": pr_wtd,
            "recall_wtd": recall_wtd}

### Fit models

In [12]:
rscv = RandomizedSearchCV(estimator=pipe,
                    param_distributions=rf_hyperparams,
                    n_iter = 10,
                    scoring = scoring_dict,
                    refit = "bal_acc_score",
                    n_jobs = -1,
                    cv = 5,
                    random_state=42)

In [13]:
result = rscv.fit(X_train, y_train)

### Print best hyperparameters

In [14]:

print(result.best_params_)
print(result.best_score_)

{'rf__n_estimators': 10, 'rf__min_samples_leaf': 5, 'rf__max_features': 'sqrt', 'rf__max_depth': 15, 'rf__criterion': 'entropy', 'rf__bootstrap': False}
0.7868118286166703


Different options were chosen for Random Forests's number of estimators, minimum samples per leaf, criterion, rule for selecting the max number of features and the choice to bootstrap to explore the effect the construction of an ensemble has on this dataset. It was observed that a forest with greater depth is suited for this dataset. 

A balanced accuracy score of 0.79 means that the model can correctly identify 79% of the labels from each class (cancelled or not cancelled).

### Extract the best hyperparameters

In [15]:
rf_bestparams = {}
for (key,value) in result.best_params_.items():
    rf_bestparams[key.partition("__")[2]] = value

print(rf_bestparams)

{'n_estimators': 10, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'max_depth': 15, 'criterion': 'entropy', 'bootstrap': False}


## Model 2: Logistic Regression

### Build pipeline

In [16]:
pipe = Pipeline([("preprocessing", preprocessing),
                ("log_reg", LogisticRegression())])

In [17]:
log_reg_hyperparams = {"log_reg__penalty": ['l1','l2'], # to check which penalty is better suited for the dataset
                      "log_reg__fit_intercept": [True],
                      "log_reg__C": [0.001, 0.01, 0.1, 1, 10, 100, 1000], # to see the impact of strength of regularization
                      "log_reg__class_weight": ["balanced"], # to account for imbalance in the dataset
                      }

In [18]:
rscv = RandomizedSearchCV(estimator=pipe,
                    param_distributions=log_reg_hyperparams,
                    n_iter = 10,
                    scoring = scoring_dict,
                    refit = "bal_acc_score",
                    n_jobs = -1,
                    cv = 5,
                    random_state=42)



In [19]:
result = rscv.fit(X_train, y_train)

        nan 0.79652246 0.80412608        nan]
        nan 0.80560316 0.8110583         nan]
        nan 0.80754929 0.81391621        nan]
        nan 0.804494   0.80967589        nan]
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [20]:
print(result.best_params_)
print(result.best_score_)

{'log_reg__penalty': 'l2', 'log_reg__fit_intercept': True, 'log_reg__class_weight': 'balanced', 'log_reg__C': 1000}
0.8041260772994334


Different values of logistic regression's C were explored to see the impact of the strength of regularization on the dataset. The value of 1000 for C indicates that regularization has little impact on this dataset. Also it was observed that l2 penalty is better suited for this dataset.

A balanced accuracy score of 0.80 means that the model can correctly identify 80% of the labels from each class (cancelled or not cancelled).

In [21]:
logreg_bestparams = {}
for (key,value) in result.best_params_.items():
    logreg_bestparams[key.partition("__")[2]] = value

print(logreg_bestparams)

{'penalty': 'l2', 'fit_intercept': True, 'class_weight': 'balanced', 'C': 1000}


# Choose final model
Although the scores of both models are very similar, logistic regression was chosen because of the higher balanced accuracy score

# Fit the final model on training data

In [22]:
pipe = Pipeline([("preprocessing", preprocessing),
                ("log_reg", LogisticRegression(**logreg_bestparams))])

model = pipe.fit(X_train, y_train)



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [23]:
# Print model hyper parameters
model.get_params

<bound method Pipeline.get_params of Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('categorical',
                                                  Pipeline(steps=[('cat_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('ohe',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  hotel                              True
lead_time                         False
arrival_date_year                 False
arrival_date_month                 True
arrival_date_week_number          False
arrival_date_day_of_month         False
stays_in_weekend_nights...
previous_bookings_not_canceled     True
reserved_room_type                False
assigned_room_type                False
booking_changes                    True
deposit_type

# Get predictions for the test data set

In [24]:
y_pred = model.predict(X_test)

# Get the metrics for final model 

In [26]:
print(f"Balanced accuracy score: {balanced_accuracy_score(y_test, y_pred): .4f}")
print(f"Precision score: {precision_score(y_test, y_pred, average='weighted'):.4f}")
print(f"Recall score: {recall_score(y_test, y_pred, average='weighted'):.4f}")
print(f"F1 score: {f1_score(y_test, y_pred, average='weighted'):.4f}")

Balanced accuracy score:  0.8104
Precision score: 0.8195
Recall score: 0.8163
F1 score: 0.8174


- A balanced accuarcy score of 0.81 implies that the model can correctly predict 81% of the labels of each class.
- A precision score of 0.82 implies that of all the cases wherein the model predicted that the booking will be cancelled, 82% were identified correctly.
- A recall score of 0.82 implies that of all the cases wherein the booking was cancelled, the model correctly identified 82%.
- An F1 score of 0.82, is the weighted average of precision and recall. Although not as intuitive as accuracy, a high f1 score indicates a good balance of precision and recall.

# Summary

The logistic regression model proved to be a better choice to predict whether a hotel booking would be cancelled given the current and prior booking information about the type of hotel, room, customer, stay, payment status, etc. While the balanced accuracy score of the random forest model was only slightly lower, using an l2 penalty with logistic regression could have helped to avoid overfitting and improved the model's generality. 

This model can help the hotel management allocate the room to someone else if there's a possibility that a booking would be cancelled. A high balanced accuracy score indicates that while the management can avoid loss in revenue by allocating a booking (which is likely to get cancelled) to someone else, the management can also know when the booking is not likely to get cancelled with a reasonable amount of accuracy. This would reduce the chances of an event when a customer shows up only to find out their booking has been allocated to someone else, resulting in reputation and revenue loss for management.   

# Future directions

By engineering more features, the number of days within which a booking is cancelled (if at all it is) can be predicted. This would enable the hotel management to devise their cancellation policies and aid in calculating cancellation fees. 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=ddc61a58-e71d-4541-81d3-ede4d6669fe3' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>