**SCC0277 - Data Science Competitions:** Classification Challenge

**Author:** Dikson F. Santos `<dikson@usp.br>`


# Introduction

This is a project intended to predict whether or not a client will cancel a reservation. The data comes from a Kaggle dataset, and we will use many common ML algorithms and discuss about their performance in this particular dataset. We also talk about how we could improve this baseline by using more advanced techniques.


# Exploratory Data Analysis & Pre-processing


We're going to get our hands on the data. Firstly, let's import the necessary libraries.


In [1]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

import time

from imblearn.over_sampling import RandomOverSampler

Now we are going to load the dataset and check what's on it.


In [2]:
df = pd.read_csv("./hotel_reservations.csv")
df.head(3).transpose()  # Transpose is useful because of the amount of columns

Unnamed: 0,0,1,2
Booking_ID,INN00001,INN00002,INN00003
no_of_adults,2,2,1
no_of_children,0,0,0
no_of_weekend_nights,1,2,2
no_of_week_nights,2,3,1
type_of_meal_plan,Meal Plan 1,Not Selected,Meal Plan 1
required_car_parking_space,0,0,0
room_type_reserved,Room_Type 1,Room_Type 1,Room_Type 1
lead_time,224,5,1
arrival_year,2017,2018,2018


It's time for the analysis!

Let's see what kind of data our attributes hold:


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date   

Two important things to note:

1. There aren't `null` values on the dataset, which eliminates the need of treatment for it.

2. Some of these features are categorical (`<object>`). We're going to see what is inside these features.


In [4]:
n_top_values = 5
for col in df.select_dtypes(include="object"):
    unique_values = df[col].nunique()
    top_values = sorted(df[col].unique())[:n_top_values]
    top_values_str = ", ".join([f"<{v}>" for v in top_values])
    print(
        f"{col}: ({unique_values} distinct) {top_values_str + ', ...' if len(top_values) == n_top_values else top_values_str}"
    )

Booking_ID: (36275 distinct) <INN00001>, <INN00002>, <INN00003>, <INN00004>, <INN00005>, ...
type_of_meal_plan: (4 distinct) <Meal Plan 1>, <Meal Plan 2>, <Meal Plan 3>, <Not Selected>
room_type_reserved: (7 distinct) <Room_Type 1>, <Room_Type 2>, <Room_Type 3>, <Room_Type 4>, <Room_Type 5>, ...
market_segment_type: (5 distinct) <Aviation>, <Complementary>, <Corporate>, <Offline>, <Online>, ...
booking_status: (2 distinct) <Canceled>, <Not_Canceled>


The `Booking_ID` is meaningless for our algorithms, so we can remove that column.

Also, we can convert our categorical features to numerical by using some encoding. There is some methods available, such as one-hot encoding, but we're going to go with label encoding because it's simpler.


In [5]:
df.drop("Booking_ID", axis=1, inplace=True)

# Identify categorical columns
cat_cols = df.select_dtypes(include="object").columns

# Label encode categorical columns
for col in cat_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

Let's check the distribution of the feature we want to predict:


In [6]:
df["booking_status"].value_counts()

booking_status
1    24390
0    11885
Name: count, dtype: int64

As the number of not canceled reservations is more than twice the amount of canceled ones, we can consider this as an imbalanced dataset.


# Model Training


In this section, we are going to evaluate and comment the performance of some classification algorithms, including KNN, Naive Bayes, Logistic Regression, SVMs, among others.


## Preparing validation

To validate our algorithms, we used two techniques:

1. **RandomOverSampler**: As we could see in the EDA section, our dataset is imbalanced. This uses data augmentation to balance out the minority class.
2. **StratifiedKFold**: To mitigate overfitting, we are going to measure our models using KFold with Stratification.


In [7]:
X = df.drop("booking_status", axis=1)
Y = df["booking_status"]

ros = RandomOverSampler(random_state=0)
X, Y = ros.fit_resample(X, Y)

n_folds = 10
kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=0)

## Creating auxiliary functions


In [8]:
# Evaluate models by using cross validation
def evaluate_model(model, model_name, frac=None):
    start_time = time.time()

    # This is need in order to SVM to run in a reasonable amount of time
    if frac is not None:
        X_sample, Y_sample = X.sample(frac=frac), Y.sample(frac=frac)
    else:
        X_sample, Y_sample = X, Y

    scores = cross_val_score(model, X_sample, Y_sample, cv=kf, n_jobs=-1)
    end_time = time.time()

    print(
        "(%0.1fs) Accuracy with %s: %0.4f ± %0.4f"
        % (end_time - start_time, model_name, scores.mean(), scores.std())
    )

## KNN


KNN classifies new observations by assigning them to the class of the majority of their k-nearest neighbors in the training data. Here, we try KNN with `1 <= k <= 3`. We are going to see that besides being a simple algorithm, KNN performs really pretty in this dataset, achiving a 89.4% accuracy when `k = 1`.


In [9]:
for i in range(1, 4):
    knn = KNeighborsClassifier(n_neighbors=i)
    evaluate_model(knn, f"{i}-KNN")

(2.6s) Accuracy with 1-KNN: 0.8937 ± 0.0036
(1.7s) Accuracy with 2-KNN: 0.8501 ± 0.0030
(1.8s) Accuracy with 3-KNN: 0.8388 ± 0.0038


## Naive Bayes


Naive Bayes classifies new observations by assuming feature independence and using Bayes theorem to calculate conditional probabilities.
In this dataset, there are dependence between some features (like `no_of_previous_cancellations` and `no_of_previous_bookings_not_canceled`), so without feature engineering, it performs poorly, achieving less than 57% accuracy.

A interesting thing to note here is the speed: `0.2s` to run, which makes it the fastest algorithm on this analysis.


In [10]:
nb = GaussianNB()
evaluate_model(nb, "Naive Bayes")

(0.2s) Accuracy with Naive Bayes: 0.5672 ± 0.0032


## Logistic Regression


Logistic Regression classifies new observations by estimating the probability of belonging to a certain class using a logistic function and a linear combination of the input features.
We tried running the estimator with `max_iter=1000`, but the library warned that we didn't reach convergence, so we increased it to `5000`, achieving 77.9% accuracy.


In [11]:
rlog = LogisticRegression(max_iter=5000)
evaluate_model(rlog, "Logistic Regression")

(8.7s) Accuracy with Logistic Regression: 0.7786 ± 0.0051


## SVM


SVM (Support Vector Machine) classifies new observations by finding the hyperplane that maximally separates the classes and assigning the observation to the class based on which side of the hyperplane it falls.

It iss particularly effective in high-dimensional datasets with a low number of samples, but doesn't perform well in the opposite condition. The complexity of the algorithm makes it unsuitable for large datasets.

Our dataset is considerably large for SVM, so we reduce the amount of data for training so it won't take too long.

All three variations perform badly, with accuracy close to 50% - No much better than tossing a coin.


In [12]:
svm = SVC(kernel="linear")
evaluate_model(svm, "SVM Linear", frac=0.1)

svm_rbf = SVC(kernel="rbf")
evaluate_model(svm, "SVM RBF", frac=0.1)

svm_poly = SVC(kernel="poly", degree=3)
evaluate_model(svm, "SVM Poly", frac=0.1)

(5.6s) Accuracy with SVM Linear: 0.5148 ± 0.0107
(6.1s) Accuracy with SVM RBF: 0.5043 ± 0.0168
(6.2s) Accuracy with SVM Poly: 0.5148 ± 0.0140


### Decision Tree


A Decision Tree classifies new observations by making a series of binary decisions based on the values of the features, splitting the data into smaller groups until it can assign a class label to the observation based on the majority class in the smallest group.

Turns out decision trees are a good fit for this dataset, with one set of parameters reaching 92.8% accuracy. They are also one of the fastest models here, running in around `0.4s`.


In [13]:
dct = tree.DecisionTreeClassifier()
evaluate_model(dct, "Decision Tree")

dct_entropy = tree.DecisionTreeClassifier(criterion="entropy")
evaluate_model(dct_entropy, "Decision Tree (criterion=entropy)")

(0.4s) Accuracy with Decision Tree: 0.9265 ± 0.0032
(0.4s) Accuracy with Decision Tree (criterion=entropy): 0.9284 ± 0.0026


### Random Forest


Random Forest classifies new observations by aggregating the predictions of multiple decision trees, where each tree is trained on a random subset of features and a random sample of the training data.

The final prediction is then based on the majority vote or the weighted average of the individual tree predictions. This helps to reduce overfitting and increase the accuracy and robustness of the model.

As expected, this model performs better than individual decision trees, reaching 94.6% accuracy with the default parameter set! The validation time also goes up, taking `~20x` more time than individual DTs.


In [14]:
rf = RandomForestClassifier()
evaluate_model(rf, "Random Forest")

(8.2s) Accuracy with Random Forest: 0.9460 ± 0.0034


## Additional Methods & Techniques


After successfully implementing a Random Forest model and achieving an accuracy of 94.6% in solving the machine learning problem, there are still various techniques that can be used to enhance the performance of the model. One approach is hyperparameter tuning, which involves adjusting the settings of the model to optimize its performance. This can be done by testing different values for the hyperparameters, such as the number of trees, maximum depth, and minimum samples required to split a node. By doing so, we can identify the optimal set of hyperparameters that will improve the accuracy of the model.

Another technique is feature engineering, which involves creating new features or modifying existing ones to improve the accuracy of the model. This can be done by selecting the most important features, combining features to create new ones, or scaling the features to prevent bias


# Conclusion


We conducted a classification challenge to predict hotel reservation cancellations using basic data science methodology.

We analyzed the dataset, performed data transformations with justifications, and developed baseline models using various classification techniques, including KNN, Naive Bayes, Logistic Regression, Naive Bayes, SVM, Decision Tree and Random Forest.

Based on the evaluation metrics, we found that the `Random Forest` algorithm performed the best with an accuracy of 0.9455 ± 0.0029.

This study shows the potential of classification algorithms and show ideas of further exploration of advanced techniques that could lead to improved results.


# Apprendix: Seminars Summary


## Using ML for detecting poisoned water using WiFi

Beatriz Proenca presented a study on using machine learning to differentiate between clean and poisoned water using Wi-Fi signals. Four classifiers were tested, achieving high accuracy, with Adaboost performing the best at 92%.

## Cancer Recurrence Prediction

Diego Giaretta presented a study that used machine learning techniques to predict cancer recurrence from open databases. The study used Naive-Bayes and SVM algorithms for training and achieved 99% accuracy through SVM with K-fold cross-validation, identifying features such as age, tumor size, and receptor status. The study was structured with sections presenting concepts, methodology, results, conclusion, and next steps.

## CatBoost

Dikson Santos presented about CatBoost, a machine learning algorithm designed to tackle classification and regression problems. This algorithm is based on gradient boosting decision trees and is known for its ability to handle categorical features in large and complex datasets, resulting in high accuracy. During the presentation, John discussed how CatBoost compares to other popular machine learning algorithms
