# Project 3b

The final part of the project will ask you to perform your own data science project to classify a new dataset.

## Submission Details

**Project is due June 14th at 11:59 pm (Friday Midnight). To submit the project, please save the notebook
as a pdf file and submit the assignment via Gradescope. In addition, make sure that
all figures are legible and suﬀiciently large. For best pdf results, we recommend printing the notebook using [$\LaTeX$](https://www.latex-project.org/)**

## Loading Essentials and Helper Functions 

In [None]:
# fix for windows memory leak with MKL
import os
import platform

if platform.system() == "Windows":
    os.environ["OMP_NUM_THREADS"] = "2"

In [None]:
# import libraries
import time
import random
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt  # this is used for the plot the graph

# Sklearn classes
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
    KFold,
)
from sklearn import metrics
from sklearn.metrics import confusion_matrix, silhouette_score
import sklearn.metrics.cluster as smc
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder,
    LabelEncoder,
    MinMaxScaler,
    PolynomialFeatures
)
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn import tree
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_blobs

import seaborn as sns
from helper import (
    draw_confusion_matrix,
    heatmap,
    make_meshgrid,
    plot_contours,
    draw_contour,
)

from sklearn.experimental import enable_halving_search_cv
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import HalvingGridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Sets random seed for reproducibility
SEED = 42
random.seed(SEED)

# (100 pts) Putting it all together: Classify your own data

Through the course of this program, you have acquired knowledge and skills in applying various models to tackle supervised learning tasks. Now, we challenge you to harness your cumulative learning and create a model capable of predicting whether a hotel reservation will be canceled or not.

### Context
Hotels welcome millions of guests every year, and their primary objective is to keep rooms occupied and paid for. Cancellations can be detrimental to the business, as it may become challenging to rebook a room on short notice. Consequently, it is beneficial for hotels to anticipate which reservations are likely to be canceled. The provided dataset offers a diverse range of information about bookings, which you will utilize to predict cancellations.

### Challenge
The goal of this project is to develop a predictive model that can determine whether a reservation will be canceled based on the available input parameters.

While we will provide specific instructions to guide you in the right direction, you have the freedom to choose the models and preprocessing techniques that you deem most appropriate. Upon completion, we request that you provide a detailed description outlining the models you selected and the rationale behind your choices.

### Data Description
Refer to https://www.kaggle.com/competitions/m-148-spring-2024-project-3/data for information

## (50 pts) Preprocessing
For the dataset, the following are mandatory pre-processing steps for your data:

- **Use One-Hot Encoding on all categorical features** (specify whether you keep the extra feature or not for features with multiple values)
- Determine which fields need to be dropped
- **Handle missing values** (Specify your strategy)
- **Rescale the real valued features using any strategy you choose** (StandardScaler, MinMaxScaler, Normalizer, etc)
- **Augment at least one feature**
- **Implement a train-test split with 20% of the data going to the test data**. Make sure that the test and train data are balanced in terms of the desired class.

After writing your preprocessing code, write out a description of what you did for each step and provide a justification for your choices. All descriptions should be written in the markdown cells of the jupyter notebook. Make sure your writing is clear and professional.  

We highly recommend reading through the [scikit-learn documentation](https://scikit-learn.org/stable/data_transforms.html) to make this part easier.

In [None]:
# Loading in dataset
df = pd.read_csv("datasets/hotel_booking.csv")

df.head()

In [None]:
df.info()

In [None]:
df.describe()

# Data cleaning

From reading the documentation, I found that the following columns are categorical and will be dealt wtih accordingly:
* hotel
* is_cancelled
* arrival_date_month
* meal
* country
* reserved_room_types
* deposit_type
* customer_type
* name
* email
* phone_number

<br>

We can drop name, email, phone_number since these do not provide any useful information

In [None]:
df = df.drop(columns=['name', 'email', 'phone-number'])
df.head()

In [None]:
# Checking for null values
df.isnull().sum()

I found that there are only 3 null values in the 'children' section, and since the dataset is rather large with almost 70,000 entries, I'm going to just drop these rows.

In [None]:
# Dropping na values
df.dropna(inplace=True)

# Checking to make sure it worked
print(df.isnull().sum())

In [None]:
# Checking if data is balanced
sns.countplot(x='is_canceled', data=df)

### Augmentation
I decided I am going to create a new feature called 'is_family' based on 'children' and 'babies'. According to SHR Group's Hotel Industry Trend, families were the most likely to cancel in 2024, so this augmented feature should provide some good insight.

I am also creating a column called 'stay_duration' which calculates the total number of nights a guest is staying, a column called 'is_repeated_guest' which indicates if a guest has stayed with the hotel before, a column 'has_special_requests' which just indicates if the guest has made special requests, and 'is_high_season' which indicates if a booking is made during a high travel season (usually in the summer), and a column called 'cancellation_rate' which calculates the rate of cancellations for a customer.

In [None]:
# # Creating new feature indicating if a booking is made for a family
df['is_family'] = df.apply(lambda row: 1 if row['children'] > 0 or row['babies'] > 0 else 0, axis=1)

# Stay Duration
df['stay_duration'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']

# Cancellation rate for each booking
df['cancellation_rate'] = df['previous_cancellations'] / (df['previous_cancellations'] + df['previous_bookings_not_canceled'])
# Fill any NaN values which might occur due to division by zero
# This occurs when there's a new guest
df['cancellation_rate'].fillna(0, inplace=True)

df['is_repeated_guest'] = np.where((df['previous_cancellations'] > 0) | (df['previous_bookings_not_canceled'] > 0), 1, 0)

df['has_special_request'] = np.where(df['total_of_special_requests'] > 0, 1, 0)

df['is_high_season'] = df['arrival_date_month'].apply(lambda x: 1 if x in ['June', 'July', 'August'] else 0)


df.head()

In [None]:
# Splitting features into numerical and categorical
numerical = ['stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'previous_cancellations', 'previous_bookings_not_canceled', 'booking_changes', 'days_in_waiting_list', 'adr', 'required_car_parking_spaces', 'stay_duration', 'cancellation_rate']

categorical = ['hotel', 'arrival_date_month', 'meal', 'country', 'reserved_room_type', 'deposit_type', 'customer_type', 'is_high_season', 'is_repeated_guest', 'has_special_request','is_family']



# Exploratory Data Analysis

Now I am just going to do some basic exploratory data analysis to look at distributions

I suspect there will be a strong relationship between customer_type and cancellations

In [None]:
# Checking correlation between customer_type and cancellations
contingency_table = pd.crosstab(df['customer_type'], df['is_canceled'])
print(contingency_table)


In [None]:
from scipy.stats import chi2_contingency

chi2, p, dof, ex = chi2_contingency(contingency_table)
print("p-value of chi-square test:", p)


It looks like there is a very strong association. This makes sense as those travelling in groups are less likely to cancel than those travelling alone or as a couple. This will be an important feature in the model.

In [None]:
# Creating a bar graph for showing number of bookings per hotel

sns.countplot(x='hotel', data=df)
plt.show()

sns.countplot(x='customer_type', data=df)
plt.show()

In [None]:
# Looking at relationship between is_family and cancellations
pd.crosstab(df['is_family'], df['is_canceled']).plot(kind='bar', stacked=True)
plt.xlabel('Is Family')
plt.ylabel('Count')
plt.show()

In [None]:
df.hist(figsize=(20,20))
plt.show()

In [None]:
# Looking at relationship between average daily rate and cancellations
sns.boxplot(x='is_canceled', y='adr', data=df)
plt.xlabel('Is Canceled')
plt.ylabel('Average Daily Rate')
plt.show()

# Data Processing

In [None]:
# Defining target
y = df['is_canceled']

# Define features
X = df.drop(columns='is_canceled')

# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

In [None]:
# Creating pipeline for transforming features
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2))
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical),
        ('cat', categorical_transformer, categorical)
    ])

In [None]:
train = preprocessor.fit_transform(X_train)
test = preprocessor.transform(X_test)

# Getting feature names
feature_names = preprocessor.get_feature_names_out(list(X.columns))


I ended up keeping the extra features after one-hot encoding as I didn't see a reason to not include them.


## (50 pts) Try out a few models
Now that you have pre-processed your data, you are ready to try out different models. 

For this part of the project, we want you to experiment with all the different models demonstrated in the course to determine which one performs best on the dataset.

You must perform classification using at least 3 of the following models:
- Logistic Regression
- K-nearest neighbors
- SVM
- Decision Tree
- Multi-Layer Perceptron

Due to the size of the dataset, be careful which models you use and look at their documentation to see how you should tackle this size issue for each model.

For full credit, you must perform some hyperparameter optimization on your models of choice. You may find the following scikit-learn library on [hyperparameter optimization](https://scikit-learn.org/stable/modules/grid_search.html#grid-search) useful.

For each model chosen, write a description of which models were chosen, which parameters you optimized, and which parameters you choose for your best model. 
While the previous part of the project asked you to pre-process the data in a specific manner, you may alter pre-processing step as you wish to adjust for your chosen classification models.


In [None]:
# Hyperparameter optimization for KNN model
# Define the parameter grid
params = {
    'n_neighbors' : [1, 3, 5, 7, 9],
    'metric' : ["euclidean", "manhattan"]
}

# Instantiate the grid search model
grid_search_knn = HalvingGridSearchCV(estimator=KNeighborsClassifier(), param_grid=params, cv=10, scoring='accuracy', n_jobs=-1)

# Fit the grid search to the data
grid_search_knn.fit(train, y_train)

# Print the best parameters and the best score
print("Best Parameters: ", grid_search_knn.best_params_)
print("Best Score: ", grid_search_knn.best_score_)


In [None]:
# Using fitted model to make predictions
test_predictions_knn = grid_search_knn.best_estimator_.predict(test)

# Calculate the accuracy of the model on the test data
test_accuracy = accuracy_score(y_test, test_predictions_knn)

# Print the test accuracy
print(f"Test Accuracy (KNN): {test_accuracy*100:.3f}")

# Classification Report
print(metrics.classification_report(y_test, test_predictions_knn))

draw_confusion_matrix(y_test, test_predictions_knn, ['Not Cancelled', 'Cancelled'])

with 'has_special_requests' -> 83.2
without ->

## KNN Model Description

This KNN model is pretty accurate, with a test accuracy of 83%. I optimized the hyperparameters using GridSearchCV and found the best parameters to be {'metric': 'euclidean', 'n_neighbors': 9}. It trains very quickly which is I used GridSearchCV instead of HalvingGridSearchCV, which I use in later models.

In [None]:
# Building Logistic Regression model and hyperparameter optimizing

from sklearn.linear_model import LogisticRegression

# Instantiate the model (using the default parameters)
logreg = LogisticRegression(max_iter=1000)

params = {
    'penalty': ["l1", "l2"],
    'solver': ["liblinear", "saga"],
    'C': [0.001, 0.1, 10]
}

# Instantiate the grid search model
grid_search_log = HalvingGridSearchCV(estimator=logreg, param_grid=params, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the grid search to the data
grid_search_log.fit(train, y_train)

# Print the best parameters and the best score
print("Best Parameters: ", grid_search_log.best_params_)
print("Best Score: ", grid_search_log.best_score_)

In [None]:
# Testing best logistic regression model
best_logreg_model = grid_search_log.best_estimator_

test_predictions_logreg = best_logreg_model.predict(test)

test_accuracy_logreg = accuracy_score(y_test, test_predictions_logreg)

print(f"Test Accuracy (Logistic Regression): {test_accuracy_logreg*100:.3f}")

# Classification Report
print(metrics.classification_report(y_test, test_predictions_logreg))

draw_confusion_matrix(y_test, test_predictions_logreg, ['Not Cancelled', 'Cancelled'])

## Logistic Regression Model Description

This logistic regression model is surprisingly less accurate than a KNN model, with a max test accuracy at 80%. The best parameters were {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}. It also is very slow to train compared to other models, even with HalvingGridSearchCV.

In [None]:
# Building decision tree model

# Optimizing
params = {
    'max_depth': [10, 20, 30, 40],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

tree = DecisionTreeClassifier()

grid_search_tree = HalvingGridSearchCV(estimator=tree, param_grid=params, cv=10, n_jobs=-1)

# Fitting GridSearch
grid_result_tree = grid_search_tree.fit(train, y_train)

# Best parameters
print("Best parameters found: ", grid_result_tree.best_params_)
print("Highest accuracy: ", grid_result_tree.best_score_)



In [None]:
# Testing best decision tree model
best_tree_model = grid_search_tree.best_estimator_

test_predictions_tree = best_tree_model.predict(test)

test_accuracy_tree = accuracy_score(y_test, test_predictions_tree)

print(f"Test Accuracy (Decision Tree): {test_accuracy_tree*100:.3f}")

# Classification Report
print(metrics.classification_report(y_test, test_predictions_tree))

draw_confusion_matrix(y_test, test_predictions_tree, ['Not Cancelled', 'Cancelled'])

## Decision Tree Model Description

This Decision Tree model is around the same accuracy as the best KNN model, with an accuracy of 82%. The best parameters were {'max_depth': 20, 'min_samples_leaf': 4, 'min_samples_split': 10}.

In [None]:
from sklearn.ensemble import RandomForestClassifier


# Define the parameters // refined
params = {
    'bootstrap': [False],
    'criterion': ['entropy'],
    'max_depth': [45, 50, 55],
    'max_features': ['sqrt'],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [2, 3],
    'n_estimators': [450, 500, 550]
}


rf = RandomForestClassifier()

grid_search_rf = HalvingGridSearchCV(estimator=rf,param_grid=params, cv=10, n_jobs=-1)

# Run the grid search
grid_search_rf.fit(train, y_train)

# Print out the best parameters
print("Best parameters found: ", grid_search_rf.best_params_)
print("Highest accuracy found: ", grid_search_rf.best_score_)


In [None]:
# Testing the best Random Forest model
best_rf_model = grid_search_rf.best_estimator_

# Predict on the testing data
test_predictions_rf = best_rf_model.predict(test)

# Get the accuracy of the model
test_accuracy_rf = accuracy_score(y_test, test_predictions_rf)

print(f"Test Accuracy (Random Forest): {test_accuracy_rf * 100:.3f}")

# Check the classification report
print(metrics.classification_report(y_test, test_predictions_rf))

# Predict probabilities
probabilities_rf = best_rf_model.predict_proba(test)

# Probabilities for positive class
auc = roc_auc_score(y_test, probabilities_rf[:, 1])

print(f"AUC-ROC score for Random Forest is {auc}")

# Confusion Matrix
draw_confusion_matrix(y_test, test_predictions_rf, ['Not Cancelled', 'Cancelled'])

## Random Forest Model Description

This Random Forest Classifier was the best model, achieving a test accuracy of 86% and an AUC-ROC score of nearly 94%. The best parameters for it were: {'bootstrap': False, 'criterion': 'entropy', 'max_depth': 45, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 550}
Highest accuracy found:  0.8559285291760205. This one was a little slow to train, so to maximize efficiency I used HalvingGridSearchCV which greatly reduces the time to train without affecting the model's performance.


## Extra Credit 

We have provided an extra test dataset named `hotel_booking_test.csv` that does not have the target labels. Classify the samples in the dataset with any method of your choosing and save the predictions into a csv file. Submit the file to our [Kaggle](https://www.kaggle.com/competitions/m-148-spring-2024-project-3/) contest. The website will specify your classification accuracy on the test set. We will award a bonus point for the project for every percentage point over 75% that you get on your kaggle test accuracy.

To get the bonus points, you must also write out a summary of the model that you submit including any changes you made to the pre-processing steps. The summary must be written in a markdown cell of the jupyter notebook. Note that you should not change earlier parts of the project to complete the extra credit.

**Please refer to *Submission and evaluation* section on the contest page for the `csv` file formatting**

### Summary
The model I chose to submit is the best bagging classifier model, with parameters Best parameters found:  {'bootstrap': False, 'bootstrap_features': True, 'estimator': DecisionTreeClassifier(random_state=42), 'max_features': 0.7, 'max_samples': 0.7, 'n_estimators': 300}. It ended up having an 86% test accuracy.

One thing I noticed in the above parts, was that class 1 (cancelled) was a little underrepresented in the dataset. In the classification reports, class 1 was consistently recalled worse by every model trained. As a result, I used SMOTE (Synthetic Minority Oversampling Technique), which synthetically creates more sample in the minority class so that the classes are balanced. This improved my accuracy a little bit, but also helped improve the models' ability to generalize.

# Cleaning and processing the testing data

In [None]:
# Read in hotel_booking_test
hotel = pd.read_csv("datasets/hotel_booking_test.csv")

hotel.head()

In [None]:
hotel.isnull().sum()

In [None]:
# Applying same preprocessing steps as before
hotel['is_family'] = hotel.apply(lambda row: 1 if row['children'] > 0 or row['babies'] > 0 else 0, axis=1)

# Stay Duration
hotel['stay_duration'] = hotel['stays_in_weekend_nights'] + hotel['stays_in_week_nights']

hotel['is_repeated_guest'] = np.where((hotel['previous_cancellations'] > 0) | (hotel['previous_bookings_not_canceled'] > 0), 1, 0)

# Cancellation rate for each booking
hotel['cancellation_rate'] = hotel['previous_cancellations'] / (hotel['previous_cancellations'] + hotel['previous_bookings_not_canceled'])
# Fill any NaN values which might occur due to division by zero
# This occurs when there's a new guest
hotel['cancellation_rate'].fillna(0, inplace=True)


hotel['has_special_request'] = np.where(hotel['total_of_special_requests'] > 0, 1, 0)

hotel['is_high_season'] = hotel['arrival_date_month'].apply(lambda x: 1 if x in ['June', 'July', 'August'] else 0)

In [None]:
# Dropping email, name, phone numer
hotel = hotel.drop(['email', 'name', 'phone-number'], axis=1)

In [None]:
# Processing dataframe
hotel_transformed = preprocessor.transform(hotel)

# Changing the preprocessing of training dataset

In [None]:
from imblearn.over_sampling import SMOTE

# Run the preprocessors
train_preprocessed = preprocessor.fit_transform(X_train)
test_preprocessed = preprocessor.transform(X_test)

# Getting the categorical transformer from the pipeline
categorical_transformer = preprocessor.named_transformers_['cat']

# Get the trained OneHotEncoder from the categorical transformer
onehot = categorical_transformer.named_steps['onehot']

# Get the categories from the encoder
transformed_categories = onehot.categories_

# Create feature names for the transformed categories
cat_features_transformed = [f"{feat}_{val}" for feat, vals in zip(categorical, transformed_categories) for val in vals]

# Combine all feature names
feature_names = numerical + cat_features_transformed



# Now for applying SMOTE
smote = SMOTE(random_state=2)
X_train_resampled, y_train_resampled = smote.fit_resample(train_preprocessed, y_train)


# Training models

In [None]:
# Hyperparameter optimization for KNN model
from sklearn.model_selection import HalvingRandomSearchCV
# Define the parameter grid
params = {
    'n_neighbors' : [1, 3, 5, 7, 9],
    'metric' : ["euclidean", "manhattan"]
}

# Instantiate the grid search model
grid_search_knn = HalvingRandomSearchCV(estimator=KNeighborsClassifier(), param_grid=params, cv=10, scoring='accuracy')

# Fit the grid search to the data
grid_search_knn.fit(X_train_resampled, y_train_resampled)

# Print the best parameters and the best score
print("Best Parameters: ", grid_search_knn.best_params_)
print("Best Score: ", grid_search_knn.best_score_)

In [None]:
# Using fitted model to make predictions
test_predictions_knn = grid_search_knn.best_estimator_.predict(test)

# Calculate the accuracy of the model on the test data
test_accuracy = accuracy_score(y_test, test_predictions_knn)

# Print the test accuracy
print(f"Test Accuracy (KNN): {test_accuracy*100:.3f}")

# Classification Report
print(metrics.classification_report(y_test, test_predictions_knn))

draw_confusion_matrix(y_test, test_predictions_knn, ['Not Cancelled', 'Cancelled'])

In [None]:
from sklearn.ensemble import RandomForestClassifier


# Define the parameters // refined
params = {
    'n_estimators': [500, 525, 550, 575, 600],  # Adjusted around 550
    'criterion': ['gini', 'entropy'],  # Keeping 'entropy' as a search option
    'max_depth': [45, 50, 55, 60],  # Adjusted around 50
    'min_samples_split': [2, 3, 4, 5],  # Adjusted around 3
    'min_samples_leaf': [1, 2, 3],  # Adjusted around 2
    'bootstrap': [False],  # Keeping 'False' as per your best results
    'max_features': ['sqrt', 'log2', None]  # Adding some more options around 'sqrt'   
}


rf = RandomForestClassifier()

grid_search_rf = HalvingRandomSearchCV(estimator=rf,param_grid=params, cv=10, n_jobs=-1)

# Run the grid search
grid_search_rf.fit(X_train_resampled, y_train_resampled)

# Print out the best parameters
print("Best parameters found: ", grid_search_rf.best_params_)
print("Highest accuracy found: ", grid_search_rf.best_score_)


In [None]:
# Testing the best Random Forest model
best_rf_model = grid_search_rf.best_estimator_

# Predict on the testing data
test_predictions_rf = best_rf_model.predict(test)

# Get the accuracy of the model
test_accuracy_rf = accuracy_score(y_test, test_predictions_rf)

print(f"Test Accuracy (Random Forest): {test_accuracy_rf * 100:.3f}")

# Check the classification report
print(metrics.classification_report(y_test, test_predictions_rf))

# Predict probabilities
probabilities_rf = best_rf_model.predict_proba(test)

# Probabilities for positive class
auc = roc_auc_score(y_test, probabilities_rf[:, 1])

print(f"AUC-ROC score for Random Forest is {auc}")

# Confusion Matrix
draw_confusion_matrix(y_test, test_predictions_rf, ['Not Cancelled', 'Cancelled'])

Best parameters found:  {'bootstrap': False, 'criterion': 'entropy', 'max_depth': 50, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 3, 'n_estimators': 550}


Test Accuracy (Random Forest): 85.731
              precision    recall  f1-score   support

           0       0.86      0.91      0.88      8333
           1       0.85      0.78      0.81      5585

    accuracy                           0.86     13918
   macro avg       0.86      0.84      0.85     13918
weighted avg       0.86      0.86      0.86     13918

AUC-ROC score for Random Forest is 0.9336623241115858

<Figure size 640x480 with 2 Axes>

In [None]:
# making bagging classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# make baseline model
base_estimator_1 = DecisionTreeClassifier(random_state=SEED)
# parameters // refined
params = {
    'n_estimators': [200, 250, 300, 350, 400],  # narrowed around 300
    'max_samples': [0.6, 0.65, 0.7, 0.75, 0.8],  # narrowed around 0.7   
    'max_features': [0.6, 0.65, 0.7, 0.75, 0.8],  # narrowed around 0.7
    'bootstrap': [False],
    'bootstrap_features': [True, False],  # Adding False as an option
    'estimator': [base_estimator_1]
    # Keeping the DecisionTreeClassifier since it has been found best 
}

bag_clf = BaggingClassifier(random_state=SEED)

hgs_bag = HalvingRandomSearchCV(bag_clf, params, scoring='accuracy', cv=10, n_jobs=-1)

# train model
hgs_bag.fit(X_train_resampled, y_train_resampled)

# print best parameters and score
print("Best parameters found: ", hgs_bag.best_params_)
print("Highest accuracy found: ", hgs_bag.best_score_)


In [None]:
best_bagging_model = hgs_bag.best_estimator_

test_predictions = best_bagging_model.predict(test)

# Get the accuracy of the model
test_accuracy = metrics.accuracy_score(y_test, test_predictions)

print(f"Test Accuracy: {test_accuracy * 100:.3f}")

# Check the classification report
print(metrics.classification_report(y_test, test_predictions))

# Probabilities for positive class
probabilities = best_bagging_model.predict_proba(test)
auc = roc_auc_score(y_test, probabilities[:, 1])

print(f"AUC-ROC score is {auc}")

draw_confusion_matrix(y_test, test_predictions,
                      ['Not Cancelled', 'Cancelled'])


Best parameters found:  {'bootstrap': False, 'bootstrap_features': True, 'estimator': DecisionTreeClassifier(random_state=42), 'max_features': 0.7, 'max_samples': 0.7, 'n_estimators': 300}


Test Accuracy: 86.636
              precision    recall  f1-score   support

           0       0.85      0.94      0.89      8333
           1       0.90      0.76      0.82      5585

    accuracy                           0.87     13918
   macro avg       0.87      0.85      0.86     13918
weighted avg       0.87      0.87      0.86     13918

AUC-ROC score is 0.9367644643117864

<Figure size 640x480 with 2 Axes>

The above two models weren't executed as they took too long after I tried expanding the parameter search and had to interrupt the kernel while executing, but I copied the outputs previously and put them into a markdown cell. 

# Making Predictions

In [None]:
# Predictions using best random forest model
test_predictions_rf = best_rf_model.predict(hotel_transformed)

# Convert the prediction array into a dataframe with 'target' column
predictions_df = pd.DataFrame(test_predictions_rf, columns=['target'])

# Create 'index' column that contains the row number from 0 to len(test_predictions)
predictions_df['index'] = range(len(test_predictions_rf))

# Reorder the columns so 'index' is first
predictions_df = predictions_df[['index', 'target']]

# Save the data frame to a .csv file without index column
predictions_df.to_csv('test_predictions.csv', index=False)


In [None]:
# Predictions using bagging classifier
bagging_test_predictions = best_bagging_model.predict(hotel_transformed)

# Convert to dataframe
predictions_bagging_df = pd.DataFrame(bagging_test_predictions, columns=['target'])

# Create 'index'
predictions_bagging_df['index'] = range(len(bagging_test_predictions))

# Save data
predictions_bagging_df.to_csv('bagging_test_predictions.csv', index=False)