# Task 1

The dataset I am using: https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand/data

As I specified in Homework 1, the dataset is extremely large. So after all necessary EDA was complete, I radnomly chose 900 columns as a sample of the larger dataset. The file is pushed with this notebook called 'sample_hotel_bookings.xlsx.'

# Implementation

In [51]:
#Importing necessary libraries
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

In [52]:
#Loading the local dataset
path = 'sample_hotel_bookings.xlsx'
data = pd.read_excel(path)

In [53]:
#Viewing the first few rows of the dataset to see all the features of the dataset
data.head()

Unnamed: 0,hotel,arrival_date_month,meal,market_segment,distribution_channel,reserved_room_type,assigned_room_type,deposit_type,customer_type,lead_time,...,babies,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,booking_changes,days_in_waiting_list,adr,required_car_parking_spaces,total_of_special_requests,is_canceled
0,1,10,3,2,2,1,1,0,0,4.127134,...,0.0,0.0,0.0,0.0,0.0,0.0,4.584967,0.693147,0.693147,0
1,1,7,0,2,2,2,2,0,0,4.477337,...,0.0,0.0,0.0,0.0,0.0,0.0,5.049856,0.0,0.693147,0
2,1,0,0,3,2,1,1,0,0,0.693147,...,0.0,0.693147,0.0,1.609438,0.0,0.0,3.828641,0.0,0.693147,0
3,1,9,0,3,2,1,2,0,2,5.645447,...,0.0,0.0,0.0,0.0,0.0,0.0,4.352469,0.0,0.0,0
4,0,5,0,1,1,6,8,0,0,3.218876,...,0.0,0.693147,0.0,1.386294,0.0,0.0,5.239628,0.693147,1.386294,0


In [54]:
#Initializing X and y
X = data.drop('is_canceled', axis=1).values
y = data['is_canceled'].values

In [55]:
#Train 80% and Test 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [56]:
#Initialize decision tree object
classification_tree = tree.DecisionTreeClassifier()

#Train our decision tree (tree induction + pruning)
classification_tree = classification_tree.fit(X_train, y_train)

In [57]:
#Generating the predictions
predictions = classification_tree.predict(X_test)

In [58]:
#Evaluating the base model with no modifications
base_accuracy = accuracy_score(y_test, predictions)
base_report = classification_report(y_test, predictions)

print(base_accuracy)
print(base_report)

0.7166666666666667
              precision    recall  f1-score   support

           0       0.72      0.87      0.79       109
           1       0.71      0.48      0.57        71

    accuracy                           0.72       180
   macro avg       0.71      0.68      0.68       180
weighted avg       0.72      0.72      0.70       180



Modifying Max_Depth Parameter

max_depth can prevent overfitting by keeping the model from getting too complex by memorizing the training data.

In [59]:
#MAX_DEPTH = 5
#Initialize decision tree object
classification_tree = tree.DecisionTreeClassifier(max_depth = 5)

#Train our decision tree (tree induction + pruning)
classification_tree = classification_tree.fit(X_train, y_train)

#Generating the predictions
predictions = classification_tree.predict(X_test)

#Evaluating the base model with no modifications
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)

print(accuracy)
print(report)

0.7055555555555556
              precision    recall  f1-score   support

           0       0.68      0.97      0.80       109
           1       0.88      0.30      0.44        71

    accuracy                           0.71       180
   macro avg       0.78      0.63      0.62       180
weighted avg       0.76      0.71      0.66       180



In [60]:
#MAX_DEPTH = 10
#Initialize decision tree object
classification_tree = tree.DecisionTreeClassifier(max_depth = 10)

#Train our decision tree (tree induction + pruning)
classification_tree = classification_tree.fit(X_train, y_train)

#Generating the predictions
predictions = classification_tree.predict(X_test)

#Evaluating the base model with no modifications
accuracy = accuracy_score(y_test, predictions)
report = classification_report(y_test, predictions)

print(accuracy)
print(report)

0.7222222222222222
              precision    recall  f1-score   support

           0       0.74      0.83      0.78       109
           1       0.68      0.55      0.61        71

    accuracy                           0.72       180
   macro avg       0.71      0.69      0.70       180
weighted avg       0.72      0.72      0.72       180



Base Model
- Accuracy: 0.717
- Precision: 0.72, 0.71
- Recall: 0.87, 0.48

Max_depth = 5
- Accuracy: 0.706
- Precision: 0.68, 0.88
- Recall: 0.97, 0.3

Max_depth = 10
- Accuracy: 0.722
- Precision: 0.74, 0.68
- Recall: 0.83, 0.55

When changing the max_depth to 5, the accuracy slightly decreases from the base model. The values for precision are close to the base model with a slight fluctuation. The recall values are low for class 1, meaning that it does not as many cancelled reservations correctly. Limiting the depth to 5 makes the model slightly more conservative.

When increasing the max_depth to 10, the accuracy increases. It also has the highest values for precision and recall. This means a more complex model would produce better results by understanding the datra better.

# Task 2

# Bagging - Random Forest

In [85]:
#Importing necessary libraries
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np

In [62]:
#Initializing X and y
X = data.drop('is_canceled', axis=1).values
y = data['is_canceled'].values

In [117]:
#Intialize the Random Forest model
rf_model = RandomForestClassifier(random_state = 5)

#k-fold cross validation
kfold = KFold(n_splits=10, shuffle=True, random_state = 5)

#Getting metrics for the model
rf_accuracy = np.mean(cross_val_score(rf_model, X, y, cv=kfold, scoring='accuracy'))
rf_precision = np.mean(cross_val_score(rf_model, X, y, cv=kfold, scoring='precision'))
rf_recall = np.mean(cross_val_score(rf_model, X, y, cv=kfold, scoring='recall'))

print("Accuracy for Random Forest Model:", rf_accuracy)
print("Precision:", rf_precision)
print("Recall:", rf_recall)

Accuracy for Random Forest Model: 0.7666666666666666
Precision: 0.7856113573250006
Recall: 0.5508042279299817


# Boosting - AdaBoost

In [81]:
#Intialize the Random Forest model
ada_model = AdaBoostClassifier(random_state = 5)

#k-fold cross validation
kfold = KFold(n_splits=10, shuffle=True, random_state = 5)

#Getting metrics value for the model
ada_accuracy = np.mean(cross_val_score(ada_model, X, y, cv=kfold, scoring='accuracy'))
ada_precision = np.mean(cross_val_score(ada_model, X, y, cv=kfold, scoring='precision'))
ada_recall = np.mean(cross_val_score(ada_model, X, y, cv=kfold, scoring='recall'))

print("Accuracy for AdaBoost Model:", ada_accuracy)
print("Precision:", ada_precision)
print("Recall:", ada_recall)

Accuracy for AdaBoost Model: 0.78
Precision: 0.7810999052117473
Recall: 0.6193630120340456


Both chosen ensemble models performed similarly and relatively well on the dataset. The AdaBoost model performed slightly better than the Random Forest model. In terms of precision, Random Forest performs better but by a very small margin. This means it is better at predicting the cancellations. For recall, AdaBoost has a higher value meaning that it is better at identifying the correct cancellations. Overall for the performance, it seems as though AdaBoost has an edge over the Random Forest model.

# Task 3

In [84]:
#MAX_DEPTH = 10
#Initialize decision tree object
classification_tree = tree.DecisionTreeClassifier(max_depth = 10)

#Train our decision tree (tree induction + pruning)
classification_tree = classification_tree.fit(X_train, y_train)

#Generating the predictions
predictions = classification_tree.predict(X_test)

#Evaluating the base model with no modifications
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)

print(accuracy)
print(precision)
print(recall)

0.7166666666666667
0.6724137931034483
0.5492957746478874


As described in task 2, the metrics I used was accuracy, precision and recall. 

Accuracy is a straightforward metric and is easily interpretable. It provides an insight into how often the models predicts the cancellation correctly.

Accuracy
- Decision Tree (max_depth 10): 0.72
- Random Forest: 0.77
- AdaBoost: 0.78

AdaBoost has the highest accuracy.

Precision
- Decision Tree (max_depth 10): 0.67
- Random Forest: 0.79
- AdaBoost: 0.78

Random Forest has the highest precision, meaning it is the best at predicting a hotel cancellation. Precision is an important metric in situations where incorrectly predicting would be a sigifnicant issue. For example, in this situation, if the model incorrectly predicted that a hotel booking would be cancelled, it would be an issue for the hotel with overbookings.

Recall
- Decision Tree (max_depth 10): 0.55
- Random Forest: 0.55
- AdaBoost: 0.62

AdaBoost has the highest recall value. It is more capable of identify the actual cancellations. Recall is an important metric when needing to identify actual cancellations, such as when actual guest numbers need to be estimated in this situation.

I also want to generate the F1 score for these models.

In [86]:
f1 = f1_score(y_test, predictions)
print(f1)

rf_f1 = np.mean(cross_val_score(rf_model, X, y, cv=kfold, scoring='f1'))
ada_f1 = np.mean(cross_val_score(ada_model, X, y, cv=kfold, scoring='f1'))

print(rf_f1)
print(ada_f1)


0.6046511627906976
0.6460663293055806
0.6852199947480602


F1 Score
- Decision Tree (max_depth 10): 0.60
- Random Forest: 0.65
- AdaBoost: 0.69

AdaBoost has the highest F1 score, which means it has the best balance between precision and recall.

Choosing different metrics is important when evaluating models based on what is important to the problem we are predicting. For instance, for hotel booking cancellations, you could argue that precision is more important as it would impact the hotel negatively to predict a hotel cancellation when it is not actually cancelled. Therefore, when choosing the best model, the one with the highest precision should be chosen.

# Task 4

# XGBoost Implementation

In [89]:
import xgboost

In [87]:
#Initializing X and y
X = data.drop('is_canceled', axis=1).values
y = data['is_canceled'].values

In [88]:
#Train 80% and Test 20%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Baseline XGBoost Model

In [109]:
#Initializing xgboost model
xgb_model = xgboost.XGBClassifier(random_state = 5)

#k-fold cross validation
kfold = KFold(n_splits=10, shuffle=True, random_state = 5)

#Metrics
xgb_accuracy = np.mean(cross_val_score(xgb_model, X, y, cv=kfold, scoring='accuracy'))
xgb_precision = np.mean(cross_val_score(xgb_model, X, y, cv=kfold, scoring='precision'))
xgb_recall = np.mean(cross_val_score(xgb_model, X, y, cv=kfold, scoring='recall'))

print(xgb_accuracy)
print(xgb_precision)
print(xgb_recall)

0.7566666666666666
0.722055501969295
0.6171654533251496


# Parameters Chosen

1. max_depth

Max_depth determines the depth of the tree. A higher value means a deeper tree model, which has a risk of overfitting. A smaller value can prevent the model from memorizing relationships within the data.

2. n_estimators

N_estimators is the number of trees that will be built before the predictions are made. Higher number of trees can increase performance but also can make the model more complex.

3. learning_rate

This is the weight of the trees in the model. When the eta is smaller, there will be more tree, which can lead to more accuracy.

# Hyperparameter Experimenting

I decided to use the grid search strategy to find the best hyperparameters.

In [105]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

In [110]:
#Parameters
parameters = {
    'max_depth': [3, 5, 10],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 300]
}

#Metrics
metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score),
}

In [116]:
#Grid search implementations
grid_search = GridSearchCV(estimator=xgb_model, param_grid=parameters, scoring=metrics, cv=kfold, refit='accuracy', verbose=1, return_train_score=True)

#Fitting the model
grid_search.fit(X_train, y_train)

Best parameters and best score
cv_results = grid_search.cv_results_

#best parameters
best_params = grid_search.best_params_
print("Best Parameters: ", best_params)

#accuracy
best_score = grid_search.best_score_
print("Accuracy: ", best_score)

#getting the index of the best parameters to get the other metrics
best_index = grid_search.best_index_

precision = cv_results['mean_test_precision'][best_index]
recall = cv_results['mean_test_recall'][best_index]

print("Precision:" ,precision)
print("Recall:" ,recall)

Fitting 10 folds for each of 27 candidates, totalling 270 fits
Best Parameters:  {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 300}
Accuracy:  0.7972222222222222
Precision: 0.8469055023923445
Recall: 0.604394879275314


The ideal parameters are:

- max_depth: 5
- learning_rate: 0.01
- n_estimators: 300

The metrics I am using are accuracy, precision, and recall.

XGBoost metrics
- Accuracy: 0.797
- Precision: 0.847
- Recall: 0.6

Accuracy
- Decision Tree (max_depth 10): 0.72
- Random Forest: 0.77
- AdaBoost: 0.78


Precision
- Decision Tree (max_depth 10): 0.67
- Random Forest: 0.79
- AdaBoost: 0.78


Recall
- Decision Tree (max_depth 10): 0.55
- Random Forest: 0.55
- AdaBoost: 0.62

Comparing the accuracy, preicision, and recall to the other models implemented, it seems as though XGBoost with the tuned hyperparameters has the best performance. This shows the importance of the hyperparameters and finding the ideal ones for a specifc model.