In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder


In [39]:
dataset = pd.read_csv('hotel_bookings_clean.csv')
dataset.head()

Unnamed: 0,hotel,is_canceled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,...,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,IsPortugal,is_same_room_type,isChangedMade,IsGroup
0,Resort Hotel,0,7,0,1,1,0.0,0,BB,GBR,...,No Company,0,Transient,75.0,0,0,0,False,False,False
1,Resort Hotel,0,13,0,1,1,0.0,0,BB,GBR,...,No Company,0,Transient,75.0,0,0,0,True,False,False
2,Resort Hotel,0,14,0,2,2,0.0,0,BB,GBR,...,No Company,0,Transient,98.0,0,1,0,True,False,False
3,Resort Hotel,0,14,0,2,2,0.0,0,BB,GBR,...,No Company,0,Transient,98.0,0,1,0,True,False,False
4,Resort Hotel,0,0,0,2,2,0.0,0,BB,PRT,...,No Company,0,Transient,107.0,0,0,1,True,False,False


In [40]:
# create a test set tratified by deposit_type
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42)

for train_index, test_index in split.split(dataset, dataset['deposit_type']):
    strat_train_set = dataset.loc[train_index]
    strat_test_set = dataset.loc[test_index]

 

In [41]:
# # drop deposit_type from train and test set

# strat_train_set.drop('deposit_type', axis=1, inplace=True)
# strat_test_set.drop('deposit_type', axis=1, inplace=True)


In [42]:
X_train = strat_train_set.drop('is_canceled', axis=1)
y_train = strat_train_set['is_canceled']

X_test = strat_test_set.drop('is_canceled', axis=1)
y_test = strat_test_set['is_canceled']

In [43]:
# Use OrdinalEncoder to encode categorical variables
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
cat_cols = X_train.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()

for col in cat_cols:
    encoder.fit(dataset[col].values.reshape(-1, 1))
    X_train[col] = encoder.transform(X_train[col].values.reshape(-1, 1))
    X_test[col] = encoder.transform(X_test[col].values.reshape(-1, 1))

# ***Baseline model***

In [44]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

log_reg = LogisticRegression(
    solver='liblinear', random_state=42, max_iter=1000
)
scores = cross_val_score(log_reg, X_train, y_train, cv=5)
scores.mean()

0.8035776817004857

# ***Xgboost model***

In [45]:
# import xgboost as xgb
from xgboost import XGBClassifier

model = XGBClassifier()
 
scores = cross_val_score(model, X_train, y_train, cv=5)
scores.mean()

0.8757413488290325

In [46]:
# test the model on the test set
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score
 
print("Accuracy on Test Set:", accuracy_score(y_test, y_pred)) 

0.8770456687516065

In [47]:
model.fit(X_train, y_train)

y_pred_on_train = model.predict(X_train)

print("Accuracy on Test Set:", accuracy_score(y_train, y_pred_on_train))

0.8884699968583097

Our model is not overfitting so much, the result are quite good on the test set too so it's interesting

0.8768880263268922

In [48]:
feature_importances = model.feature_importances_

sorted_idx = np.argsort(feature_importances)[::-1]
for i in sorted_idx:
    print(f"Importance of feature {X_train.columns[i]}: {feature_importances[i]:.4f}")

Importance of feature deposit_type: 0.5256
Importance of feature required_car_parking_spaces: 0.0997
Importance of feature is_same_room_type: 0.0913
Importance of feature market_segment: 0.0540
Importance of feature IsPortugal: 0.0525
Importance of feature previous_cancellations: 0.0309
Importance of feature total_of_special_requests: 0.0264
Importance of feature customer_type: 0.0163
Importance of feature agent: 0.0153
Importance of feature previous_bookings_not_canceled: 0.0102
Importance of feature distribution_channel: 0.0097
Importance of feature booking_changes: 0.0085
Importance of feature lead_time: 0.0081
Importance of feature country: 0.0059
Importance of feature IsGroup: 0.0052
Importance of feature days_in_waiting_list: 0.0051
Importance of feature adults: 0.0042
Importance of feature children: 0.0036
Importance of feature meal: 0.0036
Importance of feature adr: 0.0033
Importance of feature stays_in_weekend_nights: 0.0031
Importance of feature hotel: 0.0030
Importance of fe

## Insights from Feature Importance Analysis

1. **Major Impact of `deposit_type` (0.5256)**: The `deposit_type` feature stands out as the most significant predictor in our model.

2. **High Relevance of `required_car_parking_spaces` (0.0997)**: As we noticed during EDA the `required_car_parking_spaces` feature give interesting insights. It is the second most important feature in our model.

3. **Significant Influence of Custom Features**: Both `IsPortugal` and `is_same_room_type`, features that we engineered, show considerable importance in the model. This reveals that both the geographical origin of the guest and the consistency in room allocation (whether the assigned room type matches the reserved room type) are influential factors in predicting cancellations.

4. **Observation on `IsChangedMade`**: Interestingly, the `IsChangedMade` feature, which we created to capture whether a booking has been modified at least once, shows no significance in our model. This could suggest that the information is better captured by other features like `booking_changes`.


In [49]:
# classificaion report
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.92      0.90      7277
           1       0.86      0.81      0.83      4394

    accuracy                           0.88     11671
   macro avg       0.87      0.86      0.87     11671
weighted avg       0.88      0.88      0.88     11671



## Interpretation of Results

These results are quite good

- The high precision for non-cancellations (Class 0) suggests that when our model predicts a booking will not be canceled, it is likely correct.
- The recall for cancellations (Class 1) is slightly lower, indicating that our model miss more cancellations than non-cancellations.
- The F1-scores for both classes are relatively high, indicating a good balance between precision and recall for both classes.




# ***Decision Tree model***

In [50]:
# testing decision tree
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(
    random_state=42,
    max_depth=None,
    max_features=0.7,
    min_samples_split=25,
    min_samples_leaf=1
)

scores = cross_val_score(model, X_train, y_train, cv=5)
scores.mean()
 

0.8565008695284577

We have configured our Decision Tree Classifier with specific hyperparameters to optimize its performance. Here's a brief summary of each hyperparameter and its role in the model:

- **`max_depth=None`**: Allows the tree to grow as deep as necessary. Without a maximum depth, the tree keeps expanding until all leaves contains element o only 1 class or until all leaves contain less than `min_samples_split` samples.

- **`max_features=0.7`**: The fraction of features to be considered when looking for the best split. Here, 70% of the features are randomly sampled for each split decision, adding randomness to the model and potentially improving generalization.

- **`min_samples_split=25`**: The minimum number of samples required to split an internal node. A higher value prevents the model from learning overly specific patterns and helps in controlling overfitting.

- **`min_samples_leaf=1`**: The minimum number of samples required to be at a leaf node. A smaller leaf size gives the model more flexibility in learning the data but can also lead to overfitting.

In [51]:
# get train accuracy

model.fit(X_train, y_train)

y_pred = model.predict(X_train)

accuracy_score(y_train, y_pred)

0.9171164995858682

## Interpretation of Results

The results are not as good as the Xgboost model, we immediately see that the model is overfitting a lot, Bagging could be a solution

# ***Random Forest***

In [52]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

num_cols = X_train.select_dtypes(include=['int64', 'float64']).columns

for col in num_cols:
    X_train[col] = scaler.fit_transform(X_train[col].values.reshape(-1, 1))
    X_test[col] = scaler.transform(X_test[col].values.reshape(-1, 1))


In [53]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

model = RandomForestClassifier()

scores = cross_val_score(model, X_train, y_train, cv=5)
scores.mean() 

0.8873370569981189

In [54]:
# test the model on the test set
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score
 
accuracy_score(y_test, y_pred)

0.8912689572444521

0.8912689572444521

In [55]:
# grid search
"""
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 300, 500],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 4]
}

model = RandomForestClassifier(n_jobs=-1)

gs_model = GridSearchCV(estimator=model,
                        param_grid=param_grid,
                        cv=3,
                        verbose=2)

X_train_reduced = X_train[:20000]
y_train_reduced = y_train[:4000]

gs_model.fit(X_train, y_train)
 
gs_model.best_params_
 
gs_model.score(X_test, y_test)
"""

Fitting 3 folds for each of 108 candidates, totalling 324 fits
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.9s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.8s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.8s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=   2.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=   2.2s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=   2.2s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=500; total time=   3.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=500; total time=   3.7s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=500; total time=   3.5s
[CV] END m

0.8919544169308542

## Random Forest Hyperparameter Grid

- **`n_estimators`**: [100, 300, 500]  
  The number of trees in the forest. More trees can lead to better performance but also increase computation time.


In [56]:
# get best model
best_model = gs_model.best_estimator_

y_pred = best_model.predict(X_test)

accuracy_score(y_test, y_pred)

0.8919544169308542

In [57]:
gs_model.best_params_


{'max_depth': None,
 'min_samples_leaf': 1,
 'min_samples_split': 4,
 'n_estimators': 500}

In [58]:
# classificaion report
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.93      0.91      7277
           1       0.88      0.83      0.85      4394

    accuracy                           0.89     11671
   macro avg       0.89      0.88      0.88     11671
weighted avg       0.89      0.89      0.89     11671



## Results Interpretation

These results are simalr to xgboost but better 

- All metrics have improved compared to the the xgboost model.
- f1 score for class 1 has improved from 0.83 to 0.85.
- f1 score for class 0 has improved from 0.90 to 0.91.
