# From Messy Data to Actionable Insights: Predicting Hotel Booking Cancellations
## A Real-World Binary Classification Pipeline with Business Impact Analysis

In [None]:
# If running on a fresh environment, uncomment:
# !pip install -q scikit-learn xgboost shap optuna seaborn joblib

# Core data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import joblib
from datetime import datetime

# Scikit-learn for preprocessing, modeling, and evaluation
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import calibration_curve, CalibrationDisplay
from sklearn.metrics import (
    classification_report, 
    confusion_matrix, 
    roc_auc_score, 
    roc_curve, 
    precision_recall_curve,
    brier_score_loss,
    ConfusionMatrixDisplay,
    PrecisionRecallDisplay
)

# Advanced libraries
import xgboost as xgb
import shap
import optuna

# Settings for reproducibility and display
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 100)
np.random.seed(42)

# Set a professional plot style
sns.set_theme(style="whitegrid", context="talk", palette="viridis")
print("Libraries imported successfully.")

---

## CORE (90 min): From data to calibrated, thresholded classifier

### 1. The Business Problem: The Cost of Cancellations

In the hotel industry, managing bookings is a high-stakes balancing act. A vacant room is lost revenue, but overbooking can lead to angry customers and reputational damage. Booking cancellations are a major source of this uncertainty. A reliable model that predicts which bookings are likely to be canceled can be a game-changer for a hotel's revenue management strategy.

**Business Impact:**
*   **Optimized Overbooking:** If a hotel can confidently predict cancellations, it can fine-tune its overbooking levels to maximize occupancy without inconveniencing guests.
*   **Targeted Marketing:** High-risk bookings could be targeted with gentle reminders or non-refundable upgrade offers to "lock in" the reservation.
*   **Improved Staffing & Resource Planning:** Accurate demand forecasting leads to better allocation of staff, cleaning services, and inventory.

Our goal is to build a binary classification model to predict the value of the `is_canceled` column. This involves not just achieving high accuracy, but understanding the costs of different types of errors:
-   **False Positive (Type I Error):** Predicting a cancellation that doesn't happen. The cost is potentially offering an unnecessary discount or irritating a committed customer.
-   **False Negative (Type II Error):** Failing to predict a cancellation that does happen. The cost is lost revenue from an unexpectedly empty room.

Often, the cost of a false negative is significantly higher than a false positive, a crucial consideration for our model.

#### 1.1. Data Acquisition

We will use a publicly available dataset on hotel bookings (Antonio, N., de Almeida, A., & Nunes, L. (2019). Hotel booking demand datasets. Data in brief, 22, 41-49. https://www.sciencedirect.com/science/article/pii/S2352340918315191). The data contains booking information for a city hotel and a resort hotel, including details like booking date, length of stay, number of guests, and whether the booking was ultimately canceled.

In [None]:
# Download the data from a public repository to ensure reproducibility
!wget -q https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv -O hotel_bookings.csv

In [None]:
# Load the dataset
df_raw = pd.read_csv('hotel_bookings.csv')

In [None]:
# Get a first impression of the data
print(f"Dataset shape: {df_raw.shape}")
df_raw.head(3)

#### 1.2. Initial Data Quality Assessment

Let's quickly check for missing values and incorrect data types.

In [None]:
df_raw.info()

This initial check reveals:
- **Missing Data:** The `company`, `agent`, and `country` columns have missing values. `company` is missing a very large percentage.
- **Data Types:** Date information is split across multiple columns (`arrival_date_year`, `arrival_date_month`, etc.), which is not ideal for analysis.

In [None]:
# Check the distribution of our target variable
cancellation_counts = df_raw['is_canceled'].value_counts(normalize=True) * 100
print("Cancellation Distribution:")
print(cancellation_counts)

plt.figure(figsize=(8, 6))
sns.countplot(x='is_canceled', data=df_raw, palette='viridis')
plt.title('Distribution of Cancellations (0 = Not Canceled, 1 = Canceled)')
plt.ylabel('Number of Bookings')
plt.xticks([0, 1], [f'Not Canceled\n({cancellation_counts[0]:.1f}%)', f'Canceled\n({cancellation_counts[1]:.1f}%)'])
plt.show()

We have a class imbalance, with about 37% of bookings being canceled. This isn't extreme, but it's significant enough that we must account for it in our modeling process. Standard accuracy can be misleading, and the model might become biased towards the majority class (not canceled).

---

### 2. Data Cleaning & Feature Engineering

We'll now create a clean DataFrame `df` and engineer more powerful features to improve our model's performance.

**🎯 Section Objectives:**

-   Handle missing values with appropriate strategies.
-   Create a unified `arrival_date` column.
-   Engineer new features that capture booking behavior, seasonality, and guest characteristics.

#### 2.1. Handling Missing Values

We'll apply simple but effective strategies for the missing data.

In [None]:
# Make a copy to work with
df = df_raw.copy()

# For 'agent' and 'company', NaN might mean the booking was made directly.
# We'll fill NaN with 0 for these ID columns.
df['agent'].fillna(0, inplace=True)
df['company'].fillna(0, inplace=True)

# For 'country', a small number are missing. We'll fill with the mode ('PRT').
df['country'].fillna(df['country'].mode()[0], inplace=True)

# For 'children', NaN is rare. We'll assume 0 children.
df['children'].fillna(0, inplace=True)

#### 2.2. Feature Engineering

Creating new features from existing ones is often the key to building a high-performing model.

##### Temporal Features
Let's combine the date columns into a single `datetime` object and extract more granular time-based features.

In [None]:
# Map month names to numbers
month_map = {
    'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6,
    'July': 7, 'August': 8, 'September': 9, 'October': 10, 'November': 11, 'December': 12
}
df['arrival_date_month_num'] = df['arrival_date_month'].map(month_map)

# Create the full arrival_date
df['arrival_date'] = pd.to_datetime(
    df['arrival_date_year'].astype(str) + '-' +
    df['arrival_date_month_num'].astype(str) + '-' +
    df['arrival_date_day_of_month'].astype(str),
    errors='coerce' # Handle any invalid dates gracefully
)

# New seasonal/temporal features
df['arrival_day_of_week'] = df['arrival_date'].dt.dayofweek # Monday=0, Sunday=6
df['is_summer'] = df['arrival_date_month_num'].isin([6, 7, 8]).astype(int)

##### Behavioral & Interaction Features
We can combine columns to create more meaningful variables that capture guest behavior.

In [None]:
# Total number of guests
df['total_guests'] = df['adults'] + df['children'] + df['babies']

# Total stay duration in nights
df['total_nights'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']

# Create a binary feature for whether any deposit was made
df['has_deposit'] = (df['deposit_type'] != 'No Deposit').astype(int)

# Ratio of previous cancellations
# Add a small epsilon to avoid division by zero
df['previous_cancellation_rate'] = df['previous_cancellations'] / (df['previous_cancellations'] + df['previous_bookings_not_canceled'] + 1e-6)

# It's unlikely a booking has 0 guests or 0 nights. Let's treat these as invalid and remove them.
initial_rows = df.shape[0]
df = df[(df['total_guests'] > 0) & (df['total_nights'] > 0)].copy()
print(f"Removed {initial_rows - df.shape[0]} rows with 0 guests or 0 nights.")

✅ **Checkpoint:** We've cleaned the data and engineered several new features related to seasonality, guest history, and booking duration. This enriched dataset should provide more signal for our model.

---

### 3. Exploratory Data Analysis (EDA)

Let's explore the relationships between our features and the target variable, `is_canceled`.

#### 3.1. What Drives Cancellations?

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(22, 14))

# Lead Time vs. Cancellation (binned for readability)
df_lt = df[['lead_time','is_canceled']].copy()
df_lt['lead_time_bin'] = pd.qcut(df_lt['lead_time'], q=20, duplicates='drop')
plot_df = df_lt.groupby('lead_time_bin', observed=True)['is_canceled'].mean().reset_index()
sns.lineplot(data=plot_df, x=plot_df.index, y='is_canceled', ax=axes[0, 0], marker='o')
axes[0, 0].set_title('Cancellation Rate vs. Lead Time (binned)')
axes[0, 0].set_ylabel('Cancellation Probability')
axes[0, 0].set_xlabel('Lead Time (Quantile Bins)')
axes[0, 0].set_xticks([])


# Deposit Type vs. Cancellation
sns.barplot(x='deposit_type', y='is_canceled', data=df, palette='viridis', ax=axes[0, 1])
axes[0, 1].set_title('Cancellation Rate by Deposit Type')
axes[0, 1].set_ylabel('Cancellation Rate')
axes[0, 1].set_xlabel('Deposit Type')

# Repeated Guest vs. Cancellation
sns.barplot(x='is_repeated_guest', y='is_canceled', data=df, palette='viridis', ax=axes[1, 0])
axes[1, 0].set_title('Cancellation Rate for Repeated Guests')
axes[1, 0].set_ylabel('Cancellation Rate')
axes[1, 0].set_xlabel('Is Repeated Guest (0=No, 1=Yes)')

# Cancellation Rate by Arrival Month
monthly_cancellation = df.groupby('arrival_date_month_num')['is_canceled'].mean().reset_index()
sns.lineplot(x='arrival_date_month_num', y='is_canceled', data=monthly_cancellation, marker='o', ax=axes[1, 1])
axes[1, 1].set_title('Cancellation Rate by Arrival Month')
axes[1, 1].set_ylabel('Cancellation Rate')
axes[1, 1].set_xlabel('Month of Arrival')
axes[1, 1].set_xticks(range(1, 13))

plt.tight_layout()
plt.show()

#### 3.2. Correlation Analysis

A correlation heatmap helps us understand linear relationships between numerical features and identify potential multicollinearity.

In [None]:
# Select a subset of numeric features for the heatmap for clarity
numeric_subset = [
    'lead_time', 'total_guests', 'total_nights', 'previous_cancellations',
    'booking_changes', 'adr', 'total_of_special_requests', 'is_canceled'
]
corr_matrix = df[numeric_subset].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='viridis', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix of Key Numeric Features')
plt.show()

**Business Insights from EDA:**
1.  **Lead Time is Critical:** The longer a booking is made in advance, the higher the probability of cancellation. This is the strongest single indicator from our initial plots.
2.  **Deposits are a Powerful Deterrent:** "No Deposit" bookings have a much higher cancellation rate. The "Non Refund" category is an interesting edge case; its near-100% cancellation rate might reflect how no-shows on non-refundable bookings are recorded.
3.  **Loyalty Pays Off:** Repeated guests are far less likely to cancel. They are more familiar with the hotel and likely have more concrete travel plans.
4.  **Seasonality Matters:** Cancellation rates fluctuate throughout the year, peaking in early summer and dropping in the autumn. This could inform seasonal marketing campaigns.
5.  **Behavioral Signals:** The correlation matrix shows that `total_of_special_requests` has a negative correlation with `is_canceled`. This confirms our intuition: guests who make requests are more invested in their stay.

---

### 4. Temporal Split

For this kind of booking data, a random split is dangerous. It can lead to **data leakage**, where the model learns from future information to predict the past. A **temporal split** is much more robust. We will sort the data by arrival date and use earlier bookings for training and later bookings for testing.

> **⚠️ Common Pitfall: Data Leakage in Time-Series Data**
> Using a standard `train_test_split` on time-ordered data like bookings is a frequent and serious mistake. A random split would mean your training set could contain bookings from August 2017, while your test set has bookings from January 2017. The model would learn patterns from the future to predict the past—something impossible in a real-world deployment. A temporal split ensures our model evaluation realistically simulates predicting future bookings based only on past data.

In [None]:
# Final feature selection for our model
# Note: 'agent' is an ID and can behave like a noisy proxy; for purity, consider dropping it.
# We'll leave it in for now and revisit in advanced sessions.
numeric_features = [
    'lead_time', 'arrival_date_week_number', 'arrival_date_day_of_month',
    'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children',
    'babies', 'is_repeated_guest', 'previous_cancellations',
    'previous_bookings_not_canceled', 'booking_changes', 'agent',
    'days_in_waiting_list', 'adr', 'required_car_parking_spaces',
    'total_of_special_requests', 'total_guests', 'total_nights',
    'is_summer', 'previous_cancellation_rate'
]

categorical_features = [
    'hotel', 'meal', 'market_segment', 'distribution_channel',
    'reserved_room_type', 'deposit_type', 'customer_type'
]

# Our target variable
target = 'is_canceled'

# Drop rows where arrival_date is missing (from our datetime conversion)
df_model = df.dropna(subset=['arrival_date']).copy()

# Sort by date to prepare for temporal split
df_model = df_model.sort_values('arrival_date')

X = df_model[numeric_features + categorical_features]
y = df_model[target]

# Perform the temporal split
# We'll use the first 70% of data for training, the next 15% for validation, and the last 15% for testing.
train_size = int(len(X) * 0.70)
val_size = int(len(X) * 0.15)

X_train, y_train = X[:train_size], y[:train_size]
X_val, y_val = X[train_size:train_size + val_size], y[train_size:train_size + val_size]
X_test, y_test = X[train_size + val_size:], y[train_size + val_size:]

In [None]:
# Verify the shapes and time ranges
print(f"Training set shape:   {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")
print(f"Test set shape:       {X_test.shape}\n")
print(f"Training data goes up to: {df_model.iloc[train_size - 1]['arrival_date'].date()}")
print(f"Validation data from:     {df_model.iloc[train_size]['arrival_date'].date()} to {df_model.iloc[train_size + val_size - 1]['arrival_date'].date()}")
print(f"Test data from:           {df_model.iloc[train_size + val_size]['arrival_date'].date()}")

---

### 5. Preprocessing Pipeline

We'll use a `scikit-learn` pipeline to ensure our preprocessing steps (imputation, scaling, encoding) are applied consistently and without data leakage. This is crucial for creating a robust and deployable model.

> **💡 Pro Tip: Why a Pipeline?**
> A pipeline bundles preprocessing and modeling steps into a single object. This has several advantages:
> 1.  **Prevents Data Leakage:** It ensures that information from the validation or test sets (like their mean or median) is never used to transform the training set.
> 2.  **Simplifies Workflow:** You only need to call `.fit()` and `.predict()` on the pipeline object, which handles all the intermediate steps automatically.
> 3.  **Ensures Consistency:** The exact same transformations are applied during training, evaluation, and future deployment, preventing subtle bugs.

In [None]:
# Create the numeric pipeline
# We use 'median' for imputation as it's robust to outliers often found in fields like 'adr'.
# StandardScaler standardizes features to have mean 0 and variance 1, which is important for linear models.
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Create the categorical pipeline
# 'most_frequent' imputation is a safe choice for categorical data.
# OneHotEncoder converts categories into a numerical format the model can understand, without implying an order.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Assemble the preprocessor using ColumnTransformer
# This object applies the correct pipeline to the correct columns.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='drop'
)

### 6. Models: Dummy → LR → RF → XGB

We'll start with baselines and train several candidate models. Our primary evaluation metric will be **ROC AUC**, which is a good choice for imbalanced classification problems. For tree-based models, we'll use `scale_pos_weight` to handle the class imbalance by giving more weight to the minority class (cancellations).

In [None]:
# --- Baseline Model (Dummy Classifier) ---
dummy_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('classifier', DummyClassifier(strategy='stratified'))])
dummy_pipeline.fit(X_train, y_train)
y_pred_dummy_proba = dummy_pipeline.predict_proba(X_val)[:, 1]
roc_auc_dummy = roc_auc_score(y_val, y_pred_dummy_proba)
print(f"Baseline (Dummy) ROC AUC: {roc_auc_dummy:.4f}")

# --- Logistic Regression ---
logreg_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('classifier', LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000))])
logreg_pipeline.fit(X_train, y_train)
y_pred_logreg_proba = logreg_pipeline.predict_proba(X_val)[:, 1]
roc_auc_logreg = roc_auc_score(y_val, y_pred_logreg_proba)
print(f"Logistic Regression ROC AUC: {roc_auc_logreg:.4f}")

# --- Random Forest ---
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', RandomForestClassifier(random_state=42, class_weight='balanced'))])
rf_pipeline.fit(X_train, y_train)
y_pred_rf_proba = rf_pipeline.predict_proba(X_val)[:, 1]
roc_auc_rf = roc_auc_score(y_val, y_pred_rf_proba)
print(f"Random Forest ROC AUC: {roc_auc_rf:.4f}")


# --- XGBoost (Initial) ---
# Calculate scale_pos_weight for handling imbalance
neg, pos = np.bincount(y_train)
scale_pos_weight = neg / pos
print(f"Calculated scale_pos_weight for XGBoost: {scale_pos_weight:.2f}")

xgb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', xgb.XGBClassifier(objective='binary:logistic', random_state=42, 
                                                                use_label_encoder=False, eval_metric='logloss',
                                                                scale_pos_weight=scale_pos_weight))])
# We will train our final model later after choosing it
# For now, let's just create the pipeline and train it on the train set for evaluation
xgb_pipeline.fit(X_train, y_train)
y_pred_xgb_proba = xgb_pipeline.predict_proba(X_val)[:, 1]
roc_auc_xgb = roc_auc_score(y_val, y_pred_xgb_proba)
print(f"XGBoost (Initial) ROC AUC: {roc_auc_xgb:.4f}")

**Model Selection:** Both Random Forest and XGBoost performed very well. We will proceed with **XGBoost**. It offers more parameters for fine-tuning and often has a slight edge in performance. For the remainder of the CORE section, we will use this initial XGBoost model. In the PLUS section, we will tune it further.

In [None]:
# Create final pipeline with default XGBoost for now
# Train on combined train + validation data for final evaluation on test set
X_train_full = pd.concat([X_train, X_val])
y_train_full = pd.concat([y_train, y_val])

final_model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', xgb.XGBClassifier(objective='binary:logistic', random_state=42, 
                                                                use_label_encoder=False, eval_metric='logloss',
                                                                scale_pos_weight=scale_pos_weight))])

final_model_pipeline.fit(X_train_full, y_train_full)
y_pred_final_proba = final_model_pipeline.predict_proba(X_test)[:, 1]

#### 6.5. The ROC Curve and AUC: Visualizing Predictive Power

Our model produces a *probability* of cancellation, a risk score from 0% to 100%. The **ROC (Receiver Operating Characteristic) Curve** visualizes the trade-off between the True Positive Rate (catching actual cancellations) and the False Positive Rate (incorrectly flagging safe bookings) as we vary the decision threshold.

The Area Under the Curve (AUC) summarizes this performance. A score of 0.5 is no better than a random guess, while 1.0 is perfect.

In [None]:
# --- Plotting the ROC Curve ---
fpr, tpr, thresholds = roc_curve(y_test, y_pred_final_proba)
final_roc_auc = roc_auc_score(y_test, y_pred_final_proba)

print(f"Final Model ROC AUC on Test Set: {final_roc_auc:.4f}")

plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {final_roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Guess')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=14)
plt.ylabel('True Positive Rate (Recall/Sensitivity)', fontsize=14)
plt.title('ROC Curve for Cancellation Prediction', fontsize=16)
plt.legend(loc="lower right", fontsize=12)
plt.grid(True)
plt.show()

#### 6.6. Probability Calibration: Can We Trust the Scores?

A high AUC is great, but for business decisions, we also need the model's probabilities to be **well-calibrated**. This means that if the model predicts an 80% cancellation risk, roughly 80% of those bookings should actually cancel.

In [None]:
brier = brier_score_loss(y_test, y_pred_final_proba)
print(f"Brier Score Loss: {brier:.4f} (lower is better)")

fig, ax = plt.subplots(figsize=(10, 8))
CalibrationDisplay.from_predictions(y_test, y_pred_final_proba, n_bins=10, ax=ax, strategy='uniform')
ax.set_title("Calibration Plot (Reliability Diagram)")
plt.show()

**Interpretation:** Our model's calibration curve is quite close to the ideal diagonal line, and the Brier score is low. This gives us confidence that the predicted probabilities are reliable and can be used directly for tasks like calculating expected revenue loss.

#### 6.7. The Precision-Recall Trade-off

While the ROC curve is excellent for overall model evaluation, the **Precision-Recall (P-R) Curve** is often more informative for imbalanced datasets and for business problems where the positive class (cancellations) is the main focus.

*   **Precision:** *"When the model predicts a cancellation, how often is it right?"*
*   **Recall:** *"Of all the bookings that were actually canceled, what percentage did our model catch?"*

The P-R curve helps us visualize the trade-off as we adjust the probability threshold.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
PrecisionRecallDisplay.from_predictions(y_test, y_pred_final_proba, ax=ax)
plt.title("Precision-Recall Curve")
plt.show()

**Business Application:** Choosing the right threshold is a business decision.
*   **High-Precision Strategy:** Use a high threshold if the cost of acting on a false positive is high (e.g., offering a large, unnecessary discount).
*   **High-Recall Strategy:** Use a low threshold if the cost of a missed cancellation (a false negative) is high (e.g., an unexpectedly empty room).

#### 6.8. Cost-Based Thresholding

We can formalize the threshold selection by defining the costs of our errors and finding the threshold that minimizes the total expected cost on our test set.

In [None]:
# --- Cost-Based Thresholding ---
# Define business costs (edit these in class)
C_FP = 50   # cost of acting on a non-canceler (e.g., offering an unnecessary discount)
C_FN = 200  # cost of missing a true cancelation (e.g., lost revenue from an empty room)

thresholds = np.linspace(0, 1, 201)

def expected_cost(y_true, p, t, c_fp=C_FP, c_fn=C_FN):
    y_hat = (p >= t).astype(int)
    FP = ((y_true == 0) & (y_hat == 1)).sum()
    FN = ((y_true == 1) & (y_hat == 0)).sum()
    return FP * c_fp + FN * c_fn

costs = np.array([expected_cost(y_test.values, y_pred_final_proba, t) for t in thresholds])
t_star = float(thresholds[costs.argmin()])

print(f"Cost-minimizing threshold: {t_star:.2f}")

plt.figure(figsize=(9, 6))
plt.plot(thresholds, costs)
plt.xlabel("Threshold")
plt.ylabel("Expected Cost")
plt.title("Expected Cost vs Threshold")
plt.grid(True)
plt.axvline(x=t_star, color='r', linestyle='--', label=f'Optimal Threshold = {t_star:.2f}')
plt.legend()
plt.show()

Now we evaluate our model's performance using this business-aware threshold.

In [None]:
# Evaluate at t_star
y_pred_star = (y_pred_final_proba >= t_star).astype(int)
print(f"Classification Report on Test Set (Threshold = {t_star:.2f}):")
print(classification_report(y_test, y_pred_star))

fig, ax = plt.subplots(figsize=(7, 7))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred_star, normalize='true', ax=ax)
ax.set_title(f'Confusion Matrix (Threshold = {t_star:.2f})')
plt.show()

#### 6.9. From Probabilities to Forecasts

Because our model's probabilities are well-calibrated, we can aggregate them to create powerful forecasts. For example, we can estimate the number of cancellations expected on any given night. This is a direct input for dynamic overbooking strategies.

In [None]:
# Expected cancellations per night on the test period (using calibrated probabilities)
nightly_expected_cancels = (
    df_model.loc[y_test.index, ['arrival_date']]
    .assign(p=y_pred_final_proba)
    .groupby('arrival_date')['p']
    .sum()
    .sort_index()
)

print("Sample of expected nightly cancellations:")
nightly_expected_cancels.head()

## PLUS (Optional): Tuning, Explainability, Deployment

### 7. Optuna Tuning

Run if you want to push AUC further; default trials are small. We'll use `Optuna` to automatically find the best combination of hyperparameters for our XGBoost model, optimizing for the ROC AUC score on our validation set.

In [None]:
# Define the objective function for Optuna
# This function takes a 'trial' object, which suggests hyperparameter values.
# It then trains a model with these values and returns a performance score.
def objective(trial):
    # We must fit the preprocessor on training data first to avoid data leakage
    X_train_transformed = preprocessor.fit_transform(X_train)
    X_val_transformed = preprocessor.transform(X_val)
    
    # Define the search space for hyperparameters
    # Optuna intelligently samples from these ranges to find the best combination.
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 200, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'objective': 'binary:logistic',
        'random_state': 42,
        'use_label_encoder': False,
        'eval_metric': 'logloss',
        'scale_pos_weight': scale_pos_weight
    }
    
    # Train the model with the suggested parameters
    model = xgb.XGBClassifier(**params)
    model.fit(X_train_transformed, y_train)
    
    # Evaluate on the validation set
    y_pred_proba = model.predict_proba(X_val_transformed)[:, 1]
    auc = roc_auc_score(y_val, y_pred_proba)
    
    return auc

# Create a study and optimize (run a small number of trials for the workshop)
study = optuna.create_study(direction='maximize')
# Note: For a real project, n_trials would be much larger (e.g., 50-100).
study.optimize(objective, n_trials=20)

print(f"Best trial ROC AUC: {study.best_value:.4f}")
print("Best hyperparameters:", study.best_params)

# Visualize the optimization history
optuna.visualization.plot_optimization_history(study)

After tuning, we would create a new `final_model_pipeline` with these `best_params` and retrain it on the full training data (`X_train_full`, `y_train_full`) before proceeding to the next steps. For simplicity, we will continue using the untuned model for the explainability section.

### 8. SHAP

SHAP (SHapley Additive exPlanations) goes deeper than standard feature importance, showing us the impact of each feature for every single prediction.

In [None]:
# 1. Get the fitted preprocessor and classifier from our (untuned) final pipeline
fitted_preprocessor = final_model_pipeline.named_steps['preprocessor']
classifier = final_model_pipeline.named_steps['classifier']

# 2. Transform the test data to get the final feature set
X_test_transformed = pd.DataFrame(
    fitted_preprocessor.transform(X_test),
    columns=fitted_preprocessor.get_feature_names_out()
)

# 3. (Optional) downsample for speed on modest hardware
shap_sample_n = min(5000, X_test_transformed.shape[0])
X_shap = X_test_transformed.iloc[:shap_sample_n]

# 4. Create a SHAP explainer and get SHAP values
explainer = shap.TreeExplainer(classifier)
shap_values = explainer.shap_values(X_shap)

#### 8.1. Global Feature Importance with Direction

In [None]:
shap.summary_plot(shap_values, X_shap, max_display=15)

### Decoding the Drivers of Cancellation: From Data to Strategy

This SHAP plot is the "brain scan" of our model. It reveals not just *what* features are important, but *how* they influence the prediction for every single booking.

> **How to Read This Plot:**
>
> *   **Vertical Axis:** Features are ranked by their overall importance, top to bottom.
> *   **Horizontal Axis (SHAP Value):** A positive value pushes the prediction towards **"Canceled"**, while a negative value pushes it towards **"Not Canceled"**.
> *   **Color:** Red dots represent a high value for a feature (e.g., long lead time), while blue dots represent a low value (e.g., short lead time).

---

#### 🚩 Key Drivers of Cancellation (Risk Factors)

1.  **Deposit Type: `Non Refund`**: A "Non Refund" deposit type (a red dot) strongly pushes the prediction to "Canceled." This is likely a data artifact where no-shows on non-refundable rates are coded as canceled, but it confirms the immense power of financial commitment.
2.  **Lead Time**: The longer the time between booking and arrival (red dots), the higher the cancellation risk.
3.  **Market Segment: `Online TA` (Online Travel Agent)**: Bookings from OTAs (red dots) have a higher likelihood of cancellation, likely due to lenient OTA policies.

---

#### ✅ Key Indicators of a Committed Booking (Safety Factors)

1.  **Total Special Requests**: A high number of special requests (red dots) results in a strong *negative* SHAP value, significantly lowering cancellation probability. This is a powerful behavioral signal of commitment.
2.  **Required Car Parking Spaces**: A request for parking (red dot) has a large negative SHAP value, signaling a guest with confirmed travel logistics.
3.  **Booking Changes**: Counterintuitively, a guest making changes to their booking (red dots) is a sign of engagement and refinement of plans, not abandonment.

#### 8.2. Local Interpretability: Explaining a Single Prediction

In [None]:
# Find the booking with the highest predicted risk within our SHAP sample
y_pred_shap_proba = y_pred_final_proba[:shap_sample_n]
high_risk_idx_shap = np.argmax(y_pred_shap_proba)

print(f"Explaining booking with a {y_pred_shap_proba[high_risk_idx_shap]:.2%} predicted probability of cancellation.")
print(f"Actual outcome: {'Canceled' if y_test.iloc[high_risk_idx_shap] == 1 else 'Not Canceled'}")

shap.initjs()
shap.force_plot(explainer.expected_value, 
                shap_values[high_risk_idx_shap, :], 
                X_shap.iloc[high_risk_idx_shap, :])

This force plot shows the "push and pull" of each feature for this specific booking. Features in red (like a long `lead_time`) are pushing the prediction higher (towards cancellation), while blue features would push it lower.

---

### 9. Deployment & Monitoring

A model is only useful if it can be deployed into a production environment to make real-time predictions.

#### 9.1. Saving the Model

The first step is to save our entire trained pipeline (preprocessor + model) into a single file using `joblib`. This ensures that the exact same preprocessing steps are applied to new data.

In [None]:
# Add a version/date to the filename for better tracking
model_filename = f'hotel_cancellation_model_{datetime.now().strftime("%Y%m%d")}.pkl'

# Save the final pipeline object
joblib.dump(final_model_pipeline, model_filename)
print(f"Model pipeline saved successfully to '{model_filename}'")

# Example of loading the model back
loaded_pipeline = joblib.load(model_filename)
print("Model loaded successfully.")

#### 9.2. Deployment Strategy

Once saved, the model can be deployed in several ways:
*   **API Endpoint:** The most common approach is to wrap the model in a web service (e.g., using Flask or FastAPI). The booking system can then send a `POST` request with the new booking's data to the API and receive the cancellation probability in real-time.
*   **Batch Processing:** For less time-sensitive tasks like daily revenue forecasting, a script could run once a day, load the model, score all new bookings, and update a database with their risk scores.

#### 9.3. Monitoring and Retraining

Models are not static. The real world changes, and a model's performance can degrade over time—a phenomenon known as **model drift**.
*   **Monitoring:** It's critical to log the model's predictions and, when the actual outcome is known (guest checks out or cancels), compare them. Key metrics like ROC AUC, precision, recall, and calibration should be tracked on a dashboard.
*   **Retraining:** A retraining schedule should be established. This could be time-based (e.g., retrain every quarter on the latest data) or triggered by a significant drop in performance. The temporal split strategy used here should be replicated in the retraining pipeline to ensure valid evaluation.

---

### 10. Group Exercise

**Tasks for Breakout Groups:**

1.  **Advanced Feature Engineering:** The `reservation_status_date` is the date the final status (Canceled or Check-Out) was set. For canceled bookings, the time difference between the booking date and this `reservation_status_date` could be a powerful feature. Create a new feature called `cancellation_timing` (days between booking and cancellation) and discuss how you would incorporate it into the model *without causing data leakage*. How would this feature be used in a real-time prediction scenario?

2.  **The Precision/Recall Business Trade-off:**
    *   Our cost-based analysis gave us an optimal threshold. Let's explore others. Recalculate the confusion matrix and classification report using a **higher threshold of 0.7** (i.e., `y_pred = (y_pred_final_proba > 0.7).astype(int)`).
    *   How did Precision and Recall for the 'Canceled' class change?
    *   Describe a business scenario where this new, higher-precision model would be preferable, even though it catches fewer cancellations overall.
    *   Now, do the same for a **lower threshold of 0.3**. When would this higher-recall model be the better business choice?

3.  **Business Strategy Simulation:** A new booking has the following characteristics: `hotel='City Hotel'`, `lead_time=300`, `deposit_type='No Deposit'`, `customer_type='Transient'`, `total_of_special_requests=0`, and `market_segment='Online TA'`. Assume other key numeric features are at the training set median and other categoricals at the mode.
    *   Create a pandas DataFrame for this single booking.
    *   Use your final trained pipeline (`final_model_pipeline.predict_proba()`) to predict its cancellation probability.
    *   Based on this risk score and the SHAP insights, design a 2-step intervention plan for this specific booking. What is the first action, and what is the follow-up? Be specific.