# Assignment 12: Predicting Hotel Booking Cancellations  
## Models: Na√Øve Bayes, Support Vector Machine (SVM), and Neural Network

**Objectives:**
- Understand how to use classification models (Na√Øve Bayes, SVM, Neural Networks) to predict hotel cancellations.
- Compare models in terms of accuracy, complexity, and business relevance.
- Interpret and communicate model results from a business perspective.

## Business Scenario

You work as a data analyst for a hospitality group that manages both **Resort** and **City Hotels**. One major challenge in operations is the unpredictability of **booking cancellations**, which affects staffing, inventory, and revenue planning.

You‚Äôve been asked to use historical booking data to predict whether a future booking will be canceled. Your insights will help management plan more effectively.


Your task is to:
1. Build and evaluate three models: Na√Øve Bayes, SVM, and Neural Network.
2. Compare performance.
3. Recommend which model is best suited for the business needs.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_12_bayes_svm_neural.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Description: Hotel Bookings

This dataset contains booking information for two types of hotels: a **city hotel** and a **resort hotel**. Each record corresponds to a single booking and includes various details about the reservation, customer demographics, booking source, and whether the booking was canceled.

**Source**: [GitHub - TidyTuesday: Hotel Bookings](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

### Key Use Cases
- Understand customer booking behavior
- Explore factors related to cancellations
- Segment guests based on booking characteristics
- Compare city vs. resort hotel performance

### Data Dictionary

| Variable | Type | Description |
|----------|------|-------------|
| `hotel` | character | Hotel type: City or Resort |
| `is_canceled` | integer | 1 = Canceled, 0 = Not Canceled |
| `lead_time` | integer | Days between booking and arrival |
| `arrival_date_year` | integer | Year of arrival |
| `arrival_date_month` | character | Month of arrival |
| `stays_in_weekend_nights` | integer | Nights stayed on weekends |
| `stays_in_week_nights` | integer | Nights stayed on weekdays |
| `adults` | integer | Number of adults |
| `children` | integer | Number of children |
| `babies` | integer | Number of babies |
| `meal` | character | Type of meal booked |
| `country` | character | Country code of origin |
| `market_segment` | character | Booking source (e.g., Direct, Online TA) |
| `distribution_channel` | character | Booking channel used |
| `is_repeated_guest` | integer | 1 = Repeated guest, 0 = New guest |
| `previous_cancellations` | integer | Past booking cancellations |
| `previous_bookings_not_canceled` | integer | Past bookings not canceled |
| `reserved_room_type` | character | Initially reserved room type |
| `assigned_room_type` | character | Room type assigned at check-in |
| `booking_changes` | integer | Number of booking modifications |
| `deposit_type` | character | Deposit type (No Deposit, Non-Refund, etc.) |
| `agent` | character | Agent ID who made the booking |
| `company` | character | Company ID (if booking through company) |
| `days_in_waiting_list` | integer | Days on the waiting list |
| `customer_type` | character | Booking type: Contract, Transient, etc. |
| `adr` | float | Average Daily Rate (price per night) |
| `required_car_parking_spaces` | integer | Requested parking spots |
| `total_of_special_requests` | integer | Number of special requests made |
| `reservation_status` | character | Final status (Canceled, No-Show, Check-Out) |
| `reservation_status_date` | date | Date of the last status update |

This dataset is ideal for classification, segmentation, and trend analysis exercises.


## 1. Load and Prepare the Hotel Booking Dataset

**Business framing:**  
Your hotel client wants to understand which bookings are most at risk of being canceled. But before modeling, your job is to prepare the data to ensure clean and reliable input.

### Do the following:
- Load the `hotels.csv` file from https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/refs/heads/main/DataSets/hotels.csv
- Remove or impute missing values
- Encode categorical variables
- Create your `X` (features) and `y` (target = `is_canceled`)
- Split the data into training and test sets (70/30)

### In Your Response:
1. How many total rows and columns are in the dataset?
2. What types of features (categorical, numerical) are included?
3. What steps did you take to clean or prepare the data?


In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load the dataset
url = "https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/refs/heads/main/DataSets/hotels.csv"
df = pd.read_csv(url)
print(df.head)
print(df.shape)

# Handle missing values
# For 'country', fill with 'Unknown' or the mode
df['country'].fillna('Unknown', inplace=True)
# For 'agent' and 'company', fill with 0 (assuming 0 indicates no agent/company)
df['agent'].fillna(0, inplace=True)
df['company'].fillna(0, inplace=True)
# For numerical columns, fill with median or mean
df['children'].fillna(df['children'].median(), inplace=True)
df['adr'].fillna(df['adr'].median(), inplace=True)

# Drop columns that are not useful or have too many unique values (e.g., 'reservation_status_date', 'company', 'agent') for simpler models
df.drop(columns=['company', 'agent', 'reservation_status_date'], inplace=True)

# Convert categorical variables to numerical using Label Encoding or One-Hot Encoding
# For this task, let's use Label Encoding for simplicity and to avoid too many features for SVM/Na√Øve Bayes

# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns

for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

# Define features (X) and target (y)
X = df.drop('is_canceled', axis=1)
y = df['is_canceled']

# Split the data into training and testing sets (70/30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

<bound method NDFrame.head of                hotel  is_canceled  lead_time  arrival_date_year  \
0       Resort Hotel            0        342               2015   
1       Resort Hotel            0        737               2015   
2       Resort Hotel            0          7               2015   
3       Resort Hotel            0         13               2015   
4       Resort Hotel            0         14               2015   
...              ...          ...        ...                ...   
119385    City Hotel            0         23               2017   
119386    City Hotel            0        102               2017   
119387    City Hotel            0         34               2017   
119388    City Hotel            0        109               2017   
119389    City Hotel            0        205               2017   

       arrival_date_month  arrival_date_week_number  \
0                    July                        27   
1                    July                        27   


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['country'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['agent'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always beha

Shape of X_train: (83573, 28)
Shape of X_test: (35817, 28)
Shape of y_train: (83573,)
Shape of y_test: (35817,)


### ‚úçÔ∏è Your Response: üîß
1. The dataset has 119390 rows x 32 columns

2. The dataset includes both numerical features such as lead_time, adr, children, adults, and booking_changes.
It also contains categorical features like hotel, arrival_date_month, meal, market_segment, distribution_channel, country, and others that describe booking characteristics.

3. I handled missing values by filling categorical gaps (like country) with "Unknown" and replacing missing numerical values with the median, while also setting agent and company to zero when no value was provided.
Then I removed unnecessary high-cardinality columns, encoded categorical variables using LabelEncoder, and split the cleaned dataset into training and testing sets for modeling.

## 2. Build a Na√Øve Bayes Model

**Business framing:**  
Na√Øve Bayes is a quick, baseline model often used for early testing or simple classification problems.

### Do the following:
- Train a Na√Øve Bayes classifier on your training data
- Use it to predict on your test data
- Print a classification report and confusion matrix

### In Your Response:
1. How well does the model perform?  And what metric is best used to judge the performance?
2. Where might this model be useful for the hotel (e.g. real-time alerts, operational decisions)?


In [7]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix

# Initialize the Na√Øve Bayes model
naive_bayes_model = GaussianNB()

# Train the model on the training data
naive_bayes_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred_nb = naive_bayes_model.predict(X_test)

# Print classification report
print("Na√Øve Bayes Classification Report:")
print(classification_report(y_test, y_pred_nb))

# Print confusion matrix
print("\nNa√Øve Bayes Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_nb))

Na√Øve Bayes Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99     22478
           1       0.98      1.00      0.99     13339

    accuracy                           0.99     35817
   macro avg       0.99      0.99      0.99     35817
weighted avg       0.99      0.99      0.99     35817


Na√Øve Bayes Confusion Matrix:
[[22228   250]
 [    0 13339]]


### ‚úçÔ∏è Your Response: üîß
1. The Na√Øve Bayes model performs extremely well, achieving 99% accuracy with very high precision and recall for both classes. The confusion matrix shows that almost all bookings are classified correctly, with only 250 false positives and zero false negatives, meaning the model never misses a canceled booking. The best metric to judge performance is recall, especially for the ‚Äúcanceled‚Äù class, because hotels care about catching every booking that will cancel to avoid overbooking problems.

2. This model can be used to predict high-risk cancellations in real time, alerting the hotel reservation system so staff can proactively double-check or reconfirm bookings. It can also support operational decisions such as adjusting overbooking strategies, optimizing room allocation, preparing staffing levels, or triggering automated follow-up emails to guests likely to cancel.

## 3. Build a Support Vector Machine (SVM) Model

**Business framing:**  
SVM can model more complex relationships and is useful when customer behavior patterns aren't linear or obvious.

### Do the following:
- Train an SVM classifier (use `linear` kernel)
- Make predictions and evaluate with classification metrics

### In Your Response:
1. How well does the model perform?  And what metric is best used to judge the performance?
2. In what business situations could SVM provide better insights than simpler models?


In [8]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Standardize the features - SVMs are sensitive to the scale of the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize the SVM model with a linear kernel
# For larger datasets, SVC can be computationally expensive.
# Consider LinearSVC for better performance with large datasets.
# However, given the prompt explicitly asks for SVC with linear kernel, we'll proceed.
svm_model = SVC(kernel='linear', random_state=42)

# Train the model on the scaled training data
print("Training SVM model... This may take some time.")
svm_model.fit(X_train_scaled, y_train)
print("SVM model training complete.")

# Make predictions on the scaled test data
y_pred_svm = svm_model.predict(X_test_scaled)

# Print classification report
print("\nSupport Vector Machine (SVM) Classification Report:")
print(classification_report(y_test, y_pred_svm))

# Print confusion matrix
print("\nSupport Vector Machine (SVM) Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_svm))

Training SVM model... This may take some time.
SVM model training complete.

Support Vector Machine (SVM) Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     22478
           1       1.00      0.97      0.99     13339

    accuracy                           0.99     35817
   macro avg       0.99      0.99      0.99     35817
weighted avg       0.99      0.99      0.99     35817


Support Vector Machine (SVM) Confusion Matrix:
[[22478     0]
 [  349 12990]]


### ‚úçÔ∏è Your Response: üîß
1. The SVM model performs exceptionally well, reaching 99% accuracy with very high precision, recall, and F1-scores for both classes. The confusion matrix shows near-perfect predictions, with SVM correctly identifying almost all canceled and non-canceled bookings. The most important metric is recall for the canceled class, because hotels need to correctly detect guests likely to cancel to avoid revenue loss and overbooking problems.

2. SVM is especially useful when relationships in the data are complex and non-linear, such as predicting cancellations based on many interacting factors (e.g., seasonality, booking channel, lead time). It can also outperform simpler models in high-dimensional situations, such as customer segmentation, fraud detection, or pricing analytics, where there are many variables and the boundaries between classes are not easily separable.

## 4. Build a Neural Network Model

**Business framing:**  
Neural networks are flexible and powerful, though they are harder to explain. They may work well when subtle patterns exist in the data.

### Do the following:
- Build a MLBClassifier model using the neural_network package from sklearn
- Choose a simple architecture (e.g., 2 hidden layers)
- Evaluate accuracy and performance

### In Your Response:
1. How does this model compare to the others?
2. Would the business be comfortable using a ‚Äúblack box‚Äù model like this? Why or why not?


In [9]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Initialize the MLPClassifier with a simple architecture (e.g., 2 hidden layers)
# The hidden_layer_sizes define the number of neurons in each hidden layer.
# For example, (100, 50) means two hidden layers with 100 and 50 neurons respectively.
# max_iter is set to avoid the 'Maximum iterations reached' warning.
mlp_model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=300, random_state=42, verbose=True)

# Train the model on the scaled training data
# Neural networks generally perform better with scaled data
print("\nTraining Neural Network model... This may take some time.")
mlp_model.fit(X_train_scaled, y_train)
print("Neural Network model training complete.")

# Make predictions on the scaled test data
y_pred_mlp = mlp_model.predict(X_test_scaled)

# Print classification report
print("\nNeural Network (MLPClassifier) Classification Report:")
print(classification_report(y_test, y_pred_mlp))

# Print confusion matrix
print("\nNeural Network (MLPClassifier) Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_mlp))


Training Neural Network model... This may take some time.
Iteration 1, loss = 0.10918159
Iteration 2, loss = 0.00361817
Iteration 3, loss = 0.00078087
Iteration 4, loss = 0.00021091
Iteration 5, loss = 0.00022229
Iteration 6, loss = 0.00023133
Iteration 7, loss = 0.00007364
Iteration 8, loss = 0.00006364
Iteration 9, loss = 0.00005786
Iteration 10, loss = 0.00005392
Iteration 11, loss = 0.00005114
Iteration 12, loss = 0.00004909
Iteration 13, loss = 0.00004749
Iteration 14, loss = 0.00004621
Iteration 15, loss = 0.00004513
Iteration 16, loss = 0.00004417
Iteration 17, loss = 0.00004327
Iteration 18, loss = 0.00004236
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Neural Network model training complete.

Neural Network (MLPClassifier) Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     22478
           1       1.00      1.00      1.00     13339

    accuracy             

### ‚úçÔ∏è Your Response: üîß
1. The neural network clearly outperforms the other models, reaching nearly perfect accuracy (100%) with only one misclassified example out of 35,817 predictions. Compared to Na√Øve Bayes and SVM‚Äîboth strong performers‚Äîthe neural network captures even more complex patterns, giving it the highest precision, recall, and F1-scores across all classes.

2. Some businesses may hesitate to use a neural network because it is a black-box model, meaning decision logic is not easily interpretable for managers, auditors, or regulators. However, if accuracy is the top priority‚Äîsuch as maximizing revenue, preventing overbooking, or reducing cancellation losses‚Äîthe business might still use the model, especially if managers trust the results and supplement it with explainability tools (e.g., SHAP or LIME).

## 5. Compare All Three Models

### Do the following:
- Print and compare the accuracy of Na√Øve Bayes, SVM, and Neural Network models
- Summarize which model performed best

### In Your Response:
1. Which model had the best overall accuracy, training time, interpretability, and ease of use.
2. Would you recommend this model for deployment, and why?


In [10]:
from sklearn.metrics import accuracy_score

# Calculate accuracy for Na√Øve Bayes
accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f"Na√Øve Bayes Accuracy: {accuracy_nb:.4f}")

# Calculate accuracy for SVM
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f"SVM Accuracy: {accuracy_svm:.4f}")

# Calculate accuracy for Neural Network
accuracy_mlp = accuracy_score(y_test, y_pred_mlp)
print(f"Neural Network Accuracy: {accuracy_mlp:.4f}")

print("\n--- Model Performance Summary ---")
if accuracy_nb >= accuracy_svm and accuracy_nb >= accuracy_mlp:
    print("Na√Øve Bayes performed best.")
elif accuracy_svm >= accuracy_nb and accuracy_svm >= accuracy_mlp:
    print("SVM performed best.")
else:
    print("Neural Network performed best.")


Na√Øve Bayes Accuracy: 0.9930
SVM Accuracy: 0.9903
Neural Network Accuracy: 1.0000

--- Model Performance Summary ---
Neural Network performed best.


### ‚úçÔ∏è Your Response: üîß
1. The Neural Network had the best overall accuracy, while Na√Øve Bayes was the fastest to train and the easiest to use due to its simple structure. However, Na√Øve Bayes was also the most interpretable, while the Neural Network acted as a black box despite giving the strongest performance.

2. I would recommend the Neural Network for deployment because it delivers the highest accuracy and almost perfect predictions, which is valuable for avoiding costly cancellation errors. However, if the business prioritizes interpretability and transparency, Na√Øve Bayes may be a better fit since it is easier to explain to non-technical stakeholders.

## 6. Final Business Recommendation

### In Your Response:
1. In 100 words or less, write a short recommendation to hotel management based on your analysis.

Possible info to include:
- Which model do you recommend implementing?
- What business problem does it help solve?
- Are there any risks or limitations?
- What additional data might improve the results in the future?
2. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1. I recommend implementing the Neural Network model, as it delivered the highest accuracy and most reliably predicts booking cancellations. This helps the hotel optimize overbooking strategies, protect revenue, and improve operational planning. The main limitation is low interpretability, but its performance outweighs this risk. Adding behavioral data‚Äîsuch as booking modification history or payment timing‚Äîcould further strengthen predictions.

2. This project also reflects my customized learning outcomes by applying advanced analytics to improve forecasting accuracy, reduce operational risk, and support data-driven decision-making similar to supply chain resilience and market analytics in semiconductor industries.

## Submission Instructions
‚úÖ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [11]:
!jupyter nbconvert --to html "assignment_12_AlhinaiAlmuhanna.ipynb"

[NbConvertApp] Converting notebook assignment_12_AlhinaiAlmuhanna.ipynb to html
[NbConvertApp] Writing 329597 bytes to assignment_12_AlhinaiAlmuhanna.html
