# Assignment 12: Predicting Hotel Booking Cancellations  
## Models: Na√Øve Bayes, Support Vector Machine (SVM), and Neural Network

**Objectives:**
- Understand how to use classification models (Na√Øve Bayes, SVM, Neural Networks) to predict hotel cancellations.
- Compare models in terms of accuracy, complexity, and business relevance.
- Interpret and communicate model results from a business perspective.

## Business Scenario

You work as a data analyst for a hospitality group that manages both **Resort** and **City Hotels**. One major challenge in operations is the unpredictability of **booking cancellations**, which affects staffing, inventory, and revenue planning.

You‚Äôve been asked to use historical booking data to predict whether a future booking will be canceled. Your insights will help management plan more effectively.


Your task is to:
1. Build and evaluate three models: Na√Øve Bayes, SVM, and Neural Network.
2. Compare performance.
3. Recommend which model is best suited for the business needs.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_12_bayes_svm_neural.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Description: Hotel Bookings

This dataset contains booking information for two types of hotels: a **city hotel** and a **resort hotel**. Each record corresponds to a single booking and includes various details about the reservation, customer demographics, booking source, and whether the booking was canceled.

**Source**: [GitHub - TidyTuesday: Hotel Bookings](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

### Key Use Cases
- Understand customer booking behavior
- Explore factors related to cancellations
- Segment guests based on booking characteristics
- Compare city vs. resort hotel performance

### Data Dictionary

| Variable | Type | Description |
|----------|------|-------------|
| `hotel` | character | Hotel type: City or Resort |
| `is_canceled` | integer | 1 = Canceled, 0 = Not Canceled |
| `lead_time` | integer | Days between booking and arrival |
| `arrival_date_year` | integer | Year of arrival |
| `arrival_date_month` | character | Month of arrival |
| `stays_in_weekend_nights` | integer | Nights stayed on weekends |
| `stays_in_week_nights` | integer | Nights stayed on weekdays |
| `adults` | integer | Number of adults |
| `children` | integer | Number of children |
| `babies` | integer | Number of babies |
| `meal` | character | Type of meal booked |
| `country` | character | Country code of origin |
| `market_segment` | character | Booking source (e.g., Direct, Online TA) |
| `distribution_channel` | character | Booking channel used |
| `is_repeated_guest` | integer | 1 = Repeated guest, 0 = New guest |
| `previous_cancellations` | integer | Past booking cancellations |
| `previous_bookings_not_canceled` | integer | Past bookings not canceled |
| `reserved_room_type` | character | Initially reserved room type |
| `assigned_room_type` | character | Room type assigned at check-in |
| `booking_changes` | integer | Number of booking modifications |
| `deposit_type` | character | Deposit type (No Deposit, Non-Refund, etc.) |
| `agent` | character | Agent ID who made the booking |
| `company` | character | Company ID (if booking through company) |
| `days_in_waiting_list` | integer | Days on the waiting list |
| `customer_type` | character | Booking type: Contract, Transient, etc. |
| `adr` | float | Average Daily Rate (price per night) |
| `required_car_parking_spaces` | integer | Requested parking spots |
| `total_of_special_requests` | integer | Number of special requests made |
| `reservation_status` | character | Final status (Canceled, No-Show, Check-Out) |
| `reservation_status_date` | date | Date of the last status update |

This dataset is ideal for classification, segmentation, and trend analysis exercises.


## 1. Load and Prepare the Hotel Booking Dataset

**Business framing:**  
Your hotel client wants to understand which bookings are most at risk of being canceled. But before modeling, your job is to prepare the data to ensure clean and reliable input.

### Do the following:
- Load the `hotels.csv` file from https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/refs/heads/main/DataSets/hotels.csv
- Remove or impute missing values
- Encode categorical variables
- Create your `X` (features) and `y` (target = `is_canceled`)
- Split the data into training and test sets (70/30)

### In Your Response:
1. How many total rows and columns are in the dataset?
2. What types of features (categorical, numerical) are included?
3. What steps did you take to clean or prepare the data?


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

url = "https://raw.githubusercontent.com/Stan-Pugsley/is_4487_base/refs/heads/main/DataSets/hotels.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [2]:
missing_values_count = df.isnull().sum()
print("Count of missing values per column:")
print(missing_values_count)

Count of missing values per column:
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                

In [3]:
# Make a copy of the dataframe to preserve the original for reference if needed
hotels_df_processed = df.copy()

# Drop columns 'company' and 'agent' due to high percentage of missing values
hotels_df_processed.drop(columns=['company', 'agent'], inplace=True)

# Drop 'reservation_status' and 'reservation_status_date' to prevent data leakage
hotels_df_processed.drop(columns=['reservation_status', 'reservation_status_date'], inplace=True)

# Impute missing values in 'country' with its mode
country_mode = hotels_df_processed['country'].mode()[0]
hotels_df_processed['country'].fillna(country_mode, inplace=True)

# Impute missing values in 'children' with its mode
children_mode = hotels_df_processed['children'].mode()[0]
hotels_df_processed['children'].fillna(children_mode, inplace=True)

hotels_df_processed.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  hotels_df_processed['country'].fillna(country_mode, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  hotels_df_processed['children'].fillna(children_mode, inplace=True)


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,0,C,C,3,No Deposit,0,Transient,0.0,0,0
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,0,C,C,4,No Deposit,0,Transient,0.0,0,0
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,0,A,C,0,No Deposit,0,Transient,75.0,0,0
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,0,A,A,0,No Deposit,0,Transient,75.0,0,0
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,0,A,A,0,No Deposit,0,Transient,98.0,0,1


In [4]:
# Make a copy of the dataframe to preserve the original for reference if needed
hotels_df_processed = df.copy()

# Drop columns 'company' and 'agent' due to high percentage of missing values
hotels_df_processed.drop(columns=['company', 'agent'], inplace=True)

# Drop 'reservation_status' and 'reservation_status_date' to prevent data leakage
hotels_df_processed.drop(columns=['reservation_status', 'reservation_status_date'], inplace=True)

# Impute missing values in 'country' with its mode
country_mode = hotels_df_processed['country'].mode()[0]
hotels_df_processed['country'] = hotels_df_processed['country'].fillna(country_mode)

# Impute missing values in 'children' with its mode
children_mode = hotels_df_processed['children'].mode()[0]
hotels_df_processed['children'] = hotels_df_processed['children'].fillna(children_mode)

# Display missing values after cleaning
missing_values_after_cleaning = hotels_df_processed.isnull().sum()
print("Count of missing values per column after cleaning:")
print(missing_values_after_cleaning[missing_values_after_cleaning > 0])

Count of missing values per column after cleaning:
Series([], dtype: int64)


In [5]:
# Identify categorical columns
categorical_cols = hotels_df_processed.select_dtypes(include=['object']).columns

# Apply one-hot encoding
hotels_df_encoded = pd.get_dummies(hotels_df_processed, columns=categorical_cols, drop_first=False)

# Display how many rows and columns after encoding
print("Shape of DataFrame after one-hot encoding:", hotels_df_encoded.shape)

Shape of DataFrame after one-hot encoding: (119390, 256)


In [6]:
from sklearn.model_selection import train_test_split

# Define X (features) and y (target)
X = hotels_df_encoded.drop('is_canceled', axis=1)
y = hotels_df_encoded['is_canceled']

# Split the data into training and test sets (70/30)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (83573, 255)
Shape of X_test: (35817, 255)
Shape of y_train: (83573,)
Shape of y_test: (35817,)


### ‚úçÔ∏è Your Response: üîß
1. There are 119,390 rows and 255 columns after the data is cleaned.

2. There is numerical data as well as binary columns from using the one hot encoding method to take out any categorical data.

3. I deleted rows with a ton of missing data and then I imputed rows with the mode for columns with less missing data. For the categorical data, I used the one hot encoding method to take out any columns that were 'objects'.

## 2. Build a Na√Øve Bayes Model

**Business framing:**  
Na√Øve Bayes is a quick, baseline model often used for early testing or simple classification problems.

### Do the following:
- Train a Na√Øve Bayes classifier on your training data
- Use it to predict on your test data
- Print a classification report and confusion matrix

### In Your Response:
1. How well does the model perform?  And what metric is best used to judge the performance?
2. Where might this model be useful for the hotel (e.g. real-time alerts, operational decisions)?


In [7]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize the Na√Øve Bayes classifier
nb_model = GaussianNB()

# Train the model
nb_model.fit(X_train, y_train)

# Predict on the test data
y_pred_nb = nb_model.predict(X_test)

# Evaluate the model
print("Na√Øve Bayes Classification Report:")
print(classification_report(y_test, y_pred_nb))

print("\nNa√Øve Bayes Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_nb))

accuracy_nb = accuracy_score(y_test, y_pred_nb)
print(f"\nNa√Øve Bayes Accuracy: {accuracy_nb:.4f}")

Na√Øve Bayes Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.33      0.48     22478
           1       0.45      0.91      0.60     13339

    accuracy                           0.55     35817
   macro avg       0.65      0.62      0.54     35817
weighted avg       0.70      0.55      0.52     35817


Na√Øve Bayes Confusion Matrix:
[[ 7414 15064]
 [ 1224 12115]]

Na√Øve Bayes Accuracy: 0.5452


### ‚úçÔ∏è Your Response: üîß
1. The model doesn't preform great at 0.545. I am using the accuracy value which is measured from 0-1. Additionally the confusion matrix shows that there are a lot of values that fall into the false positive and false negative columns.

2. Though this naiive bays model isn't super accurate, its recall for cancelations is really good at 91%. This could be useful for flagging bookings that might cancel, knowing how many rooms they can overbook, etc.

## 3. Build a Support Vector Machine (SVM) Model

**Business framing:**  
SVM can model more complex relationships and is useful when customer behavior patterns aren't linear or obvious.

### Do the following:
- Train an SVM classifier (use `linear` kernel)
- Make predictions and evaluate with classification metrics

### In Your Response:
1. How well does the model perform?  And what metric is best used to judge the performance?
2. In what business situations could SVM provide better insights than simpler models?


In [8]:
# Filter X_train and y_train for 2015
filter_2015_train = X_train['arrival_date_year'] == 2015
X_train_2015 = X_train[filter_2015_train]
y_train_2015 = y_train[filter_2015_train]

# Filter X_test and y_test for 2015
filter_2015_test = X_test['arrival_date_year'] == 2015
X_test_2015 = X_test[filter_2015_test]
y_test_2015 = y_test[filter_2015_test]

# Drop the 'arrival_date_year' column from the filtered feature sets
X_train_2015 = X_train_2015.drop(columns=['arrival_date_year'])
X_test_2015 = X_test_2015.drop(columns=['arrival_date_year'])

print("Shape of X_train_2015:", X_train_2015.shape)
print("Shape of y_train_2015:", y_train_2015.shape)
print("Shape of X_test_2015:", X_test_2015.shape)
print("Shape of y_test_2015:", y_test_2015.shape)

Shape of X_train_2015: (15422, 254)
Shape of y_train_2015: (15422,)
Shape of X_test_2015: (6574, 254)
Shape of y_test_2015: (6574,)


In [9]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize StandardScaler
scaler = StandardScaler()

# Fit on training data and transform both training and test data
X_train_scaled_2015 = scaler.fit_transform(X_train_2015)
X_test_scaled_2015 = scaler.transform(X_test_2015)

# Initialize SVM classifier with a linear kernel
svm_model = SVC(kernel='linear', random_state=42)

# Train the SVM model
print("Training SVM model...")
svm_model.fit(X_train_scaled_2015, y_train_2015)
print("SVM model trained.")

# Predict on the scaled test data
y_pred_svm_2015 = svm_model.predict(X_test_scaled_2015)

# Evaluate the model
print("\nSVM Classification Report (2015 data):")
print(classification_report(y_test_2015, y_pred_svm_2015))

print("\nSVM Confusion Matrix (2015 data):")
print(confusion_matrix(y_test_2015, y_pred_svm_2015))

accuracy_svm_2015 = accuracy_score(y_test_2015, y_pred_svm_2015)
print(f"\nSVM Accuracy (2015 data): {accuracy_svm_2015:.4f}")

Training SVM model...
SVM model trained.

SVM Classification Report (2015 data):
              precision    recall  f1-score   support

           0       0.88      0.97      0.92      4102
           1       0.93      0.78      0.85      2472

    accuracy                           0.90      6574
   macro avg       0.91      0.87      0.89      6574
weighted avg       0.90      0.90      0.89      6574


SVM Confusion Matrix (2015 data):
[[3959  143]
 [ 537 1935]]

SVM Accuracy (2015 data): 0.8966


### ‚úçÔ∏è Your Response: üîß
1. The model trains well at 0.896 accuracy for 2015. In addition to looking at the oveall accuracy, I looked at the predictions of cancelations (prescision 1, recall 1, and f-1 score for class 1.). Additionally, I looked at the fase positives and false negaitives and those were relitively low.
2. In terms of business situations SVM models can be more useful than simpler models because it can help find more complex customer groups for more customization and better marketing, allowing them to increase customer retention among other things.

## 4. Build a Neural Network Model

**Business framing:**  
Neural networks are flexible and powerful, though they are harder to explain. They may work well when subtle patterns exist in the data.

### Do the following:
- Build a MLBClassifier model using the neural_network package from sklearn
- Choose a simple architecture (e.g., 2 hidden layers)
- Evaluate accuracy and performance

### In Your Response:
1. How does this model compare to the others?
2. Would the business be comfortable using a ‚Äúblack box‚Äù model like this? Why or why not?


In [10]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler

# Ensure X_train and X_test are scaled for Neural Networks as well
# Re-initializing scaler to ensure it's fit on the full X_train (not just 2015 data)
scaler_nn = StandardScaler()
X_train_scaled_nn = scaler_nn.fit_transform(X_train)
X_test_scaled_nn = scaler_nn.transform(X_test)

# Initialize MLPClassifier with a simple architecture (e.g., 2 hidden layers)
# hidden_layer_sizes=(100, 50) means two hidden layers with 100 and 50 neurons respectively.
# max_iter is set higher as neural networks often need more iterations to converge.
# random_state for reproducibility.
mlp_model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=300, random_state=42, verbose=True, early_stopping=True, n_iter_no_change=10)

# Train the Neural Network model
print("\nTraining Neural Network model... This may take significant time.\n")
mlp_model.fit(X_train_scaled_nn, y_train)
print("\nNeural Network model trained.")

# Predict on the scaled test data
y_pred_mlp = mlp_model.predict(X_test_scaled_nn)

# Evaluate the model
print("\nNeural Network Classification Report:")
print(classification_report(y_test, y_pred_mlp))

print("\nNeural Network Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_mlp))

accuracy_mlp = accuracy_score(y_test, y_pred_mlp)
print(f"\nNeural Network Accuracy: {accuracy_mlp:.4f}")


Training Neural Network model... This may take significant time.

Iteration 1, loss = 0.39906272
Validation score: 0.838717
Iteration 2, loss = 0.33576350
Validation score: 0.848169
Iteration 3, loss = 0.31887716
Validation score: 0.851400
Iteration 4, loss = 0.30934376
Validation score: 0.850921
Iteration 5, loss = 0.29993904
Validation score: 0.853195
Iteration 6, loss = 0.29319627
Validation score: 0.852836
Iteration 7, loss = 0.28768766
Validation score: 0.855468
Iteration 8, loss = 0.28156662
Validation score: 0.857502
Iteration 9, loss = 0.27751143
Validation score: 0.860254
Iteration 10, loss = 0.27265696
Validation score: 0.859655
Iteration 11, loss = 0.26816828
Validation score: 0.860373
Iteration 12, loss = 0.26542145
Validation score: 0.860373
Iteration 13, loss = 0.26060864
Validation score: 0.860613
Iteration 14, loss = 0.25630512
Validation score: 0.863006
Iteration 15, loss = 0.25371562
Validation score: 0.859775
Iteration 16, loss = 0.24995009
Validation score: 0.86145

### ‚úçÔ∏è Your Response: üîß
1. This model had a better overall accuracy than the naieve bays but it was not as good as the SVM model. It seemed like it was a model that was relivily fast but was still acccurate.

2. The business would probably be comfortable with using this model because it is pretty accurate and is fast. However, if the company needed it for stakeholders, the model might be less suitable. The neural network processes are hard to interpret and its complexity also makes debugging challenging.

## 5. Compare All Three Models

### Do the following:
- Print and compare the accuracy of Na√Øve Bayes, SVM, and Neural Network models
- Summarize which model performed best

### In Your Response:
1. Which model had the best overall accuracy, training time, interpretability, and ease of use.
2. Would you recommend this model for deployment, and why?


In [11]:
print(f"Na√Øve Bayes Accuracy (Full Data): {accuracy_nb:.4f}")
print(f"SVM Accuracy (2015 Data Only): {accuracy_svm_2015:.4f}")
print(f"Neural Network Accuracy (Full Data): {accuracy_mlp:.4f}")

Na√Øve Bayes Accuracy (Full Data): 0.5452
SVM Accuracy (2015 Data Only): 0.8966
Neural Network Accuracy (Full Data): 0.8638


### ‚úçÔ∏è Your Response: üîß
1. I would chose to use to use the Neural Network because the naive bays model was not super accurate and the SVM model took way too long to train.

2. I would reccomend this model for deployment because it was pretty accurate. To adress the 'black box' concerns, the good preformance usually outweighs the challenge and we could run other tests to provide more specific insights.

## 6. Final Business Recommendation

### In Your Response:
1. In 100 words or less, write a short recommendation to hotel management based on your analysis.

Possible info to include:
- Which model do you recommend implementing?
- What business problem does it help solve?
- Are there any risks or limitations?
- What additional data might improve the results in the future?
2. How does this relate to your customized learning outcome you created in canvas?


### ‚úçÔ∏è Your Response: üîß
1. I recomend implementing the Neural Network model to predict hotel booking cancelations. This model outpreformed the simpler naieve bays model by a bit and it was much faster to train than the SVM model. There are risks with interpretability and resource avalibility. Additional data that we can use in the future could be data on economic conditions, customer loyalty info, website interaction data, and much more.

2. This relates to both of my customized learning goals because it shows how data analytics can identify, measure, and reduce operational waste, similar to sustainability optimization in a supply chain.

## Submission Instructions
‚úÖ Checklist:
- All code cells run without error
- All markdown responses are complete
- Submit on Canvas as instructed

In [None]:
!jupyter nbconvert --to html "assignment_12_LastnameFirstname.ipynb"