<a href="https://colab.research.google.com/github/egs1sos/IS-4487/blob/main/assignment_12_bayes_svm_neural.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 12: Predicting Hotel Booking Cancellations  
## Models: Naïve Bayes, Support Vector Machine (SVM), and Neural Network

**Objectives:**
- Understand how to use classification models (Naïve Bayes, SVM, Neural Networks) to predict hotel cancellations.
- Compare models in terms of accuracy, complexity, and business relevance.
- Interpret and communicate model results from a business perspective.

## Business Scenario

You work as a data analyst for a hospitality group that manages both **Resort** and **City Hotels**. One major challenge in operations is the unpredictability of **booking cancellations**, which affects staffing, inventory, and revenue planning.

You’ve been asked to use historical booking data to predict whether a future booking will be canceled. Your insights will help management plan more effectively.


Your task is to:
1. Build and evaluate three models: Naïve Bayes, SVM, and Neural Network.
2. Compare performance.
3. Recommend which model is best suited for the business needs.

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/assignment_12_bayes_svm_neural.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Description: Hotel Bookings

This dataset contains booking information for two types of hotels: a **city hotel** and a **resort hotel**. Each record corresponds to a single booking and includes various details about the reservation, customer demographics, booking source, and whether the booking was canceled.

**Source**: [GitHub - TidyTuesday: Hotel Bookings](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

### Key Use Cases
- Understand customer booking behavior
- Explore factors related to cancellations
- Segment guests based on booking characteristics
- Compare city vs. resort hotel performance

### Data Dictionary

| Variable | Type | Description |
|----------|------|-------------|
| `hotel` | character | Hotel type: City or Resort |
| `is_canceled` | integer | 1 = Canceled, 0 = Not Canceled |
| `lead_time` | integer | Days between booking and arrival |
| `arrival_date_year` | integer | Year of arrival |
| `arrival_date_month` | character | Month of arrival |
| `stays_in_weekend_nights` | integer | Nights stayed on weekends |
| `stays_in_week_nights` | integer | Nights stayed on weekdays |
| `adults` | integer | Number of adults |
| `children` | integer | Number of children |
| `babies` | integer | Number of babies |
| `meal` | character | Type of meal booked |
| `country` | character | Country code of origin |
| `market_segment` | character | Booking source (e.g., Direct, Online TA) |
| `distribution_channel` | character | Booking channel used |
| `is_repeated_guest` | integer | 1 = Repeated guest, 0 = New guest |
| `previous_cancellations` | integer | Past booking cancellations |
| `previous_bookings_not_canceled` | integer | Past bookings not canceled |
| `reserved_room_type` | character | Initially reserved room type |
| `assigned_room_type` | character | Room type assigned at check-in |
| `booking_changes` | integer | Number of booking modifications |
| `deposit_type` | character | Deposit type (No Deposit, Non-Refund, etc.) |
| `agent` | character | Agent ID who made the booking |
| `company` | character | Company ID (if booking through company) |
| `days_in_waiting_list` | integer | Days on the waiting list |
| `customer_type` | character | Booking type: Contract, Transient, etc. |
| `adr` | float | Average Daily Rate (price per night) |
| `required_car_parking_spaces` | integer | Requested parking spots |
| `total_of_special_requests` | integer | Number of special requests made |
| `reservation_status` | character | Final status (Canceled, No-Show, Check-Out) |
| `reservation_status_date` | date | Date of the last status update |

This dataset is ideal for classification, segmentation, and trend analysis exercises.


## 1. Load and Prepare the Hotel Booking Dataset

**Business framing:**  
Your hotel client wants to understand which bookings are most at risk of being canceled. But before modeling, your job is to prepare the data to ensure clean and reliable input.

### Do the following:
- Load the `hotels.csv` file
- Remove or impute missing values
- Encode categorical variables
- Create your `X` (features) and `y` (target = `is_canceled`)
- Split the data into training and test sets (70/30)

### In your markdown:
1. How many total rows and columns are in the dataset?
2. What types of features (categorical, numerical) are included?
3. What steps did you take to clean or prepare the data?


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
url = '/content/hotels (1).csv'
df = pd.read_csv(url)
df.drop(columns=['reservation_status', 'reservation_status_date'], inplace=True)
df.dropna(inplace=True)
x = df.drop(columns=['is_canceled'])
y = df['is_canceled']
categorical_cols = x.select_dtypes(include='object').columns
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_features = encoder.fit_transform(x[categorical_cols])
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)
encoded_x = pd.DataFrame(encoded_features, columns=encoded_feature_names, index=x.index)
x_numerical = x.select_dtypes(exclude='object')
x_processed = pd.concat([x_numerical, encoded_x], axis=1)
x_processed.columns = x_processed.columns.astype(str)
x_train, x_test, y_train, y_test = train_test_split(x_processed, y, test_size=0.3, random_state=42)
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests
2392,Resort Hotel,0,6,2015,October,42,11,2,0,2,...,E,1,No Deposit,240.0,113.0,0,Transient,82.0,1,1
2697,Resort Hotel,0,24,2015,October,44,26,7,15,1,...,G,2,No Deposit,185.0,281.0,0,Transient-Party,52.2,0,0
2867,Resort Hotel,0,24,2015,November,45,3,0,3,2,...,A,1,No Deposit,334.0,281.0,0,Transient-Party,48.0,0,0
2877,Resort Hotel,0,24,2015,November,45,3,2,10,1,...,A,2,No Deposit,328.0,281.0,0,Transient-Party,40.0,0,0
2878,Resort Hotel,0,24,2015,November,45,3,3,10,2,...,A,2,No Deposit,326.0,281.0,0,Transient-Party,48.0,0,0


### ✍️ Your Response:
1. There are 32 columns (before dropping, 30 columns after dropping) and 119390 rows.

2. There is a mix of both categorical and numerical data in the dataset.

3. I dropped reservation status and reservation status date, as I felt they weren't relevant to the dataset.

## 2. Build a Naïve Bayes Model

**Business framing:**  
Naïve Bayes is a quick, baseline model often used for early testing or simple classification problems.

### Do the following:
- Train a Naïve Bayes classifier on your training data
- Use it to predict on your test data
- Print a classification report and confusion matrix

### In your markdown:
1. How accurate is this model?
2. Where might this model be useful for the hotel (e.g. real-time alerts, operational decisions)?


In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
clf = GaussianNB()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", clf.score(x_test, y_test))

              precision    recall  f1-score   support

           0       0.96      0.84      0.89        61
           1       0.23      0.60      0.33         5

    accuracy                           0.82        66
   macro avg       0.60      0.72      0.61        66
weighted avg       0.91      0.82      0.85        66

[[51 10]
 [ 2  3]]
Accuracy: 0.8181818181818182


### ✍️ Your Response:
1. This model is 81% accurate.

2. I would definitely recommend this model for real-time decisions, as it is easy to implement and very accurate.

## 3. Build a Support Vector Machine (SVM) Model

**Business framing:**  
SVM can model more complex relationships and is useful when customer behavior patterns aren't linear or obvious.

### Do the following:
- Train an SVM classifier (use RBF kernel)
- Make predictions and evaluate with classification metrics

### In your markdown:
1. How well does the model perform?
2. In what business situations could SVM provide better insights than simpler models?


In [None]:
# Add code here
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
clf1 = SVC(kernel='rbf')
clf1.fit(x_train, y_train)
y_pred = clf1.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      1.00      0.96        61
           1       0.00      0.00      0.00         5

    accuracy                           0.92        66
   macro avg       0.46      0.50      0.48        66
weighted avg       0.85      0.92      0.89        66

[[61  0]
 [ 5  0]]
Accuracy: 0.9242424242424242


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### ✍️ Your Response:
1. This model perfoms even better than the Naive Bayes model

2. SVM performs more operations on the data, meaning that there will be more accurate results.

## 4. Build a Neural Network Model

**Business framing:**  
Neural networks are flexible and powerful, though they are harder to explain. They may work well when subtle patterns exist in the data.

### Do the following:
- Build a neural network using `MLPClassifier`
- Choose a simple architecture (e.g., 2 hidden layers)
- Evaluate accuracy and performance

### In your markdown:
1. How does this model compare to the others?
2. Would the business be comfortable using a “black box” model like this? Why or why not?


In [None]:
# Add code here
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
clf2 = MLPClassifier(hidden_layer_sizes=(100, 100), max_iter=1000, random_state=1, activation='relu')
clf2.fit(x_train, y_train)
y_pred = clf2.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      1.00      0.96        61
           1       0.00      0.00      0.00         5

    accuracy                           0.92        66
   macro avg       0.46      0.50      0.48        66
weighted avg       0.85      0.92      0.89        66

[[61  0]
 [ 5  0]]
Accuracy: 0.9242424242424242


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### ✍️ Your Response:
1. This model performs almost exactly like the SVM model and better than the Naive Bayes model.

2. A "black box" model would provide accurate data for a business, but it could be cost-prohibitive to implement.

## 5. Compare All Three Models

### Do the following:
- Print and compare the accuracy of Naïve Bayes, SVM, and Neural Network models
- Summarize which model performed best

### In your markdown:
1. Which model would you recommend for deployment, and why?
2. Consider accuracy, training time, interpretability, and ease of use.


In [None]:
from sklearn.metrics import accuracy_score

print("Naive Bayes Accuracy:", clf.score(x_test, y_test))
print("SVM Accuracy:", clf1.score(x_test, y_test))
print("Neural Network Accuracy:", clf2.score(x_test, y_test))

Naive Bayes Accuracy: 0.8181818181818182
SVM Accuracy: 0.9242424242424242
Neural Network Accuracy: 0.9242424242424242


### ✍️ Your Response:
1. I would recommend the SVM model for implementation, as it's just as accurate as a neural network, but easier to implement and not as cost-prohibitive to set up.

2. While both the SVM and neural network have similar accuracies, the SVM model is easier and cheaper to implement.

## 6. Final Business Recommendation

### In your markdown:
1. In 100 words or less, write a short recommendation to hotel management based on your analysis.

Possible info to include:
- Which model do you recommend implementing?
- What business problem does it help solve?
- Are there any risks or limitations?
- What additional data might improve the results in the future?
2. How does this relate to your customized learning outcome you created in canvas?


### ✍️ Your Response:
1. I would recommend using the SVM model. It is highly accurate (92%), easy to set up, and not as cost-prohibitive as a neural network. However, it does still require more effort to train than Naive Bayes.

2. This relates to my learning goal of using analytics for strategic management, because building classification models is critical for predicting future business decisions and moving in the right direction.