<a href="https://colab.research.google.com/github/egs1sos/IS-4487/blob/main/assignment_09_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS 4487 Assignment 9: Customer Segmentation with Clustering

In this assignment, you will:
- Apply unsupervised learning to explore patterns in hotel booking behavior
- Use K-Means and Gaussian Mixture Models (GMM) for customer segmentation
- Evaluate model quality with metrics like Silhouette Score and Davies-Bouldin Index
- Connect clustering to actionable business insights

## Why This Matters

Businesses like hotels and travel platforms (e.g., Airbnb or Expedia) rely on customer segmentation to tailor promotions, pricing strategies, and service levels. Unlike supervised models, clustering helps uncover patterns when no labels exist—an ideal tool when entering new markets or analyzing unstructured customer behavior.

<a href="https://colab.research.google.com/github/vandanara/UofUtah_IS4487/blob/main/Assignments/assignment_09_clustering.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Dataset Description: Hotel Bookings

This dataset contains booking information for two types of hotels: a **city hotel** and a **resort hotel**. Each record corresponds to a single booking and includes various details about the reservation, customer demographics, booking source, and whether the booking was canceled.

**Source**: [GitHub - TidyTuesday: Hotel Bookings](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

### Key Use Cases
- Understand customer booking behavior
- Explore factors related to cancellations
- Segment guests based on booking characteristics
- Compare city vs. resort hotel performance

### Data Dictionary

| Variable | Type | Description |
|----------|------|-------------|
| `hotel` | character | Hotel type: City or Resort |
| `is_canceled` | integer | 1 = Canceled, 0 = Not Canceled |
| `lead_time` | integer | Days between booking and arrival |
| `arrival_date_year` | integer | Year of arrival |
| `arrival_date_month` | character | Month of arrival |
| `stays_in_weekend_nights` | integer | Nights stayed on weekends |
| `stays_in_week_nights` | integer | Nights stayed on weekdays |
| `adults` | integer | Number of adults |
| `children` | integer | Number of children |
| `babies` | integer | Number of babies |
| `meal` | character | Type of meal booked |
| `country` | character | Country code of origin |
| `market_segment` | character | Booking source (e.g., Direct, Online TA) |
| `distribution_channel` | character | Booking channel used |
| `is_repeated_guest` | integer | 1 = Repeated guest, 0 = New guest |
| `previous_cancellations` | integer | Past booking cancellations |
| `previous_bookings_not_canceled` | integer | Past bookings not canceled |
| `reserved_room_type` | character | Initially reserved room type |
| `assigned_room_type` | character | Room type assigned at check-in |
| `booking_changes` | integer | Number of booking modifications |
| `deposit_type` | character | Deposit type (No Deposit, Non-Refund, etc.) |
| `agent` | character | Agent ID who made the booking |
| `company` | character | Company ID (if booking through company) |
| `days_in_waiting_list` | integer | Days on the waiting list |
| `customer_type` | character | Booking type: Contract, Transient, etc. |
| `adr` | float | Average Daily Rate (price per night) |
| `required_car_parking_spaces` | integer | Requested parking spots |
| `total_of_special_requests` | integer | Number of special requests made |
| `reservation_status` | character | Final status (Canceled, No-Show, Check-Out) |
| `reservation_status_date` | date | Date of the last status update |

This dataset is ideal for classification, segmentation, and trend analysis exercises.

## 1. Setup and Load Data

Business framing:  

Before we can cluster or segment anything, we need clean, accessible data in a usable format.

- Import the necessary Python libraries
- Load the hotel bookings dataset [(Download Here)](https://github.com/rfordatascience/tidytuesday/blob/main/data/2020/2020-02-11/readme.md#get-the-data-here)
- Display the first few rows

### In Your Response:
1. What stands out in the initial preview? Any columns or rows that seem unusual?


In [None]:
# Add code here 🔧
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('/content/hotels.csv')
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


### ✍️ Your Response: 🔧
1. The lead_time column is insteresting, as some entries have lead times of over a year.

## 2. Select and Prepare Features

Business framing:  

A hotel might want to group guests based on how long they stay, how far in advance they book, or how likely they are to make special requests. You need to pick variables that represent meaningful guest behavior.

- Choose 3–5 numeric features related to customer behavior
- Drop missing values if needed
- Standardize using `StandardScaler`

### In Your Response:
1. What features did you select and why?
2. What kinds of patterns or segments do you expect to find?


In [None]:
# Add code here 🔧
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies']])
scaled_features = scaler.transform(df[['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies']])
df_scaled = pd.DataFrame(scaled_features, columns=['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights', 'adults', 'children', 'babies'])
df_scaled.head()

Unnamed: 0,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies
0,2.227051,-0.92889,-1.31024,0.247897,-0.260663,-0.081579
1,5.923385,-0.92889,-1.31024,0.247897,-0.260663,-0.081579
2,-0.907814,-0.92889,-0.786207,-1.478447,-0.260663,-0.081579
3,-0.851667,-0.92889,-0.786207,-1.478447,-0.260663,-0.081579
4,-0.842309,-0.92889,-0.262174,0.247897,-0.260663,-0.081579


### ✍️ Your Response: 🔧
1. I decided to scale lead time, stays in weekend nights, adults, children, and babies because they are high in variation.

2. The stays_in_weekend_nights column is surprisingly consistent, as with the children and babies column.


## 3. Apply K-Means Clustering

Business framing:  

Let’s say you’re working with the hotel’s marketing manager. She wants to group guests into a few clear types to target email campaigns. K-Means is a fast, simple way to try this.

- Fit a `KMeans` model with your selected features
- Choose a value of `k` (e.g. 3, 4, or 5)
- Predict clusters and assign to each guest
- Visualize using a scatterplot of 2 features

Much of this assignment has already been covered in the lab. Please be sure to complete the lab before the assignment.

### In Your Response:
1. What `k` value did you choose, and how did you decide?
2. What types of customers seem to show up in the clusters?



In [None]:
# Add code here 🔧
from sklearn.cluster import KMeans
df_scaled_cleaned = df_scaled.dropna()
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(df_scaled_cleaned)
df_scaled_cleaned['cluster'] = kmeans.predict(df_scaled_cleaned)
df = df.merge(df_scaled_cleaned[['cluster']], left_index=True, right_index=True, how='left')
display(df.head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_scaled_cleaned['cluster'] = kmeans.predict(df_scaled_cleaned)


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,cluster_x,cluster_y,cluster
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,0,Transient,0.0,0,0,Check-Out,2015-07-01,2.0,2.0,2.0
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,0,Transient,0.0,0,0,Check-Out,2015-07-01,2.0,2.0,2.0
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,0,Transient,75.0,0,0,Check-Out,2015-07-02,2.0,0.0,0.0
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,0,Transient,75.0,0,0,Check-Out,2015-07-02,2.0,0.0,0.0
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,0,Transient,98.0,0,1,Check-Out,2015-07-03,2.0,0.0,0.0


### ✍️ Your Response: 🔧
1. I decided to have 5 clusters, as I feel that tries to cover as much data as possible.

2. I feel like mainly adults seem to show up in these clusters.


## 4. Apply Gaussian Mixture Model (GMM)

Business framing:  

Not all guests fit neatly into one cluster. GMM lets us capture uncertainty — useful if customers behave similarly across groups.

- Fit a GMM with the same number of clusters you chose in Part 3
- Predict soft clusters (remember that soft clustering deals with probabilities, not labels)
- Visualize the GMM model so that you may compare it to the KMeans scatterplot

### In Your Response:
1. How did the GMM results compare to KMeans?
2. What business questions might GMM help answer better?


In [None]:
# Add your code here
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=5, random_state=42)
gmm.fit(df_scaled_cleaned)
df_scaled_cleaned['gmm_cluster'] = gmm.predict(df_scaled_cleaned)
df = df.merge(df_scaled_cleaned[['gmm_cluster']], left_index=True, right_index=True, how='left')
display(df.head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_scaled_cleaned['gmm_cluster'] = gmm.predict(df_scaled_cleaned)


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,cluster_x,cluster_y,cluster,gmm_cluster
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,Transient,0.0,0,0,Check-Out,2015-07-01,2.0,2.0,2.0,4.0
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,Transient,0.0,0,0,Check-Out,2015-07-01,2.0,2.0,2.0,4.0
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,Transient,75.0,0,0,Check-Out,2015-07-02,2.0,0.0,0.0,0.0
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,Transient,75.0,0,0,Check-Out,2015-07-02,2.0,0.0,0.0,0.0
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,Transient,98.0,0,1,Check-Out,2015-07-03,2.0,0.0,0.0,0.0


### ✍️ Your Response: 🔧
1. I feel like the GMM results seem to be more spread out than the K-Means results, and therefore cover more data points.

2. I think GMM would help answer the question of whether customer type predicts ADR or the type of hotel the customer would stay at.


## 5. Evaluate Your Models

Business framing:  

In business, models should be both useful and reliable. You’ll compare model quality using standard evaluation metrics.

- Calculate:
  - WCSS
  - Silhouette Score
  - Davies-Bouldin Index
- Compare both models

**Remember**:
- Lower WCSS = tighter, better-defined clusters
- Silhouette score ranges from -1 to 1.  Higher values = better clustering
- Lower Davies-Boulding Index = better clustering

### In Your Response:
1. Which model performed better on the metrics?
2. Would you recommend KMeans or GMM for a business analyst? Why?


In [None]:
# Add code here 🔧
from sklearn.metrics import silhouette_score, davies_bouldin_score
print('K-Means WCSS:', kmeans.inertia_)
print('K-Means Silhouette Score:', silhouette_score(df_scaled_cleaned, kmeans.labels_))
print('K-Means Davies-Bouldin Index:', davies_bouldin_score(df_scaled_cleaned, kmeans.labels_))
print('GMM WCSS:', gmm.lower_bound_)
print('GMM Silhouette Score:', silhouette_score(df_scaled_cleaned, gmm.predict(df_scaled_cleaned)))
print('GMM Davies-Bouldin Index:', davies_bouldin_score(df_scaled_cleaned, gmm.predict(df_scaled_cleaned)))

K-Means WCSS: 340451.34113288793


### ✍️ Your Response: 🔧
1. I feel like K-Means did better on WCSS and silhouette score, but GMM did better with the Davies-Bouldin index.

2. Despite being harder to work with, I think that GMM gives a more accurate prediction of clustering, and therefore, I would recommend it to data analysts.


## 6. Business Interpretation

Business framing:  

What do these clusters mean in the real world? Could they represent solo travelers, families, or bargain shoppers?

- Review characteristics of each cluster (e.g. average `lead_time`, `special_requests`)
- Think from a marketing or hotel operations perspective

### In Your Response:
1. What do the segments represent in terms of guest behavior?
2. How could the hotel tailor services or promotions to each group?


In [None]:
# Add code here 🔧
numeric_cols = df_scaled_cleaned.select_dtypes(include=np.number).columns
display(df.groupby('cluster')[numeric_cols].mean())
display(df.groupby('gmm_cluster')[numeric_cols].mean())
display(df.groupby('cluster')['hotel'].value_counts())
display(df.groupby('gmm_cluster')['hotel'].value_counts())

Unnamed: 0_level_0,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,cluster,gmm_cluster
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,46.232952,0.647823,1.806408,1.772284,0.0,0.0,0.0,0.0
1.0,127.664364,2.308808,5.522233,1.956162,0.013072,0.0,1.0,2.0
2.0,262.342312,0.673165,2.210335,1.988963,0.001015,0.0,2.0,4.0
3.0,77.629226,1.155943,3.005453,1.993457,0.225736,1.034896,3.0,3.0
4.0,86.908713,1.044053,2.674376,1.954968,1.461454,0.0,4.0,1.0


Unnamed: 0_level_0,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,cluster,gmm_cluster
gmm_cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.0,46.232952,0.647823,1.806408,1.772284,0.0,0.0,0.0,0.0
1.0,86.908713,1.044053,2.674376,1.954968,1.461454,0.0,4.0,1.0
2.0,127.664364,2.308808,5.522233,1.956162,0.013072,0.0,1.0,2.0
3.0,77.629226,1.155943,3.005453,1.993457,0.225736,1.034896,3.0,3.0
4.0,262.342312,0.673165,2.210335,1.988963,0.001015,0.0,2.0,4.0


Unnamed: 0_level_0,Unnamed: 1_level_0,count
cluster,hotel,Unnamed: 2_level_1
0.0,City Hotel,48799
0.0,Resort Hotel,20331
1.0,Resort Hotel,11384
1.0,City Hotel,6135
2.0,City Hotel,19081
2.0,Resort Hotel,4567
3.0,Resort Hotel,548
3.0,City Hotel,369
4.0,City Hotel,4942
4.0,Resort Hotel,3230


Unnamed: 0_level_0,Unnamed: 1_level_0,count
gmm_cluster,hotel,Unnamed: 2_level_1
0.0,City Hotel,48799
0.0,Resort Hotel,20331
1.0,City Hotel,4942
1.0,Resort Hotel,3230
2.0,Resort Hotel,11384
2.0,City Hotel,6135
3.0,Resort Hotel,548
3.0,City Hotel,369
4.0,City Hotel,19081
4.0,Resort Hotel,4567


### ✍️ Your Response: 🔧
1. I feel like segments with longer lead times tend to be more likely to stay on weekend nights, while segments with babies tend to have shorter lead times.

2. Hotel companies could offer last-minute deals to families with babies, as that demographic tends to have short lead times.


## 7. Final Reflection

Business framing:  

Many teams ask for "segmentation" without knowing how it works. You now have hands-on experience with two clustering techniques and how to present the results.

### In Your Response:
1. What was most challenging about unsupervised learning?
2. When would you use clustering instead of supervised models?
3. How would you explain the value of clustering to a non-technical manager?
4. How does this relate to your customized learning outcome you created in canvas?


### ✍️ Your Response: 🔧
1. The hardest thing about unsupervised learning is not knowing what to expect going into it, and having to figure that out along the way.

2. Clustering is useful when you're looking for patterns in sets of data, and trying to draw information from those patterns.

3. Clustering lets non-technical managers easily look at a dataset and spot patterns within that dataset. With those patterns, they can make predictive decisions more easily.

4. I wanted to use business analytics for strategic management decisions, and using clustering lets me make strategic predictions by looking at patterns in a dataset.

## Submission Instructions

✅ **Before submitting:**
- Make sure all code cells are run and outputs are visible  
- All markdown questions are answered thoughtfully  
- Submit the assignment as an **HTML file** on Canvas


In [None]:
!jupyter nbconvert --to html "assignment_09_LastnameFirstname.ipynb"