# HW1: Data Preprocessing, KPI, and PCA

## Introduction: Exploring the Drivers of Hotel Booking Cancellations

Imagine you are a **data science consultant** who has just been hired by two hotels.  
The managers are worried about the growing number of **cancellations**, and they want to understand what drives this behavior so they can make better business decisions.  

Your job in this homework is to help the hotel managers uncover the **main reasons behind cancellations**.  
You will approach this problem using the **same structured workflow** you practiced in the Lab (remember when you were part of the airline‚Äôs *Customer Experience & Insights* team üòÑ).  

This time, instead of analyzing passenger satisfaction, you are analyzing hotel bookings. The process is the same:  

- Load and clean the dataset,  
- Explore it systematically,  
- Compute KPIs and summaries,  
- Apply PCA to simplify complexity,  
- And extract insights.  

At the end, you should be able to **report the key factors behind cancellations**.  
While this is not a formal requirement in the submission, try to **write a short text as if you were delivering recommendations to your client**. The goal is to practice turning analysis into **clear, data-driven advice** that managers can act on.  


**This homework is your first hands-on warm-up. By the end, you will be able to:**  
- Load a real dataset and perform a **light, deterministic clean**.  
- Compute **basic KPIs** and **categorical summaries** without plotting.  
- Build a **PCA-ready numeric matrix** (impute + standardize) and run **PCA**.  
- Interpret **explained variance ratios** and **dominant features** for principal components.  
- Submit clean, testable code that passes **VPL** auto-grading.


## Dataset & Columns: Quick Guide

**Data Source & Coverage**  
This homework uses the public **Hotel booking demand** dataset. It contains **booking-level records** from two hotels in Portugal - a **City Hotel** and a **Resort Hotel** - covering bookings around **2015‚Äì2017**.  
A copy is available on Kaggle: [Hotel booking demand (Mostipak)](https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand).

> First time using Kaggle?
>Use the link above to read about the dataset, explore the documentation, and get familiar with the platform.


- **File:** `hotel_bookings.csv`  
- **Size:** **119390 rows √ó 32 columns**  
- **Goal of this HW:** understand **cancellations** and prepare numeric features for **PCA**.

**What‚Äôs inside:** 
This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data.

- Booking details (e.g., **lead time**, **stay length**, **ADR** price).  
- Channel/segment info (e.g., **market_segment**, **distribution_channel**, **deposit_type**).  
- Guest counts (**adults**, **children**, **babies**) and a flag for **repeated guests**.  
- Dates (arrival year/month/day, and **reservation_status_date** for final status).  
- The dataset is anonymized.

**Target / Outcome**  
- **`is_canceled`** ‚Äî 0 (not canceled) or 1 (canceled).

**Key categoricals used in EDA**
- **`hotel`**
- **`deposit_type`**
- **`market_segment`**
- **`distribution_channel`**
- **`customer_type`**

**Numeric features used for PCA (14)**
- **`lead_time`**
- **`stays_in_weekend_nights`**
- **`stays_in_week_nights`**
- **`adults`**
- **`children`**
- **`babies`**
- **`previous_cancellations`**
- **`previous_bookings_not_canceled`**
- **`booking_changes`**
- **`days_in_waiting_list`**
- **`adr`**
- **`required_car_parking_spaces`**
- **`total_of_special_requests`**
- **`total_guests`** *(we will create this: adults + children + babies)*

> Tip: Don‚Äôt rename columns. VPL expects the standard names in your functions.


## Before You Start  

We will follow a **similar structured workflow** as in the lab. Since homeworks are automatically graded with the **VPL system**, and VPL cannot evaluate plots, we will check **numeric KPIs** instead of graphs.  

‚û°Ô∏è **Recommendation:** Create your own plots locally while working through the notebook. This extra practice will help you explore the data more deeply and strengthen your intuition.  


## Task 1: Load & Light Clean 
---

**What you‚Äôll build:** a function `load_and_clean(csv_path)` that returns a **cleaned** `pandas.DataFrame` ready for analysis.  
You will **not** plot anything here. Keep the cleaning **deterministic** so everyone gets the same result.

In short: These steps make sure dates are real dates, numbers are real numbers, bookings make sense (no zero guests), and errors are flagged as missing values instead of breaking the dataset.

### Required steps (follow in this order)
1) **Convert reservation date column into proper date format**  
   - Convert the column `reservation_status_date` from text into a proper date format so it can be used in time-based analysis.
   - If a value doesn‚Äôt look like a valid date, mark it as NaT (Not a Time), instead of crashing the process.

2) **Convert numeric columns** (convert strings to numbers; invalids ‚Üí `NaN`)  
   - Columns to convert (if present):  
     `is_canceled, lead_time, stays_in_weekend_nights, stays_in_week_nights, adults, children, babies,`  
     `previous_cancellations, previous_bookings_not_canceled, booking_changes, days_in_waiting_list, adr,`  
     `required_car_parking_spaces, total_of_special_requests, arrival_date_year, arrival_date_day_of_month`  
   - Use `pd.to_numeric(col, errors="coerce")` in a simple loop.
   - If a value can‚Äôt be converted (e.g., text in a number column), replace it with `NaN`. 

3) **Create total guests**  
   - Create new column: `total_guests = adults + children + babies` (row‚Äëwise sum).  
   - If any of these are `NaN`, the sum will be `NaN` ‚Äî that‚Äôs fine for now.

4) **Drop impossible rows**  
   - Keep only rows where `total_guests > 0`. (Removes bookings with 0 guests.)

5) **Handle negative prices**  
   - If `adr < 0`, set it to `NaN` (do **not** drop those rows).
   - ADR represents the average revenue earned per occupied room per day.

In [None]:
# If you changed hw1.py, reload the module so Jupyter sees your edits
import numpy as np
import importlib
import hw1 as hw
importlib.reload(hw)

In [None]:
# Run Task 1
import os
print(os.getcwd())
CSV_file = "hotel_bookings.csv"
df = hw.load_and_clean(CSV_file)

print("Loaded & cleaned")
print("Shape:", df.shape)


In [None]:
# ‚ö†Ô∏è Do not change this part!  
# These assertions mirror the VPL auto-grading checks.  
# Use them locally to confirm that your function works properly.

assert df.shape == (119210, 33)
assert "total_guests" in df and (df["total_guests"] > 0).all()
assert (df["adr"] < 0).sum() == 0

## Task 2: Basic Numeric KPIs
---

**What you‚Äôll build:** a function `numeric_kpis(df)` that returns a small **dictionary of key numbers** describing the cleaned dataset from **Task 1**.  
No plots, no printing required, just compute and return the values.

In short: You‚Äôll get a quick ‚Äúhealth check‚Äù of the dataset: how big it is, how often bookings are canceled, how prices look at the high end, and how long stays typically are.

### Required steps

1) **Use the cleaned DataFrame from Task 1**  
   - Input to this function is the **already cleaned** `df` from `load_and_clean(...)`.

2) **Compute these KPIs and return them in a dict:**  
   - **`rows`** ‚Üí number of rows (after cleaning).  
       
   - **`cols`** ‚Üí number of columns.  
     
   - **`cancel_rate`** ‚Üí average of `is_canceled`. (NaNs are ignored automatically)

   - **`adr_p95`** ‚Üí 95th percentile of `adr` (a robust ‚Äúhigh price‚Äù marker).  
    
   - **`avg_stay_len`** ‚Üí mean of **total nights** per booking, where  
     `total_nights = stays_in_week_nights + stays_in_weekend_nights`.  
    

In [None]:
# Reload again to reset any changes
importlib.reload(hw)

In [None]:
k = hw.numeric_kpis(df)
display(k)

In [None]:
# ‚ö†Ô∏è Do not change this part!  
# These assertions mirror the VPL auto-grading checks.  
# Use them locally to confirm that your function works properly.

assert k["rows"] == 119210 and k["cols"] == 33
assert abs(k["cancel_rate"] - 0.370766) < 1e-3
assert abs(k["adr_p95"] - 193.5) < 1e-6
assert abs(k["avg_stay_len"] - 3.426248) < 1e-3

## Task 3: Categorical EDA (no plots)
---

**What you‚Äôll build:** a function `categorical_cancel_stats(df)` that summarizes **cancellation rates by category** and identifies which **market segment** has the highest cancellation rate among sufficiently common categories.  
No plots ‚Äî just compute the numbers and return them.

In short: You‚Äôll measure how cancellations differ across **hotel type** and **deposit policy**, and find the highest-risk **market segment** (only considering segments with enough data).

### Required steps

1) **Use the cleaned DataFrame from Task 1**  
   - Input to this function is the **already cleaned** `df` from `load_and_clean(...)`.

2) **Cancellation by hotel type** ‚Üí `hotel_rates`  
   - Group by `hotel` and compute the **mean of `is_canceled`**.  
   - Convert to a Python dictionary.  

3) **Cancellation by deposit type** ‚Üí `deposit_rates`  
   - Group by `deposit_type` and compute the **mean of `is_canceled`**.  
   - Convert to a dictionary.  

4) **Top market segment (n ‚â• 500)** ‚Üí `top_segment_500`  
   - Count rows per `market_segment` and **keep only categories with at least 500 rows**.  
   - Filter `df` to those segments, group by `market_segment`, compute the **mean cancellation rate**, and sort descending.  
   - Take the **first** one and return it as a **tuple**: `(segment_name, rate)` where `rate` is a float.  
    
5) **Return format (exact keys):**  
   ```python
   {
     "hotel_rates": <dict>,        # e.g., {"City Hotel": 0.xxx, "Resort Hotel": 0.yyy}
     "deposit_rates": <dict>,      # e.g., {"Non Refund": 0.xxx, "No Deposit": 0.yyy, "Refundable": 0.zzz}
     "top_segment_500": (<name>, <rate_float>)
   }


In [None]:
# Reload again to reset any changes
importlib.reload(hw)

In [None]:
cat = hw.categorical_cancel_stats(df)
cat

In [None]:
# ‚ö†Ô∏è Do not change this part!  
# These assertions mirror the VPL auto-grading checks.  
# Use them locally to confirm that your function works properly.

assert "City Hotel" in cat["hotel_rates"] and "Resort Hotel" in cat["hotel_rates"]
assert cat["top_segment_500"][0] == "Groups"

## Task 4: Build PCA-ready Matrix
---

**What you‚Äôll build:** a function `build_pca_matrix(df)` that takes the **cleaned** DataFrame from Task 1 and returns a **NumPy array** of numeric features, ready for PCA.  
No plots, no printing ‚Äî just return the matrix.

In short: You will **pick 14 numeric features**, **fill missing values with the median**, and **standardize** each feature to mean 0 and standard deviation 1.

### Features to include (exactly these 14)
- `lead_time`  
- `stays_in_weekend_nights`  
- `stays_in_week_nights`  
- `adults`  
- `children`  
- `babies`  
- `previous_cancellations`  
- `previous_bookings_not_canceled`  
- `booking_changes`  
- `days_in_waiting_list`  
- `adr`  
- `required_car_parking_spaces`  
- `total_of_special_requests`  
- `total_guests`  *(created in Task 1)*

> Do **not** include the target `is_canceled`. No categorical encoding is needed for this task.

### Required steps
1) **Select the columns** above, in that order.  
   - You may store this list in a constant like `PCA_FEATURES` to keep it consistent.

2) **Impute missing values with the median**  
   - Use `sklearn.impute.SimpleImputer(strategy="median")`.  
   - Fit on your selected columns and transform to get a **fully numeric, NaN-free** array.

3) **Standardize (z-score) each feature**  
   - Use `sklearn.preprocessing.StandardScaler()`.  
   - Fit on the imputed data, then transform so each column has mean ~0 and std ~1.

4) **Return a NumPy array** (not a DataFrame)  
   - Shape must be `(n_samples, 14)` and contain **no NaNs**.


In [None]:
# Reload again to reset any changes
importlib.reload(hw)

In [None]:
X = hw.build_pca_matrix(df)
X.shape, np.isnan(X).sum()

In [None]:
# ‚ö†Ô∏è Do not change this part!  
# These assertions mirror the VPL auto-grading checks.  
# Use them locally to confirm that your function works properly.

assert X.shape == (len(df), 14) and not np.isnan(X).any()

## Task 5: Run PCA and Inspect Results
---

**What you‚Äôll build:** a function `run_pca(X, n_components=3)` that fits **Principal Component Analysis** on the PCA-ready matrix from **Task 4** and returns two things:  
1) the **explained variance ratios** for each component, and  
2) the **component loadings** (the weights for each original feature).

No plots are required ‚Äî just compute and return the values.

In short: You‚Äôll compress the 14 standardized features into 3 principal components, then report how much variance each component explains and what linear combinations (loadings) define them.

### Required steps

1) **Use the matrix from Task 4**  
   - Input is the NumPy array `X` produced by `build_pca_matrix(df)` (shape `(n_samples, 14)`, no NaNs).

2) **Initialize PCA**  
   - `from sklearn.decomposition import PCA`  
   - `pca = PCA(n_components=3)`

3) **Fit PCA on `X`**  
   - `pca.fit(X)`

4) **Return these two attributes**  
   - `pca.explained_variance_ratio_`  ‚Üí array of length **3**  
   - `pca.components_`                ‚Üí matrix of shape **(3, 14)**

> Do **not** return the PCA object itself. Return the **tuple** `(explained_variance_ratio_, components_)`.


In [None]:
# Reload again to reset any changes
importlib.reload(hw)

In [None]:
ratio, comps = hw.run_pca(X, n_components=3)
print('variance ratios:', ratio, 'sum:', ratio.sum())
features = hw.PCA_FEATURES
for i, row in enumerate(comps):
    j = np.argmax(np.abs(row))
    print(f'PC{i+1} top feature: {features[j]} (loading={row[j]:.3f})')

In [None]:
# ‚ö†Ô∏è Do not change this part!  
# These assertions mirror the VPL auto-grading checks.  
# Use them locally to confirm that your function works properly.

assert ratio.shape == (3,) and comps.shape == (3, 14)
print("All local checks passed!")


## ‚ú® Optional Practice (Not Graded)
---

This part is **not graded**, but it is highly recommended as extra practice.  

Congratulations! You‚Äôve completed your assignment as a **data science consultant** for the hotels. You have:
- Cleaned and structured messy booking data,  
- Computed KPIs to get a clear picture of cancellations,  
- Explored patterns across features,  
- Applied PCA to simplify and highlight the most important drivers.  

Now comes the most important part: **communication**.  
The hotel managers don‚Äôt want code or plots! they want a clear story. As a consultant, your analysis is only as valuable as the clarity of your recommendations.  
Take a few minutes to **summarize your findings in plain language** as if you were writing a short note to the hotel managers.  

Your note should answer questions like:  
- What are the **main reasons for cancellations**?  
- Which factors appear less important?  
- What practical advice can managers take away?  

Keep it short: 5‚Äì7 sentences is enough. Managers don‚Äôt have time to read a long report with many pages! üòÑ 

Think of this as practicing the skill of **turning data into decisions**, which is just as important as the analysis itself.  

This closes the loop: from raw data ‚Üí structured workflow ‚Üí actionable business insight.  
