<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/exercises/week02_group_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2 Group Exercise ‚Äî Forecasting Florida Hotel Occupancy
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Objective:** Apply the SARIMAX and Prophet forecasting pipeline to quarterly Florida hotel occupancy data, compare model performance, and present a recommendation to a simulated hotel revenue team.

**Time:** ~60 minutes | **Deliverable:** Completed notebook (one per group) + 3-minute presentation

**What you'll practice:**
- Exploratory analysis of a real-world seasonal time series
- Stationarity testing (ADF)
- SARIMAX parameter selection with `auto_arima`
- Prophet forecasting with direct parameters
- Model comparison using RMSE and R¬≤
- Communicating results to a non-technical audience

### Group Members & Roles

Self-assign one role per person. If your group has fewer than 4 members, combine roles.

| Role | Name | Responsibility |
|------|------|----------------|
| üñ•Ô∏è **Lead Coder** | | Drives the notebook, types the code |
| üìä **Data Interpreter** | | Reads outputs aloud, explains what the numbers mean |
| üé§ **Presenter** | | Delivers the 3-minute share-out to the class |
| ‚úÖ **QA Reviewer** | | Checks outputs against checkpoints, catches errors |

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° GROUP DISCUSSION (before coding)</strong><br>
  Take 3 minutes to discuss: <em>If you were managing a hotel in Miami Beach, what time of year would you expect the highest and lowest occupancy? What external events might cause unexpected spikes or dips?</em>
</div>

**Our group's answer:**

*(Type your response here)*

---

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run the two cells below to install packages and load the data. Do not modify these cells.
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell first. Do not modify.
# ============================================================
!pip install -q pmdarima prophet

In [None]:
# ============================================================
# Imports & Data Loading ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import logging

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.seasonal import seasonal_decompose
from pmdarima import auto_arima
from prophet import Prophet
from sklearn.metrics import mean_squared_error, r2_score

warnings.filterwarnings("ignore")
logging.getLogger("prophet").setLevel(logging.WARNING)
logging.getLogger("cmdstanpy").setLevel(logging.WARNING)

plt.rcParams["figure.figsize"] = (12, 5)
plt.rcParams["figure.dpi"] = 100

# Load Florida hotel occupancy data
data_url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/florida_hotel_occupancy.csv"
hotel_df = pd.read_csv(data_url, parse_dates=["quarter_start"], index_col="quarter_start")

print(f"Dataset: {hotel_df.shape[0]} rows √ó {hotel_df.shape[1]} columns")
print(f"Date range: {hotel_df.index[0].strftime('%Y Q%q')} to {hotel_df.index[-1].strftime('%Y')}")
print(f"\nColumns: {hotel_df.columns.tolist()}")
hotel_df.head()

---
## Step 1: Explore the Data

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Before forecasting, we need to understand our data. The Florida hotel dataset has 10 columns ‚Äî but for time series forecasting we need to pick <strong>one target variable</strong> and turn it into a single series with a datetime index and a frequency. We'll use <code>occupancy_rate_pct</code> ‚Äî the percentage of hotel rooms occupied each quarter.
</div>

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  1. Display summary statistics for the dataset using <code>.describe()</code><br>
  2. Extract the <code>occupancy_rate_pct</code> column as a pandas Series<br>
  3. Set its frequency to quarterly with <code>.asfreq("QS")</code>
</div>

In [None]:
# 1a. Summary statistics
# YOUR CODE HERE


# 1b. Extract occupancy rate as a time series
ts_data = hotel_df["occupancy_rate_pct"].asfreq("QS")

print(f"Time series: {len(ts_data)} quarters")
print(f"Mean occupancy: {ts_data.mean():.1f}%")
print(f"Min: {ts_data.min():.1f}% | Max: {ts_data.max():.1f}%")

---
## Step 2: Exploratory Analysis ‚Äî Plot, Test, Decompose

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Complete all three EDA tasks below:
  <ol>
    <li>Plot the time series</li>
    <li>Run the ADF stationarity test</li>
    <li>Run seasonal decomposition with <code>period=4</code></li>
  </ol>
</div>

In [None]:
# 2a. Time series plot
plt.figure(figsize=(12, 5))
# YOUR CODE HERE ‚Äî plot ts_data with a title, axis labels, and grid
# Hint: plt.plot(ts_data.index, ts_data.values, ...)



plt.tight_layout()
plt.show()

In [None]:
# 2b. ADF Stationarity Test
adf_result = adfuller(ts_data.dropna(), autolag="AIC")

print(f"ADF Statistic: {adf_result[0]:.4f}")
print(f"P-value:       {adf_result[1]:.4f}")
print()
if adf_result[1] < 0.05:
    print("‚úÖ Data IS stationary (p < 0.05)")
else:
    print("‚ö†Ô∏è  Data is NOT stationary (p ‚â• 0.05) ‚Äî SARIMAX will need differencing")

In [None]:
# 2c. Seasonal Decomposition
decomposition = seasonal_decompose(ts_data, model="additive", period=4)
fig = decomposition.plot()
fig.set_size_inches(12, 8)
plt.tight_layout()
plt.show()

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND CHECK ‚Äî Checkpoint 1</strong><br>
  Before moving on, confirm:
  <ul>
    <li>Your time series plot shows 80 quarterly data points from 2005 to 2024</li>
    <li>You can see a clear seasonal pattern ‚Äî Q1 peaks (snowbird season) and Q3/Q4 dips</li>
    <li>There's a visible dip around 2020 (COVID impact)</li>
    <li>The ADF test returned a p-value (stationary or not ‚Äî either result is fine)</li>
    <li>The decomposition shows trend, seasonal, and residual components</li>
  </ul>
</div>

---
## Step 3: Train/Test Split

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Time series splits must respect chronological order ‚Äî we can't use future data to predict the past. We'll use 64 quarters for training (2005‚Äì2020) and 16 quarters for testing (2021‚Äì2024). That gives us 4 full seasonal cycles in the test set.
</div>

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Split the data: first 64 quarters as <code>train</code>, remaining as <code>test</code>. Then plot both with a red dashed line at the split point.
</div>

In [None]:
# 3. Train/test split
train = ts_data.iloc[:64]
test = ts_data.iloc[64:]

print(f"Train: {len(train)} quarters ({train.index[0].year}‚Äì{train.index[-1].year})")
print(f"Test:  {len(test)} quarters ({test.index[0].year}‚Äì{test.index[-1].year})")

# Visualize the split
plt.figure(figsize=(12, 5))
# YOUR CODE HERE ‚Äî plot train (blue) and test (orange) with a red dashed vertical line at the split
# Hint: plt.axvline(x=test.index[0], color='red', linestyle='--', label='Split')



plt.tight_layout()
plt.show()

---
## Step 4: SARIMAX Forecasting

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  SARIMAX is a classical statistical model that captures both trend and seasonality through autoregression and differencing. We use <code>auto_arima</code> to automatically find the best (p,d,q)(P,D,Q,s) parameters by minimizing AIC.
</div>

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  1. Run <code>auto_arima</code> on the training data with <code>seasonal=True, m=4</code><br>
  2. Extract the order and seasonal_order parameters<br>
  3. Fit a SARIMAX model and forecast the test period<br>
  4. Calculate RMSE and R¬≤
</div>

In [None]:
# 4a. auto_arima ‚Äî find best parameters
auto_model = auto_arima(
    train,
    seasonal=True,
    m=4,
    suppress_warnings=True,
    error_action="ignore",
    trace=False
)

order = auto_model.order
seasonal_order = auto_model.seasonal_order

print(f"Best SARIMAX order: {order}")
print(f"Seasonal order:     {seasonal_order}")
print(f"AIC: {auto_model.aic():.2f}")

In [None]:
# 4b. Fit SARIMAX and forecast
sarimax_params = {
    "p": order[0], "d": order[1], "q": order[2],
    "P": seasonal_order[0], "D": seasonal_order[1],
    "Q": seasonal_order[2], "s": seasonal_order[3]
}

model = SARIMAX(
    train,
    order=(sarimax_params["p"], sarimax_params["d"], sarimax_params["q"]),
    seasonal_order=(sarimax_params["P"], sarimax_params["D"], sarimax_params["Q"], sarimax_params["s"]),
    enforce_stationarity=False,
    enforce_invertibility=False
)
sarimax_result = model.fit(disp=False)
sarimax_forecast = sarimax_result.forecast(steps=len(test))

# Evaluate
sarimax_rmse = np.sqrt(mean_squared_error(test, sarimax_forecast))
sarimax_r2 = r2_score(test, sarimax_forecast)

print(f"SARIMAX RMSE: {sarimax_rmse:.2f} percentage points")
print(f"SARIMAX R¬≤:   {sarimax_r2:.4f}")

In [None]:
# 4c. Plot SARIMAX forecast vs actuals
plt.figure(figsize=(12, 5))
plt.plot(train.index, train, label="Train", color="steelblue")
plt.plot(test.index, test, label="Test (Actual)", color="darkorange")
plt.plot(test.index, sarimax_forecast, label="SARIMAX Forecast", linestyle="--", color="green")
plt.axvline(x=test.index[0], color="red", linestyle="--", alpha=0.5, label="Split")
plt.title(f"SARIMAX Forecast ‚Äî Florida Hotel Occupancy (RMSE={sarimax_rmse:.2f}, R¬≤={sarimax_r2:.4f})")
plt.xlabel("Quarter")
plt.ylabel("Occupancy Rate (%)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND CHECK ‚Äî Checkpoint 2</strong><br>
  <ul>
    <li><code>auto_arima</code> returned parameters ‚Äî write them down, you'll need them for the comparison</li>
    <li>RMSE should be in the range of 1‚Äì5 percentage points (if it's 20+, something went wrong)</li>
    <li>The green forecast line should roughly follow the orange actual line's seasonal pattern</li>
  </ul>
</div>

---
## Step 5: Prophet Forecasting

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
  Prophet takes a completely different approach ‚Äî it models trend and seasonality as additive or multiplicative components using curves, not autoregression. It doesn't need stationarity or frequency metadata. We run both to see which one captures Florida's seasonal hotel patterns better.
</div>

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE</strong><br>
  Prophet requires a DataFrame with columns named exactly <code>ds</code> (dates) and <code>y</code> (values). Any other column names will cause an error.
</div>

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  1. Create a Prophet-formatted DataFrame from the training data<br>
  2. Fit a Prophet model with multiplicative seasonality<br>
  3. Forecast the test period and calculate RMSE and R¬≤
</div>

In [None]:
# 5a. Prepare Prophet data format
train_prophet = pd.DataFrame({"ds": train.index, "y": train.values})

# 5b. Fit Prophet
prophet_params = {
    "changepoint_prior_scale": 0.05,
    "seasonality_prior_scale": 10.0,
    "seasonality_mode": "multiplicative",
    "changepoint_range": 0.85
}

prophet_model = Prophet(
    changepoint_prior_scale=prophet_params["changepoint_prior_scale"],
    seasonality_prior_scale=prophet_params["seasonality_prior_scale"],
    seasonality_mode=prophet_params["seasonality_mode"],
    changepoint_range=prophet_params["changepoint_range"],
    yearly_seasonality=True,
    weekly_seasonality=False,
    daily_seasonality=False
)
prophet_model.fit(train_prophet)

# 5c. Forecast test period
future = prophet_model.make_future_dataframe(periods=len(test), freq="QS")
forecast_df = prophet_model.predict(future)
prophet_forecast = forecast_df["yhat"].iloc[-len(test):].values

# Evaluate
prophet_rmse = np.sqrt(mean_squared_error(test, prophet_forecast))
prophet_r2 = r2_score(test, prophet_forecast)

print(f"Prophet RMSE: {prophet_rmse:.2f} percentage points")
print(f"Prophet R¬≤:   {prophet_r2:.4f}")

In [None]:
# 5d. Plot Prophet forecast vs actuals
plt.figure(figsize=(12, 5))
plt.plot(train.index, train, label="Train", color="steelblue")
plt.plot(test.index, test, label="Test (Actual)", color="darkorange")
plt.plot(test.index, prophet_forecast, label="Prophet Forecast", linestyle="--", color="purple")
plt.axvline(x=test.index[0], color="red", linestyle="--", alpha=0.5, label="Split")
plt.title(f"Prophet Forecast ‚Äî Florida Hotel Occupancy (RMSE={prophet_rmse:.2f}, R¬≤={prophet_r2:.4f})")
plt.xlabel("Quarter")
plt.ylabel("Occupancy Rate (%)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# 5e. Prophet component plots
fig = prophet_model.plot_components(forecast_df)
plt.tight_layout()
plt.show()

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND CHECK ‚Äî Checkpoint 3</strong><br>
  <ul>
    <li>Prophet RMSE should be in the range of 1‚Äì6 percentage points</li>
    <li>The component plots should show an upward trend and a clear seasonal pattern</li>
    <li>The seasonal pattern should show Q1 as the highest occupancy quarter (snowbird season)</li>
  </ul>
</div>

---
## Step 6: Compare Models & Build Recommendation

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  1. Build a comparison table<br>
  2. Create a combined overlay plot showing both forecasts<br>
  3. Write your group's recommendation
</div>

In [None]:
# 6a. Comparison table
comparison = pd.DataFrame({
    "Metric": ["RMSE (% points)", "R¬≤", "Approach", "Requires Stationarity?", "Explainability"],
    "SARIMAX": [
        f"{sarimax_rmse:.2f}",
        f"{sarimax_r2:.4f}",
        f"Statistical ‚Äî order {order}",
        "Yes (differencing)",
        "Coefficient table"
    ],
    "Prophet": [
        f"{prophet_rmse:.2f}",
        f"{prophet_r2:.4f}",
        "Decomposable ‚Äî trend + seasonality curves",
        "No (handles trend internally)",
        "Component plots"
    ]
})
print(comparison.to_string(index=False))

print()
if sarimax_rmse < prophet_rmse:
    print(f"üèÜ SARIMAX wins on RMSE by {prophet_rmse - sarimax_rmse:.2f} percentage points")
else:
    print(f"üèÜ Prophet wins on RMSE by {sarimax_rmse - prophet_rmse:.2f} percentage points")

In [None]:
# 6b. Combined overlay plot ‚Äî both models on one chart
plt.figure(figsize=(12, 5))
plt.plot(train.index, train, label="Train", color="steelblue", alpha=0.6)
plt.plot(test.index, test, label="Actual", color="darkorange", linewidth=2)
plt.plot(test.index, sarimax_forecast, label=f"SARIMAX (RMSE={sarimax_rmse:.2f})", linestyle="--", color="green")
plt.plot(test.index, prophet_forecast, label=f"Prophet (RMSE={prophet_rmse:.2f})", linestyle="--", color="purple")
plt.axvline(x=test.index[0], color="red", linestyle="--", alpha=0.3)
plt.title("Model Comparison ‚Äî SARIMAX vs Prophet on Florida Hotel Occupancy")
plt.xlabel("Quarter")
plt.ylabel("Occupancy Rate (%)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---
## Step 7: Group Recommendation

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  As a group, write a 3‚Äì5 sentence recommendation as if you're presenting to a Florida hotel revenue team. Address:
  <ol>
    <li>Which model performed better and by how much?</li>
    <li>Which quarters did each model struggle with most?</li>
    <li>Which model would you recommend for production forecasting and why?</li>
  </ol>
</div>

**Our group's recommendation:**

*(Type your response here ‚Äî this will be the basis of your 3-minute presentation)*

---

## Troubleshooting

| Problem | Likely Cause | Fix |
|---------|-------------|-----|
| `ModuleNotFoundError: prophet` | Package didn't install | Re-run the install cell; if it fails, restart the runtime and try again |
| Prophet produces a flat forecast | `seasonality_mode` might be wrong for this data | Try changing to `"additive"` and re-run |
| SARIMAX `LinAlgError` or convergence warning | Edge-case parameters | Add `enforce_stationarity=False, enforce_invertibility=False` (already included above) |
| RMSE is extremely large (20+) | Likely a data formatting issue | Check that `ts_data` is a Series with a DatetimeIndex and `QS` frequency |
| `ValueError: freq QS not understood` | Older pandas version | Colab should be fine, but check `pd.__version__` ‚Äî needs 1.3+ |

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Week 2 Group Exercise ‚Äî SARIMAX & Prophet Forecasting on Florida Hotel Occupancy Data
</p>