<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/labs/lab01_forecasting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 1 ‚Äî Time Series Exploration & Forecasting
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

**Covers:** Chapters 1‚Äì2 (Rolling Windows, Resampling, Decomposition, SARIMAX, Prophet)

**Points:** 50 | **Due:** See Canvas for deadline | **Submission:** Download as .ipynb and upload to Canvas

**Dataset:** Florida Hotel Occupancy ‚Äî quarterly data from 2005‚Äì2024 (80 observations, 10 columns). You will choose **one numeric column** as your forecasting target.

| Part | Skills Tested | Points |
|------|--------------|--------|
| A: Exploration (Week 1 skills) | Rolling windows, resampling, decomposition | 20 |
| B: Forecasting (Week 2 skills) | SARIMAX, Prophet, comparison | 20 |
| C: Reflection | Written analysis | 10 |

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1A5276;">üí° GRADING PHILOSOPHY</strong><br>
  This lab rewards <strong>process over perfection</strong>. If your code doesn't work but you explain what you tried and what went wrong, you earn most of the points. A student who writes "I tried X, it failed because Y, so I adjusted to Z" earns more than one who submits broken code with no explanation.
</div>

### Student Information

- **Name:**
- **Date:**
- **Target Column Chosen:**

---
## Setup

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run both cells below. Do not modify them.
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell first. Do not modify.
# ============================================================
!pip install -q pmdarima prophet

In [None]:
# ============================================================
# Imports & Data Loading ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings, logging

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.seasonal import seasonal_decompose
from pmdarima import auto_arima
from prophet import Prophet
from sklearn.metrics import mean_squared_error, r2_score

warnings.filterwarnings("ignore")
logging.getLogger("prophet").setLevel(logging.WARNING)
logging.getLogger("cmdstanpy").setLevel(logging.WARNING)
plt.rcParams["figure.figsize"] = (12, 5)
plt.rcParams["figure.dpi"] = 100

# Load data
data_url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/florida_hotel_occupancy.csv"
hotel_df = pd.read_csv(data_url, parse_dates=["quarter_start"], index_col="quarter_start")

print(f"Dataset: {hotel_df.shape[0]} rows √ó {hotel_df.shape[1]} columns")
print(f"\nAvailable columns:")
for col in hotel_df.columns:
    print(f"  ‚Ä¢ {col} ‚Äî range: {hotel_df[col].min():.1f} to {hotel_df[col].max():.1f}")

---
## Choose Your Target Column

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Pick <strong>one</strong> numeric column from the dataset as your forecasting target. Set it in the cell below. Do NOT use <code>occupancy_rate_pct</code> ‚Äî that was the group exercise target.
</div>

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE</strong><br>
  Make sure to set the frequency with <code>.asfreq("QS")</code> after extracting your column. Without this, SARIMAX and Prophet will not know the data is quarterly.
</div>

In [None]:
# Choose your target column (change the string below)
TARGET_COLUMN = "avg_daily_rate_usd"  # ‚Üê CHANGE THIS to your chosen column

# Extract as time series with quarterly frequency
ts_data = hotel_df[TARGET_COLUMN].asfreq("QS")

print(f"Target: {TARGET_COLUMN}")
print(f"Observations: {len(ts_data)}")
print(f"Range: {ts_data.min():.2f} to {ts_data.max():.2f}")
print(f"Mean: {ts_data.mean():.2f}")

---
# Part A: Time Series Exploration (20 points)

Apply Week 1 skills to explore your chosen time series.

### Task A1: Time Series Plot (4 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Create a line plot of your target variable over time. Include a descriptive title, axis labels, and a grid.
</div>

In [None]:
# A1: Time series plot
# YOUR CODE HERE


### Task A2: Rolling Windows (4 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Calculate a 4-quarter rolling mean and a 4-quarter rolling standard deviation. Plot both on the same chart as the original data (3 lines total).
</div>

In [None]:
# A2: Rolling windows
# YOUR CODE HERE


<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND CHECK ‚Äî Part A Checkpoint 1</strong><br>
  <ul>
    <li>Your time series plot shows 80 data points from 2005 to 2024</li>
    <li>The rolling mean line is smoother than the original ‚Äî it filters out seasonal noise</li>
    <li>The rolling std line shows whether volatility is increasing or decreasing over time</li>
  </ul>
</div>

### Task A3: Resampling (4 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Resample your quarterly data to <strong>annual frequency</strong> using the mean. Plot the annual version alongside the quarterly original.
</div>

In [None]:
# A3: Resample to annual
# Hint: ts_data.resample("YS").mean()
# YOUR CODE HERE


### Task A4: ADF Stationarity Test (4 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run the Augmented Dickey-Fuller test. Print the test statistic, p-value, and a conclusion about stationarity. Then write 1‚Äì2 sentences explaining what the result means for forecasting.
</div>

In [None]:
# A4: ADF test
# YOUR CODE HERE


**Your interpretation (1‚Äì2 sentences):**

*(Write here)*

### Task A5: Seasonal Decomposition (4 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Run <code>seasonal_decompose</code> with <code>period=4</code>. Display the decomposition plot. Then write 1‚Äì2 sentences describing the seasonal pattern you see.
</div>

In [None]:
# A5: Seasonal decomposition
# YOUR CODE HERE


**Your interpretation (1‚Äì2 sentences):**

*(Write here)*

---
# Part B: Forecasting (20 points)

Apply Week 2 skills: train/test split, SARIMAX, Prophet, and comparison.

### Task B1: Train/Test Split (4 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Split the data: first 64 quarters for training, remaining 16 for testing. Print the date ranges for each set and create a visualization showing the split.
</div>

In [None]:
# B1: Train/test split
# YOUR CODE HERE


### Task B2: SARIMAX Forecast (4 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Use <code>auto_arima(train, seasonal=True, m=4)</code> to find parameters, then fit SARIMAX and forecast the test period. Print the parameters, RMSE, and R¬≤. Plot the forecast against actuals.
</div>

In [None]:
# B2: SARIMAX
# YOUR CODE HERE


<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #922B21;">üõë STOP AND CHECK ‚Äî Part B Checkpoint</strong><br>
  <ul>
    <li>auto_arima returned a set of parameters (p,d,q)(P,D,Q,4)</li>
    <li>RMSE is a reasonable number (not zero, not astronomically large)</li>
    <li>The forecast line roughly follows the actual seasonal pattern</li>
  </ul>
  If your forecast is flat, check that <code>m=4</code> is set in <code>auto_arima</code>.
</div>

### Task B3: Prophet Forecast (4 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Fit a Prophet model on the training data with <code>seasonality_mode="multiplicative"</code>. Forecast the test period. Print RMSE and R¬≤. Plot the forecast against actuals. Display the component plots.
</div>

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE</strong><br>
  Prophet requires columns named exactly <code>ds</code> and <code>y</code>. Create a new DataFrame: <code>pd.DataFrame({"ds": train.index, "y": train.values})</code>
</div>

In [None]:
# B3: Prophet
# YOUR CODE HERE


### Task B4: Model Comparison (4 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Create a comparison table and a combined overlay plot showing both forecasts on the same chart. Declare a winner.
</div>

In [None]:
# B4: Comparison
# YOUR CODE HERE


### Task B5: Future Forecast (4 points)

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
  <strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
  Using the <strong>winning model</strong>, retrain on all 80 quarters and forecast <strong>8 quarters into the future</strong> (2025‚Äì2026). Plot the full history + future forecast.
</div>

In [None]:
# B5: Future forecast using winning model
# YOUR CODE HERE


---
# Part C: Reflection (10 points)

### C1: Model Analysis (5 points)

In 3‚Äì5 sentences, answer: **Why do you think the winning model performed better on your chosen variable?** Consider the characteristics of your target column (trend strength, seasonal amplitude, COVID disruption, volatility) and how each model handles those features.

**Your answer:**

*(Write here)*

### C2: Real-World Application (5 points)

In 3‚Äì5 sentences, answer: **If you were presenting this forecast to a hotel executive in Miami, what caveats or limitations would you mention?** Think about: sample size, external events, model assumptions, and the difference between the test period and the future.

**Your answer:**

*(Write here)*

---
## Troubleshooting

| Problem | Likely Cause | Fix |
|---------|-------------|-----|
| `ModuleNotFoundError: prophet` | Package didn't install | Re-run the install cell; restart runtime if needed |
| Flat SARIMAX forecast | `m` not set or `D=0` | Ensure `m=4` in `auto_arima` |
| Prophet flat forecast | Wrong seasonality mode | Try `"additive"` instead of `"multiplicative"` |
| `LinAlgError` | Edge-case parameters | Use `enforce_stationarity=False, enforce_invertibility=False` |
| Rolling window starts with NaN | Expected behavior | First `window-1` values are NaN; use `.dropna()` for calculations |

---
<p style="color:#7F8C8D; font-size:0.85em;">
<em>CAP4767 Data Mining with Python | Miami Dade College | Spring 2026</em><br>
Lab 1 ‚Äî Time Series Exploration & Forecasting (Chapters 1‚Äì2) | 50 Points
</p>