<a href="https://colab.research.google.com/github/c-marq/cap4767-data-mining/blob/main/demos/week01_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 1 Demo ‚Äî Time Series Foundations
**CAP4767 Data Mining with Python** | Miami Dade College ‚Äî Kendall Campus

---

**What we're building today:** The core pandas toolkit for working with time-indexed data ‚Äî generating date ranges, reindexing, resampling, rolling windows, and running totals.

**Why this matters for data mining:** Nearly every real-world dataset has a time dimension. Stock prices, customer transactions, sensor readings, wildfire records ‚Äî the patterns hidden in *when* things happen are often more valuable than the events themselves. This week gives you the foundation that every forecasting model in Weeks 2‚Äì7 will depend on.

**Datasets:**
- `AAPL.csv` ‚Äî Apple stock price data (daily OHLC, 2020)
- `acresBurned.csv` ‚Äî California wildfire acres burned by discovery date (1992‚Äì2015)

**Adapted from:** Murach's *Python for Data Science*, Chapter 9

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
<strong>Where does this fit in the data mining pipeline?</strong><br><br>Time series analysis is the <em>first stage</em> of our forecasting pipeline. Before we can predict anything (Week 2: SARIMAX &amp; Prophet), we need to know how to manipulate time-indexed data ‚Äî resample it, smooth it, and reshape it. Think of today's skills like learning to prep ingredients before you cook. Every forecasting model downstream expects clean, properly indexed time series as input.
</div>

<div style="background-color: #D5F5E3; border-left: 5px solid #27AE60; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #1E8449;">‚úÖ DO THIS</strong><br>
Run the next two cells to load our libraries and datasets. <strong>Do not modify these cells.</strong>
</div>

In [None]:
# ============================================================
# Setup ‚Äî Run this cell. Do not modify.
# ============================================================
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries loaded successfully.")

In [None]:
# ============================================================
# Load datasets ‚Äî Run this cell. Do not modify.
# ============================================================
# TODO: Replace these URLs with your GitHub raw URLs after pushing to repo
aapl_url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/stocks.csv"
acres_url = "https://raw.githubusercontent.com/c-marq/cap4767-data-mining/refs/heads/main/data/acresBurned.csv"

# Apple stock data ‚Äî daily OHLC prices for 2020
stockData = pd.read_csv(aapl_url,
                        usecols=['Date', 'Open', 'High', 'Low', 'Close'],
                        parse_dates=['Date'])
stockData.set_index('Date', inplace=True)

# California wildfire acres burned by discovery date
# NOTE: The CSV must include the 'discovery_date' column as the first column.
# If your CSV only has 'acres_burned', re-export from the .pkl with index=True.
acresBurned = pd.read_csv(acres_url,
                          index_col='discovery_date',
                          parse_dates=True)

print(f"‚úÖ stockData loaded: {stockData.shape[0]} rows, {stockData.shape[1]} columns")
print(f"‚úÖ acresBurned loaded: {acresBurned.shape[0]} rows, {acresBurned.shape[1]} columns")

---
## Section 1: How to Generate Time Periods

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
Before we can analyze time-series data, we need to understand how pandas represents time. The <code>pd.date_range()</code> function is your go-to tool for generating sequences of dates. Think of it like a ruler for time ‚Äî you set the start, the end, and the spacing between tick marks. This becomes critical when we need to reindex stock data to specific intervals (like every Friday) or resample wildfire data to monthly totals.
</div>

### Monthly Start Dates (`freq='MS'`)
Generate the first day of every month in 2020:

In [None]:
# MS = Month Start ‚Äî gives us Jan 1, Feb 1, Mar 1, etc.
pd.date_range('01/01/2020', '12/31/2020', freq='MS')

### Business Days (`freq='B'`)
Generate only weekday dates ‚Äî no Saturdays or Sundays:

In [None]:
# B = Business days ‚Äî skips weekends automatically
pd.date_range('01/01/2020', '01/31/2020', freq='B')

### Weekly on Mondays (`freq='W-MON'`)
Generate every Monday in December 2020:

In [None]:
# W-MON = Weekly anchored on Monday
pd.date_range('12/01/2020', '12/31/2020', freq='W-MON')

### Sub-Daily: 12-Hour Intervals (`freq='12h'`)
Time periods aren't limited to days ‚Äî we can go as granular as hours, minutes, or seconds:

In [None]:
# 12h = every 12 hours ‚Äî useful for shift-based or sensor data
pd.date_range('01/01/2020', '01/31/2020', freq='12h')

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE</strong><br>
<strong>Pandas version note:</strong> The Murach textbook uses uppercase frequency codes like <code>'H'</code>, <code>'M'</code>, and <code>'Q'</code>. In pandas 2.2+ (which Google Colab now uses), these have been replaced:<br>
‚Ä¢ <code>'H'</code> ‚Üí <code>'h'</code> (hours) ¬∑ <code>'M'</code> ‚Üí <code>'ME'</code> (month-end) ¬∑ <code>'Q'</code> ‚Üí <code>'QE'</code> (quarter-end)<br><br>
If you see an <code>Invalid frequency</code> error, switch to the updated code shown in this notebook. Codes like <code>'MS'</code>, <code>'B'</code>, <code>'W-FRI'</code>, and <code>'SMS'</code> are unchanged.
</div>

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #922B21;">üõë STOP AND CHECK</strong><br>
<strong>Checkpoint ‚Äî Section 1</strong><br><br>You should see four different <code>DatetimeIndex</code> outputs above:<br>‚Ä¢ 12 monthly dates (Jan‚ÄìDec 2020)<br>‚Ä¢ 23 business days in January 2020<br>‚Ä¢ 4 Mondays in December 2020<br>‚Ä¢ 61 twelve-hour intervals in January 2020<br><br>If any cell produced an error, check that the Setup cell ran successfully first.
</div>

---
## Section 2: Reindexing with Datetime Indexes

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
Real-world data doesn't always come on the schedule we need. Stock markets are closed on weekends and holidays, but your analysis might need data at regular intervals ‚Äî every Friday, every two weeks, or every month. <code>reindex()</code> lets us reshape a DataFrame to match a new set of dates. Dates with no matching data get <code>NaN</code> ‚Äî and those NaN values are actually useful information: they tell us <em>when data is missing</em>.
</div>

### Quick look at our stock data

In [None]:
# AAPL daily stock data ‚Äî the index is already a DatetimeIndex
stockData.head(3)

### Reindex to Fridays only
Let's extract only Friday closing prices ‚Äî a common view for weekly investment reporting:

In [None]:
# Generate every Friday in 2020
fridays = pd.date_range('01/01/2020', '12/31/2020', freq='W-FRI')
print(f"Generated {len(fridays)} Fridays in 2020")

# Reindex: keep only rows that fall on a Friday
stockData.reindex(fridays).head(3)

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE</strong><br>
Notice: some Fridays may show <code>NaN</code>. This happens when the stock market was closed on that Friday (like Good Friday or the day after Thanksgiving). <code>reindex()</code> doesn't invent data ‚Äî if there's no trading record for that date, you get NaN. That's the correct behavior.
</div>

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #922B21;">üõë STOP AND CHECK</strong><br>
<strong>Checkpoint ‚Äî Section 2</strong><br><br>Your Friday reindex should show 52 Fridays and display 3 rows of AAPL data. The first Friday (2020-01-03) should show Open ‚âà 74.29 and Close ‚âà 74.36.
</div>

---
## Section 3: Semi-Month Reindexing & Fixing Weekend Dates

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
Some financial reports use <strong>semi-monthly</strong> intervals ‚Äî the 1st and 15th of each month. But what happens when the 1st or 15th lands on a weekend? The stock market is closed, so there's no data. We'll first see the problem, then build a custom function to fix it. This is a great example of why data wrangling is never just "load and go" ‚Äî real data always has edge cases.
</div>

### The Problem: Semi-Month Dates on Weekends

In [None]:
# SMS = Semi-Month Start (1st and 15th of each month)
semiMonths = pd.date_range('01/01/2020', '12/31/2020', freq='SMS')
print(f"Generated {len(semiMonths)} semi-month dates")
semiMonths[:6]  # Show the first few

In [None]:
# Reindex stock data to semi-month dates ‚Äî notice the NaN values!
stockData.reindex(semiMonths).head()

In [None]:
# Plot it ‚Äî the gaps from NaN values create ugly breaks in the line
stockData.reindex(semiMonths).plot(title='Semi-Month Reindex ‚Äî Before Fix (Notice the Gaps)')
plt.tight_layout()

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE</strong><br>
See the NaN rows? January 1st was a holiday (New Year's Day), February 1st was a Saturday, February 15th was a Saturday ‚Äî none of these had trading data. The plot has visible gaps. We need a smarter approach.
</div>

### The Fix: A Custom `adjustDate()` Function
If a date falls on Saturday, shift it back to Friday. If it falls on Sunday, shift it forward to Monday:

In [None]:
def adjustDate(date):
    """Shift weekend dates to the nearest weekday.
    Saturday ‚Üí Friday (go back 1 day)
    Sunday   ‚Üí Monday (go forward 1 day)
    Weekdays ‚Üí no change
    """
    if date.weekday() < 5:       # Mon-Fri (0-4) ‚Äî already a weekday
        return date
    elif date.weekday() == 5:    # Saturday ‚Äî shift to Friday
        return date - dt.timedelta(days=1)
    else:                        # Sunday ‚Äî shift to Monday
        return date + dt.timedelta(days=1)

In [None]:
# Apply the fix: adjust each semi-month date to the nearest business day
semiMonths = pd.date_range('01/01/2020', '12/31/2020', freq='SMS')
semiMonthsAdjusted = semiMonths.to_series().apply(adjustDate)

# Compare original vs adjusted ‚Äî look at the dates that changed
comparison = pd.DataFrame({
    'Original': semiMonths[:6],
    'Adjusted': semiMonthsAdjusted.values[:6]
})
comparison['Changed?'] = comparison['Original'] != comparison['Adjusted']
comparison

In [None]:
# Now reindex with the adjusted dates ‚Äî no more NaN!
stockData.reindex(semiMonthsAdjusted).head()

In [None]:
# Plot the fixed version ‚Äî smooth, continuous line
stockData.reindex(semiMonthsAdjusted).plot(
    title='Semi-Month Reindex ‚Äî After Fix (Clean!)')
plt.tight_layout()

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #922B21;">üõë STOP AND CHECK</strong><br>
<strong>Checkpoint ‚Äî Section 3</strong><br><br>Compare the two plots:<br>‚Ä¢ <strong>Before fix:</strong> Broken lines with gaps where NaN values dropped out<br>‚Ä¢ <strong>After fix:</strong> Smooth, continuous lines across all 24 semi-month points<br><br>The adjusted reindex should show <strong>no NaN values</strong> in the first 5 rows. January 1st (holiday) should shift to January 2nd. February 1st (Saturday) should shift to January 31st.
</div>

---
## Section 4: Resampling Time Series Data

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
Reindexing picks specific dates from existing data. <strong>Resampling</strong> is different ‚Äî it <em>aggregates</em> data into new time buckets. Think of it like this:<br><br>‚Ä¢ <strong>Reindexing</strong> = "Show me only the data from these specific dates" (like filtering)<br>‚Ä¢ <strong>Resampling</strong> = "Combine all the data within each month/quarter/year into one number" (like GROUP BY in SQL)<br><br>We'll use the California wildfire dataset to see how daily acres-burned data can be rolled up to monthly and quarterly totals.
</div>

### Quick look at the wildfire data

In [None]:
# California wildfire acres burned ‚Äî daily records from 1992 onward
acresBurned.head(3)

### Monthly Totals with `resample()`

In [None]:
# Resample to monthly frequency, summing all acres burned within each month
acresBurned.resample(rule='ME').sum().head(3)

### Upsampling: Going *More* Granular (12-Hour Bins)

In [None]:
# What happens if we resample to a SMALLER interval than the data has?
# 12h = 12-hour bins. Since original data is daily, the new rows get 0.
acresBurned.resample(rule='12h').sum().head(4)

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE</strong><br>
Upsampling (going to a finer resolution) fills new intervals with 0 when using <code>.sum()</code> because there's nothing to aggregate. If you used <code>.mean()</code> instead, you'd get NaN for the new intervals. Choose your aggregation function carefully based on what the zeros or NaNs mean in your domain.
</div>

### Controlling Bin Boundaries: `label` and `closed`
When resampling to quarters, should January 1st belong to Q1 or the previous Q4? The `label` and `closed` parameters give you control:

In [None]:
# label='right', closed='right': Q1 ends March 31, labeled as March 31
acresBurned.resample(rule='QE', label='right', closed='right').sum().head()

In [None]:
# label='left', closed='left': Q1 starts Jan 1, labeled as Jan 1
acresBurned.resample(rule='QE', label='left', closed='left').sum().head()

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #922B21;">üõë STOP AND CHECK</strong><br>
<strong>Checkpoint ‚Äî Section 4</strong><br><br>Your monthly resample should show aggregated totals per month. The first row (1992-01-31) should be the sum of all acres burned in January 1992.<br><br>The quarterly resample with <code>label='right'</code> should show dates like 1992-03-31, 1992-06-30, etc.<br>The quarterly resample with <code>label='left'</code> should show dates like 1992-01-01, 1992-04-01, etc.
</div>

---
## Section 5: How Downsampling Improves Plots

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
Raw daily stock data can be noisy ‚Äî every small fluctuation shows up in the plot, making it hard to see the overall trend. <strong>Downsampling</strong> (resampling to a coarser frequency) smooths the data. Here we compare the daily Apple closing price to the weekly average. This is the same idea behind the "zoom out" feature in any stock charting app.
</div>

In [None]:
# Daily closing price ‚Äî lots of noise
stockData.plot(y='Close', title='AAPL Daily Close Price (2020)', legend=False)
plt.ylabel('Price ($)')
plt.tight_layout()

In [None]:
# Weekly average closing price ‚Äî smoother trend line
stockData.resample(rule='W').mean().plot(
    y='Close', title='AAPL Weekly Mean Close Price (2020)', legend=False)
plt.ylabel('Price ($)')
plt.tight_layout()

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #922B21;">üõë STOP AND CHECK</strong><br>
<strong>Checkpoint ‚Äî Section 5</strong><br><br>Compare the two plots side by side:<br>‚Ä¢ The <strong>daily</strong> plot shows every jitter and gap (‚âà253 data points)<br>‚Ä¢ The <strong>weekly mean</strong> plot is smoother, showing the clear trend: AAPL started ~$75 in January, crashed to ~$60 in March (COVID), then rallied to ~$130 by year end<br><br>Both plots should render without errors.
</div>

---
## Section 6: Rolling Windows

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
Downsampling compresses data into fewer points. <strong>Rolling windows</strong> take a different approach ‚Äî they keep every data point but replace each value with the average of its surrounding neighbors. Imagine sliding a magnifying glass across the data, where at each position you compute the average of the last N days.<br><br>This is one of the most common smoothing techniques in data mining, used in everything from stock technical analysis to IoT sensor anomaly detection.
</div>

### Raw vs. Smoothed: Concept Visualization
Let's look at January 2020 AAPL High/Low prices ‚Äî first raw, then with a 7-day rolling mean:

In [None]:
# Raw High and Low prices for January 2020
df_raw = stockData[['High', 'Low']].query('Date <= "01/31/2020"')

g = sns.relplot(data=df_raw, kind='line', markers=True, aspect=1.5)
g.figure.suptitle('AAPL High/Low ‚Äî Raw Daily (Jan 2020)', y=1.02)
for ax in g.axes.flat:
    ax.tick_params('x', labelrotation=90)
    ax.set_xticks(pd.date_range(start='01/02/2020', end='01/31/2020', freq='B'))
    ax.set_xticklabels(pd.date_range(start='01/02/2020', end='01/31/2020',
                                     freq='B').strftime('%m-%d'))

In [None]:
# Same data with a 7-day rolling mean applied ‚Äî notice how the lines smooth out
df_smooth = stockData[['High', 'Low']].query('Date <= "01/31/2020"') \
                                      .rolling(window=7, min_periods=7).mean()

g = sns.relplot(data=df_smooth, kind='line', markers=True, aspect=1.5)
g.figure.suptitle('AAPL High/Low ‚Äî 7-Day Rolling Mean (Jan 2020)', y=1.02)
for ax in g.axes.flat:
    ax.tick_params('x', labelrotation=90)
    ax.set_xticks(pd.date_range(start='01/02/2020', end='01/31/2020', freq='B'))
    ax.set_xticklabels(pd.date_range(start='01/02/2020', end='01/31/2020',
                                     freq='B').strftime('%m-%d'))

### Understanding `min_periods`
The `window` parameter sets how many observations to include. But what happens at the start when you don't have 7 days of data yet?

In [None]:
# Default: min_periods = window size (7)
# First 6 rows are NaN because there aren't enough prior points
df_strict = stockData[['High', 'Low']].query('Date <= "01/31/2020"') \
                                      .rolling(window=7).mean()
print("With min_periods=7 (default) ‚Äî first 8 rows:")
df_strict.head(8)

In [None]:
# Relaxed: min_periods=1 means "compute the average with whatever you have"
# No NaN values ‚Äî the first row uses just 1 value, second uses 2, etc.
df_relaxed = stockData[['High', 'Low']].query('Date <= "01/31/2020"') \
                                       .rolling(window=7, min_periods=1).mean()
print("With min_periods=1 ‚Äî first 8 rows:")
df_relaxed.head(8)

<div style="background-color: #FEF9E7; border-left: 5px solid #F1C40F; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #7D6608;">‚ö†Ô∏è COMMON MISTAKE</strong><br>
When should you use <code>min_periods=1</code> vs. the default?<br><br>‚Ä¢ Use the <strong>default</strong> (strict) when accuracy matters ‚Äî you want a true 7-day average, not a partial one<br>‚Ä¢ Use <code>min_periods=1</code> when you need a value for every row (e.g., for downstream plotting or modeling) and can tolerate the early values being less stable
</div>

### Final Rolling Window Plot

In [None]:
# Clean rolling window plot with strict min_periods
df_plot = stockData[['High', 'Low']].query('Date <= "01/31/2020"') \
                                    .rolling(window=7, min_periods=7).mean()

g = sns.relplot(data=df_plot, kind='line', markers=True, aspect=1.5)
g.figure.suptitle('AAPL 7-Day Rolling Mean ‚Äî High/Low (Jan 2020)', y=1.02)
for ax in g.axes.flat:
    ax.tick_params('x', labelrotation=90)
    ax.set_xticks(pd.date_range(start='01/10/2020', end='01/31/2020', freq='B'))
    ax.set_xticklabels(pd.date_range(start='01/10/2020', end='01/31/2020',
                                     freq='B').strftime('%Y-%m-%d'))

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #922B21;">üõë STOP AND CHECK</strong><br>
<strong>Checkpoint ‚Äî Section 6</strong><br><br>You should see three plots and two DataFrames:<br>‚Ä¢ <strong>Raw plot:</strong> Jagged High/Low lines with sharp day-to-day swings<br>‚Ä¢ <strong>Smoothed plot:</strong> Gentler curves ‚Äî the 7-day rolling mean absorbs the daily noise<br>‚Ä¢ <strong>Default min_periods:</strong> First 6 rows are NaN (not enough data for a full 7-day window)<br>‚Ä¢ <strong>min_periods=1:</strong> All rows have values, but early values are based on fewer observations<br>‚Ä¢ <strong>Final plot:</strong> Clean rolling mean starting from Jan 10 (first date with a full 7-day window)
</div>

---
## Section 7: Running Totals with `expanding()`

<div style="background-color: #D6EAF8; border-left: 5px solid #2E86C1; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #1A5276;">üí° WHY ARE WE DOING THIS?</strong><br>
A <strong>running total</strong> (also called a cumulative sum) keeps a running tally as you move through time. At any point, it tells you the total from the very beginning up to that moment. This is how you'd answer: "How many total acres have burned <em>so far</em> this year?"<br><br><code>expanding()</code> is like <code>rolling()</code> but the window always starts at row 1 and grows ‚Äî it never drops old data.
</div>

In [None]:
# Reload acresBurned fresh (we'll add a column to it)
acresBurned_rt = acresBurned.copy()

# expanding().sum() = cumulative sum from the beginning
acresBurned_rt['running_total'] = acresBurned_rt['acres_burned'].expanding().sum()
acresBurned_rt.head()

### Visualizing Daily Burn vs. Running Total
Let's use the first 10 days to see the bar chart clearly. We'll use `pd.melt()` to reshape the data for Seaborn's grouped bar chart:

In [None]:
# Take the first 10 days for a readable chart
acresPlot = acresBurned_rt.head(10).copy()
acresPlot.reset_index(inplace=True)
acresPlot.head(3)

In [None]:
# Melt from wide to long format ‚Äî one row per date+metric combination
acresMelted = pd.melt(acresPlot,
                      id_vars='discovery_date',
                      value_vars=['acres_burned', 'running_total'],
                      var_name='value_type')
acresMelted.head(3)

In [None]:
# Grouped bar chart: daily burn (blue) vs running total (orange)
g = sns.catplot(data=acresMelted, kind='bar',
                x='discovery_date', y='value', hue='value_type', aspect=1.5)
g.figure.suptitle('Daily Acres Burned vs. Running Total (First 10 Days)', y=1.02)
for ax in g.axes.flat:
    ax.tick_params('x', labelrotation=90)
    ax.set_xticklabels(acresMelted.discovery_date.drop_duplicates().astype(str))

<div style="background-color: #FADBD8; border-left: 5px solid #E74C3C; padding: 15px; margin: 15px 0; border-radius: 4px;">
<strong style="color: #922B21;">üõë STOP AND CHECK</strong><br>
<strong>Checkpoint ‚Äî Section 7</strong><br><br>Your bar chart should show 10 pairs of bars:<br>‚Ä¢ <strong>acres_burned</strong> (blue) ‚Äî the daily value, varies from day to day<br>‚Ä¢ <strong>running_total</strong> (orange) ‚Äî grows monotonically, never decreases<br><br>By day 10, the running total should be significantly larger than any single day's burn. This pattern ‚Äî small daily increments building into a large cumulative total ‚Äî is the signature of running totals.
</div>

---
## Wrap-Up: What We Learned & What's Next

**Today we built the time series toolkit:**

| Technique | pandas Method | What It Does |
|-----------|--------------|--------------|
| Date ranges | `pd.date_range()` | Generate sequences of dates at any frequency |
| Reindexing | `df.reindex()` | Reshape data to a new set of dates |
| Date adjustment | Custom `adjustDate()` | Fix weekend dates to nearest business day |
| Resampling | `df.resample()` | Aggregate data into time buckets (daily ‚Üí monthly) |
| Rolling windows | `df.rolling()` | Smooth data with a sliding average |
| Running totals | `df.expanding().sum()` | Cumulative sum from the beginning |

**Next week (Week 2):** We take these skills and use them to actually *predict the future* ‚Äî time series forecasting with SARIMAX and Prophet. Everything we did today (resampling, smoothing, datetime indexing) is prerequisite input for those models.

---
*CAP4767 Data Mining with Python | Miami Dade College | Spring 2026*