<a href="https://colab.research.google.com/github/c-marq/CAP3321C-Data-Wrangling/blob/main/exercises/chapter-08/exercise_8_1_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 8-1: Analyze the Forest Fires Data

## üîë INSTRUCTOR SOLUTION KEY

**CAP3321C - Data Wrangling**

---

## Read the Data

In [None]:
import pandas as pd

In [None]:
# Download the data file from GitHub
!wget -q https://raw.githubusercontent.com/c-marq/CAP3321C-Data-Wrangling/main/data/fires_by_month.pkl
print("Data file downloaded successfully!")

In [None]:
# Load the fires data
fires_by_month = pd.read_pickle('fires_by_month.pkl')
print("Data shape:", fires_by_month.shape)

### Task 4: Display the First Five Rows

In [None]:
# ‚úÖ SOLUTION
fires_by_month.head()

---

## Part 1: Group and Aggregate the Data

### Task 5: Group the Data by State and Year

In [None]:
# ‚úÖ SOLUTION
fires_grouped = fires_by_month.groupby(['state', 'fire_year'])
fires_grouped

#### üìù Instructor Notes - Task 5

**Key Teaching Points:**
- `groupby()` returns a DataFrameGroupBy object, not a DataFrame
- Must apply an aggregation method (sum, mean, count, etc.) to get results
- Column order in the list affects the index order

### Task 6: Sum the Grouped Data

In [None]:
# ‚úÖ SOLUTION
fires_by_year = fires_grouped.sum(numeric_only=True)
fires_by_year.head()

#### üìù Instructor Notes - Task 6

**Key Teaching Points:**
- `sum()` adds up all numeric columns within each group
- `numeric_only=True` avoids FutureWarning about including non-numeric columns
- Result has a MultiIndex (state, fire_year)

**Common Student Errors:**
- Forgetting to assign to a variable
- Trying to display grouped without aggregating first

### Task 7: Drop the fire_month Column

In [None]:
# ‚úÖ SOLUTION
fires_by_year = fires_by_year.drop(columns=['fire_month'])

In [None]:
# Verify
fires_by_year.head()

#### üìù Instructor Notes - Task 7

**Key Teaching Points:**
- After summing by year, the fire_month column contains meaningless sums (1+2+3+...+12)
- Dropping irrelevant columns keeps data clean and understandable

---

## Part 2: Use Pivot Tables

### Task 8: Select Recent Data

In [None]:
# ‚úÖ SOLUTION
fires_recent = fires_by_year.query('fire_year >= 2013').reset_index()
fires_recent.head()

#### üìù Instructor Notes - Task 8

**Key Teaching Points:**
- `query()` works on index levels too (fire_year is in the index)
- `reset_index()` converts index back to columns - needed for `pivot()`
- Chaining methods is efficient and readable

### Task 9: Use pivot() Method

In [None]:
# ‚úÖ SOLUTION
fires_recent.pivot(index='state', columns='fire_year', values='acres_burned')

#### üìù Instructor Notes - Task 9

**Key Teaching Points:**
- `pivot()` requires:
  - `index` - what becomes row labels
  - `columns` - what becomes column labels  
  - `values` - what fills the cells
- Data must be pre-aggregated (no duplicate index/column combinations)
- If duplicates exist, use `pivot_table()` instead

### Task 10: Use pivot_table() Method

In [None]:
# ‚úÖ SOLUTION
fires_by_month.query('fire_year >= 2013').pivot_table(
    index='state',
    columns='fire_year',
    values='acres_burned',
    aggfunc='sum'
)

#### üìù Instructor Notes - Task 10

**Key Teaching Points:**
- `pivot_table()` combines grouping, aggregating, AND pivoting in one step
- `aggfunc='sum'` tells it how to combine duplicate values
- This is much simpler than: groupby ‚Üí sum ‚Üí filter ‚Üí pivot
- Default aggfunc is 'mean'

**Comparison:**
- `pivot()` = reshape only (data must be pre-aggregated)
- `pivot_table()` = aggregate AND reshape (handles duplicates)

---

## Part 3: Work with Bins

### Task 11: Reset the Index for fires_by_year

In [None]:
# ‚úÖ SOLUTION
fires_by_year = fires_by_year.reset_index()
fires_by_year.head()

### Task 12: Bin the Rows by Decade

In [None]:
# ‚úÖ SOLUTION
fires_by_year['decade'] = pd.cut(
    fires_by_year['fire_year'],
    bins=[1990, 2000, 2010, 2020],
    labels=['1990s', '2000s', '2010s']
)
fires_by_year.head()

#### üìù Instructor Notes - Task 12

**Key Teaching Points:**
- `pd.cut()` bins continuous data into discrete categories
- `bins` defines the edges: (1990, 2000], (2000, 2010], (2010, 2020]
- By default, bins are right-inclusive: 2000 goes in '1990s', 2001 goes in '2000s'
- `labels` gives human-readable names

**Common Student Errors:**
- Wrong bin edges (forgetting to cover all data)
- Mismatched number of labels vs bins (need n-1 labels for n bin edges)

### Task 13: Double-Check the Edge Values

In [None]:
# ‚úÖ SOLUTION
fires_by_year.head(25)

# Check: 1992-1999 should be '1990s'
#        2000-2009 should be '2000s'
#        2010-2016 should be '2010s'

### Task 14: Drop fire_year and Create fires_by_decade

In [None]:
# ‚úÖ SOLUTION
fires_by_decade = fires_by_year.drop(columns=['fire_year'])
fires_by_decade.head()

### Task 15: Group by State and Decade

In [None]:
# ‚úÖ SOLUTION
fires_by_decade.groupby(['state', 'decade']).sum(numeric_only=True)

#### üìù Instructor Notes - Task 15

**Key Teaching Points:**
- Binning allows aggregation at different granularities
- Year ‚Üí Decade reduces 25 years to 3 time periods
- Makes trends easier to see and compare

**Discussion:** How do fire totals compare across decades? Any patterns?

---

## Summary

In this exercise, you practiced data analysis techniques:

**Grouping and Aggregating:**
- `groupby()` - Group data by one or more columns
- `sum()` - Aggregate with sum

**Pivot Tables:**
- `pivot()` - Reshape data (requires pre-aggregated data)
- `pivot_table()` - Reshape and aggregate in one step

**Binning:**
- `pd.cut()` - Bin continuous data into categories
- Use `labels` parameter to name the bins