# MDST – Predicting Economic Values  
## Week 3: Feature Engineering (SOLUTIONS + TA Commentary)

**Today:** 2026-02-07

This solutions notebook:
- Fills in all TODOs from the student workbook
- Uses ONE example FRED series to demonstrate (students still choose their own in workbook)
- Adds short TA commentary on *why* each feature is used and common pitfalls


## Setup

In [None]:
import matplotlib.pyplot as plt
from pandas_datareader import data as pdr

plt.style.use("default")


## Example: Load one FRED series (demonstration only)

> **TA note:** Students should choose their own series in the workbook.  
We pick one here just so every cell runs end-to-end.


In [None]:
SERIES_ID = "CPIAUCSL"  # Example: CPI (Consumer Price Index)
START_DATE = "2000-01-01"
END_DATE = None

df = pdr.DataReader(SERIES_ID, "fred", START_DATE, END_DATE)
df.columns = ["value"]
df.head()


## Part 1: Raw levels

> **TA note:** Many macro series are trending (non-stationary). Trend can drive spurious correlation.  
Always visualize levels first to understand scale, breaks, and outliers.


In [None]:
plt.figure(figsize=(10,4))
plt.plot(df.index, df["value"])
plt.title(f"Raw Levels: {SERIES_ID}")
plt.xlabel("Date")
plt.ylabel("Value")
plt.show()


## Part 2: First differences

> **TA note:** Differencing turns levels into *changes*.  
For CPI, a first difference is roughly “inflation in index points” (not a percent).


In [None]:
df["diff_1"] = df["value"].diff()

plt.figure(figsize=(10,4))
plt.plot(df.index, df["diff_1"])
plt.title(f"First Differences: {SERIES_ID}")
plt.xlabel("Date")
plt.ylabel("Change")
plt.show()

df[["value","diff_1"]].dropna().head()


## Part 3: Percent changes

> **TA note:** Percent change is scale-free and interpretable as a growth rate.  
For CPI, percent change approximates inflation rate (per period frequency).


In [None]:
df["pct_change"] = df["value"].pct_change()

plt.figure(figsize=(10,4))
plt.plot(df.index, df["pct_change"])
plt.title(f"Percent Change: {SERIES_ID}")
plt.xlabel("Date")
plt.ylabel("Percent Change")
plt.show()

df[["pct_change"]].dropna().head()


## Part 4: Lagged variables

> **TA note:** Lags help models learn delayed relationships (e.g., policy affects economy with delay).  
Choose lags that match frequency (monthly → 1, 3, 12; quarterly → 1, 4).


In [None]:
df["lag_1"] = df["value"].shift(1)
df["lag_3"] = df["value"].shift(3)

lag_df = df.dropna()
lag_df[["value","lag_1","lag_3"]].head()


## Part 5: Rolling statistics

> **TA note:** Rolling means smooth noise. For monthly data, 12-month rolling is a common “annualized” smoothing.  
Avoid centered windows in predictive settings (they leak future info).


In [None]:
WINDOW = 12
df["rolling_mean"] = df["value"].rolling(window=WINDOW).mean()

plt.figure(figsize=(10,4))
plt.plot(df.index, df["value"], alpha=0.5, label="Raw")
plt.plot(df.index, df["rolling_mean"], label=f"Rolling Mean (window={WINDOW})")
plt.title(f"Rolling Mean vs Raw: {SERIES_ID}")
plt.legend()
plt.show()


## Part 6: Feature comparison

> **TA note:** These plots help students *see* why feature choice changes interpretability.  
A model trained on levels might mostly learn “time trend”; changes focus on dynamics.


In [None]:
features_to_compare = ["value", "diff_1", "pct_change"]

fig, axes = plt.subplots(len(features_to_compare), 1, figsize=(10,8), sharex=True)
for ax, col in zip(axes, features_to_compare):
    ax.plot(df.index, df[col])
    ax.set_title(col)
plt.tight_layout()
plt.show()


## Caution box: Common “bad feature” mistakes (TA-ready)

1. **Over-differencing** (stationary series → noisy series)  
2. **Meaningless ratios** (units and intuition matter)  
3. **Leakage** (centered rolling windows, peeking into future target)  
4. **Mismatched frequencies** (monthly vs quarterly must be aligned)  
5. **Large lags shrink sample size** (watch dropna)

> **TA tip:** Ask students to explain the *economic meaning* of every engineered feature in one sentence.


## Guided exercise solutions: Two feature combinations

We demonstrate both approaches:
- Combine transformations of the same series (momentum + volatility)
- Combine two series (CPI inflation with unemployment interaction)


In [None]:
# Option 2: Combine transformations of the same series
df["momentum_3"] = df["pct_change"].rolling(window=3).sum()
df["pct_vol_12"] = df["pct_change"].rolling(window=12).std()

df[["pct_change","momentum_3","pct_vol_12"]].dropna().head()


### TA commentary
- **momentum_3:** cumulative growth over ~3 periods; captures short-run direction.
- **pct_vol_12:** volatility over ~1 year (monthly); captures stability/uncertainty.


In [None]:
# Option 1: Combine TWO series (example)
SECOND_SERIES_ID = "UNRATE"  # Unemployment rate (monthly)
df2 = pdr.DataReader(SECOND_SERIES_ID, "fred", START_DATE, END_DATE)
df2.columns = ["unrate"]

merged = df.join(df2, how="inner")

# Interaction: inflation proxy (pct_change) * unemployment level
merged["inflation_x_unrate"] = merged["pct_change"] * merged["unrate"]

merged[["pct_change","unrate","inflation_x_unrate"]].dropna().head()


### TA commentary
- Interactions encode “the effect of X depends on Y”.
- This is only meaningful when you can explain the economic story (e.g., inflation dynamics may differ across labor market tightness).


## Final note for students
There is no single correct transformation. Prefer features that:
- match the question you’re asking
- reduce spurious trend learning
- stay interpretable
- avoid leakage
