# MDST – Predicting Economic Values  
## Week 3: Feature Engineering (Student Workbook)

**Theme:** Transform raw FRED time series into model-ready features.  
**Data choice:** You will choose **your own** economic series from [FRED](https://fred.stlouisfed.org).  
**Today:** 2026-02-07

> This notebook follows the Week 3 slides: raw levels → differences → percent change → lags → rolling stats → comparison.  


## Feature Engineering Toolbox: Syntax Primer

Before you start with FRED data, here are the key pandas operations you will use this week.  
These are **generic examples** — you will apply them to your own FRED series below.

### 1) First differences
Measures change between consecutive time periods.

**Syntax**
```python
df["diff"] = df["value"].diff()
```

### 2) Percent change
Measures relative change (growth rate); common in economics.

**Syntax**
```python
df["pct_change"] = df["value"].pct_change()
```

### 3) Lagged variables
Shifts data backward in time to capture delayed effects.

**Syntax**
```python
df["lag_1"] = df["value"].shift(1)
df["lag_k"] = df["value"].shift(k)
```

### 4) Rolling statistics
Computes statistics over a moving window.

**Syntax**
```python
df["rolling_mean"] = df["value"].rolling(window=3).mean()
```

### 5) Feature combination (engineering new features)
Combining series can encode economic intuition.

**Common patterns**
- **Spread:** `long_rate - short_rate`
- **Ratio:** `exports / imports`
- **Interaction:** `inflation * unemployment`
- **Acceleration:** `diff(diff(x))`


In [None]:
import pandas as pd
import numpy as np

# Tiny example series so you can see the outputs clearly
example_series = pd.Series([100, 105, 103, 110], name="value")
example_series


In [None]:
# 1) First differences
example_series.diff()


In [None]:
# 2) Percent change
example_series.pct_change()


In [None]:
# 3) Lagged variables
example_series.shift(1), example_series.shift(2)


In [None]:
# 4) Rolling mean (window=2)
example_series.rolling(window=2).mean()


### Feature combination examples (generic)

Below are *templates* (not runnable yet) that show common feature engineering combinations in economics.

```python
# A) Spread (difference between two rates)
df["spread"] = df["long_rate"] - df["short_rate"]

# B) Ratio (relative scale)
df["ratio"] = df["exports"] / df["imports"]

# C) Interaction (conditional effect / nonlinearity)
df["interaction"] = df["inflation"] * df["unemployment"]

# D) Acceleration (change of change)
df["diff_of_diff"] = df["value"].diff().diff()
```


In [None]:
import matplotlib.pyplot as plt
from pandas_datareader import data as pdr

plt.style.use("default")


## Part 0: Load your FRED series

### Instructions
1. Go to https://fred.stlouisfed.org  
2. Choose **one** time series you are interested in.  
3. Copy the **Series ID** (e.g., `CPIAUCSL`, `UNRATE`, `FEDFUNDS`, `GDPC1`)  
4. Paste it into `SERIES_ID` below and run the cell.


In [None]:
SERIES_ID = ""  # TODO: paste your FRED Series ID (string)
START_DATE = "2000-01-01"
END_DATE = None  # None = up to latest available

# Fetch data from FRED
df = pdr.DataReader(SERIES_ID, "fred", START_DATE, END_DATE)
df.columns = ["value"]

df.head()


## Part 1: Raw levels

Raw levels often trend over time. Trends can obscure relationships and inflate correlations.


In [None]:
plt.figure(figsize=(10,4))
plt.plot(df.index, df["value"])
plt.title(f"Raw Levels: {SERIES_ID}")
plt.xlabel("Date")
plt.ylabel("Value")
plt.show()


**Reflection (write in a markdown cell):**  
- What do you notice about the trend or volatility of this series?  
- Does it look stationary or trending?  


## Part 2: First differences

First differences measure period-to-period change and often reduce long-term trends.


In [None]:
# TODO: Compute first differences
df["diff_1"] = None  # replace None with the correct expression

plt.figure(figsize=(10,4))
plt.plot(df.index, df["diff_1"])
plt.title(f"First Differences: {SERIES_ID}")
plt.xlabel("Date")
plt.ylabel("Change")
plt.show()


**Reflection:**  
- How does differencing change the behavior of the series?  
- When might this be more appropriate than levels?  


## Part 3: Percent changes

Percent changes normalize by scale and are common in economics (growth rates).


In [None]:
# TODO: Compute percent change
df["pct_change"] = None  # replace None with the correct expression

plt.figure(figsize=(10,4))
plt.plot(df.index, df["pct_change"])
plt.title(f"Percent Change: {SERIES_ID}")
plt.xlabel("Date")
plt.ylabel("Percent Change")
plt.show()


**Reflection:**  
Why might economists prefer percent change over raw differences?  


## Part 4: Lagged variables

Lagged features capture delayed effects (policy impacts, investment cycles, etc.).


In [None]:
# TODO: Create at least two lagged features
df["lag_1"] = None   # df["value"].shift(1)
df["lag_3"] = None   # df["value"].shift(3) (or pick another lag)

lag_df = df.dropna()
lag_df[["value", "lag_1", "lag_3"]].head()


**Reflection:**  
What economic story could justify using lagged values for your series?  


## Part 5: Rolling statistics

Rolling averages smooth short-term noise. The window size controls the tradeoff between detail and stability.


In [None]:
# TODO: Compute a rolling mean
WINDOW = None  # choose a window (e.g., 3, 6, 12 depending on frequency)
df["rolling_mean"] = None  # replace None with rolling mean expression

plt.figure(figsize=(10,4))
plt.plot(df.index, df["value"], alpha=0.5, label="Raw")
plt.plot(df.index, df["rolling_mean"], label=f"Rolling Mean (window={WINDOW})")
plt.title(f"Rolling Mean vs Raw: {SERIES_ID}")
plt.legend()
plt.show()


**Reflection:**  
- What information is lost and gained with rolling averages?  
- How does the window size change the story?  


## Part 6: Feature comparison

Compare how levels vs differences vs percent changes behave for your chosen series.


In [None]:
features_to_compare = ["value", "diff_1", "pct_change"]

fig, axes = plt.subplots(len(features_to_compare), 1, figsize=(10,8), sharex=True)

for ax, col in zip(axes, features_to_compare):
    ax.plot(df.index, df[col])
    ax.set_title(col)

plt.tight_layout()
plt.show()


## Caution box: Common “bad feature” mistakes (read before modeling)

Feature engineering is powerful — and easy to misuse. Watch out for:

1. **Over-differencing**  
   Differencing a series that is already stationary can turn signal into noise.

2. **Meaningless ratios**  
   Ratios only make sense when numerator/denominator share a logical relationship (units, scale, or economic interpretation).

3. **Leakage through future information**  
   If you compute features using information from the future (e.g., centered rolling windows, or using target values at time *t+1*), your model will look unrealistically good.

4. **Mismatched frequencies**  
   Mixing monthly and quarterly series without resampling/aligning creates artifacts and NaNs.

5. **Lags too large for available history**  
   Large lags can shrink your usable dataset dramatically.

**Rule of thumb:** prefer simple, interpretable features grounded in an economic story.


## Guided exercise: Build at least 2 combined features

Now engineer **two new features** by combining *multiple* signals. You can do this in two ways:

### Option 1: Use two FRED series
Choose a second FRED series that you believe relates to your first.

Examples:
- **Yield curve spread:** `DGS10 - DGS2`
- **Real wage proxy:** wage growth minus inflation
- **Consumption pressure:** spending growth vs income growth

### Option 2: Combine transformations of the same series
Examples:
- **Momentum:** `pct_change` over the last 3 months
- **Volatility:** rolling std of `pct_change`
- **Acceleration:** `diff(diff(x))`

**Deliverable:** implement at least **two** combined features and explain their economic meaning.


In [None]:
# TODO (Option 1): Load a SECOND related FRED series (optional but recommended)
SECOND_SERIES_ID = ""  # e.g., "DGS10" or "UNRATE" etc.

# Uncomment to use a second series:
# df2 = pdr.DataReader(SECOND_SERIES_ID, "fred", START_DATE, END_DATE)
# df2.columns = ["value2"]

# If you loaded df2, merge on time index:
# merged = df.join(df2, how="inner")

# TODO: Feature combination examples (choose at least 2)
# Example spread:
# merged["spread"] = merged["value"] - merged["value2"]

# Example interaction:
# merged["interaction"] = merged["pct_change"] * merged["value2"].pct_change()

# Example volatility from one series:
# df["pct_vol_12"] = df["pct_change"].rolling(window=12).std()

# Example momentum:
# df["momentum_3"] = df["pct_change"].rolling(window=3).sum()

# Display the engineered columns you created:
# df.tail()


## Final reflection (submit)

In a few sentences:
1. Which feature set would you use for modeling and why?  
2. What economic intuition supports your choice?  
3. What assumptions are you making by choosing those features?  
