# Week 3 — Universal Feature Engineering (40–60 minutes)
 Works for **monthly or quarterly** datasets

## Today’s goals
1) Create features: lags, diffs, % changes, rolling means
2) Compare levels vs changes
3) Explore a lag relationship
4) Write a short explanation


## Quick Python Basics Recap (for this week)

This notebook assumes **you are still new to Python**. Below is the minimum syntax you need today.

### DataFrames (pandas)
```python
import pandas as pd
df = pd.read_csv("your_file.csv")
df.head()            # shows first rows
df.columns           # column names
df["col_name"]       # select one column (a Series)
df[["a","b"]]        # select multiple columns (a DataFrame)
df.isna().sum()      # missing values per column
```

### Making plots (matplotlib)
```python
import matplotlib.pyplot as plt

plt.figure()
plt.plot(df["x"], df["y"])   # line plot
plt.title("Title")
plt.xlabel("x label")
plt.ylabel("y label")
plt.tight_layout()
plt.savefig("figures/example.png", dpi=150)
plt.show()
```
Key idea: **You build a plot step-by-step**, then save it with `savefig`.

### Writing comments
- Use `#` for a comment on one line.
- In this course, you must explain what your code does and what you learned from each plot.

### Strings and f-strings (for readable printing)
```python
value = 3.14
print(f"The value is {value}")
```


## Final Project Artifacts (you must create these)

By the end of this notebook, you must have:
1. At least **2 saved figures** in the `figures/` folder (PNG files).
2. A short **Insights** write-up answering: What changed? What matters? What would you model next?

If you cannot find `figures/`, create it using:
```python
import os
os.makedirs("figures", exist_ok=True)
```


## Syntax Toolbox for Week 3 (Feature Engineering)

You will see these pandas patterns later. Here is the syntax **before** we use it.

### 1) Creating a new column
```python
df["new_col"] = df["old_col"] * 100
```

### 2) Percent change (`pct_change`)
```python
df["unrate_pct"] = df["UNRATE"].pct_change() * 100
```
- `pct_change()` computes `(current - previous) / previous`.

### 3) Difference (`diff`)
```python
df["unrate_diff"] = df["UNRATE"].diff()
```
- `diff()` computes `current - previous`.

### 4) Rolling mean (`rolling`)
```python
df["unrate_roll4"] = df["UNRATE"].rolling(window=4).mean()
```
- A rolling mean smooths short-term noise.

### 5) Lag features (`shift`)
```python
df["unrate_lag1"] = df["UNRATE"].shift(1)
```
- Lag features shift values down so the model can use the past.

### 6) Dropping missing values created by features
```python
df_feat = df.dropna()
```
- `pct_change`, `diff`, `rolling`, and `shift` create `NaN` at the start.


> **Expanded version** (generated 2026-01-05). Added extra coding + commenting + writing tasks.


## 1) Setup + load your dataset (10–15 min)

In [None]:
# If needed (first time only):
# !pip -q install pandas_datareader scikit-learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (11, 4)  # Create a plot for interpretation / reporting
plt.rcParams["axes.grid"] = True  # Create a plot for interpretation / reporting

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

from pandas_datareader import data as pdr

def fetch_fred_series(series_id: str, start="1990-01-01", end=None) -> pd.DataFrame:
    """Fetch one FRED series as a DataFrame with a datetime index."""
    if end is None:
        end = pd.Timestamp.today().strftime("%Y-%m-%d")
    s = pdr.DataReader(series_id, "fred", start, end)
    s.columns = [series_id]
    s.index = pd.to_datetime(s.index)  # Ensure date/time column is parsed correctly
    return s

def fetch_many(series_ids, start="1990-01-01"):
    dfs = [fetch_fred_series(s, start=start) for s in series_ids]
    return pd.concat(dfs, axis=1).sort_index()

def infer_freq(index: pd.DatetimeIndex) -> str:
    f = pd.infer_freq(index)
    if f is None:
        return "U"
    f = f.upper()
    if "Q" in f:
        return "Q"
    if "M" in f:
        return "M"
    return "U"

def to_period_end(df: pd.DataFrame, target: str) -> pd.DataFrame:
    # Default: last observation within each period.
    if target == "M":
        return df.resample("M").last()
    if target == "Q":
        return df.resample("Q").last()
    raise ValueError("target must be 'M' or 'Q'")

def add_common_features(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    for c in out.columns:
        out[f"{c}_lag1"] = out[c].shift(1)
        out[f"{c}_diff1"] = out[c].diff(1)
        out[f"{c}_pct1"] = out[c].pct_change(1) * 100
        out[f"{c}_roll3"] = out[c].rolling(3).mean()
    return out

In [None]:
# ===========================
# STUDENT CHOICE (EDIT HERE)
# ===========================
# Choose 3–6 FRED series IDs relevant to your question.
# Search on https://fred.stlouisfed.org and copy the series ID.

series_ids = [
    "UNRATE",
    "CPIAUCSL",
    "FEDFUNDS"
]

# Choose your target variable (must be one of the series_ids)
target_id = "CPIAUCSL"

start_date = "1990-01-01"

In [None]:
df_raw = fetch_many(series_ids, start=start_date)

# Infer each series' frequency
freqs = {c: infer_freq(df_raw[c].dropna().index) for c in df_raw.columns}  # Handle missing values
freqs

In [None]:
# Rule: if any series is quarterly, use quarterly for everything (safe when mixing).
use_freq = "Q" if any(v == "Q" for v in freqs.values()) else "M"
print("Using frequency:", use_freq)

df = to_period_end(df_raw, use_freq)

# Drop rows where target is missing (we can’t model without target)
df = df.dropna(subset=[target_id])  # Handle missing values

# Missing-value strategies:
df_complete = df.dropna()                 # simplest: keep only complete rows
df_ffill = df.fillna(method="ffill")      # common: forward-fill predictors

#  Choose ONE:
df_use = df_complete   # or df_ffill

df_use.head()

## 2) Create engineered features (10 min)

### Function Writing (Required)

You will write two reusable functions. This is a core Python skill.

## A) `make_features(df, col, window)`
**Goal:** create new columns from one base column (percent change, difference, rolling mean).

### Syntax you need
**Function header**
```python
def my_function(arg1, arg2):
    # body (indented)
    return something
```

**Creating a new column in a DataFrame**
```python
df["new_name"] = df[col].pct_change() * 100
```

### Example (tiny)
```python
tiny = pd.DataFrame({"x": [10, 11, 12, 15]})
tiny["x_pct"] = tiny["x"].pct_change() * 100
tiny
```

## B) `make_lags(df, col, lags)`
**Goal:** create lag columns like `col_lag1`, `col_lag2`, ...

### Syntax you need
```python
df["x_lag1"] = df["x"].shift(1)
```

### Example (tiny)
```python
tiny["x_lag1"] = tiny["x"].shift(1)
tiny
```

When you are done, you should be able to call:
```python
df_feat = make_features(df, target_col, window=4)
df_feat = make_lags(df_feat, target_col, lags=[1,2,3,4])
```


In [None]:

# TODO (STUDENTS):
import numpy as np
import pandas as pd

def make_features(df, cols):
    """Create engineered features for each column in cols."""
    df = df.copy()
    # TODO
    return df

def make_lags(df, col, lags):
    """Create lag features for `col` for each lag in `lags`."""
    df = df.copy()
    # TODO
    return df

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

# SOLUTION (INSTRUCTOR): Example implementation scaffold
import numpy as np
import matplotlib.pyplot as plt

num_cols = df_use.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns:", num_cols[:10])

# Choose columns safely
col = num_cols[0]
fig, ax = plt.subplots(figsize=(10,4))  # Create a plot for interpretation / reporting
df_use[col].plot(ax=ax)  # Create a plot for interpretation / reporting
ax.set_title(f"Example plot for {col}")
ax.set_xlabel("Date")
ax.set_ylabel("Value")
fig.tight_layout()
plt.show()  # Create a plot for interpretation / reporting

###  Quick Test (Required)
Add **3 assertions** proving your functions work (new columns exist, row counts, shift correctness).


In [None]:

# TODO:
# assert ...
# assert ...
# assert ...
print(" checks passed (once you complete them)")

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

df_feat = add_common_features(df_use[series_ids]).dropna()  # Handle missing values
df_feat.head()

## 3) Choose your target form (5 min)

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

y_level = df_feat[target_id]
y_pct = df_feat[f"{target_id}_pct1"]
y_diff = df_feat[f"{target_id}_diff1"]

# TODO: choose ONE
y = y_pct
y.name = "y"
y.head()

## 4) Compare: levels vs changes (15–20 min)

###  Deeper Comparison (Required)
Create **3 scatter plots**:
1) y level vs X level  
2) y change vs X change  
3) y change vs lagged X change (choose 1 lag)

Add a best-fit line for each and write 2–3 bullets interpreting each plot.


In [None]:

# TODO:
# Make the 3 scatter plots described above.

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

predictor_id = [s for s in series_ids if s != target_id][0]

plt.figure()  # Create a plot for interpretation / reporting
plt.scatter(df_feat[predictor_id], df_feat[target_id], alpha=0.5)  # Create a plot for interpretation / reporting
plt.title(f"Levels: {predictor_id} vs {target_id}")  # Create a plot for interpretation / reporting
plt.xlabel(predictor_id); plt.ylabel(target_id)  # Create a plot for interpretation / reporting
plt.show()  # Create a plot for interpretation / reporting

plt.figure()  # Create a plot for interpretation / reporting
plt.scatter(df_feat[f"{predictor_id}_pct1"], df_feat[f"{target_id}_pct1"], alpha=0.5)  # Create a plot for interpretation / reporting
plt.title(f"Changes: {predictor_id}_pct1 vs {target_id}_pct1")  # Create a plot for interpretation / reporting
plt.xlabel(f"{predictor_id}_pct1"); plt.ylabel(f"{target_id}_pct1")  # Create a plot for interpretation / reporting
plt.show()  # Create a plot for interpretation / reporting

**Instructor note:** explain drift vs movement.

## 5) Lag exploration (10 min)

###  Lag Selection Mini-Experiment (Required)
Test multiple lags and build a results table:
- lag
- correlation
- abs correlation

Sort by abs correlation and pick the best lag. Explain why.


In [None]:

# TODO:
# lags = [...]
# rows = []
# for lag in lags:
#     ...
# lag_results = pd.DataFrame(rows).sort_values("abs_corr", ascending=False)
# display(lag_results)

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

plt.figure()  # Create a plot for interpretation / reporting
plt.scatter(df_feat[f"{predictor_id}_lag1"], df_feat[f"{target_id}_pct1"], alpha=0.5)  # Create a plot for interpretation / reporting
plt.title(f"Lag: {predictor_id}_lag1 vs {target_id}_pct1")  # Create a plot for interpretation / reporting
plt.xlabel(f"{predictor_id}_lag1"); plt.ylabel(f"{target_id}_pct1")  # Create a plot for interpretation / reporting
plt.show()  # Create a plot for interpretation / reporting

**Instructor note:** keep lag length at 1 to reduce cognitive load.

## 6) Reflection (8–10 min)
Write 5–7 sentences.

**Instructor example:** Changes reduce drift; lags can capture delayed effects; rolling means smooth noise. Next: test more lags and compare models.

## End-of-class checkpoint
 `df_feat` + comparisons + reflection

## Final Project Artifacts (Save these for your report)

By the end of the project, you should have:
- At least 2 polished figures that show *trends* and *relationships*
- A small table of your **top correlations** with the target
- A short written interpretation of what the plots suggest

In this section you will generate and save figures you can reuse in your final write-up.


In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

# SOLUTION (INSTRUCTOR):
# Save a trend figure and a relationship figure using available numeric columns.

import os
import numpy as np
import matplotlib.pyplot as plt

os.makedirs("figures", exist_ok=True)  # Create output folder if it does not exist

num_cols = df_use.select_dtypes(include=[np.number]).columns.tolist()
if len(num_cols) == 0:
    raise ValueError("No numeric columns found in df_use. Check data loading/cleaning steps.")

# Trend figure: first numeric column
col_trend = num_cols[0]
fig, ax = plt.subplots(figsize=(10,4))  # Create a plot for interpretation / reporting
df_use[col_trend].plot(ax=ax)  # Create a plot for interpretation / reporting
ax.set_title(f"Trend of {col_trend} over time")
ax.set_xlabel("Date")
ax.set_ylabel("Value")
fig.tight_layout()
fig.savefig(f"figures/trend_{col_trend}.png", dpi=200)  # Save an artifact you can reuse in the final project

# Relationship figure: correlation heatmap (top 8 numeric cols)
use_cols = num_cols[:8]
corr = df_use[use_cols].corr()  # Compute correlations to look for relationships

fig, ax = plt.subplots(figsize=(7,6))  # Create a plot for interpretation / reporting
im = ax.imshow(corr.values)
ax.set_xticks(range(len(use_cols)))
ax.set_yticks(range(len(use_cols)))
ax.set_xticklabels(use_cols, rotation=45, ha="right")
ax.set_yticklabels(use_cols)
ax.set_title("Correlation heatmap (subset of variables)")
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
fig.tight_layout()
fig.savefig("figures/corr_heatmap.png", dpi=200)  # Save an artifact you can reuse in the final project

corr.round(3)

### Written Insights (Required)

Write 5–8 bullet points answering:
1. Which variable trends most strongly over time? What might explain it?
2. Which pair of variables looks most related? Is that relationship stable over time?
3. What missingness or outliers could bias modeling?
4. What is one feature engineering idea you want to try next week?


#### Sample instructor bullets (example)

- The first numeric series shows a clear upward trend; this suggests non-stationarity and motivates using percent change or differencing.
- Correlation heatmap indicates several variables move together, suggesting multicollinearity; regularization may help.
- Missingness is concentrated in a small set of columns; imputation strategy should be justified and tested.
- Outliers appear around major economic events; consider robust scaling or winsorization and document the rationale.
- Next week: create lag features and rolling aggregates to capture delayed effects.
