# Week 5 — Universal Modeling (Regression) (40–60 minutes)
 Works for **monthly or quarterly** datasets

## Today’s goals
1) Choose target (level vs change)
2) Choose small feature set (3–8 cols)
3) Train baseline regression
4) Evaluate with time split
5) Interpret coefficients


## Quick Python Basics Recap (for this week)

This notebook assumes **you are still new to Python**. Below is the minimum syntax you need today.

### DataFrames (pandas)
```python
import pandas as pd
df = pd.read_csv("your_file.csv")
df.head()            # shows first rows
df.columns           # column names
df["col_name"]       # select one column (a Series)
df[["a","b"]]        # select multiple columns (a DataFrame)
df.isna().sum()      # missing values per column
```

### Making plots (matplotlib)
```python
import matplotlib.pyplot as plt

plt.figure()
plt.plot(df["x"], df["y"])   # line plot
plt.title("Title")
plt.xlabel("x label")
plt.ylabel("y label")
plt.tight_layout()
plt.savefig("figures/example.png", dpi=150)
plt.show()
```
Key idea: **You build a plot step-by-step**, then save it with `savefig`.

### Writing comments
- Use `#` for a comment on one line.
- In this course, you must explain what your code does and what you learned from each plot.

### Strings and f-strings (for readable printing)
```python
value = 3.14
print(f"The value is {value}")
```


## Final Project Artifacts (you must create these)

By the end of this notebook, you must have:
1. At least **2 saved figures** in the `figures/` folder (PNG files).
2. A short **Insights** write-up answering: What changed? What matters? What would you model next?

If you cannot find `figures/`, create it using:
```python
import os
os.makedirs("figures", exist_ok=True)
```


## Syntax Toolbox for Week 5 (Modeling)

This week introduces machine learning code. Here is the syntax before you use it.

### 1) Selecting feature columns (X) and target (y)
```python
X = df[["feat1", "feat2"]]
y = df["target"]
```

### 2) Train/test split
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```
- The model learns from `train` and is evaluated on `test`.

### 3) Fit a model and predict
```python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
pred = model.predict(X_test)
```

### 4) Evaluate with MAE / RMSE
```python
from sklearn.metrics import mean_absolute_error, mean_squared_error
mae = mean_absolute_error(y_test, pred)
rmse = mean_squared_error(y_test, pred, squared=False)
```

### 5) A scatter plot to check prediction quality
```python
plt.scatter(y_test, pred)
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Actual vs Predicted")
plt.show()
```


> **Expanded version** (generated 2026-01-05). Added extra coding + commenting + writing tasks.


## 1) Setup + load + engineer (15 min)

In [None]:
# If needed (first time only):
# !pip -q install pandas_datareader scikit-learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (11, 4)  # Create a plot for interpretation / reporting
plt.rcParams["axes.grid"] = True  # Create a plot for interpretation / reporting

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

from sklearn.linear_model import LinearRegression  # Fit a model (baseline or predictive)
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

from pandas_datareader import data as pdr

def fetch_fred_series(series_id: str, start="1990-01-01", end=None) -> pd.DataFrame:
    """Fetch one FRED series as a DataFrame with a datetime index."""
    if end is None:
        end = pd.Timestamp.today().strftime("%Y-%m-%d")
    s = pdr.DataReader(series_id, "fred", start, end)
    s.columns = [series_id]
    s.index = pd.to_datetime(s.index)  # Ensure date/time column is parsed correctly
    return s

def fetch_many(series_ids, start="1990-01-01"):
    dfs = [fetch_fred_series(s, start=start) for s in series_ids]
    return pd.concat(dfs, axis=1).sort_index()

def infer_freq(index: pd.DatetimeIndex) -> str:
    f = pd.infer_freq(index)
    if f is None:
        return "U"
    f = f.upper()
    if "Q" in f:
        return "Q"
    if "M" in f:
        return "M"
    return "U"

def to_period_end(df: pd.DataFrame, target: str) -> pd.DataFrame:
    # Default: last observation within each period.
    if target == "M":
        return df.resample("M").last()
    if target == "Q":
        return df.resample("Q").last()
    raise ValueError("target must be 'M' or 'Q'")

def add_common_features(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    for c in out.columns:
        out[f"{c}_lag1"] = out[c].shift(1)
        out[f"{c}_diff1"] = out[c].diff(1)
        out[f"{c}_pct1"] = out[c].pct_change(1) * 100
        out[f"{c}_roll3"] = out[c].rolling(3).mean()
    return out

In [None]:
# ===========================
# STUDENT CHOICE (EDIT HERE)
# ===========================
# Choose 3–6 FRED series IDs relevant to your question.
# Search on https://fred.stlouisfed.org and copy the series ID.

series_ids = [
    "UNRATE",
    "CPIAUCSL",
    "FEDFUNDS"
]

# Choose your target variable (must be one of the series_ids)
target_id = "CPIAUCSL"

start_date = "1990-01-01"

In [None]:
df_raw = fetch_many(series_ids, start=start_date)

# Infer each series' frequency
freqs = {c: infer_freq(df_raw[c].dropna().index) for c in df_raw.columns}  # Handle missing values
freqs

In [None]:
# Rule: if any series is quarterly, use quarterly for everything (safe when mixing).
use_freq = "Q" if any(v == "Q" for v in freqs.values()) else "M"
print("Using frequency:", use_freq)

df = to_period_end(df_raw, use_freq)

# Drop rows where target is missing (we can’t model without target)
df = df.dropna(subset=[target_id])  # Handle missing values

# Missing-value strategies:
df_complete = df.dropna()                 # simplest: keep only complete rows
df_ffill = df.fillna(method="ffill")      # common: forward-fill predictors

#  Choose ONE:
df_use = df_complete   # or df_ffill

df_use.head()

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

df_feat = add_common_features(df_use[series_ids]).dropna()  # Handle missing values
df_feat.head()

## 2) Choose y (5 min)

In [None]:
# TODO: choose level/pct/diff
y = df_feat[f"{target_id}_pct1"]
y.name = "y"
y.head()

## 3) Choose X (10 min)

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

predictors = [s for s in series_ids if s != target_id]
cols=[]
for s in predictors:
    cols += [f"{s}_lag1", f"{s}_pct1", f"{s}_roll3"]

X = df_feat[cols].copy()
X.head()

## 4) Time split (8–10 min)

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

cut = int(len(df_feat)*0.8)
X_train, X_test = X.iloc[:cut], X.iloc[cut:]
y_train, y_test = y.iloc[:cut], y.iloc[cut:]
len(X_train), len(X_test)

## 5) Fit + evaluate (10–12 min)

###  Baseline Model (Required)
Before using regression, build a **baseline** so you can say whether your model is actually helpful.

Pick one baseline:
- predict the training mean of y
- "last value" baseline (ŷ_t = y_{t-1})
- simple moving average baseline

**Deliverable:** compute baseline MAE and compare to your regression MAE.


In [None]:

# TODO (STUDENTS):
# Implement one baseline and compute MAE on the test set.
# Print a comparison:
# - baseline MAE
# - regression MAE
# - % improvement (if any)

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

# SOLUTION (INSTRUCTOR): Example implementation scaffold
import numpy as np
import matplotlib.pyplot as plt

num_cols = df_use.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns:", num_cols[:10])

# Choose columns safely
col = num_cols[0]
fig, ax = plt.subplots(figsize=(10,4))  # Create a plot for interpretation / reporting
df_use[col].plot(ax=ax)  # Create a plot for interpretation / reporting
ax.set_title(f"Example plot for {col}")
ax.set_xlabel("Date")
ax.set_ylabel("Value")
fig.tight_layout()
plt.show()  # Create a plot for interpretation / reporting

###  Pipeline Practice (Required)
Build a scikit-learn **Pipeline** that:
1) imputes missing values (if any)
2) scales features
3) fits a model

Then try **two models**:
- LinearRegression (baseline)
- Ridge OR Lasso (regularized)

**Deliverable:** which one generalizes better on the test set? Why might that be?


In [None]:

# TODO (STUDENTS):
# Create a Pipeline using sklearn.
# Fit LinearRegression and Ridge (or Lasso) and compare test MAE/RMSE.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso  # Fit a model (baseline or predictive)
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# pipe_lr = Pipeline([...])
# pipe_ridge = Pipeline([...])

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

# SOLUTION (INSTRUCTOR): Example implementation scaffold
import numpy as np
import matplotlib.pyplot as plt

num_cols = df_use.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns:", num_cols[:10])

# Choose columns safely
col = num_cols[0]
fig, ax = plt.subplots(figsize=(10,4))  # Create a plot for interpretation / reporting
df_use[col].plot(ax=ax)  # Create a plot for interpretation / reporting
ax.set_title(f"Example plot for {col}")
ax.set_xlabel("Date")
ax.set_ylabel("Value")
fig.tight_layout()
plt.show()  # Create a plot for interpretation / reporting

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

model = LinearRegression().fit(X_train, y_train)  # Fit a model (baseline or predictive)
pred_test = model.predict(X_test)
print("Test MSE:", mean_squared_error(y_test, pred_test))
print("Test R2 :", r2_score(y_test, pred_test))

## 6) Coefficients (10 min)

###  Interpretation Drill (Required)
Choose **two** coefficients and explain, in plain English:
- what a 1-unit increase in X means (units!)
- how y changes (direction + magnitude)
- whether that seems realistic

Also add a note: could multicollinearity be affecting coefficients?


Write at least 3 sentences. Include: (1) what you observe, (2) why it might be happening, (3) how it affects your modeling choices.

###  Optional Stretch: Time Series Cross-Validation
Use `TimeSeriesSplit` to evaluate your model across multiple splits.

**Deliverable:** a table of MAE per split + average MAE.


In [None]:

# TODO (STUDENTS, optional stretch):
from sklearn.model_selection import TimeSeriesSplit

# tscv = TimeSeriesSplit(n_splits=5)
# maes = []
# for train_idx, test_idx in tscv.split(X):
#     ...
# print(maes, np.mean(maes))

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

coef_tbl = pd.DataFrame({"feature": X.columns, "coef": model.coef_})
coef_tbl["abs_coef"] = coef_tbl["coef"].abs()
coef_tbl = coef_tbl.sort_values("abs_coef", ascending=False)
coef_tbl.head(10)

**Instructor note:** stress association language and units.

## 7) Residual plot (5 min)

###  Residual Diagnostics (Required)
Compute and comment on:
- residual mean (should be near 0)
- residual std
- whether residuals get larger when predictions are larger (heteroskedasticity)

**Deliverable:** 3–5 sentences explaining whether your model assumptions look reasonable.


In [None]:

# TODO (STUDENTS):
# Compute residual summary stats and print them clearly.

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

# SOLUTION (INSTRUCTOR): Example implementation scaffold
import numpy as np
import matplotlib.pyplot as plt

num_cols = df_use.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns:", num_cols[:10])

# Choose columns safely
col = num_cols[0]
fig, ax = plt.subplots(figsize=(10,4))  # Create a plot for interpretation / reporting
df_use[col].plot(ax=ax)  # Create a plot for interpretation / reporting
ax.set_title(f"Example plot for {col}")
ax.set_xlabel("Date")
ax.set_ylabel("Value")
fig.tight_layout()
plt.show()  # Create a plot for interpretation / reporting

In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

resid = y_test - pred_test
plt.figure()  # Create a plot for interpretation / reporting
plt.plot(resid.index, resid.values)  # Create a plot for interpretation / reporting
plt.title("Residuals over time (test)")  # Create a plot for interpretation / reporting
plt.xlabel("Date"); plt.ylabel("Residual (y - ŷ)")  # Create a plot for interpretation / reporting
plt.show()  # Create a plot for interpretation / reporting

## 8) Reflection (8–10 min)
1) Performance?
2) Important feature?
3) Next step?

**Instructor example:** Try more lags, different y, add domain predictors; keep explainable.

## End-of-class checkpoint
 baseline model + interpretation

## Final Project Artifacts (Save these for your report)

By the end of the project, you should have:
- At least 2 polished figures that show *trends* and *relationships*
- A small table of your **top correlations** with the target
- A short written interpretation of what the plots suggest

In this section you will generate and save figures you can reuse in your final write-up.


In [None]:
# Explanation:
# - Goal: Run and understand the steps below
# - What you should check after running:
#   1) The output has the expected shape/columns
#   2) The values look reasonable (no obvious NaNs or impossible values)
#   3) Any figures have clear titles/labels and are saved to disk when required
#
# How to read this code:
# - Imports / configuration come first
# - Then we compute intermediate variables (feature engineering)
# - Then we summarize / visualize
# - Finally, we write a short interpretation in Markdown below the figure/table

# SOLUTION (INSTRUCTOR):
# Save a trend figure and a relationship figure using available numeric columns.

import os
import numpy as np
import matplotlib.pyplot as plt

os.makedirs("figures", exist_ok=True)  # Create output folder if it does not exist

num_cols = df_use.select_dtypes(include=[np.number]).columns.tolist()
if len(num_cols) == 0:
    raise ValueError("No numeric columns found in df_use. Check data loading/cleaning steps.")

# Trend figure: first numeric column
col_trend = num_cols[0]
fig, ax = plt.subplots(figsize=(10,4))  # Create a plot for interpretation / reporting
df_use[col_trend].plot(ax=ax)  # Create a plot for interpretation / reporting
ax.set_title(f"Trend of {col_trend} over time")
ax.set_xlabel("Date")
ax.set_ylabel("Value")
fig.tight_layout()
fig.savefig(f"figures/trend_{col_trend}.png", dpi=200)  # Save an artifact you can reuse in the final project

# Relationship figure: correlation heatmap (top 8 numeric cols)
use_cols = num_cols[:8]
corr = df_use[use_cols].corr()  # Compute correlations to look for relationships

fig, ax = plt.subplots(figsize=(7,6))  # Create a plot for interpretation / reporting
im = ax.imshow(corr.values)
ax.set_xticks(range(len(use_cols)))
ax.set_yticks(range(len(use_cols)))
ax.set_xticklabels(use_cols, rotation=45, ha="right")
ax.set_yticklabels(use_cols)
ax.set_title("Correlation heatmap (subset of variables)")
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
fig.tight_layout()
fig.savefig("figures/corr_heatmap.png", dpi=200)  # Save an artifact you can reuse in the final project

corr.round(3)

### Written Insights (Required)

Write 5–8 bullet points answering:
1. Which variable trends most strongly over time? What might explain it?
2. Which pair of variables looks most related? Is that relationship stable over time?
3. What missingness or outliers could bias modeling?
4. What is one feature engineering idea you want to try next week?


#### Sample instructor bullets (example)

- The first numeric series shows a clear upward trend; this suggests non-stationarity and motivates using percent change or differencing.
- Correlation heatmap indicates several variables move together, suggesting multicollinearity; regularization may help.
- Missingness is concentrated in a small set of columns; imputation strategy should be justified and tested.
- Outliers appear around major economic events; consider robust scaling or winsorization and document the rationale.
- Next week: create lag features and rolling aggregates to capture delayed effects.
