# Stage 09 — Homework Starter Notebook

In the lecture, we learned how to create engineered features. Now it’s your turn to apply those ideas to your own project data.

In [3]:
import pandas as pd
import numpy as np
import sys

# Example synthetic data (replace with your project dataset)
# np.random.seed(0)
# n = 100
# df = pd.DataFrame({
#     'income': np.random.normal(60000, 15000, n).astype(int),
#     'monthly_spend': np.random.normal(2000, 600, n).astype(int),
#     'credit_score': np.random.normal(680, 50, n).astype(int)
# })
# df.head()

df = pd.read_csv("../data/raw/preprocessed_dataset.csv")
df.head()

Unnamed: 0,date,spx_close,vix,dgs10,fedfunds,cpi,unrate
0,2020-08-24,0.060143,0.259699,0.65,4.33,301.476,4.0
1,2020-08-25,0.063962,0.251297,0.69,4.33,301.476,4.0
2,2020-08-26,0.074826,0.281937,0.69,4.33,301.476,4.0
3,2020-08-27,0.076627,0.311589,0.74,4.33,301.476,4.0
4,2020-08-28,0.083887,0.274277,0.74,4.33,301.476,4.0


## Implement at least 2 engineered features here

In [4]:
# Example template:
# df['spend_income_ratio'] = df['monthly_spend'] / df['income'] 
# Example: df['rolling_spend_mean'] = df['monthly_spend'].rolling(3).mean()
# Add rationale in markdown below

In [5]:
# Daily log returns of S&P 500
df["spx_ret"] = np.log(df["spx_close"] / df["spx_close"].shift(1))

# 21-day realized volatility (annualized, in %)
# 252 trading days per year; multiply rolling std by sqrt(252)*100 to get percentage volatility 
roll = 21
df["rv_21d_pct"] = df["spx_ret"].rolling(roll).std() * np.sqrt(252) * 100 

  result = getattr(ufunc, method)(*inputs, **kwargs)


### Rationale for Feature 1 
- Rolling standard deviation of daily log returns over ~1 trading month, annualized and expressed in percent.
- EDA showed right-skewed SPX Close with occasional spikes. Prices are non-stationary and on a large scale; transforming to log returns makes the series closer to stationary and reduces scale effects.
- A rolling 21-day window (approximately 1 trading month) captures volatility while smoothing transients. It helps to capture realized (historical) volatility regime. Near-term volatility tends to cluster; yesterday's turbulence is informative about tomorrow's risk. 

In [6]:
# Variance Risk Premium proxy (VRP ratio): VIX / RealizedVol - 1
# VIX is already in percent; align units with rv_21d_pct (also percent)
# Guard against division by zero/NaN 
df["vrp_ratio"] = np.where(df["rv_21d_pct"] > 0, df["vix"] / df["rv_21d_pct"] - 1.0, np.nan) 

### Rationale for Feature 2
- Ratio of implied volatility (VIX) to realized volatility minus 1. 
- EDA showed that when SPX falls, VIX rises and there are right-tail spikes in VIX. Comparing implied VIX to realized volatility (rv_21d_pct) yields a risk-premium signal that often anticipates near-term volatility regimes. This reflects how option markets price future risk vs. what was actually realized, making it highly relevant for forecasting.

In [7]:
# 10Y rate shock 
df["rate_shock_5d_bps"] = (df["dgs10"] - df["dgs10"].shift(5)) * 100 # % -> bps

### Rationale for Feature 3
- 5-business-day change in the 10-year yield, expressed in basis points
- Sudden rate moves capture macro shocks (policy changes, inflation suprises) that often spill into equity volatility. 

In [8]:
# Drop leading rows with NaNs introduced by rolling/changes
df = df.dropna(subset=["spx_ret", "rv_21d_pct", "vrp_ratio", "rate_shock_5d_bps"]).reset_index(drop=True)

# Save datasets
df.to_csv("../data/processed/engineered_features.csv", index=False)
df.to_parquet("../data/processed/engineered_features.parquet", index=False)

# Quick check
print(df.head(3))
print("\nColumns:", list(df.columns))

         date  spx_close       vix  dgs10  fedfunds      cpi  unrate  \
0  2020-10-23   0.070698  0.387695   0.85      0.09  260.319     6.9   
1  2020-10-26   0.050764  0.509019   0.81      0.09  260.319     6.9   
2  2020-10-27   0.047580  0.531011   0.79      0.09  260.319     6.9   

    spx_ret  rv_21d_pct  vrp_ratio  rate_shock_5d_bps  
0  0.053491  721.180880  -0.999462                9.0  
1 -0.331233  395.760560  -0.998714                3.0  
2 -0.064779  337.028035  -0.998424               -2.0  

Columns: ['date', 'spx_close', 'vix', 'dgs10', 'fedfunds', 'cpi', 'unrate', 'spx_ret', 'rv_21d_pct', 'vrp_ratio', 'rate_shock_5d_bps']
