# Stage 09 — Feature Engineering (Completed)

**Dataset:** `options_panel_synthetic_2025_Q3_basic_no_oi.csv`

This notebook follows the starter template:
1. Replace the sample synthetic data with the project dataset.
2. Implement at least **two** new features (no Greeks).
3. Document each feature with a short rationale.
4. (Optional) Explore correlation with a chosen target.


## 1) Load Project Dataset
We use the options panel created previously (basic parameters only).

In [None]:
import pandas as pd, numpy as np
pd.set_option('display.max_columns', 0)
df = pd.read_csv("/mnt/data/options_panel_synthetic_2025_Q3_basic_no_oi.csv")
df.head()

## 2) Feature Engineering (No Greeks)
We add **four** simple, finance-relevant features derived only from the basic columns:
- **`log_moneyness = ln(S/K)`** — scale-free measure of relative strike; commonly predictive of option prices.
- **`tau_years = DTE/365`** — time-to-expiry in years; options decay with time.
- **`Intrinsic_Value` & `Extrinsic_Value`** — decomposes price into intrinsic (max(in/out-of-the-money, 0)) and the time-value remainder.
- **`Rel_Spread = (Ask - Bid)/Last`** — a simple liquidity/transaction-cost proxy often correlated with pricing noise.
- **`ITM` (0/1)** — whether the option is in-the-money.
- **`Volume_z`** — z-scored volume within each ticker to normalize cross-date volume swings.

In [None]:
df["log_moneyness"] = np.log(df["Underlying_Price"] / df["Strike"])
df["tau_years"] = df["DTE"] / 365.0
df["Intrinsic_Value"] = np.where(
    df["OptionType"].str.lower() == "call",
    np.maximum(df["Underlying_Price"] - df["Strike"], 0.0),
    np.maximum(df["Strike"] - df["Underlying_Price"], 0.0),
)
df["Extrinsic_Value"] = np.maximum(df["Last"] - df["Intrinsic_Value"], 0.0)
df["Rel_Spread"] = (df["Ask"] - df["Bid"]) / df["Last"]
df["Rel_Spread"] = df["Rel_Spread"].replace([np.inf, -np.inf], np.nan).clip(lower=0.0)
df["ITM"] = (
    ((df["OptionType"].str.lower() == "call") & (df["Underlying_Price"] > df["Strike"])) |
    ((df["OptionType"].str.lower() == "put") & (df["Underlying_Price"] < df["Strike"]))
).astype(int)
df["Volume_z"] = df.groupby("Ticker")["Volume"].transform(lambda s: (s - s.mean())/s.std(ddof=0))
df.head()

### Short Rationales
- **`log_moneyness`**: captures relative distance to strike in a stable, additive way; standard in options literature.
- **`tau_years`**: options prices/decay scale with time; years is a natural scale.
- **`Intrinsic/Extrinsic`**: separates value driven by moneyness vs. time (and other premiums).
- **`Rel_Spread`**: wider spreads imply lower liquidity and potential price inefficiency.
- **`ITM`**: structural regime indicator that often changes price behavior.
- **`Volume_z`**: normalizes activity within each name so busy days are comparable across the panel.

## 3) Save Feature Set

In [None]:
out_path = "/mnt/data/options_panel_synthetic_2025_Q3_basic_features.csv"
df.to_csv(out_path, index=False)
out_path

## 4) (Optional) Correlation with Target
Treat **`Last`** (option price) as the target and compute simple Pearson correlations.

In [None]:
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
corr_series = df[num_cols].corr()["Last"].sort_values(ascending=False)
corr_series.to_frame("corr_with_Last").head(12)

## 5) (Optional) Quick Plot
Scatter of **`Last`** vs **`log_moneyness`** to visualize the relationship.

In [None]:
import matplotlib.pyplot as plt
plt.figure()
plt.scatter(df["log_moneyness"], df["Last"], s=6, alpha=0.4)
plt.title("Option Last vs log-moneyness")
plt.xlabel("log_moneyness = ln(S/K)")
plt.ylabel("Last")
plt.show()