This project explores a realistic approach to modelling market information from different sources.

The provided dataset contains information on: 

- Equity prices and volumes

- Option Greeks and implied volatility

- Realized volatility

- Sentiment data

- Market indices

Our goals in this project are to explore feature engineering, cross model correlations, and predictive modeling

Along the way we will also learn about key financial components such as volatility and sentiment and relavent considerations 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import mutual_info_regression
from statsmodels.stats.outliers_influence import variance_inflation_factor


In [None]:
df = pd.read_csv("data/data.csv", parse_dates=["date","expiration_date"])
cols = df.columns
cols

We have a lot of features to consider here.

One useful approach to feature engineering is to group our features based on their domains.

We see that in our data we can naturally group our data by:

- stock

- option

- greeks

- volatility

- sentiment

- earnings

- vix

Grouping by economic meaning helps prevent thoughtless correlation based dropping later For example delta, gamma, and vega will all appear highly correlated in plots, but each reflects a different sensitivity dimension of an otions contract. Dropping one without context can hinder interpretability

In [None]:
feature_groups = {
    "stock": ['stock_open','stock_high','stock_low','stock_close','stock_traded_volume'],
    "option": ['options_close_price','options_volume','strike','open_interest'],
    "greeks": ['delta','gamma','theta','vega','rho','epsilon','lambda','d1','d2'],
    "volatility": ['implied_vol','realized_vol','realized_vol_diff_target','7_day_realized_vol_target','7_day_implied_vol_target','implied_vol_diff_target'],
    "vix": ['vix-open','vix-high','vix-low','vix-close'],
    "sentiment": ['article_sentiment','pos_total_count','neu_total_count','neg_total_count','total_count'],
    "earnings": ['reported_estimate_eps_percent_diff','pos_em_count','neg_em_count','em_total_count'],
}
for group, cols in feature_groups.items():
    print(f"\n{group} features: {len(cols)}")
    display(df[cols].describe().T.head(10))


Correlation helps visualize redundancy: if two features are ~1 correlated one might be redundant. 

Remember we want the our unit feature space's hypercube to be as small as possible so that our feature space is occupied by our datapoints vs. mostly empty in higher dimensions


Spearman correlation detects monotonic relationships, not just linear ones

This matters in cases where variables move in a non linear pattern (Think greeks due to their black-scholes derivation)

In [None]:
def correlation_report(df, cols):
    pairs = []
    for i in range(len(cols)):
        for j in range(i+1, len(cols)):
            rho, p = spearmanr(df[cols[i]], df[cols[j]], nan_policy='omit')
            if abs(rho) > 0.7:
                pairs.append((cols[i], cols[j], rho))
    return pd.DataFrame(pairs, columns=['Feature_1', 'Feature_2', 'Spearman_rho'])

for group, cols in feature_groups.items():
    print(f"\n Highly correlated pairs in {group.upper()}:")
    display(correlation_report(df, cols))


Almost immediatly we see that some features like stockopen, stockclose, stockhigh, are ~1 correlated, so adding them all to our model unnessisarily inflates our feature space.

Lets also grasp a visual understanding of these dependencies using heatmaps

In [None]:
for group, cols in feature_groups.items():
    corr = df[cols].corr()
    plt.figure(figsize=(8, 6))
    sns.heatmap(corr)
    plt.show()
    

## Feature Engineering

Here we perform some commmon transforms and explain the cocepts behind them:

- Price and volume data is often non-stationary, meaning their mean and variance change over time. Machine learning assumes stationary data. We can fix this with logs, returns, and ratios.


In [None]:
# Logs
df['log_stock_close'] = np.log1p(df['stock_close'])
df['log_option_close'] = np.log1p(df['options_close_price'])
df['log_traded_volume'] = np.log1p(df['stock_traded_volume'])

# Returns
df['stock_return'] = df['stock_close'].pct_change()
df['option_return'] = df['options_close_price'].pct_change()

# Ratios
df['option_stock_ratio'] = df['options_close_price'] / df['stock_close']
df['implied_realized_vol_ratio'] = df['implied_vol'] / df['realized_vol']
df['bid_ask_spread'] = (df['ask'] - df['bid']) / df['stock_close']

# Rolling avgs
df['rolling_stock_vol_7d'] = df['stock_return'].rolling(7).std()
df['rolling_option_vol_7d'] = df['option_return'].rolling(7).std()


- lagged: lagged returns avoid data leakage by ensuring we only use infromation available at time t-1 to predict time t. For example if one of our columns was "Next_day_returns" this would allow our model to peak into information for time t+1.

In [None]:
# Lags 
for lag in [1, 3, 5, 7]:
    df[f'lag_stock_return_{lag}'] = df['stock_return'].shift(lag)
    df[f'lag_implied_vol_{lag}'] = df['implied_vol'].shift(lag)
    df[f'lag_option_return_{lag}'] = df['option_return'].shift(lag)
    
# --- Calendar features ---
df['day_of_week'] = df['date'].dt.dayofweek
df['days_to_expiration'] = (df['expiration_date'] - df['date']).dt.days

# --- Cyclic encoding for day of week ---
df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['day_of_week_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

- Options greeks and voltilities are connected by nonlinear surfaces. If you've studied options or taken a financial instruments class you might be familiar with volatility smiles/skews which option greeks tend to follow

- When we know data follows a certain distribution we should do our best to model that curvature. 

A brief explanation of the changes below is provided, but dont worry about following with the finance jargon too much:

- vol skew models asymmetry of risk expectations for out of money vs at money vs in money options
- vol surf curvature models gamma vega structure showing sensitivity of iv to the underlying
- volvol models the volatility of volatility

In [None]:
# Vol curves
df['vol_skew'] = df['d1'] - df['d2']
df['vol_surface_curvature'] = df['gamma'] / (df['vega'] + 1e-6)

# Vol change 
df['iv_change'] = df['implied_vol'].pct_change()
df['rv_change'] = df['realized_vol'].pct_change()
df['vol_premium_change'] = df['implied_realized_vol_ratio'].pct_change()

# volvol
df['vol_vol'] = df['iv_change'] * df['rv_change']

- Sentiment data gives you 'orthoginal information' or information not directly from price or volume but from traders 

Sentiment ratios normalize counts and weight scores market tone

In [None]:
# Normalize by article count
for sentiment_type in ['pos_total_count', 'neu_total_count', 'neg_total_count']:
    df[f'{sentiment_type}_ratio'] = df[sentiment_type] / (df['total_count'] + 1e-6)

# weifht sentiment score
df['weighted_sentiment'] = (
    df['article_sentiment'] * 
    (df['pos_total_count_ratio'] - df['neg_total_count_ratio'])
)

# rolling mean for smoothness
df['sentiment_rolling_mean_7d'] = df['weighted_sentiment'].rolling(7).mean()


Finally we consider interaction across domains

- does volatility rise on down days? (leveraging effect)

- macro micro perceptions (VIX to IV)

- are greeks related in explanatory ways 

In [None]:
# stock vol relations
df['return_x_iv'] = df['stock_return'] * df['implied_vol']
df['return_x_vix'] = df['stock_return'] * df['vix-close']
df['iv_x_sentiment'] = df['implied_vol'] * df['weighted_sentiment']

# greeks
df['vega_x_theta'] = df['vega'] * df['theta']
df['delta_x_gamma'] = df['delta'] * df['gamma']

# micro macro
df['vix_iv_ratio'] = df['vix-close'] / (df['implied_vol'] + 1e-6)


Many of the new features are heavy tailed or have probabilities concentrated at extremes (ends of dist)

To prevent these from dominating model learning we can use robust normalization which for learning purposes is less sensitive to outliers

In [None]:
engineered_features = [
    'stock_return','option_return','implied_realized_vol_ratio','bid_ask_spread',
    'vol_skew','vol_surface_curvature','iv_change','rv_change','vol_vol',
    'weighted_sentiment','sentiment_rolling_mean_7d','return_x_iv','vega_x_theta','vix_iv_ratio'
]

scaler = RobustScaler()
df[engineered_features] = scaler.fit_transform(df[engineered_features])

Lets check our new features making sure they are centered, have meaningful relations, and avoid redundancy

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df[engineered_features].corr(), cmap='coolwarm', center=0)
plt.title("Correlation Heatmap of Engineered Features")
plt.show()

df[engineered_features].hist(bins=40, figsize=(14, 10))
plt.suptitle("Distribution of Engineered Features", fontsize=16)
plt.show()

If you thought this was a lot of work thats because most of time building models is spent in the data processing/feature selection phases. These aren't always intuitive relationships and often require teams and researchers!