This project explores a realistic approach to modelling market information from different sources.

The provided dataset contains information on: 

- Equity prices and volumes

- Option Greeks and implied volatility

- Realized volatility

- Sentiment data

- Market indices

Our goals in this project are to explore feature engineering, cross model correlations, and predictive modeling

Along the way we will also learn about key financial components such as volatility and sentiment and relavent considerations 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [32]:
df = pd.read_csv("data/data.csv", parse_dates=["date","expiration_date"])
cols = df.columns
cols

Index(['date', 'volume_of_trades', 'strike', 'stock_close', 'stock_high',
       'stock_low', 'stock_open', 'stock_traded_volume', 'divCash',
       'expiration_date', 'options_close_price', 'options_volume', 'count',
       'bid', 'bid_size', 'ask', 'ask_size', 'open_interest', 'delta', 'theta',
       'vega', 'rho', 'epsilon', 'lambda', 'gamma', 'd1', 'd2', 'implied_vol',
       'iv_error', 'expiration', 'realized_vol', 'realized_vol_diff_target',
       '7_day_realized_vol_target', 'realized_vol_diff_bin_target',
       '7_day_implied_vol_target', 'implied_vol_diff_target',
       'implied_vol_diff_bin_target', 'options_close_price_7_days',
       'options_7_day_diff', 'options_7_day_frac_diff',
       'options_7_day_diff_bin', '7_day_implied_7_day_forecasted_vol_diff',
       '7_day_implied_7_day_forecasted_vol_diff_bin',
       'current_implied_7_day_forecasted_vol_diff',
       'current_implied_7_day_forecasted_vol_diff_bin',
       'reported_estimate_eps_percent_diff', 'vix-open

We have a lot of features to consider here.

One useful approach to feature engineering is to group our features based on their domains.

We see that in our data we can naturally group our data by:

- stock

- option

- greeks

- volatility

- sentiment

- earnings

- vix

In [33]:
feature_groups = {
    "stock": ['stock_open','stock_high','stock_low','stock_close','stock_traded_volume'],
    "option": ['options_close_price','options_volume','strike','open_interest'],
    "greeks": ['delta','gamma','theta','vega','rho','epsilon','lambda','d1','d2'],
    "volatility": ['implied_vol','realized_vol','realized_vol_diff_target','7_day_realized_vol_target','7_day_implied_vol_target','implied_vol_diff_target'],
    "vix": ['vix-open','vix-high','vix-low','vix-close'],
    "sentiment": ['article_sentiment','pos_total_count','neu_total_count','neg_total_count','total_count'],
    "earnings": ['reported_estimate_eps_percent_diff','pos_em_count','neg_em_count','em_total_count'],
}
for group, cols in feature_groups.items():
    print(f"\n{group} features: {len(cols)}")
    display(df[cols].describe().T.head(10))


stock features: 5


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
stock_open,6252.0,101.1236,9.529625,85.09091,94.3971,99.48861,106.1942,125.1853
stock_high,6252.0,102.595,9.602983,86.20685,95.68243,100.8238,107.4597,125.9724
stock_low,6252.0,99.93549,9.578786,83.03836,93.26123,98.44241,104.7096,122.2958
stock_close,6252.0,101.2774,9.662109,83.12804,94.33732,99.77756,106.9515,124.5974
stock_traded_volume,6252.0,33802560.0,14380740.0,9701441.0,24994270.0,30411040.0,37000400.0,119455000.0



option features: 4


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
options_close_price,6252.0,3.263034,1.001214,0.68,2.53,3.2,3.89,7.24
options_volume,6252.0,1375.02975,2577.556678,11.0,173.0,532.0,1449.0,37504.0
strike,6252.0,101.613884,9.660514,84.0,95.0,100.0,106.0,126.0
open_interest,6234.0,2621.408566,5759.04114,0.0,314.25,787.0,2072.0,42368.0



greeks features: 9


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
delta,6252.0,0.526519,0.06064,0.2408,0.491,0.526,0.5617,0.8293
gamma,6252.0,0.053728,0.014636,0.0283,0.0433,0.0508,0.0614,0.1096
theta,6252.0,-0.097222,0.027989,-0.2878,-0.111225,-0.0898,-0.076675,-0.0398
vega,6252.0,8.786254,2.009745,4.0088,7.2498,8.8376,10.2476,14.051
rho,6252.0,2.521724,1.031867,0.4966,1.6571,2.4801,3.2923,5.3792
epsilon,6252.0,-2.697519,1.127381,-5.8403,-3.542725,-2.6441,-1.7351,-0.5096
lambda,6252.0,17.717508,5.128986,9.04,14.0564,16.5968,20.3439,42.6684
d1,6252.0,0.067501,0.155131,-0.7036,-0.0223,0.0653,0.1553,0.9514
d2,6252.0,-0.009981,0.155281,-0.7479,-0.0993,-0.0179,0.0789,0.907



volatility features: 6


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
implied_vol,6252.0,0.353838,0.066475,0.205,0.3083,0.343,0.3881,0.7162
realized_vol,6252.0,0.366995,0.093499,0.22563,0.297812,0.329207,0.422394,0.597613
realized_vol_diff_target,6252.0,-0.00492,0.062225,-0.222737,-0.035591,-0.004113,0.021723,0.202085
7_day_realized_vol_target,6252.0,0.362075,0.094628,0.22563,0.295855,0.32314,0.410772,0.597613
7_day_implied_vol_target,6252.0,0.416683,0.252482,0.0937,0.3139,0.36105,0.435,4.9683
implied_vol_diff_target,6252.0,0.062844,0.24404,-0.2273,-0.0163,0.01635,0.0621,4.6944



vix features: 4


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
vix-open,6252.0,22.678141,4.227956,16.13,19.39,21.8,25.34,34.5
vix-high,6252.0,23.57976,4.396253,16.62,20.08,22.6,26.35,34.88
vix-low,6252.0,21.714399,4.030704,15.53,18.8,20.89,23.85,33.11
vix-close,6252.0,22.351256,4.190005,15.78,19.1,21.44,25.0,33.63



sentiment features: 5


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
article_sentiment,6023.0,-0.04516,0.688873,-5.0,0.0,0.0,0.0,6.0
pos_total_count,6023.0,488.933588,406.929067,0.0,131.0,477.0,628.0,2034.0
neu_total_count,6023.0,148.426864,146.252667,0.0,48.0,133.0,185.0,1122.0
neg_total_count,6023.0,223.132658,222.47111,0.0,64.0,206.0,278.0,1602.0
total_count,6023.0,860.49311,733.347301,0.0,227.0,832.0,1099.0,4428.0



earnings features: 4


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
reported_estimate_eps_percent_diff,6252.0,-0.066286,1.348195,-15.2,0.0,0.0,0.0,10.377358
pos_em_count,6023.0,153.973933,148.913652,0.0,59.0,129.0,193.5,1099.0
neg_em_count,6023.0,52.748962,59.17791,0.0,20.0,44.0,64.0,439.0
em_total_count,6023.0,230.251038,229.576494,0.0,77.0,204.0,275.0,1643.0


Similar to last week, lets get a visual interpretation of our data by looking at a correlation heatmap of our different variables.

In [34]:
corr_cols = ['stock_close','options_close_price','implied_vol','realized_vol','article_sentiment','vix-close']
corr_matrix = df[corr_cols].corr()

plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Cross-Domain Correlation Matrix")
plt.show()

AttributeError: module 'matplotlib' has no attribute 'figure'