### Import Data

And clean dates into `pd.DateTime`. 

In [167]:
import pandas as pd

df = pd.read_csv('./dat/spx_option_prices.csv')
df['exdate'] = pd.to_datetime(df['exdate'].astype(str))
df['date'] = pd.to_datetime(df['date'].astype(str))

### Simplify Dataset

There are a few columns that would never be needed because they do not even vary. There are additional columns giving the option greeks, but I think I'll also omit those too later on. 

In [168]:
for col in ['secid','cp_flag','index_flag','issuer','exercise_style']:
    print(df[col].unique())
    del df[col]

[108105]
['P']
[1]
['CBOE S&P 500 INDEX']
['E']


### Add Column for `daysToExpiry`

`daysToExpiry` is probably clear.

In [169]:
df['daysToExpiry'] = ((df['exdate'] - df['date']).astype(int)/1e9/3600/24).astype(int)
df['daysToExpiry'].min()

0

### Add Column for `lifeTime`

There is a unique key in the dataset called `optionid`. It is unique in the sense that for every `(date, optionid)` there is exactly 1 row in our dataset. However, the properties of an option are determined by the strike price (`strike_price`) and expiry date (`exdate`) because otherwise, the timing and cash flows are exactly the same. Thus, we expected `optionid` to `1:1` with `(exdate, strike_price)`. However, the following shows that `(date, exdate, strike_price)` have multiple rows associated with them.

My guess is that there is some pair of weekly, monthly, and quarterly options in which both option styles have the same expiration date. To try to approximate that, we look at the option's `lifeTime` which is the maximum date until expiry. 

In [170]:
print(len(df))
print(len(df[['date','exdate','strike_price']].drop_duplicates()))

8462528
7841799


In [171]:
df_lt = df.groupby('optionid')['daysToExpiry'].max().reset_index().rename(
    {'daysToExpiry':'lifeTime'},axis=1)
df = df.merge(df_lt, how = 'left', on = 'optionid', validate = 'm:1')

In [172]:
print(len(df))
print(len(df[['date','exdate','strike_price','lifeTime']].drop_duplicates()))

8462528
8018016


In [207]:
obj = df.groupby('optionid')['daysToExpiry'].min()
mx = obj.max()
mx

903

In [210]:
obj[obj == mx].head()

optionid
127294298    903
127294299    903
127294300    903
127294301    903
127294302    903
Name: daysToExpiry, dtype: int64

In [211]:
df[df['optionid'] == obj[obj == mx].index[0]].head()

Unnamed: 0,date,exdate,last_date,strike_price,best_bid,best_offer,volume,open_interest,impl_volatility,delta,gamma,vega,theta,optionid,forward_price,daysToExpiry,lifeTime,ct
7572729,2018-12-26,2021-12-17,20181226.0,1000000,7.5,17.4,1,0,0.310159,-0.020868,3.8e-05,212.0977,-10.16121,127294298,2548.87234,1087,1087,1
7580461,2018-12-27,2021-12-17,20181227.0,1000000,8.0,18.0,1,1,0.314782,-0.021189,3.7e-05,216.6408,-10.55952,127294298,2570.205534,1086,1087,1
7588125,2018-12-28,2021-12-17,20181228.0,1000000,8.0,17.5,1,2,0.313325,-0.020969,3.7e-05,214.3754,-10.41986,127294298,2565.769114,1085,1087,1
7595927,2018-12-31,2021-12-17,20181228.0,1000000,7.0,16.5,0,2,0.312168,-0.019507,3.5e-05,203.2778,-9.803896,127294298,2600.50399,1082,1087,1
7603604,2019-01-02,2021-12-17,20181228.0,1000000,6.1,16.0,0,2,0.309355,-0.018666,3.4e-05,195.8898,-9.375538,127294298,2604.869839,1080,1087,1


In [212]:
obj = df[df['exdate'] <= df['date'].max()].groupby('optionid')['daysToExpiry'].min()
mx = obj.max()
mx

715

In [213]:
obj[obj == mx]

optionid
31622275    715
Name: daysToExpiry, dtype: int64

In [187]:
df[df['optionid'] == mx.index[0]]

Unnamed: 0,date,exdate,last_date,strike_price,best_bid,best_offer,volume,open_interest,impl_volatility,delta,gamma,vega,theta,optionid,forward_price,daysToExpiry,lifeTime,ct
288542,2004-06-28,2006-06-17,20040628.0,1115000,84.7,88.7,100,10631,0.186653,-0.362154,0.001239,583.3969,-17.78302,31622275,1172.446159,719,719,1
288817,2004-06-29,2006-06-17,20040629.0,1115000,81.6,85.6,40,2,0.182844,-0.358732,0.001259,582.7842,-17.4774,31622275,1174.919843,718,719,1
289094,2004-06-30,2006-06-17,20040629.0,1115000,83.2,83.7,0,92,0.183636,-0.356319,0.001246,583.5762,-18.25211,31622275,1177.082994,717,719,1
289290,2004-07-01,2006-06-17,20040701.0,1115000,86.0,90.0,50,95,0.182954,-0.373285,0.001284,585.7026,-18.25056,31622275,1163.283931,716,719,1
289573,2004-07-02,2006-06-17,20040701.0,1115000,89.7,93.7,0,142,0.184864,-0.381652,0.001284,587.233,-19.07706,31622275,1156.701283,715,719,1


### Robustness Tests

Although it gained us ~177k data points (assuming we were going to otherwise drop them), this does not seem to get us back all the available data. We have a few options
1. Just keep them and treat them as independent data points (in this case we cannot guarantee uniqueness to the `(date, exdate, strike_price, lifeTime)` level.) 
2. Drop them and potentially merge them back in
  - We could outright drop anything that's duplicated
  - We could look at duplicates and select one security
      * The combined liquidity product (take total `volume`, `open_interest`, maximum `best_bid`, minimum `best_ask`)
      * The minimum liquidity product (take minimum `volume` product and its associated `open_interest`, `best_bid` and `best_ask`)
      * The maximum liquidity product (take maximum `volume` product and its associated `open_interest`, `best_bid` and `best_ask`)

We will implement all of these and create a data loader that lets us separate loading the data and running the analysis. If we require an analysis where `(date, exdate, strike_price, lifeTime)` must be unique we must pass `drop = False` in our data loader. 

In [203]:
df_ct = df.groupby(['date','exdate','strike_price','lifeTime'])['volume'].count().reset_index().rename(
    {'volume':'ct'},axis=1)
df = df.merge(df_ct, how = 'left', on = ['date','exdate','strike_price','lifeTime'], validate = 'm:1')

In [204]:
df_outright_drop = df[df['ct'] == 1].copy()
print(len(df_outright_drop))
print(len(df_outright_drop[['date','exdate','strike_price','lifeTime']].drop_duplicates()))

7573504
7573504


In [None]:
field_map = {'volume':'sum','open_interest':'sum','best_bid',:'max','best_ask':'min'}
df_mrg_liquidity = df.groupby(['date','exdate','strike_price','lifeTime']).agg()

In [122]:
df[(df['ct'] == 2)&(df['volume']>0)].sort_values(['date','exdate','strike_price','lifeTime']).head(4).T

Unnamed: 0,5110923,5114564,5110925,5110927
date,2017-05-03 00:00:00,2017-05-03 00:00:00,2017-05-03 00:00:00,2017-05-03 00:00:00
exdate,2017-07-21 00:00:00,2017-07-21 00:00:00,2017-07-21 00:00:00,2017-07-21 00:00:00
last_date,2.01705e+07,2.01705e+07,2.01705e+07,2.01705e+07
strike_price,1995000,1995000,2005000,2015000
best_bid,3.3,3.4,3.4,3.6
best_offer,3.7,3.6,3.9,4.1
volume,93,30,108,891
open_interest,0,0,0,0
impl_volatility,0.219788,0.218382,0.216376,0.213412
delta,-0.035091,-0.035093,-0.036912,-0.039143


In [127]:
obj1 = df[df['ct'] == 2].groupby(['date','exdate','strike_price','lifeTime'])['volume'].min()
obj2 = df[df['ct'] == 2].groupby(['date','exdate','strike_price','lifeTime'])['volume'].max()

In [133]:
(obj1==0).value_counts()

True     347465
False     97047
Name: volume, dtype: int64

In [148]:
obj1[obj2 > obj1].sort_values(ascending=False)

date        exdate      strike_price  lifeTime
2018-12-12  2018-12-21  2540000       123         20321
2018-11-15  2018-11-16  2700000       116         11495
2018-09-20  2018-09-21  2900000       368         10381
2019-05-16  2019-05-17  2850000       115         10284
2018-09-20  2018-09-21  2925000       246         10272
                                                  ...  
2018-06-01  2018-08-17  2750000       116             0
                        2760000       116             0
                        2770000       116             0
                        2780000       116             0
2019-06-28  2019-11-15  3000000       179             0
Name: volume, Length: 205792, dtype: int64

In [155]:
obj2['2018-12-12','2018-12-21',2540000,123]

20334

In [145]:
obj1[obj2 > obj1 + 40000]

date        exdate      strike_price  lifeTime
2018-02-15  2018-02-16  2700000       116         9906
2018-04-02  2018-05-18  2250000       116           12
2018-04-19  2018-04-20  2700000       123         2930
2018-04-25  2018-05-18  2350000       116          104
2018-05-11  2018-07-20  2470000       123            0
2018-07-19  2018-08-17  2650000       116          101
2018-09-13  2018-10-19  2600000       123           29
2018-09-14  2018-09-21  2175000       368            0
2018-10-30  2018-11-16  2450000       116           75
2018-11-30  2018-12-21  2560000       123          182
2018-12-07  2018-12-21  2590000       123          150
2018-12-14  2018-12-21  2640000       123          138
2018-12-19  2019-01-18  2330000       116            2
2018-12-20  2019-01-18  2280000       116           38
2019-06-20  2019-07-19  2200000       150          477
Name: volume, dtype: int64

In [144]:
obj2[obj2 > obj1 + 40000]

date        exdate      strike_price  lifeTime
2018-02-15  2018-02-16  2700000       116         58853
2018-04-02  2018-05-18  2250000       116         41895
2018-04-19  2018-04-20  2700000       123         43829
2018-04-25  2018-05-18  2350000       116         42692
2018-05-11  2018-07-20  2470000       123         40026
2018-07-19  2018-08-17  2650000       116         40584
2018-09-13  2018-10-19  2600000       123         42153
2018-09-14  2018-09-21  2175000       368         43070
2018-10-30  2018-11-16  2450000       116         45639
2018-11-30  2018-12-21  2560000       123         40807
2018-12-07  2018-12-21  2590000       123         40434
2018-12-14  2018-12-21  2640000       123         40640
2018-12-19  2019-01-18  2330000       116         40028
2018-12-20  2019-01-18  2280000       116         40378
2019-06-20  2019-07-19  2200000       150         68793
Name: volume, dtype: int64

In [135]:
(obj2 - obj1).describe()

count    444512.000000
mean        209.220536
std        1153.727392
min           0.000000
25%           0.000000
50%           0.000000
75%          24.000000
max       68316.000000
Name: volume, dtype: float64

In [131]:
(obj2 == obj1)

True     238720
False    205792
Name: volume, dtype: int64

In [160]:
df = pd.read_csv('./dat/spx_option_prices.csv')
df = df[['date','exdate','strike_price','best_bid','best_offer','volume','open_interest','optionid']]
df.rename({'best_offer':'best_ask'}, axis = 1, inplace = True)

df['exdate'] = pd.to_datetime(df['exdate'].astype(str))
df['date'] = pd.to_datetime(df['date'].astype(str))

df['daysToExpiry'] = ((df['exdate'] - df['date']).astype(int)/1e9/3600/24).astype(int)

df_lt = df.groupby('optionid')['daysToExpiry'].max().reset_index().rename(
    {'daysToExpiry':'lifeTime'},axis=1)
df = df.merge(df_lt, how = 'left', on = 'optionid', validate = 'm:1')

df.head()

Unnamed: 0,date,exdate,strike_price,best_bid,best_ask,volume,open_interest,optionid,daysToExpiry,lifeTime
0,2000-01-03,2000-03-18,1410000,36.75,38.75,0,1,10120210,75,75
1,2000-01-03,2000-01-22,1505000,55.25,57.25,0,50,10000760,19,19
2,2000-01-03,2000-06-17,1350000,40.75,42.75,290,14570,10016917,166,166
3,2000-01-03,2000-01-22,1500000,51.625,53.625,57,505,10149633,19,19
4,2000-01-03,2000-01-22,1540000,84.875,86.875,0,0,10056576,19,19


In [None]:
robustness_choice = ['drop','add','max','min','avg']

In [None]:
drop_dups = True
agg_dups = 'sum' # 'max','min'

In [None]:
df_ct = df.groupby(['date','exdate','strike_price','lifeTime'])['volume']
df_ct = df_ct.count().reset_index().rename({'volume':'ct'},axis=1)

df = df.merge(df_ct, how = 'left', on = ['date','exdate','strike_price','lifeTime'], validate = 'm:1')

if drop_dups:
    df = df[df['ct'] == 1].drop('ct', axis=1)
else:
    
    

In [None]:
dd_agg = {'best_bid':
         'best_offer':
         'volume':
         'open_interest':}


df.groupby(['date','exdate','strike_price','lifeTime','daysToExpiry']).agg(dd_agg)

In [None]:
df[['date','exdate','strike_price','lifeTime']]

In [123]:
df['lifeTime'].max()

1095

In [119]:
889024 / 2 + 7573504

8018016.0