# Options Preprocessing

Goal: Create a data frame containing all options and reduce it to only features that will be used in the model.
      Tansform date/exdate columns in the options data to a common format that will be used in each data frame. Will need the same date format to properly join the
      data frames.

The options data corresponds to options securities from the iPath S&P 500 VIX Short-Term Futures ETN (VXX).
The VXX was an ETN and thus traded like an ETF until its ultimate expiration on Jan. 31, 2019. Over its life the VXX was primarily used to speculate on and hedge against market volatility, and was never designed to be a buy-and-hold-investment. A similar project, the iPath Series B S&P 500 VIX Short Term Futures ETN (VXXB), was launched in 2018 with a longer maturity of 30 years to replace the VXX. It is in turn very similar to the VXX and thus this work may also be relevant to the VXXB.

Note: The VIX is a market index that measures expected price swings in the S&P 500 Index over the next 30 days, based on the trading of options contracts linked to it and is often called a fear gague for the market as a whole. VXX / VXXB provides exposure to futures contracts on the VIX.

In [2]:
import pandas as pd

In [3]:
%cd '/Users/benjochem/Desktop/Junior/Research'

/Users/benjochem/Desktop/Junior/Research


In [6]:
#read in all options files (range from 2010 to 2019)
opt5=pd.read_csv('Project/data/raw/option price_vxx_5.2_010118-123119.csv')
opt4=pd.read_csv('Project/data/raw/option price_vxx_4.2_010116-123117.csv')
opt3=pd.read_csv('Project/data/raw/option price_vxx_3.2_010114-123115.csv')
opt2=pd.read_csv('Project/data/raw/option price_vxx_2.2_010112-123113.csv')
opt1=pd.read_csv('Project/data/raw/option price_vxx_1.2_010110-123111.csv')

In [13]:
# concatenate options data by 'stacking' subsequent data frames on top of one another
# oldest option at index 0
options = pd.concat([opt1,opt2,opt3,opt4,opt5])
print(len(options))
options.head()

3147375


Unnamed: 0,date,symbol,symbol_flag,exdate,last_date,cp_flag,strike_price,best_bid,best_offer,volume,open_interest,impl_volatility,delta,gamma
0,5/28/2010,VXX 100619C15000,1,6/19/2010,,C,15000,13.3,13.8,0,0,,,
1,5/28/2010,VXX 100619C16000,1,6/19/2010,,C,16000,12.3,12.8,0,0,,,
2,5/28/2010,VXX 100619C17000,1,6/19/2010,,C,17000,11.3,11.8,0,0,,,
3,5/28/2010,VXX 100619C18000,1,6/19/2010,,C,18000,10.3,10.8,0,0,,,
4,5/28/2010,VXX 100619C19000,1,6/19/2010,,C,19000,9.3,9.8,0,0,,,


In [14]:
options.isnull().sum()

date                    0
symbol                  0
symbol_flag             0
exdate                  0
last_date          648599
cp_flag                 0
strike_price            0
best_bid                0
best_offer              0
volume                  0
open_interest           0
impl_volatility    630549
delta              630549
gamma              630549
dtype: int64

In [16]:
#convert data frame of all options to a csv file and save in interim data directory
options.to_csv('Project/data/interim/full_options.csv', index = False)

In [12]:
options=pd.read_csv('Project/data/interim/full_options.csv')
print(len(options))
options.head()

3147375


Unnamed: 0,date,symbol,symbol_flag,exdate,last_date,cp_flag,strike_price,best_bid,best_offer,volume,open_interest,impl_volatility,delta,gamma
0,5/28/2010,VXX 100619C15000,1,6/19/2010,,C,15000,13.3,13.8,0,0,,,
1,5/28/2010,VXX 100619C16000,1,6/19/2010,,C,16000,12.3,12.8,0,0,,,
2,5/28/2010,VXX 100619C17000,1,6/19/2010,,C,17000,11.3,11.8,0,0,,,
3,5/28/2010,VXX 100619C18000,1,6/19/2010,,C,18000,10.3,10.8,0,0,,,
4,5/28/2010,VXX 100619C19000,1,6/19/2010,,C,19000,9.3,9.8,0,0,,,


In [13]:
#reduce data frame to features to be used in the model
options=options[['date','exdate','cp_flag','strike_price','volume','open_interest','delta','gamma','impl_volatility','best_bid','best_offer']]
options.head()

Unnamed: 0,date,exdate,cp_flag,strike_price,volume,open_interest,delta,gamma,impl_volatility,best_bid,best_offer
0,5/28/2010,6/19/2010,C,15000,0,0,,,,13.3,13.8
1,5/28/2010,6/19/2010,C,16000,0,0,,,,12.3,12.8
2,5/28/2010,6/19/2010,C,17000,0,0,,,,11.3,11.8
3,5/28/2010,6/19/2010,C,18000,0,0,,,,10.3,10.8
4,5/28/2010,6/19/2010,C,19000,0,0,,,,9.3,9.8


In [15]:
#function to match data structure of the options dates to treasury data dates (yyyymmdd)
def date_to_numeric(date = []):
    converted = []
    for d in date:
        d = d.strip().split('/')
        day,month,year = d[0], d[1], d[2]
    
        # leading zeros on days/months
        if len(day) == 1: 
            day = '0' + day
        if len(month) == 1:
            month = '0' + month

        string = year + day + month
        converted.append(string)
    
    return converted

In [16]:
# apply function to date/exdate columns in options and add correctly formatted Date/exDate column
options['Date'] = date_to_numeric(options.date)
options['exDate'] = date_to_numeric(options.exdate)
options.head()

Unnamed: 0,date,exdate,cp_flag,strike_price,volume,open_interest,delta,gamma,impl_volatility,best_bid,best_offer,Date,exDate
0,5/28/2010,6/19/2010,C,15000,0,0,,,,13.3,13.8,20100528,20100619
1,5/28/2010,6/19/2010,C,16000,0,0,,,,12.3,12.8,20100528,20100619
2,5/28/2010,6/19/2010,C,17000,0,0,,,,11.3,11.8,20100528,20100619
3,5/28/2010,6/19/2010,C,18000,0,0,,,,10.3,10.8,20100528,20100619
4,5/28/2010,6/19/2010,C,19000,0,0,,,,9.3,9.8,20100528,20100619


In [17]:
#drop old date columns
options=options.drop('date',axis=1)
options=options.drop('exdate',axis=1)

In [18]:
# new full options dataframe with Dates and exDates in correct format
print(len(options))
options.head()

3147375


Unnamed: 0,cp_flag,strike_price,volume,open_interest,delta,gamma,impl_volatility,best_bid,best_offer,Date,exDate
0,C,15000,0,0,,,,13.3,13.8,20100528,20100619
1,C,16000,0,0,,,,12.3,12.8,20100528,20100619
2,C,17000,0,0,,,,11.3,11.8,20100528,20100619
3,C,18000,0,0,,,,10.3,10.8,20100528,20100619
4,C,19000,0,0,,,,9.3,9.8,20100528,20100619


In [19]:
#convert data frame of all options with correct dates to a csv file and save in interim data directory
options.to_csv('Project/data/interim/full_options_w_dates.csv', index = False)