So now that I've decided that I'm going to use 1-min RV as my volatility proxy, I can move on to the juicy part: forecasting.

# The Data

---


In [1]:
import pandas as pd
import numpy as np
import sqlite3
from matplotlib import pyplot as plt
from scipy import stats

# Set default figure size
plt.rcParams["figure.figsize"] = (15, 10)
pd.plotting.register_matplotlib_converters()

In [2]:
# Here's my minute data for the S&P 500
spx_minute = minute = pd.read_csv("SPX_1min.csv", header=0,names=['datetime', 'open', 'high', 'low', 'close'],
                                  index_col='datetime', parse_dates=True)

In [3]:
# Here's the function for calculating the 1-min RV, as discussed in my last post
def rv_calc(data):
    results = {}
    
    for idx, data in data.groupby(data.index.date):
        returns = np.log(data['close']) - np.log(data['close'].shift(1))
        results[idx] = np.sum(returns**2)
        
    return pd.Series(results)

In [4]:
spx_rv = rv_calc(spx_minute)

# The Model

My goal is to predict the volatility over the next week, or 5 trading days. This means my independent variables will be the last 21 days of volatility, and my dependent variable is the realized volatility over the next 5 days. For the sake of increased samples, I'm going to create a rolling 5-day window of volatility and shift it 5 periods backwards and use that as the dependent variable. This means I can create a 5-day volatility forecast for each day, rather than each week.

In [63]:
def create_lags(series, lags):
    """
    Creates a dataframe with lagged values of the given series.
    Generates columns named x_{n} which means the value of each row is the value of the original series lagged n times
    """
    result = pd.DataFrame(index=series.index)
    result["x"] = series
    ""
    for n in range(lags):
        result[f"x_{n+1}"] = series.shift((n+1))
        
    return result

In [77]:
dep_var = spx_rv.rolling(5).sum().shift(-5).dropna()
indep_var = create_lags(spx_rv, 21).dropna()
# This ensures that we only keep rows that occur in each set. This means their length is the same and
# rows match up properly
common_index = dep_var.index.intersection(indep_var.index)
dep_var = dep_var.loc[common_index]
indep_var = indep_var.loc[common_index]