<a href="https://colab.research.google.com/github/fintechsteve/modeling-volatility/blob/master/Part_04b_Creating_Turbulence_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part 05 (b): Creating a GARCH Model

### In this section you will:


*   Load volatility forecasts with a rolling window.
*   Transform the forecasts into set of dynamic weights and track performance of the model.



## Import all necessary libraries

For this piece, we will need the following packages to be available to our environment:

*   Numpy and Pandas (For data manipulation)
*   DateTime (For basic date manipulation)
*   Matplotlib (For timeseries vizualization)

If the packages are not available, install the with "pip install X"

In [9]:
import numpy as np, pandas as pd
from datetime import datetime, timedelta
import pickle
import cloudpickle as cp
import urllib.request, urllib.parse, urllib.error
import matplotlib.pyplot as plt
import bisect
from scipy import stats
import seaborn as sns

### Read in data from previously stored vol_pred.pkl file



In [5]:
vol_pred = cp.load(urllib.request.urlopen('https://github.com/fintechsteve/data/blob/master/vol_pred.pkl?raw=true'))

HTTPError: HTTP Error 404: Not Found

Alternatively, if you have estimated the models locally.

In [10]:
with open('./vol_pred.pkl', 'rb') as f:
    returns = pickle.load(f)
    f.close()

In [11]:
vol_pred.head(10)

NameError: name 'vol_pred' is not defined

Plot p = 2, q = 3 as an example.

In [12]:
vol_pred_23 = df_vol_pred[(df_vol_pred['p']==2) & (df_vol_pred['q']==3)][['date', 'var']]
vol_pred_23 = vol_pred_23.set_index('date')
vol_pred_23.plot()

NameError: name 'df_vol_pred' is not defined

### Convert rolling volatility forecast into a weight 

We are going to use linear ranking to create weights, just as we did for the Turbulence signal.

The linear ranking, when used in-sample, should map results to a uniform distribution.

In [14]:
vol_pred_23.hist(bins=50)

NameError: name 'vol_pred_23' is not defined

In [15]:
vol_pred_23_rank = vol_pred_23.dropna().rank()/vol_pred_23.count()
vol_pred_23_rank.hist(bins=50)

NameError: name 'vol_pred_23' is not defined

In [16]:
insample_start_date = '1975-12-30'
insample_end_date = '2000-01-01'
insample_vol_pred_23 = (vol_pred_23.loc[insample_start_date:insample_end_date]).dropna().values
insample_vol_pred_23.sort()
num_insample = np.size(insample_vol_pred_23)
vol_pred_23_ranked = 1-np.true_divide(list(map(lambda x:float('NaN') if np.isnan(x) else bisect.bisect_left(insample_vol_pred_23,x), vol_pred_23['var'])),num_insample)
plt.plot(vol_pred_23_ranked)

NameError: name 'vol_pred_23' is not defined

In [17]:
def garch_weights(vol, insample_start_date, insample_end_date):
  insample_vol = (vol.loc[insample_start_date:insample_end_date]).dropna().values
  insample_vol.sort()
  num_insample = np.size(insample_vol)
  vol_weights = pd.DataFrame(1-np.true_divide(list(map(lambda x:float('NaN') if np.isnan(x) else bisect.bisect_left(insample_vol,x), vol['var'])),num_insample), index=vol.index)
  vol_weights.columns=['weights']
  return vol_weights

In [18]:
vol_weights = garch_weights(vol_pred_23, '1975-12-30', '2000-01-01')
vol_weights.head(20)

NameError: name 'vol_pred_23' is not defined

### Quick sanity checks

1) Are the weights aligned with decreasing volatility?
2) Does the weight inversely correlate with subsequent volatility?

In [19]:
pd.concat([vol_pred_23, vol_weights],axis=1).plot(kind='scatter', x='weights', y='var',
                                           color='Red', label='')

NameError: name 'vol_pred_23' is not defined

In [20]:
dxy_weight = [0, 0.119, 0.036, 0, 0.136, 0.576, 0, 0, 0.091]
dxy = pd.DataFrame(returns.dot(dxy_weight))
dxy_trunc = dxy[insample_start_date:'2017-12-22'] ##the last few days are missing?, I need to redo the GARCH stuff to get the last couple days
dxy_forward = dxy_trunc.shift(-1)
dxy_forward_sq = (dxy_forward**2).rename(columns={0:'dxyfsq'})

ValueError: Dot product shape mismatch, (98586, 4) vs (9,)

In [None]:
highvol = vol_weights['weights']<.25
pd.DataFrame([[np.sqrt(261)*dxy_trunc[highvol].std(), np.sqrt(261)*dxy_trunc[~highvol].std()], [sum(highvol), sum(~highvol)]], index=['std','N'], columns=['High Vol','Low Vol'])

### Calculate the performance of the GARCH model

Leveraging the framework we developed in Part 3, we calculate the excess performance of the model.


In [21]:
def model_perf(model_wts, dxyfsq, framework_params):
  # Note, since chapter 3, modified the code a little to handle the possibility that column headings are not constant
  # Also added an estimation window, which we will use later
    
  model_vol = pd.DataFrame()
  model_vol['total'] = model_wts[model_wts.columns[0]]*dxyfsq[dxyfsq.columns[0]] 
  model_vol['avg'] = (np.mean(model_wts.loc[framework_params['insample_start_date']:framework_params['insample_end_date']].dropna()[model_wts.columns[0]])*dxyfsq)

  is_model_vol = model_vol.loc[framework_params['insample_start_date']:framework_params['insample_end_date']].dropna() 
  is_vol_rmse = np.sqrt(np.mean(is_model_vol))

  out_model_vol = model_vol.loc[framework_params['outofsample_start_date']:framework_params['outofsample_end_date']].dropna()
  out_vol_rmse = np.sqrt(np.mean(out_model_vol))
  
  model_perf_stats = {'insample rmse': is_vol_rmse['total'], 'insample excess rmse': is_vol_rmse['total'] - is_vol_rmse['avg'],
                        'out-of-sample rmse': out_vol_rmse['total'], 'out-of-sample excess rmse': out_vol_rmse['total'] - out_vol_rmse['avg']}
    
  if framework_params.get('estimationsample_start_date'):
    est_model_vol = model_vol.loc[framework_params['estimationsample_start_date']:framework_params['estimationsample_end_date']].dropna()
    est_vol_rmse = np.sqrt(np.mean(est_model_vol))
    model_perf_stats['estimation rmse'] = est_vol_rmse['total']
    model_perf_stats['estimation excess rmse'] = est_vol_rmse['total'] - est_vol_rmse['avg']
        
  if framework_params.get('verbose',False):
    model_perf_stats['insample model volatility'] = is_model_vol
    model_perf_stats['out-of-sample model volatility'] = out_model_vol
    
  return model_perf_stats

In [17]:
framework_params = {'insample_start_date': '1975-12-30',
                    'insample_end_date': '2000-01-01',
                    'outofsample_start_date': '2000-01-02',
                    'outofsample_end_date': '2017-12-22',
                    'verbose': False}

model_perf_stats = model_perf(vol_weights, dxy_forward_sq, framework_params)
model_perf_stats

{'insample rmse': 0.003233913057816355,
 'insample excess rmse': -0.00022350394266198916,
 'out-of-sample rmse': 0.0029439065157343116,
 'out-of-sample excess rmse': -0.00021882194541095017}

### Test significance of excess volatility

A simple test of the ability of a turbulence model to outperform a random model is to calculate the significance of the average excess volatility.

We want this statistic to be negative, because a good model should reduce volatility.

We can calculate this by looking at the difference between the volatility for the model and the volatility of a model that assigns an average weight to volatility and calculating a t-statistic:

In [22]:
stats.ttest_rel(model_perf_stats['insample model volatility']['total'], model_perf_stats['insample model volatility']['avg'])

NameError: name 'model_perf_stats' is not defined

In [23]:
stats.ttest_rel(model_perf_stats['out-of-sample model volatility']['total'], model_perf_stats['out-of-sample model volatility']['avg'])

NameError: name 'model_perf_stats' is not defined

### Grid search for the optimal parameters

We have been working with the GARCH model with p = 2 and q = 3. How do we pick the correct parameters?

### Splitting the sample for hyperparameter tuning


In [24]:
framework_params = {'insample_start_date': '1975-12-30',
                    'insample_end_date': '1994-12-31',
                    'estimationsample_start_date': '1995-01-01',
                    'estimationsample_end_date': '2004-12-31',
                    'outofsample_start_date': '2005-01-01',
                    'outofsample_end_date': '2017-12-22',
                    'verbose': False}

def garch_models_performance(vol_signal, framework_params):
  vol_wts = garch_weights(vol_signal, framework_params['insample_start_date'], framework_params['insample_end_date'])
  model_perf_stats = model_perf(vol_wts, dxy_forward_sq, framework_params)
  return {'model_perf_stats':model_perf_stats, 'vol_wts':turb_wts}

some_model_perf_stats = garch_models_performance(vol_pred_23)
some_model_perf_stats['model_perf_stats']

NameError: name 'vol_pred_23' is not defined

We will run for p=1, 2, and 3 and q = 1, 2, and 3


In [25]:
%%time
ps = [1, 2, 3]
qs = [1, 2, 3]

garch_model_perf_stats = pd.DataFrame(index = index, columns = some_model_perf_stats['model_perf_stats'].keys())
garch_model_weights = pd.DataFrame(index = returns.index, columns = index)

for p in ps:
  for q in qs:
    vol_pred = df_vol_pred[(df_vol_pred['p']==p) & (df_vol_pred['q']==q)][['date', 'var']]
    model_output = garch_models_performance(vol_pred, framework_params)
    garch_model_perf_stats.loc[(lookback,smoothing)] = model_output['model_perf_stats']
    garch_model_weights.loc[:,(lookback,smoothing)] = model_output['vol_wts']['weights']

garch_model_perf_stats

NameError: name 'index' is not defined

In [26]:
vol_model_weights.tail(20)

NameError: name 'vol_model_weights' is not defined

In [28]:
garch_model_weights.to_pickle('./garch_model_weights.pkl')
garch_model_perf_stats.to_pickle('./garch_model_perf_stats.pkl')