## Risk Factor Model Construction with Regression

This is a simple notebook that goes through the process of constructing a simple factor model with some sample data using regression.

Suppose we have some data (supplied in "risk factors.xlsx"). Our objective is to build a simple risk factor model to explain the risk attribution of a return series.
The sheet labeled "Target" contains a time series vector that we would like to explain, and the “factors” and/or “etf-returns” sheets are potential risk factors.

My approach is to fit a multiple linear regression of the etfs, factors, or both the etfs and factors to the return series. The models' variables (etfs and/or factors) serve as the factors in the factor model, and the coefficients to these factors serve as a proxy to the proportion that each one of those factors plays in modeling the variance.

First, some exploratory analysis. What are we dealing with here?

It helps to open up the excel spreadsheet and take a look, as well as to plot some of the data.

Upon opening up the excel document, we notice that there is a date mismatch. It is not unheard of to have an extended time series, i.e. of a greater length for the predictors or independent variables, and a shorter time series for your response variable (or dependent variable). This is typically done when trying to use a historical period of time to compute one or more statistics that are relevant to the response variable.

However, in this data set, the response had a larger vector length, starting from 2010, while the independent variables had data only from 2018 onward. This is unusual. It could be caused for a variety of reasons. Perhaps the data was provided by an external data provider "as-is." Perhaps the data was generated by querying a table, and the date ranges weren't aligned. Regardless, we need to figure out what to do here. One option is to go out and find data for 2010-2018 for the predictors, and then prepend that to the given data. However, this could result in issues if different data providers submit their own, differing quotes. For exchange data this doesn't happen, but I am less familiar with macroeconomic data. Thus, I will restrict the data analysis period to 2018 to 2020.



Having familiarized myself with the data, some ideas immediately came to mind:
1. Regress the four ETFs onto the target returns to see if the target return could be explained by the ETF returns
2. Regress the three macroeconomic factors onto the target returns to see if the target return could be explained by the factors.
3. Regress the four ETFs and the three factors onto the target.
4. Regress an intelligently-chosen subset of the four ETFs and three factors onto the target.
5. Perform a principal component analysis on the ETFs, and then perform a regression with the principal components as the dependent variables.

In [5]:
#For running as a script, include this line
#!usr/bin/env python3


#Step 0: Import Libraries

import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Version 1.4.2 on my system, versions 0.x will result in problems
import pandas as pd
import matplotlib.pyplot as plt
