## Learning and Cointegration in Pairs

This report have describe the implementation and techniques used for implementing a Cointegration in Pairs. To keep it simple project is limited to a pair rather than portfolio of stocks.

• Part 1 - Describe details about the data used for this project

• Part 2 - Describe concise matrix form estimation for multivariate Vector Auto regression and conduct model spectification test for:      
      
      (a) Identification of optimal lag p with AIC / BIC tests 
      
      (b) Stability check with eigenvectors of the autoregression system.

• Part 3 - Describes implementation of Engle-Granger procedure and explore several cointegrated pairs.

• Part 4 - Describes robust estimation.

• Part 5 - Trading strategy based on cointegration spread.

• Appendix - Describes some of the mathematical methods involved such as Multivariate Regression models (VAR(p), ECM, Augmented Dickey-Fuller Test and Ornstein–Uhlenbeck processes.


## Part 1 - Data sets used in this project

## Simulated Data

Simulated Stochastic processes produced using Monte Carlo (MC) where random samples are drawn using normal distribution.

## Market Data

* To keep it simple two stocks historic prices are used to describe the project. 

* The pairs researched for cointegration are US banking stocks Bank of America and Citi bank.

* The two series of adjusted closing prices were joined to produce a single dataset consisting of daily adjusted closing prices for Bank of America and Citi bank.

* The time series is for Jan-2016 to Dec-2016 for the in-sample testing and Jan-2015 to Jun-2015 for the out-of-sample testing. This was because several sources recommend to use one year of historic data to estimate the cointegration parameters and trade the estimates for a 6-month period, given that the parameters might change over time or the relationship cease to exist

* Since Time series of stocks are generally non stationary. A log of returns was taken to make the Time Series stationary.


## Part 2 - Describe concise matrix form estimation for multivariate Vector Auto regression and conduct model spectification test  for (a) Identification of optimal lag p with AIC / BIC tests (b)  Stability check with eigenvectors of the autoregression system.

In [10]:
import pandas as pd

from statsmodels.tsa.api import VAR
from statsmodels.regression.linear_model import OLS
from statsmodels.tsa.tsatools import (lagmat, add_trend)
from statsmodels.tsa.stattools import adfuller

%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pylab as pylab

print("Imports completed!")

Imports completed!


In [1]:
print("Started loading all self implementation of VAR, OLS, Optimum Lag selection based on Matrices manupulation")
%run Cointegration.py
print("Completed loading all self implementation of VAR, OLS, Optimum Lag selection based on Matrices manupulation")

Started loading all self implementation of VAR, OLS, Optimum Lag selection based on Matrices manupulation
Completed loading all self implementation of VAR, OLS, Optimum Lag selection based on Matrices manupulation


Please refer Cointegration.py for checking the self implementation of below :
    
    1. OLS
    
    2. ADFuller test
    
    3. Vector Autoregression
    
    4. Optimum Lag selection
    
    5. Stability checks
    
    6. Z - Score calucaltion
    
Please note above self implementation has been verified with Python's implementation in statsmodel library.

# Vector Autoregression (VAR)

It is a multivariate regression with past values.

VAR(p) is the simplest way of structural equation modelling.

It models a system of endogeneous variables that depends only on their past (lagged) values.

$$
Y_t = C + A_1 Y_{t-1} + ... + A_{t-p} Y_{t-p} + \epsilon_t
$$

where $ Y_t = (y_{1,t} , ... , y_{n,t})' $  is a column vector N_var X1

and A_p is a n X n matrix of coefficients for lagged variables  $ Y_{t-1} ... Y_{t-p} $

# Vector Autoregression : Estimation

Although VAR(p) can be exceedingly large, it is a system of seemingly unrelated regressions that can be estimated separately line by line using Ordinary Least Squares (OLS)

Matrix manipulation is available numpy package in Python, allowing to specify a concise form and estimate all lines of Vector Autoregression in one go.

Even though VAR implementation is available in statsmodel package in Python, Matrix based estimation of VAR was implemented using Numpy package to rewrite calculation of VAR using following steps:

1. Dependent data matrix was formed as follows, with $T = N_obs$ Dependent data matrix was formed with observation for the first p lags removed. Here observation are in rows from time p+1 to most recent observation at T
$$
Y = [y_{p+1}   y_{p+2}   ...   y_{T}]
$$
where
$ [y_{1,t=1}   y_{1,...}   ...   y_{1,p}   y_{1,p+2} ...   y_{1,t=T}]$ refers to all historic observations of the variable $y_1$


2. 

$$
Z= 1 1 1

y_{p} y_{p+1} y_{T-1}

y_{p-1} y_{p} y_{T-2}           
$$