#PaTSwAPS: Pairs Trading Strategy with Automated Pair Selection
#*Pair Selector Notebook*
##To-Do

1) Read this paper. The authors manage to find cointegrated pairs and simulate a successful trading strategy. They give decent advice on how they did their screening. http://www.ccsenet.org/journal/index.php/ijef/article/view/33007

2) Test for cointegration over many time periods (i.e. days, weeks, months, years)

3) Research how to correct for the multiple comparisons problem. https://en.wikipedia.org/wiki/Multiple_comparisons_problem#Controlling_procedures

4) Consider using the following more advanced statistical tests for cointegration:
* Augmented-Dickey Fuller test 
* Hurst exponent
* Half-life of mean reversion inferred from an Ornstein–Uhlenbeck process

## Correcting for Multiple Comparisons

For references and more information, see https://en.wikipedia.org/wiki/Multiple_comparisons_problem#Controlling_procedures

Let $\bar{\alpha}$ denote the *family-wise error rate* (FWER), or the experiment-wide significance level, which is the probability of making one or more false positives (type I errors), or identifying a cointegrated pair when there is no underlying cointegration present.

Furthermore, let $\alpha_{comp}$ denote the significance level of each individual trial (or comparison).

Then, the following relationship holds: $\bar{\alpha}=1-(1-\alpha_{comp})^k$

Below are several ways of correcting for multiple comparisons:

* Bonferroni correction: Let $\alpha_{comp}=\bar{\alpha}/k$. This correction is just an approximate solution for the per-comparison significance level, using the binomial theorem. The approximation gets better with small $\alpha_{comp}$ and $k$. For more information, see https://en.wikipedia.org/wiki/Bonferroni_correction#Definition


* Šidák correction: Let $\alpha_{comp}=1-(1-\bar{\alpha})^{1/k}$. This correction is an exact solution for the per-comparison significance level. For more information, see https://en.wikipedia.org/wiki/%C5%A0id%C3%A1k_correction#Usage


* Holm-Bonferroni correction: Let $\alpha_{comp}=\bar{\alpha}/(k-i+1)$. This correction is a more powerful method than the simple Bonferroni correction, but is more complex in that it is a stepwise algorithm. In essence, it tests the most extreme p-value against the most stringent criteria $(i=1)$, and tests progressively less extreme p-values against progressively less strict criteria $(i>1)$. For more information, see https://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method#Formulation

We implement the Šidák correction, as it is the most robust correction that can be implemented in a few lines of code.

In [16]:
import math
import numpy as np
from datetime import date, timedelta
from statsmodels.tsa.stattools import coint
from quantopian.pipeline import CustomFactor, Pipeline
from quantopian.pipeline.factors import SimpleMovingAverage, AverageDollarVolume
from quantopian.pipeline.classifiers.morningstar import Sector
from quantopian.pipeline.data import morningstar
from quantopian.pipeline.data.builtin import USEquityPricing
from quantopian.research import run_pipeline

In [46]:
# This function takes a dataframe of prices (price vs. time) and returns a list of any pairs that are cointegrated.

def find_cointegrated_pairs(data):    
    # Drop duplicated rows and set up necessary variables
    data = data.T.drop_duplicates().T
    m, n = data.shape[0], data.shape[1]
    pvalue_matrix = np.zeros((n, n))
    keys = data.keys()
    pairs = []
    
    # Make a matrix of p-values
    for i in range(n):
        for j in range(i+1, n):
            result = coint(data[keys[i]], data[keys[j]])
            pvalue_matrix[i, j] = result[1]
    
    # Find uniquely cointegrated pairs of securities
    alpha = sidak(0.05, n*(n-1)/2)
    for i in range(n):
        for j in range(i+1, n):
            check1 = (pvalue_matrix[k, j] >= 0.5 for k in range(i))
            check2 = (pvalue_matrix[i, k] >= 0.5 for k in range(j, n))
            #check3 = (not math.isnan(float(pvalue_matrix[k, j])) for k in range(i))
            #check4 = (not math.isnan(float(pvalue_matrix[i, k])) for k in range(j, n))
            if (pvalue_matrix[i, j] <= alpha) and check1 and check2: #and check3 and check4:
                pairs.append((keys[i].symbol, keys[j].symbol))
    
    return pairs

def sidak(fwer, num_comps):
    return np.float128(1-(1-fwer)**(1.0/num_comps))

In [37]:
# Interesting cointegrated pairs to keep track of!
foo = get_pricing(['CSUN', 'ASTI', 'ABGB', 'FSLR'], '01-01-2014', '01-01-2015', fields='price')
find_cointegrated_pairs(foo)

[(u'ABGB', u'FSLR')]

In [19]:
# Define and instantiate all necessary factors

class Market_Cap(CustomFactor):
    inputs = [morningstar.valuation.market_cap]
    window_length = 1
    def compute(self, today, assets, out, inputs):
        out[:] = inputs

class Industry_Group(CustomFactor):
    inputs = [morningstar.asset_classification.morningstar_industry_group_code]
    window_length = 1
    def compute(self, today, assets, out, inputs):
        out[:] = inputs

avg_close = SimpleMovingAverage(inputs=[USEquityPricing.close], window_length=20)
avg_vol = AverageDollarVolume(window_length=20)
sector = Sector()
group = Industry_Group()
market_cap = Market_Cap()

In [20]:
# The Pipeline filters the universe by group code, and applies minimum acceptance requirements (enumerated below)

def make_pipeline(group_code):
    sector_filter = sector.notnull() # No stocks in misc. sector
    penny_stock_filter = (avg_close > 5.0) # No stocks that are less that $5
    volume_filter = (avg_vol > 750000) # No companies who have a dollar volume of less than $0.75m
    small_cap_filter = (market_cap >= 300000000) # No companies who are valued at less than $300m
    group_filter = group.eq(group_code) # No companies that are not in the industry group under consideration
    
    return Pipeline(
        columns = {'industry_group':group},
        screen = (sector_filter & penny_stock_filter & volume_filter & small_cap_filter & group_filter)
    )

In [6]:
# Morningstar industry group codes: for mappings, see https://www.quantopian.com/help/fundamentals#industry-sector

group_codes = [10101, 10102, 10103, 10104, 10105, 10106, 10107,
               10208, 10209, 10210, 10211, 10212, 10213, 10214, 10215, 10216, 10217, 10218,
               10319, 10320, 10321, 10322, 10323, 10324, 10325, 10326,
               10427, 10428,
               20529, 20530, 20531, 20532, 20533, 20534,
               20635, 20636, 20637, 20638, 20639, 20640, 20641, 20642,
               20743, 20744,
               30845,
               30946, 30947, 30948, 30949, 30950, 30951,
               31052, 31053, 31054, 31055, 31056, 31057, 31058, 31059, 31060, 31061, 31062, 31063, 31064,
               31165, 31166, 31167, 31168, 31169]

In [41]:
# This code goes through accepted stocks in each industry group and find if any stocks are cointegrated over the
# past 365 days

pairs = []
start = '01-01-2014'
end = '01-01-2015'

for i in range(len(group_codes)):
    symbols = []
    pipe_output = run_pipeline(make_pipeline(group_codes[i]), end, end)
    for j in range(len(pipe_output.index)):
        symbols.append(pipe_output.index.values[j][1].symbol)
    if symbols != []:
        prices = get_pricing(symbols, start, end, fields='price')
        prices.dropna(axis=1)
        pairs = pairs + find_cointegrated_pairs(prices)

pairs

[(u'VIAB', u'VIA'),
 (u'RYL', u'BZH'),
 (u'LEG', u'SCSS'),
 (u'WTW', u'STON'),
 (u'RENX', u'RELX'),
 (u'VKI', u'MMD'),
 (u'NXZ', u'MAIN'),
 (u'NXZ', u'MMD'),
 (u'EIM', u'APO'),
 (u'TMP', u'UBNK'),
 (u'UBNK', u'WAL'),
 (u'UBNK', u'HOMB'),
 (u'UBNK', u'TREE'),
 (u'UBNK', u'BSBR'),
 (u'UBNK', u'STBZ'),
 (u'UBNK', u'FRC'),
 (u'UBNK', u'WD'),
 (u'UBNK', u'BSMX'),
 (u'ANAT', u'BRK_A'),
 (u'ANAT', u'HMN'),
 (u'ANAT', u'ORI'),
 (u'ANAT', u'KMPR'),
 (u'ANAT', u'BRK_B'),
 (u'ANAT', u'HIG'),
 (u'ANAT', u'ING'),
 (u'ANAT', u'SLF'),
 (u'ANAT', u'AIZ'),
 (u'ANAT', u'GNW'),
 (u'ANAT', u'ESGR'),
 (u'ANAT', u'GTS'),
 (u'ANAT', u'AV'),
 (u'BRK_A', u'BRK_B'),
 (u'GLPI', u'CTT'),
 (u'BF_A', u'BF_B'),
 (u'NBIX', u'SCMP'),
 (u'MCK', u'ABC'),
 (u'LVLT', u'SJR'),
 (u'ZNH', u'UAL'),
 (u'TAL', u'HEES'),
 (u'TAL', u'AYR'),
 (u'TAL', u'HRI'),
 (u'TAL', u'CAI'),
 (u'TAL', u'FLY'),
 (u'TAL', u'TGH'),
 (u'TAL', u'MG'),
 (u'TAL', u'AL'),
 (u'TAL', u'ADT'),
 (u'TAL', u'ALLE'),
 (u'AZZ', u'ZBRA'),
 (u'AZZ', u'HON'),
 (

In [45]:
pairs = []
start = '01-01-2014'
end = '01-01-2015'

i = 19
symbols = []
pipe_output = run_pipeline(make_pipeline(group_codes[i]), end, end)
for j in range(len(pipe_output.index)):
    symbols.append(pipe_output.index.values[j][1].symbol)
if symbols != []:
    prices = get_pricing(symbols, start, end, fields='price')
    prices.dropna(axis=1)
    pairs = pairs + find_cointegrated_pairs(prices)

pairs

[(u'TMP', u'UBNK'),
 (u'UBNK', u'WAL'),
 (u'UBNK', u'HOMB'),
 (u'UBNK', u'TREE'),
 (u'UBNK', u'BSBR'),
 (u'UBNK', u'STBZ'),
 (u'UBNK', u'FRC'),
 (u'UBNK', u'WD'),
 (u'UBNK', u'BSMX')]