# Statistical arbitrage with Cointegration

## Pairs Trading & Statistical Arbitrage

Statistical arbitrage refers to strategies that employ some statistical model or method to take
advantage of what appears to be relative mispricing of assets, while maintaining a level of
market neutrality.

Pairs trading is a conceptually straightforward strategy that has been employed by algorithmic traders since at least the mid-eighties ([Gatev, Goetzmann, and Rouwenhorst 2006](http://www-stat.wharton.upenn.edu/~steele/Courses/434/434Context/PairsTrading/PairsTradingGGR.pdf)). The goal is to find two assets whose prices have historically moved together, track the spread (the difference between their prices), and, once the spread widens, buy the
loser that has dropped below the common trend and short the winner. If the relationship persists, the long and/or the short leg will deliver profits as prices converge and the positions are closed.

This approach extends to a multivariate context by forming baskets from multiple securities and trading one asset against a basket of two baskets against each other.

## Pairs Trading in Practice

In practice, the strategy requires two steps:

1. **Formation phase**: Identify securities that have a long-term mean-reverting relationship. Ideally, the spread should have a high variance to allow for frequent profitable trades while reliably reverting to the common trend.
2. **Trading phase**: Trigger entry and exit trading rules as price movements cause thespread to diverge and converge.

Several approaches to the formation and trading phases have emerged from increasingly active research in this area, across multiple asset classes, over the last several years. The book outlines the key differences between them; the notebook dives into an example application.

## Imports & Settings

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from collections import Counter

from time import time
from pathlib import Path

import numpy as np
import pandas as pd

from pykalman import KalmanFilter
from statsmodels.tsa.stattools import coint
from statsmodels.tsa.vector_ar.vecm import coint_johansen
from statsmodels.tsa.api import VAR

import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
idx = pd.IndexSlice
sns.set_style('whitegrid')

In [4]:
def format_time(t):
    m_, s = divmod(t, 60)
    h, m = divmod(m_, 60)
    return f'{h:>02.0f}:{m:>02.0f}:{s:>02.0f}'

### Johansen Test Critical Values

In [5]:
critical_values = {0: {.9: 13.4294, .95: 15.4943, .99: 19.9349},
                   1: {.9: 2.7055, .95: 3.8415, .99: 6.6349}}

In [6]:
trace0_cv = critical_values[0][.95] # critical value for 0 cointegration relationships
trace1_cv = critical_values[1][.95] # critical value for 1 cointegration relationship

## Load Data

In [7]:
DATA_PATH = Path('..', 'data') 
STORE = DATA_PATH / 'assets.h5'

### Get backtest prices

Combine OHLCV prices for relevant stock and ETF tickers.

In [8]:
def get_backtest_prices():
    with pd.HDFStore('data.h5') as store:
        tickers = store['tickers']

    with pd.HDFStore(STORE) as store:
        prices = (pd.concat([
            store['stooq/us/nyse/stocks/prices'],
            store['stooq/us/nyse/etfs/prices'],
            store['stooq/us/nasdaq/etfs/prices'],
            store['stooq/us/nasdaq/stocks/prices']])
                  .sort_index()
                  .loc[idx[tickers.index, '2016':'2019'], :])
    print(prices.info(show_counts=True))
    prices.to_hdf('backtest.h5', 'prices')
    tickers.to_hdf('backtest.h5', 'tickers')

In [9]:
get_backtest_prices()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 305824 entries, ('AA.US', Timestamp('2016-01-04 00:00:00')) to ('WYNN.US', Timestamp('2019-12-31 00:00:00'))
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   open    305824 non-null  float64
 1   high    305824 non-null  float64
 2   low     305824 non-null  float64
 3   close   305824 non-null  float64
 4   volume  305824 non-null  float64
dtypes: float64(5)
memory usage: 13.2+ MB
None


### Load Stock Prices

In [10]:
# see notebook 05_cointagration_tests
stocks = pd.read_hdf('data.h5', 'stocks/close').loc['2015':]
stocks.info()
stocks

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1258 entries, 2015-01-02 to 2019-12-31
Columns: 172 entries, AAPL.US to AEP.US
dtypes: float64(172)
memory usage: 1.7 MB


ticker,AAPL.US,AMZN.US,MSFT.US,BAC.US,GOOGL.US,NFLX.US,C.US,JPM.US,XOM.US,INTC.US,...,CLF.US,FITB.US,GEN.US,GAP.US,ADSK.US,FSLR.US,ADI.US,PCG.US,NVS.US,AEP.US
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-02,24.3203,15.4260,40.1536,15.3668,26.3819,49.849,45.0718,50.3059,63.7264,28.9303,...,6.6967,14.5194,10.27600,31.2373,59.53,44.545,46.7245,48.772,73.049,46.3700
2015-01-05,23.6361,15.1095,39.7826,14.9238,25.8791,47.311,43.6485,48.7429,61.9827,28.6040,...,6.2108,14.0833,10.13220,31.4391,58.66,41.830,45.8705,49.594,73.210,45.6898
2015-01-06,23.6381,14.7645,39.1972,14.4711,25.2406,46.501,42.1145,47.4804,61.6540,28.0687,...,6.0109,13.5169,9.97376,31.0230,57.50,40.860,44.7955,49.519,72.591,45.9386
2015-01-07,23.9734,14.9210,39.6975,14.5438,25.1663,46.743,42.5031,47.5512,62.2793,28.6587,...,6.5157,13.6660,10.06520,32.5231,57.38,41.750,45.2680,49.913,72.907,46.5922
2015-01-08,24.8927,15.0230,40.8676,14.8413,25.2540,47.779,43.1443,48.6122,63.3131,29.1950,...,6.9825,13.9755,10.24930,32.1495,58.80,43.630,46.0677,50.543,75.442,46.9913
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-12-24,68.8225,89.4605,150.5140,32.8869,66.9786,333.200,70.4716,125.5370,58.7539,54.3557,...,8.2083,25.3039,13.91340,15.6452,183.91,57.980,113.2370,10.950,87.976,84.8095
2019-12-26,70.1874,93.4385,151.7760,33.1679,67.8774,332.630,71.5816,126.8730,58.8474,54.7318,...,8.1692,25.4341,13.78850,15.9020,184.24,58.660,113.2470,10.860,87.966,84.8459
2019-12-27,70.1597,93.4900,152.0480,33.0077,67.4874,329.090,71.4387,126.9610,58.6451,54.9702,...,8.0324,25.2543,13.87010,15.8131,185.38,56.410,112.9800,10.440,88.486,85.1279
2019-12-30,70.5778,92.3445,150.7200,32.8212,66.7435,323.310,71.2947,126.4890,58.3018,54.5501,...,8.1399,25.1636,13.80520,15.8045,183.30,56.260,112.4090,10.800,87.725,84.9815


### Load ETF Data

In [11]:
# see notebook 05_cointagration_tests
etfs = pd.read_hdf('data.h5', 'etfs/close').loc['2015':]
etfs.info()
etfs

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1258 entries, 2015-01-02 to 2019-12-31
Columns: 132 entries, SPY.US to VNM.US
dtypes: float64(132)
memory usage: 1.3 MB


ticker,SPY.US,EEM.US,GLD.US,EFA.US,XLF.US,XLE.US,TLT.US,GDX.US,EWZ.US,HYG.US,...,EPU.US,WIP.US,PJP.US,INDY.US,XPH.US,STPZ.US,BRF.US,IDX.US,EWN.US,VNM.US
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-02,172.069,31.2787,114.08,49.2607,17.2861,58.0695,97.6041,17.6428,28.5620,65.7660,...,26.3338,48.3205,59.5551,27.644,47.7751,46.5885,16.212,21.057,20.6083,17.332
2015-01-05,168.957,30.7209,115.80,48.0975,16.9218,55.6700,99.0879,18.1069,27.5829,65.1567,...,25.9352,48.2820,59.1721,27.327,47.4247,46.5290,15.602,20.633,19.9071,17.004
2015-01-06,167.378,30.5919,117.12,47.5535,16.6656,54.8522,100.9170,19.0656,28.0758,64.9076,...,26.1610,48.1745,58.8240,26.463,46.9928,46.4249,15.451,20.633,19.6988,17.242
2015-01-07,169.435,31.2549,116.43,48.0818,16.8398,54.9686,100.6960,18.7126,28.8371,65.3095,...,26.0954,47.9929,60.1818,26.888,48.1821,46.4834,15.720,20.977,19.8706,17.195
2015-01-08,172.469,31.7878,115.94,48.7324,17.0901,56.2021,99.3622,18.4333,29.2975,65.8027,...,26.3936,48.0728,61.4590,27.718,49.2462,46.5637,15.943,21.074,20.1767,17.213
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-12-24,296.904,39.9678,141.27,65.1116,28.9749,52.2111,118.6900,27.5556,42.7073,83.1442,...,33.4480,52.1692,64.0911,36.100,45.3280,50.2998,25.342,22.084,32.4870,15.688
2019-12-26,298.482,40.2535,142.38,65.3943,29.1333,52.1945,118.9800,27.9588,43.6511,83.2684,...,33.8494,52.3300,63.6822,35.914,45.1433,50.3583,25.867,22.172,32.6914,15.708
2019-12-27,298.405,40.4156,142.33,65.5068,29.0591,51.9753,119.1090,27.7572,43.3034,83.2316,...,33.7954,52.5876,63.3112,35.942,44.6361,50.3583,26.039,22.129,32.7585,15.708
2019-12-30,296.759,40.1466,142.63,65.0167,28.9749,51.8110,118.6800,28.3540,43.2939,83.1730,...,33.8407,52.5116,62.8228,35.989,44.0930,50.4158,26.115,21.909,32.6519,15.767


### Load Ticker Dictionary

In [12]:
names = pd.read_hdf('data.h5', 'tickers').to_dict()

In [13]:
pd.Series(names).count()

np.int64(304)

## Precompute Cointegration

In [14]:
def test_cointegration(etfs, stocks, test_end, lookback=2):
    start = time()
    results = []
    test_start = test_end - pd.DateOffset(years=lookback) + pd.DateOffset(days=1)
    
    print(f"Test Start Date: {test_start.strftime('%Y-%m-%d')}")
    print(f"Test End Date  : {test_end.strftime('%Y-%m-%d')}")
    
    etf_tickers = etfs.columns.tolist()
    etf_data = etfs.loc[str(test_start):str(test_end)]

    stock_tickers = stocks.columns.tolist()
    stock_data = stocks.loc[str(test_start):str(test_end)]
    
    print(f"Number of samples (days) in range: {len(etf_data)}")
    
    n = len(etf_tickers) * len(stock_tickers)
    j = 0
    for i, s1 in enumerate(etf_tickers, 1):
        for s2 in stock_tickers:
            j += 1
            if j % 1000 == 0:
                print(f'\t{j:5,.0f} ({j/n:3.1%}) | {time() - start:.2f}')
            df = etf_data.loc[:, [s1]].dropna().join(stock_data.loc[:, [s2]].dropna(), how='inner')
            with warnings.catch_warnings():
                warnings.simplefilter('ignore')
                var = VAR(df)
                lags = var.select_order()
                result = [test_end, s1, s2]
                order = lags.selected_orders['aic']
                result += [coint(df[s1], df[s2], trend='c')[1], coint(df[s2], df[s1], trend='c')[1]]

            cj = coint_johansen(df, det_order=0, k_ar_diff=order)
            result += (list(cj.lr1) + list(cj.lr2) + list(cj.evec[:, cj.ind[0]]))
            results.append(result)
    return results

### Define Test Periods

In [15]:
dates = stocks.loc['2016-12':'2019-6'].resample('Q').last().index
dates

DatetimeIndex(['2016-12-31', '2017-03-31', '2017-06-30', '2017-09-30',
               '2017-12-31', '2018-03-31', '2018-06-30', '2018-09-30',
               '2018-12-31', '2019-03-31', '2019-06-30'],
              dtype='datetime64[ns]', name='date', freq='QE-DEC')

### Run Tests

In [None]:
# # NON PARALLEL VERSION
# test_results = []
# columns = ['test_end', 's1', 's2', 'eg1', 'eg2',
#            'trace0', 'trace1', 'eig0', 'eig1', 'w1', 'w2']

# for test_end in dates:
#     print(test_end)
#     result = test_cointegration(etfs, stocks, test_end=test_end)
#     test_results.append(pd.DataFrame(result, columns=columns))

# pd.concat(test_results).to_hdf('backtest.h5', 'cointegration_test')

2016-12-31 00:00:00
Test Start Date: 2015-01-01
Test End Date  : 2016-12-31
Number of samples (days) in range: 504
	1,000 (4.4%) | 21.12
	2,000 (8.8%) | 42.27
	3,000 (13.2%) | 63.61
	4,000 (17.6%) | 84.77
	5,000 (22.0%) | 106.33
	6,000 (26.4%) | 128.35


KeyboardInterrupt: 

It is possible to parallelize the loop over `dates` using `joblib.Parallel` and `delayed`, as the computations for each `test_end` appear to be independent. This can provide a speedup if you have multiple CPU cores available, though the benefit may be limited if the number of `dates` is small (e.g., ~10-11 in your example) relative to the overhead of parallelization. The heavy lifting is inside `test_cointegration` (the nested loops over tickers), but since your question focuses on the outer loop, here's how to modify it.

### Key Notes:
- `n_jobs=-1` uses all available CPU cores. You can set it to a specific number (e.g., `n_jobs=4`) if you want to limit it.
- `verbose=10` will print progress information from joblib (e.g., which jobs are running). Adjust or remove if not needed. The `print(test_end)` in your original loop won't execute sequentially in parallel, but the verbose output can serve a similar purpose.
- The inner `print` statements in `test_cointegration` (e.g., every 1000 iterations) may appear interleaved or out of order in the console due to parallel execution, but this won't affect the results.
- Ensure your environment supports multiprocessing (e.g., no issues with shared resources like the `etfs` and `stocks` DataFrames, which are read-only here).
- If the speedup isn't sufficient, consider parallelizing the inner nested loops in `test_cointegration` instead (e.g., over pairs of tickers), as that's likely where most time is spent. That could be done similarly with `Parallel` but would require refactoring to generate the list of (s1, s2) pairs upfront.


11m45s

In [None]:
# # PARALLEL VERSION
# from joblib import Parallel, delayed

# test_results = []
# columns = ['test_end', 's1', 's2', 'eg1', 'eg2',
#            'trace0', 'trace1', 'eig0', 'eig1', 'w1', 'w2']

# # Parallelize the loop over dates
# results = Parallel(n_jobs=-1, verbose=10)(
#     delayed(test_cointegration)(etfs, stocks, test_end=test_end) for test_end in dates
# )

# # Convert each result to a DataFrame
# test_dfs = [pd.DataFrame(result, columns=columns) for result in results]

# # Concatenate and save
# pd.concat(test_dfs).to_hdf('backtest.h5', 'cointegration_test')

#### Reload  Test Results

Column Definitions for Cointegration Tests
==========================================

The list `columns = ['test_end', 's1', 's2', 'eg1', 'eg2', 'trace0', 'trace1', 'eig0', 'eig1', 'w1', 'w2']` defines columns for a DataFrame storing results from cointegration tests on ETF-stock pairs, run in `test_cointegration` over a 2-year lookback period ending at `test_end`. Cointegration indicates a stable long-term relationship between non-stationary time series, useful for pairs trading. Tests include Engle-Granger (EG) and Johansen methods.

-   **test_end**: End date of the test period (quarterly, 2016-12 to 2019-06). Input to `test_cointegration`.
-   **s1**: ETF ticker (e.g., 'SPY'). From `etfs` DataFrame columns.
-   **s2**: Stock ticker (e.g., 'AAPL'). From `stocks` DataFrame columns.
-   **eg1**: P-value from EG test with s1 as dependent (regressed on s2). From `coint(df[s1], df[s2], trend='c')[1]`. Low p-value (<0.05) suggests cointegration.
-   **eg2**: P-value from EG test with s2 as dependent (regressed on s1). From `coint(df[s2], df[s1], trend='c')[1]`. Checks reverse direction.
-   **trace0**: Johansen trace statistic for H0: r=0 (no cointegration). From `coint_johansen(df, det_order=0, k_ar_diff=order).lr1[0]`. If > 15.4943 (95% critical value), reject H0, indicating cointegration.
-   **trace1**: Johansen trace statistic for H0: r≤1. From `coint_johansen(...).lr1[1]`. For pairs, r>1 is impossible (saturation), so typically < 3.8415 (95% critical value), confirming at most one relation.
-   **eig0**: Johansen max-eigenvalue statistic for r=0. From `coint_johansen(...).lr2[0]`. High values suggest cointegration.
-   **eig1**: Johansen max-eigenvalue statistic for r=1. From `coint_johansen(...).lr2[1]`. Tests impossible r=2, usually insignificant.
-   **w1**: Cointegrating vector weight for s1. From `coint_johansen(...).evec[:, cj.ind[0]][0]`. Part of β vector for stationary combination.
-   **w2**: Cointegrating vector weight for s2. From `coint_johansen(...).evec[:, cj.ind[0]][1]`. With `w1`, forms hedge ratio for trading.

### Notes on `trace0` and `trace1`

For pairs (two series), the Johansen test has a maximum rank of 1. `trace0` tests for cointegration (r>0); `trace1` tests for r>1 (impossible, hence "saturation"). A high `trace0` with low `trace1` suggests exactly one cointegrating relation. Results are saved to 'backtest.h5' for analysis, parallelized over quarterly periods using Joblib.

In [None]:
test_results = pd.read_hdf('backtest.h5', 'cointegration_test')
test_results.info()
test_results

## Identify Cointegrated Pairs

### Significant Johansen Trace Statistic

In [None]:
test_results['joh_sig'] = ((test_results.trace0 > trace0_cv) &
                           (test_results.trace1 < trace1_cv))

In [None]:
# test_results['joh_sig'] = ((test_results.trace0 > trace0_cv) &
#                            (test_results.trace1 > trace1_cv))

In [None]:
test_results.joh_sig.value_counts(normalize=True)

### Significant Engle Granger Test

In [None]:
test_results['eg'] = test_results[['eg1', 'eg2']].min(axis=1)
test_results['s1_dep'] = test_results.eg1 < test_results.eg2
test_results['eg_sig'] = (test_results.eg < .05)

In [None]:
test_results.eg_sig.value_counts(normalize=True)

### Comparison Engle-Granger vs Johansen

In [None]:
test_results['coint'] = (test_results.eg_sig & test_results.joh_sig)
test_results.coint.value_counts(normalize=True)

In [None]:
test_results = test_results.drop(['eg1', 'eg2', 'trace0', 'trace1', 'eig0', 'eig1'], axis=1)
test_results.info()

In [None]:
test_results

### Comparison

### Data Being Plotted
- `test_results.groupby('test_end').coint.mean()`: Groups by quarterly `test_end` dates and computes the mean of the boolean `coint` column, yielding the proportion of ETF-stock pairs cointegrated per both Engle-Granger and Johansen tests.
  - Example: For 100 pairs with 5 cointegrated, the proportion is 0.05 (5%).
  - Computed over thousands of pairs (ETF-stock Cartesian product).
- `.to_frame('# Pairs')`: Converts to a DataFrame; label is misleading as it shows proportions (e.g., 0.0–0.1), not counts (use `.sum()` for counts).

### Plot Type
- `.plot()`: Generates a line plot of proportions over time.
  - X-axis: Quarterly `test_end` dates (~2016-12 to 2019-06).
  - Y-axis: Proportion of cointegrated pairs (labeled '# Pairs').

### Additional Element
- `ax.axhline(.05, lw=1, ls='--', c='k')`: Horizontal dashed black line at y=0.05.
  - Benchmark for 5% significance level: Expected false positives under null hypothesis of no cointegration.
  - Line above 0.05 suggests genuine relationships (e.g., sector/market factors); at/below indicates noise or weak evidence.

### Location in Code
- Under "### Comparison", post-`coint` computation, pre-candidate selection.
- Diagnostic to track cointegrated pair fraction over periods; complements overall proportion from `test_results.coint.value_counts(normalize=True)` (~0.01–0.05).

### Graph Interpretation
- **Expected Plot**: Line fluctuates ~0.01–0.10, often near/above 0.05; spikes/trends reflect market stability (more co-movement) vs. volatility.
- **Strategic Relevance**: Assesses viable pairs for trading; low proportions (~0.05) risk false positives and poor backtests; precedes signal generation (e.g., z-score >2 entries).
- **Issues/Insights**:
  - **Multiple Testing**: High pair volume inflates false positives; >0.05 is positive but unadjusted (no Bonferroni here).
  - **Time Variation**: Reveals stability (flat) or decay (downtrend), common in markets.
  - **Comparisons**: Differs from count plots (e.g., `candidates.groupby('test_end').size()`) by benchmarking vs. chance.

This graph diagnostically evaluates cointegration stability and significance in the pairs trading pipeline, prior to simulation.

In [None]:
ax = test_results.groupby('test_end').coint.mean().to_frame('Proportion').plot()
ax.axhline(.05, lw=1, ls='--', c='k');

### Select Candidate Pairs

In [None]:
def select_candidate_pairs(data):
    candidates = data[data.joh_sig | data.eg_sig]
    candidates['y'] = candidates.apply(lambda x: x.s1 if x.s1_dep else x.s2, axis=1)
    candidates['x'] = candidates.apply(lambda x: x.s2 if x.s1_dep else x.s1, axis=1)
    return candidates.drop(['s1_dep', 's1', 's2'], axis=1)

In [None]:
candidates = select_candidate_pairs(test_results)

In [None]:
candidates.to_hdf('backtest.h5', 'candidates')

In [None]:
candidates = pd.read_hdf('backtest.h5', 'candidates')
candidates.info()
candidates

#### Candidates over Time

In [None]:
candidates.groupby('test_end').size().plot(figsize=(8, 5))

#### Most Common Pairs 

In [None]:
with pd.HDFStore('data.h5') as store:
    print(store.info())
    tickers = store['tickers']

In [None]:
with pd.HDFStore('backtest.h5') as store:
    print(store.info())

The `counter = Counter()` creates a `Counter` object that tallies the frequency of each unique sorted pair `(ticker1, ticker2)` (with `ticker1 < ticker2` lexicographically) among the candidate pairs where both `joh_sig` (Johansen significance) and `eg_sig` (Engle-Granger significance) are True.

Each count represents the number of test periods (out of the 11 quarterly periods in the provided `DatetimeIndex`) in which that pair demonstrated cointegration according to both tests.

In [None]:
counter = Counter()
for s1, s2 in zip(candidates[candidates.joh_sig & candidates.eg_sig].y, 
                  candidates[candidates.joh_sig & candidates.eg_sig].x):
    if s1 > s2:
        counter[(s2, s1)] += 1
    else: 
        counter[(s1, s2)] += 1
counter

In [None]:
most_common_pairs = pd.DataFrame(counter.most_common(10))
most_common_pairs = pd.DataFrame(most_common_pairs[0].values.tolist(), columns=['s1', 's2'])
most_common_pairs

In [None]:
with pd.HDFStore('backtest.h5') as store:
    prices = store['prices'].close.unstack('ticker').ffill(limit=5)
    tickers = store['tickers'].to_dict()

In [None]:
cnt = pd.Series(counter).reset_index()
cnt.columns = ['s1', 's2', 'n']
cnt['name1'] = cnt.s1.map(tickers)
cnt['name2'] = cnt.s2.map(tickers)
cnt.nlargest(10, columns='n')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

for i in range(len(most_common_pairs)):
    # Get the tickers for the current pair
    s1, s2 = most_common_pairs.at[i, 's1'], most_common_pairs.at[i, 's2']
    
    # Create a new figure for each pair
    fig, ax = plt.subplots(figsize=(10, 5))
    
    # Plot the price series for s1 and s2
    prices.loc[:, [s1, s2]].rename(columns=tickers).plot(
        secondary_y=tickers[s2],  # Second ticker on right y-axis
        ax=ax,
        rot=0  # Horizontal x-axis labels
    )
    
    # Customize the plot
    ax.grid(False)
    ax.set_xlabel('')  # Remove x-axis label
    ax.set_title(f'Price Series: {tickers[s1]} vs {tickers[s2]}')  # Add title with pair names
    
    # Clean up with Seaborn style
    sns.despine()
    plt.tight_layout()
    
    # Show the plot
    plt.show()

In [None]:
# fig, axes = plt.subplots(ncols=5, nrows=2, figsize=(14, 5))
# for i in [0, 1]:
#     s1, s2 = most_common_pairs.at[i, 's1'], most_common_pairs.at[i, 's2']
#     prices.loc[:, [s1, s2]].rename(columns=tickers).plot(secondary_y=tickers[s2],
#                                                          ax=axes[i],
#                                                          rot=0)
#     axes[i].grid(False)
#     axes[i].set_xlabel('')

# sns.despine()
# fig.tight_layout()

## Get Entry and Exit Dates 

### Explanation of Key Functions in the Pairs Trading Code

This code is part of a strategy for "pairs trading," where you find two related assets (like an ETF and a stock) whose prices usually move together. When they temporarily drift apart (diverge), you bet they'll come back together (converge) by buying one and selling the other. The functions below help prepare the data for this: smoothing noisy prices, figuring out how to balance trades, estimating how quickly divergences fix themselves, and processing all pairs efficiently. I'll explain each in plain English, with examples tied to the code's context (e.g., using ETF 'SPY.US' and stock 'XOM.US' as a sample pair).

#### KFSmoother: Smoothing Prices to Reduce Noise
This function uses a mathematical tool called a Kalman filter to "smooth" out the daily ups and downs in asset prices, making it easier to see the true underlying trend. It's like applying a smart moving average that adapts over time, ignoring short-term wiggles caused by market noise.

- **Plain English Breakdown**:
  - Prices in financial markets are bumpy due to random events (e.g., news or trades). This function estimates a cleaner version of the price series by assuming the "true" price evolves gradually but gets observed with some error.
  - It starts with an initial guess (mean=0) and updates its estimate step by step as new price data comes in, balancing between the observed price and its prediction.
  - The result is a smoothed series that's less volatile, which helps in later steps like calculating relationships between two assets.

- **How It Works in Code**:
  - It sets up a simple Kalman filter model: The price doesn't change much from day to day (transition matrix is identity), but there's a little wiggle room (covariance=0.05 for changes, 1 for observation noise).
  - It runs the filter on the price values and returns a new series with the smoothed estimates.

- **Example in Context**:
  - Imagine raw prices for 'SPY.US' (S&P 500 ETF) over a week: [200, 202, 198, 205, 199]. These jump around.
  - After smoothing: [200, 201, 200, 202, 201]. It's steadier, reducing noise.
  - In the code, `smoothed_prices = prices.apply(KFSmoother)` applies this to every ticker's column in the `prices` DataFrame (e.g., all ETFs and stocks from 2016-2019). This preprocessed data is then used for hedge ratios and spreads, making the strategy more reliable than using raw, noisy prices.

In [None]:
def KFSmoother(prices):
    """Estimate rolling mean"""
    
    kf = KalmanFilter(transition_matrices=np.eye(1),
                      observation_matrices=np.eye(1),
                      initial_state_mean=0,
                      initial_state_covariance=1,
                      observation_covariance=1,
                      transition_covariance=.05)

    state_means, _ = kf.filter(prices.values)
    return pd.Series(state_means.flatten(),
                     index=prices.index)

In [None]:
smoothed_prices = prices.apply(KFSmoother)
smoothed_prices.to_hdf('tmp.h5', 'smoothed')

In [None]:
smoothed_prices = pd.read_hdf('tmp.h5', 'smoothed')
smoothed_prices

#### KFHedgeRatio: Calculating a Dynamic Balance for Trading Pairs
This function figures out the "hedge ratio" – basically, how much of one asset you need to trade against the other to keep the pair balanced and neutral to overall market moves. It's dynamic, meaning it changes over time as market conditions shift, using another Kalman filter for adaptability.

- **Plain English Breakdown**:
  - In pairs trading, you don't just buy/sell equal amounts; one asset might be twice as volatile as the other, so you need a ratio (e.g., sell 2 units of Asset X for every 1 unit of Asset Y you buy) to cancel out common movements.
  - This function treats the relationship as a changing linear equation (Y ≈ β * X + intercept), estimating β (the ratio) and the intercept over time.
  - It returns a negative version of the estimates because the code later uses it to build the spread as Y + (negative β) * X, which is equivalent to Y - β * X.

- **How It Works in Code**:
  - It sets up a Kalman filter for a 2D state (β and intercept): They evolve slowly (small transition covariance based on delta=0.001).
  - The observation matrix uses X and a constant (for intercept), and it filters based on Y's values.
  - Output: A time series of -[β, intercept] for each day.

- **Example in Context**:
  - For pair Y='SPY.US' (dependent, price ~200) and X='XOM.US' (independent, price ~80):
    - On Day 1, it might estimate β ≈ 2.5 (meaning SPY moves 2.5 times more than XOM in response to common factors).
    - Hedge ratio returned: -2.5 (negative for spread formula).
    - In the code, inside `process_pair`: `KFHedgeRatio(y=smoothed SPY prices, x=smoothed XOM prices)[:, 0]` gives the daily -β, used to compute the spread. If β=2.5, you might short 2.5 shares of XOM per share of SPY to hedge.

In [None]:
# def KFHedgeRatio(x, y):
#     """Estimate Hedge Ratio"""
#     delta = 1e-3
#     trans_cov = delta / (1 - delta) * np.eye(2)
#     obs_mat = np.expand_dims(np.vstack([[x], [np.ones(len(x))]]).T, axis=1)

#     kf = KalmanFilter(n_dim_obs=1, n_dim_state=2,
#                       initial_state_mean=[0, 0],
#                       initial_state_covariance=np.ones((2, 2)),
#                       transition_matrices=np.eye(2),
#                       observation_matrices=obs_mat,
#                       observation_covariance=2,
#                       transition_covariance=trans_cov)

#     state_means, _ = kf.filter(y.values)
#     return -state_means

In [None]:
def KFHedgeRatio(x, y):
    """Estimate Hedge Ratio and Intercept"""
    delta = 1e-3
    trans_cov = delta / (1 - delta) * np.eye(2)
    obs_mat = np.expand_dims(np.vstack([[x], [np.ones(len(x))]]).T, axis=1)

    kf = KalmanFilter(n_dim_obs=1, n_dim_state=2,
                      initial_state_mean=[0, 0],
                      initial_state_covariance=np.ones((2, 2)),
                      transition_matrices=np.eye(2),
                      observation_matrices=obs_mat,
                      observation_covariance=2,
                      transition_covariance=trans_cov)

    state_means, _ = kf.filter(y.values)
    hedge_ratios = -state_means[:, 0]
    intercepts = -state_means[:, 1]
    return hedge_ratios, intercepts

### Estimate mean reversion half life


#### estimate_half_life: Estimating How Long Divergences Take to Fix Themselves
This function calculates the "half-life" of a price spread – the average number of days it takes for a divergence between two assets to shrink by half, assuming it mean-reverts (comes back to normal). It's a measure of how quickly the pair corrects itself, which helps decide trading windows.

- **Plain English Breakdown**:
  - Mean-reverting spreads don't snap back instantly; they take time. Half-life tells you the speed: Short (e.g., 10 days) means fast fixes (good for quick trades); long (e.g., 200 days) means slow (riskier, as things might change).
  - It fits a simple regression model to the spread's changes, estimating the reversion speed (beta), then converts it to days using a formula ($-ln(2)/beta$).
  - Ensures a minimum of 1 day to avoid nonsense values.

- **How It Works in Code**:
  - Creates lagged spread (X) and differences (Y), adds a constant for intercept.
  - Solves for beta using linear algebra (normal equation for OLS regression).
  - Computes half-life and rounds/clamps it.

- **Example in Context**:
  - Suppose the spread (SPY - β * XOM) over 2 years: Starts at 0, jumps to 4, then slowly returns (e.g., 4 → 2 in 20 days, 2 → 1 in another 20).
  - Beta might be -0.035 (negative indicates reversion), half-life ≈ 20 days (-0.693 / -0.035 ≈ 20).
  - In the code, inside `process_pair`: `half_life = estimate_half_life(pair.spread.loc[t: test_end])` uses the formation period (2 years pre-trading). This 20-day value sets the rolling window for z-scores (min(40, max_window)), helping detect tradable divergences.

In [None]:
def estimate_half_life(spread):
    X = spread.shift().iloc[1:].to_frame().assign(const=1)
    y = spread.diff().iloc[1:]
    beta = (np.linalg.inv(X.T @ X) @ X.T @ y).iloc[0]
    halflife = int(round(-np.log(2) / beta, 0))
    return max(halflife, 1)

### Compute Spread & Bollinger Bands

#### get_spread_parallel: Processing All Pairs Efficiently in Parallel
This is the main workhorse function that loops over time periods and candidate pairs, computing everything needed for trading (hedge ratios, spreads, half-lives, z-scores) using the above helpers. It runs in parallel to speed things up, as there are thousands of pairs.

- **Plain English Breakdown**:
  - It breaks the data into quarterly "test periods" (e.g., ending Dec 2016), grabs candidate pairs for each, and for a 2-year "formation" window plus 6-month "trading" window:
    - Smooths prices, computes daily hedge ratios and spreads.
    - Estimates half-life to size a rolling window.
    - Calculates z-scores (how far the spread is from normal, in standard deviations) for spotting trades (e.g., z>2 means diverge, enter trade).
  - Parallelizes per-pair work to handle scale (e.g., 1000+ pairs/period) quickly.

- **How It Works in Code**:
  - Outer loop: Over unique test_end dates (quarters from 2016-2019).
  - For each period: Define time windows (t=2 years back, T=6 months forward).
  - Inner function `process_pair`: For each pair (y,x), compute smoothed prices, hedge ratio, spread, half-life, rolling mean/std, z-score; output trading-period DataFrame and half-life list.
  - Uses `joblib.Parallel` to run `process_pair` on all CPUs.
  - Collects results into lists: `pairs` (DataFrames per pair-period) and `half_lives` (lists per pair).

- **Example in Context**:
  - For period ending 2016-12-31, with 1000 candidates (e.g., SPY-XOM as pair 1).
    - Formation: 2015-01-01 to 2016-12-31; Trading: 2017-01-01 to 2017-06-30.
    - For SPY-XOM: Smooth prices, get hedge ratios (e.g., -2.89 on Jan 3), spread (e.g., 2.50), half-life (19 days), z-score (e.g., -0.61 – not extreme).
    - Outputs: A 125-row DataFrame (trading days) with these metrics for SPY-XOM, plus half-life [2016-12-31, 'SPY.US', 'XOM.US', 19].
  - Full run: Produces `pairs` (list of ~thousands of such DataFrames) and `half_lives` (list of lists), used later for trade signals (e.g., enter if |z|>2).

These functions work together: Smoothing cleans data, hedge ratio balances pairs, half-life tunes timing, and parallel processing makes it feasible for many pairs. In the strategy, this setup identifies profitable convergence opportunities while managing risk.

In [None]:
def get_spread(candidates, prices):
    pairs = []
    half_lives = []

    periods = pd.DatetimeIndex(sorted(candidates.test_end.unique()))
    start = time()
    for p, test_end in enumerate(periods, 1):
        start_iteration = time()

        period_candidates = candidates.loc[candidates.test_end == test_end, ['y', 'x']]
        trading_start = test_end + pd.DateOffset(days=1)
        t = trading_start - pd.DateOffset(years=2)
        T = trading_start + pd.DateOffset(months=6) - pd.DateOffset(days=1)
        max_window = len(prices.loc[t: test_end].index)
        print(f"max window: {max_window}")
        print(f"test_end {test_end.date()}, {len(period_candidates)} pairs")
        for i, (y, x) in enumerate(zip(period_candidates.y, period_candidates.x), 1):
            if i % 1000 == 0:
                msg = f'{i:5.0f} | {time() - start_iteration:7.1f} | {time() - start:10.1f}'
                print(msg)
            pair = prices.loc[t: T, [y, x]]
            pair['hedge_ratio'] = KFHedgeRatio(y=KFSmoother(prices.loc[t: T, y]),
                                               x=KFSmoother(prices.loc[t: T, x]))[:, 0]
            pair['spread'] = pair[y].add(pair[x].mul(pair.hedge_ratio))
            half_life = estimate_half_life(pair.spread.loc[t: test_end])                
            spread = pair.spread.rolling(window=min(2 * half_life, max_window))
            print(f"half_life {half_life} , spread window {min(2 * half_life, max_window)}")
            pair['z_score'] = pair.spread.sub(spread.mean()).div(spread.std())
            pairs.append(pair.loc[trading_start: T].assign(s1=y, s2=x, period=p, pair=i).drop([x, y], axis=1))

            half_lives.append([test_end, y, x, half_life])
    return pairs, half_lives

In [None]:
candidates = pd.read_hdf('backtest.h5', 'candidates')
candidates.info()

48m 20.0s


- **Max window:** 252, **Test end:** 2016-12-31, **Pairs:** 3497
- **Max window:** 314, **Test end:** 2017-03-31, **Pairs:** 1978
- **Max window:** 377, **Test end:** 2017-06-30, **Pairs:** 4124
- **Max window:** 440, **Test end:** 2017-09-30, **Pairs:** 2024
- **Max window:** 503, **Test end:** 2017-12-31, **Pairs:** 2885
- **Max window:** 503, **Test end:** 2018-03-31, **Pairs:** 3513
- **Max window:** 503, **Test end:** 2018-06-30, **Pairs:** 2399
- **Max window:** 502, **Test end:** 2018-09-30, **Pairs:** 2929
- **Max window:** 502, **Test end:** 2018-12-31, **Pairs:** 2846
- **Max window:** 501, **Test end:** 2019-03-31, **Pairs:** 2606
- **Max window:** 501, **Test end:** 2019-06-30, **Pairs:** 2645


```python
print(y, x) 
#SPY.US XOM.US
```

```python
pair = prices.loc[t: T, [y, x]]
print(pair)
ticker         BMY.US     ILF.US
date                            
2016-07-01  60.032011  20.127575
2016-07-05  60.274389  20.160980
2016-07-06  60.481711  20.160264
2016-07-07  60.683829  20.138191
2016-07-08  61.045403  20.273513
...               ...        ...
2018-12-24  44.698739  25.344163
2018-12-26  44.580871  25.357431
2018-12-27  44.553777  25.403545
2018-12-28  44.625562  25.489196
2018-12-31  44.866089  25.577877
```

In [None]:
# pairs, half_lives = get_spread(candidates, smoothed_prices)

5m 4.0s

In the `get_spread_parallel` function, the variables `trading_start`, `t`, and `T` define key dates for analyzing cointegrated pairs in a statistical arbitrage strategy:

- **`trading_start`**: This is the date when trading begins for a given test period. It’s set to the day after the `test_end` date, marking the start of the trading window.
- **`t`**: This is the start of the lookback period, exactly two years before `trading_start`. It defines the beginning of the historical data used to estimate the spread and hedge ratio.
- **`T`**: This is the end of the trading window, six months after `trading_start`. It marks the cutoff date for the trading period being analyzed.

In plain terms, `trading_start` is when you start trading, `t` is the start of the two-year historical data window used for calculations, and `T` is the end of the six-month trading period.

In [None]:
# from joblib import Parallel, delayed

# def get_spread_parallel(candidates, prices):
#     pairs = []
#     half_lives = []

#     periods = pd.DatetimeIndex(sorted(candidates.test_end.unique()))
#     start = time()
#     for p, test_end in enumerate(periods, 1):
#         start_iteration = time()

#         period_candidates = candidates.loc[candidates.test_end == test_end, ['y', 'x']]
#         trading_start = test_end + pd.DateOffset(days=1) #test end date + 1
#         t = trading_start - pd.DateOffset(years=2) # 2 years before trading start
#         T = trading_start + pd.DateOffset(months=6) - pd.DateOffset(days=1) # 6 months after trading start - 1 day
#         max_window = len(prices.loc[t: test_end].index)
#         print(test_end.date(), len(period_candidates))

#         def process_pair(i, y, x):
#             pair = prices.loc[t: T, [y, x]]
#             pair['hedge_ratio'] = KFHedgeRatio(y=KFSmoother(prices.loc[t: T, y]),
#                                                x=KFSmoother(prices.loc[t: T, x]))[:, 0]
            
#             pair['spread'] = pair[y].add(pair[x].mul(pair.hedge_ratio))
#             half_life = estimate_half_life(pair.spread.loc[t: test_end])                

#             spread = pair.spread.rolling(window=min(2 * half_life, max_window))
#             pair['z_score'] = pair.spread.sub(spread.mean()).div(spread.std())
#             pair_out = pair.loc[trading_start: T].assign(s1=y, s2=x, period=p, pair=i).drop([x, y], axis=1)

#             hl_out = [test_end, y, x, half_life]
#             return pair_out, hl_out

#         pair_results = Parallel(n_jobs=-1, verbose=10)(
#             delayed(process_pair)(i, y, x)
#             for i, (y, x) in enumerate(zip(period_candidates.y, period_candidates.x), 1)
#         )

#         pairs.extend([pr[0] for pr in pair_results])
#         half_lives.extend([pr[1] for pr in pair_results])

#     return pairs, half_lives

In [None]:
from joblib import Parallel, delayed

def get_spread_parallel(candidates, prices):
    pairs = []
    half_lives = []

    periods = pd.DatetimeIndex(sorted(candidates.test_end.unique()))
    start = time()
    for p, test_end in enumerate(periods, 1):
        start_iteration = time()

        period_candidates = candidates.loc[candidates.test_end == test_end, ['y', 'x']]
        trading_start = test_end + pd.DateOffset(days=1) #test end date + 1
        t = trading_start - pd.DateOffset(years=2) # 2 years before trading start
        T = trading_start + pd.DateOffset(months=6) - pd.DateOffset(days=1) # 6 months after trading start - 1 day
        max_window = len(prices.loc[t: test_end].index)
        print(test_end.date(), len(period_candidates))

        def process_pair(i, y, x):
            pair = prices.loc[t: T, [y, x]]
            hedge_ratio, intercept = KFHedgeRatio(y=KFSmoother(prices.loc[t: T, y]),
                                                  x=KFSmoother(prices.loc[t: T, x]))
            
            pair['hedge_ratio'] = hedge_ratio
            pair['intercept'] = intercept
            pair['spread'] = pair[y].add(pair[x].mul(pair['hedge_ratio'])).add(pair['intercept'])
            half_life = estimate_half_life(pair.spread.loc[t: test_end])                

            spread = pair.spread.rolling(window=min(2 * half_life, max_window))
            pair['z_score'] = pair.spread.sub(spread.mean()).div(spread.std())
            pair_out = pair.loc[trading_start: T].assign(s1=y, s2=x, period=p, pair=i).drop([x, y], axis=1)

            hl_out = [test_end, y, x, half_life]
            return pair_out, hl_out

        pair_results = Parallel(n_jobs=-1, verbose=10)(
            delayed(process_pair)(i, y, x)
            for i, (y, x) in enumerate(zip(period_candidates.y, period_candidates.x), 1)
        )

        pairs.extend([pr[0] for pr in pair_results])
        half_lives.extend([pr[1] for pr in pair_results])

    return pairs, half_lives

In [None]:
pairs, half_lives = get_spread_parallel(candidates, smoothed_prices)

### Collect Results

#### Half Lives

In [None]:
hl = pd.DataFrame(half_lives, columns=['test_end', 's1', 's2', 'half_life'])
hl.info()

In [None]:
hl.half_life.describe()

In [None]:
hl.to_hdf('backtest.h5', 'half_lives')
hl

In [None]:
print(hl.half_life.describe())

import plotly.express as px
fig = px.histogram(hl.half_life)
fig.show()

#### Pair Data

In [None]:
pair_data = pd.concat(pairs)
pair_data.info(show_counts=True)

In [None]:
pair_data.to_hdf('backtest.h5', 'pair_data')

In [None]:
pair_data = pd.read_hdf('backtest.h5', 'pair_data')
pair_data

Plots the price series of a cointegrated pair (s1 and s2) for a given period and pair ID.
    
This function loads the necessary price data and ticker names from 'backtest.h5'.
It filters the pair_data for the specified period and pair_id to identify s1 and s2,
then plots their close prices over the trading dates in that period.

Parameters:
- period (int): The period number (e.g., 1, 2, ..., corresponding to quarterly test periods).
- pair_id (int): The pair identifier within the period (e.g., 1, 2, ... for each candidate pair).
- normalize (bool, optional): If True, normalize prices to start at 100 for easier comparison. Default is False.
- figsize (tuple, optional): Figure size for the plot. Default is (12, 6).

Returns:
- None: Displays the plot.

Explanation:
In statistical arbitrage with cointegrated pairs, visualizing the raw price series of the two assets (e.g., an ETF and a stock)
helps understand their co-movement. Cointegrated pairs tend to move together in the long run, even if they diverge temporarily.
- The left y-axis shows the price of the first asset (s1).
- The right y-axis shows the price of the second asset (s2) for better scaling.
- If normalize=True, prices are rebased to 100 at the start of the period to highlight relative movements, which is useful for spotting divergences.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def plot_pair_prices(period, pair_id, normalize=False, figsize=(12, 8)):
    # Load prices, pair_data, and ticker names from the HDF store
    with pd.HDFStore('backtest.h5') as store:
        prices = store['prices'].close.unstack('ticker')  # Close prices with dates as index, tickers as columns
        pair_data = store['pair_data']  # Load pair_data which contains hedge_ratio, intercept, and spread
        tickers = store['tickers'].to_dict()  # Dictionary mapping ticker symbols to names

    # Filter pair_data for the given period and pair_id
    data = pair_data.query('period == @period & pair == @pair_id')
    
    if data.empty:
        print(f"No data found for period {period} and pair {pair_id}.")
        return
    
    # Extract s1 and s2 (the pair tickers)
    s1 = data['s1'].iloc[0]  # Dependent variable (y)
    s2 = data['s2'].iloc[0]  # Independent variable (x)
    
    # Get the date range for this period/pair (trading dates)
    dates = data.index
    
    # Extract prices for s1 and s2 over the date range
    pair_prices = prices.loc[dates, [s1, s2]].dropna()  # Drop any NaNs if present
    
    if pair_prices.empty:
        print(f"No price data available for {s1} and {s2} in the given period.")
        return
    
    # Estimate the price of s1 using hedge_ratio and intercept
    # Since spread = s1 + hedge_ratio * s2 + intercept ≈ 0,
    # estimated_s1 ≈ -hedge_ratio * s2 - intercept
    estimated = -data['hedge_ratio'] * pair_prices[s2] - data['intercept']
    
    # Add estimated prices and spread to the DataFrame
    pair_prices['estimated'] = estimated
    pair_prices['spread'] = data['spread']
    
    # Optionally normalize prices to start at 100
    if normalize:
        pair_prices[[s1, s2, 'estimated']] = (pair_prices[[s1, s2, 'estimated']] / pair_prices[[s1, s2, 'estimated']].iloc[0]) * 100
    
    # Rename columns to use full names for the legend
    pair_prices = pair_prices.rename(columns={
        s1: tickers.get(s1, s1),
        s2: tickers.get(s2, s2),
        'estimated': f"Estimated {tickers.get(s1, s1)}"
    })
    
    # Create the figure with two subplots (stacked vertically)
    fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=figsize, sharex=True, gridspec_kw={'height_ratios': [3, 1]})
    
    # Plot s1, estimated s1, and s2 on the first subplot
    pair_prices[[pair_prices.columns[0], pair_prices.columns[2]]].plot(
        ax=ax1,
        style=['-', '--'],  # Solid for s1, dashed for estimated
        legend=True
    )
    pair_prices[pair_prices.columns[1]].plot(
        ax=ax1,
        secondary_y=True,  # s2 on secondary y-axis
        style='-',  # Solid for s2
        legend=True
    )
    
    # Customize the first subplot
    ax1.set_title(f"Price Series for Pair {pair_id} in Period {period}: {pair_prices.columns[0]} vs {pair_prices.columns[1]} (with Estimated {pair_prices.columns[0]})")
    ax1.set_ylabel("Price" if not normalize else "Normalized Price (starting at 100)")
    ax1.grid(True)
    
    # Plot the spread on the second subplot
    pair_prices['spread'].plot(ax=ax2, color='purple', legend=True)
    ax2.axhline(0, color='black', linestyle='--', linewidth=1)  # Add horizontal line at zero
    ax2.set_title(f"Spread for Pair {pair_id} in Period {period}")
    ax2.set_xlabel("Date")
    ax2.set_ylabel("Spread")
    ax2.grid(True)
    
    # Apply Seaborn despine for cleaner look
    sns.despine()
    plt.tight_layout()
    plt.show()

# Example usage: Plot for period 1, pair 1 (adjust based on your data)
plot_pair_prices(period=1, pair_id=1, normalize=True)

### Identify Long & Short Entry and Exit Dates

In [None]:
# def get_trades(data):
#     pair_trades = []
#     for i, ((period, s1, s2), pair) in enumerate(data.groupby(['period', 's1', 's2']), 1):
#         if i % 100 == 0:
#             print(i)

#         first3m = pair.first('3M').index
#         last3m = pair.last('3M').index

#         entry = pair.z_score.abs() > 2
#         entry = ((entry.shift() != entry)
#                  .mul(np.sign(pair.z_score))
#                  .fillna(0)
#                  .astype(int)
#                  .sub(2))

#         exit = (np.sign(pair.z_score.shift().fillna(method='bfill'))
#                 != np.sign(pair.z_score)).astype(int) - 1

#         trades = (entry[entry != -2].append(exit[exit == 0])
#                   .to_frame('side')
#                   .sort_values(['date', 'side'])
#                   .squeeze())
#         if not isinstance(trades, pd.Series):
#             continue
#         try:
#             trades.loc[trades < 0] += 2
#         except:
#             print(type(trades))
#             print(trades)
#             print(pair.z_score.describe())
#             break

#         trades = trades[trades.abs().shift() != trades.abs()]
#         window = trades.loc[first3m.min():first3m.max()]
#         extra = trades.loc[last3m.min():last3m.max()]
#         n = len(trades)

#         if window.iloc[0] == 0:
#             if n > 1:
#                 print('shift')
#                 window = window.iloc[1:]
#         if window.iloc[-1] != 0:
#             extra_exits = extra[extra == 0].head(1)
#             if extra_exits.empty:
#                 continue
#             else:
#                 window = window.append(extra_exits)

#         trades = pair[['s1', 's2', 'hedge_ratio', 'period', 'pair']].join(window.to_frame('side'), how='right')
#         trades.loc[trades.side == 0, 'hedge_ratio'] = np.nan
#         trades.hedge_ratio = trades.hedge_ratio.ffill()
#         pair_trades.append(trades)
#     return pair_trades

In [None]:
def get_trades(data):
    pair_trades = []
    for i, ((period, s1, s2), pair) in enumerate(data.groupby(['period', 's1', 's2']), 1):
        if i % 100 == 0:
            print(i)

        first3m = pair.first('3M').index
        last3m = pair.last('3M').index

        entry = pair.z_score.abs() > 2
        entry = ((entry.shift() != entry)
                 .mul(np.sign(pair.z_score))
                 .fillna(0)
                 .astype(int)
                 .sub(2))

        exit = (np.sign(pair.z_score.shift().fillna(method='bfill'))
                != np.sign(pair.z_score)).astype(int) - 1

        trades = (pd.concat([entry[entry != -2], exit[exit == 0]])
                  .to_frame('side')
                  .reset_index()  # Reset index to make 'date' a column for sorting
                  .sort_values(['date', 'side'])
                  .set_index('date')  # Set 'date' back as index
                  .squeeze())

        if not isinstance(trades, pd.Series):
            continue
        try:
            trades.loc[trades < 0] += 2
        except:
            print(type(trades))
            print(trades)
            print(pair.z_score.describe())
            break

        trades = trades[trades.abs().shift() != trades.abs()]
        window = trades.loc[first3m.min():first3m.max()]
        extra = trades.loc[last3m.min():last3m.max()]
        n = len(trades)

        if window.iloc[0] == 0:
            if n > 1:
                print('shift')
                window = window.iloc[1:]
        if window.iloc[-1] != 0:
            extra_exits = extra[extra == 0].head(1)
            if extra_exits.empty:
                continue
            else:
                window = pd.concat([window, extra_exits])

        trades = pair[['s1', 's2', 'hedge_ratio', 'period', 'pair']].join(window.to_frame('side'), how='right')
        trades.loc[trades.side == 0, 'hedge_ratio'] = np.nan
        trades.hedge_ratio = trades.hedge_ratio.ffill()
        pair_trades.append(trades)
    return pair_trades


In [None]:
pair_trades = get_trades(pair_data)

In [None]:
pair_trade_data = pd.concat(pair_trades)
pair_trade_data.info()

In [None]:
pair_trade_data.head()

In [None]:
trades = pair_trade_data['side'].copy()
trades.loc[trades != 0] = 1
trades.loc[trades == 0] = -1
trades.sort_index().cumsum().plot(figsize=(14, 4))
sns.despine()

In [None]:
pair_trade_data.to_hdf('backtest.h5', 'pair_trades')

In [None]:
trades