# Notebook 00 — Pair Selection Framework

**Adaptive Pair Trading using Cointegration, Volatility and ML Diagnostics**  
**Author:** Ayush Arora (MQMS2404)

---

## Objective

This notebook implements a **hierarchical pair selection framework** to identify
economically meaningful and statistically tradable stock pairs.

The goal is **not** to find the most profitable pair, but to systematically filter
out weak candidates before applying deeper statistical arbitrage analysis.

The selected pairs are then passed to subsequent notebooks for validation,
modeling, and backtesting.

## Cell 1: Import required libraries

In [1]:
import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import coint, adfuller
import itertools

## Cell 2: Load cleaned price data

In [2]:
prices = pd.read_csv('data/prices.csv', index_col=0, parse_dates=True)
prices = prices.dropna(axis=1)
prices.head()

Unnamed: 0_level_0,ABB.NS,ADANIENT.NS,ADANIPORTS.NS,AMBUJACEM.NS,APOLLOHOSP.NS,ASHOKLEY.NS,ASIANPAINT.NS,AUROPHARMA.NS,AXISBANK.NS,BAJAJ-AUTO.NS,...,TCS.NS,TECHM.NS,TITAN.NS,TORNTPHARM.NS,TVSMOTOR.NS,UBL.NS,ULTRACEMCO.NS,VEDL.NS,WIPRO.NS,YESBANK.NS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-01-01,1131.143555,70.761368,301.772858,195.77301,1084.179443,21.544609,684.931519,532.220825,486.962891,1845.359253,...,1006.856689,462.629211,362.723755,494.190887,259.65564,808.554321,2548.028809,70.611023,93.508202,147.012466
2015-01-02,1119.410767,71.107956,301.584015,198.523697,1086.056763,22.124655,708.61145,534.77063,497.853088,1845.058472,...,1020.26532,464.893738,365.588867,514.467468,250.206802,809.523132,2624.206299,71.593277,94.337654,150.751801
2015-01-05,1119.671143,72.284912,305.786499,198.265808,1089.667969,23.947659,708.565674,534.510925,500.999146,1851.793701,...,1004.760681,457.13739,368.549469,512.181091,253.434326,806.907471,2629.56543,70.788124,94.506927,151.075287
2015-01-06,1109.50293,71.736145,303.944946,190.959244,1057.216675,23.429758,691.65155,513.663208,483.09079,1837.344482,...,967.71814,452.233887,355.943024,488.62558,250.768112,806.471375,2555.780762,67.326019,92.29789,146.841232
2015-01-07,1093.510986,71.100731,303.236633,189.197098,1065.498047,24.714153,705.548523,521.808777,482.703583,1841.633911,...,956.287292,450.12085,357.757568,486.015625,260.357208,869.103821,2545.683594,67.309937,91.595375,144.119995


## Cell 3: Define economically meaningful universe (example: metals)

Pairs are generated **within sectors** to preserve economic logic.
This example focuses on the metals sector.

In [3]:
metal_stocks = [
    'JSWSTEEL.NS', 'TATASTEEL.NS', 'HINDALCO.NS', 'VEDL.NS'
]

pairs = list(itertools.combinations(metal_stocks, 2))
pairs

[('JSWSTEEL.NS', 'TATASTEEL.NS'),
 ('JSWSTEEL.NS', 'HINDALCO.NS'),
 ('JSWSTEEL.NS', 'VEDL.NS'),
 ('TATASTEEL.NS', 'HINDALCO.NS'),
 ('TATASTEEL.NS', 'VEDL.NS'),
 ('HINDALCO.NS', 'VEDL.NS')]

## Cell 4: Correlation filter

Pairs with insufficient co-movement are removed at this stage.

In [10]:
def corr_filter(prices, pairs, threshold=0.4):
    records = []
    for a, b in pairs:
        corr = prices[a].pct_change().corr(prices[b].pct_change())
        if corr > threshold:
            records.append((a, b, corr))
    return pd.DataFrame(records, columns=['Stock A', 'Stock B', 'Correlation'])

corr_pairs = corr_filter(prices, pairs)
corr_pairs

Unnamed: 0,Stock A,Stock B,Correlation
0,TATASTEEL.NS,HINDALCO.NS,0.692928
1,TATASTEEL.NS,VEDL.NS,0.628697
2,HINDALCO.NS,VEDL.NS,0.652886


## Cell 5: Cointegration testing

Engle–Granger cointegration test is applied to correlated pairs.

In [14]:
def cointegration_test(prices, df, alpha=0.05):
    records = []
    for _, row in df.iterrows():
        a, b = row['Stock A'], row['Stock B']
        _, pval, _ = coint(prices[a], prices[b])
        if pval < alpha:
            records.append((a, b, row['Correlation'], pval))
    return pd.DataFrame(records, columns=['Stock A', 'Stock B', 'Correlation', 'Coint p-value'])

coint_pairs = cointegration_test(prices, corr_pairs)
coint_pairs

Unnamed: 0,Stock A,Stock B,Correlation,Coint p-value
0,TATASTEEL.NS,HINDALCO.NS,0.692928,0.000586


## Cell 6: Stationarity of spread

ADF test is applied to confirm mean-reverting behavior of the spread.

In [15]:
def adf_spread_test(prices, df):
    records = []
    for _, row in df.iterrows():
        a, b = row['Stock A'], row['Stock B']
        spread = prices[a] - prices[b]
        adf_stat, pval, *_ = adfuller(spread)
        if pval < 0.05:
            records.append((a, b, row['Correlation'], row['Coint p-value'], adf_stat))
    return pd.DataFrame(records, columns=['Stock A', 'Stock B', 'Correlation', 'Coint p-value', 'ADF stat'])

stationary_pairs = adf_spread_test(prices, coint_pairs)
stationary_pairs

Unnamed: 0,Stock A,Stock B,Correlation,Coint p-value,ADF stat


## Cell 7: Half-life of mean reversion

Half-life is used as a practical measure of reversion speed.
Only pairs with reasonably fast reversion are retained.

In [16]:
def half_life(spread):
    spread_lag = spread.shift(1).dropna()
    spread_ret = spread.diff().dropna()
    beta = np.polyfit(spread_lag, spread_ret, 1)[0]
    if beta >= 0:
        return np.inf
    return -np.log(2) / beta

def compute_half_life(prices, df, max_hl=60):
    records = []
    for _, row in df.iterrows():
        a, b = row['Stock A'], row['Stock B']
        spread = prices[a] - prices[b]
        hl = half_life(spread)
        if hl < max_hl:
            records.append((a, b, row['Correlation'], row['Coint p-value'], row['ADF stat'], hl))
    return pd.DataFrame(records, columns=['Stock A', 'Stock B', 'Correlation', 'Coint p-value', 'ADF stat', 'Half-life'])

final_pairs = compute_half_life(prices, stationary_pairs)
final_pairs.sort_values('Half-life')

Unnamed: 0,Stock A,Stock B,Correlation,Coint p-value,ADF stat,Half-life


## Final Output

The resulting pairs represent candidates that satisfy:

- Economic plausibility (sector-based selection)
- Statistical validity (correlation, cointegration, stationarity)
- Practical tradability (fast mean reversion)

These pairs are forwarded to subsequent notebooks for detailed modeling and backtesting.