# Pairs Trading Strategy: A Cointegration-Based Approach
## IEOR 198 Final Project

This notebook implements a market-neutral pairs trading strategy using cointegration analysis on S&P 500 stocks. The strategy identifies statistically related stock pairs and trades on mean reversion when their price relationship diverges from historical norms.


In [3]:
# Imports
import datetime as dt
import pandas as pd
import numpy as np
import warnings
import yfinance as yf
import statsmodels.api as sm
from statsmodels.tsa.stattools import coint, adfuller
import matplotlib.pyplot as plt
from scipy.stats import zscore
from itertools import combinations

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

print("All imports loaded successfully!")


All imports loaded successfully!


## 1. Data Collection

We'll download 2+ years of adjusted close prices for a subset of S&P 500 stocks using yfinance. To keep computation tractable for pairs screening, we focus on stocks from specific sectors.


In [4]:
# Define S&P 500 tickers by sector for pairs trading
# We focus on sectors where pairs trading works well (similar business models)

# Technology sector
tech_tickers = ['AAPL', 'MSFT', 'GOOGL', 'META', 'NVDA', 'AMD', 'INTC', 'CRM', 'ADBE', 'ORCL']

# Financial sector
financial_tickers = ['JPM', 'BAC', 'WFC', 'GS', 'MS', 'C', 'USB', 'PNC', 'TFC', 'COF']

# Consumer Discretionary
consumer_tickers = ['AMZN', 'TSLA', 'HD', 'MCD', 'NKE', 'SBUX', 'TGT', 'LOW', 'TJX', 'ROST']

# Energy sector
energy_tickers = ['XOM', 'CVX', 'COP', 'SLB', 'EOG', 'MPC', 'PSX', 'VLO', 'OXY', 'HAL']

# Combine all tickers
all_tickers = tech_tickers + financial_tickers + consumer_tickers + energy_tickers

# Create sector mapping for later use
sector_map = {}
for t in tech_tickers: sector_map[t] = 'Technology'
for t in financial_tickers: sector_map[t] = 'Financial'
for t in consumer_tickers: sector_map[t] = 'Consumer'
for t in energy_tickers: sector_map[t] = 'Energy'

print(f"Total tickers: {len(all_tickers)}")
print(f"Sectors: {list(set(sector_map.values()))}")


Total tickers: 40
Sectors: ['Technology', 'Financial', 'Energy', 'Consumer']


In [5]:
# Download historical price data
# Using 2+ years of data for robust cointegration testing

start_date = '2022-01-01'
end_date = '2024-12-01'

print(f"Downloading data from {start_date} to {end_date}...")

# Download adjusted close prices for all tickers
price_data = yf.download(all_tickers, start=start_date, end=end_date)['Adj Close']

# Drop any tickers with missing data
price_data = price_data.dropna(axis=1, how='any')

print(f"\nData shape: {price_data.shape}")
print(f"Date range: {price_data.index[0].date()} to {price_data.index[-1].date()}")
print(f"Tickers with complete data: {len(price_data.columns)}")
price_data.head()


[                       0%                       ]

Downloading data from 2022-01-01 to 2024-12-01...


[*********************100%***********************]  40 of 40 completed

1 Failed download:
['INTC']: OperationalError('database is locked')



Data shape: (732, 0)
Date range: 2022-01-03 to 2024-11-29
Tickers with complete data: 0


Ticker
Date
2022-01-03
2022-01-04
2022-01-05
2022-01-06
2022-01-07


In [6]:
# Calculate log returns for analysis
log_returns = np.log(price_data / price_data.shift(1)).dropna()

print(f"Log returns shape: {log_returns.shape}")
print("\nSample statistics:")
log_returns.describe()


Log returns shape: (732, 0)

Sample statistics:


ValueError: Cannot describe a DataFrame without columns

## 2. Pair Selection via Cointegration

**Cointegration** is a statistical property where two non-stationary time series share a common stochastic trend. Unlike correlation (which measures co-movement), cointegration implies a stable long-run equilibrium relationship.

We use the **Engle-Granger two-step cointegration test**:
1. Run OLS regression: Y_t = alpha + beta * X_t + epsilon_t
2. Test residuals for stationarity using ADF test

A p-value < 0.05 indicates the pair is likely cointegrated.


In [None]:
def find_cointegrated_pairs(data, sector_map, significance=0.05):
    """
    Find cointegrated pairs within the same sector using Engle-Granger test.
    
    Returns:
        DataFrame with pair info sorted by p-value
    """
    tickers = data.columns.tolist()
    n = len(tickers)
    pairs_results = []
    
    # Only test pairs within the same sector (fundamental justification)
    for i in range(n):
        for j in range(i+1, n):
            ticker1, ticker2 = tickers[i], tickers[j]
            
            # Check if same sector
            if sector_map.get(ticker1) != sector_map.get(ticker2):
                continue
                
            # Run cointegration test
            score, pvalue, _ = coint(data[ticker1], data[ticker2])
            
            if pvalue < significance:
                pairs_results.append({
                    'ticker1': ticker1,
                    'ticker2': ticker2,
                    'sector': sector_map.get(ticker1, 'Unknown'),
                    'pvalue': pvalue,
                    't_statistic': score
                })
    
    # Sort by p-value (most significant first)
    pairs_df = pd.DataFrame(pairs_results).sort_values('pvalue')
    return pairs_df

print("Cointegration function defined.")


In [None]:
# Split data into training (80%) and testing (20%) periods
# We use training data for pair selection to avoid lookahead bias

train_size = int(len(price_data) * 0.8)
train_data = price_data.iloc[:train_size]
test_data = price_data.iloc[train_size:]

print(f"Training period: {train_data.index[0].date()} to {train_data.index[-1].date()} ({len(train_data)} days)")
print(f"Testing period: {test_data.index[0].date()} to {test_data.index[-1].date()} ({len(test_data)} days)")


In [None]:
# Find cointegrated pairs using training data only
print("Screening for cointegrated pairs (this may take a moment)...\n")

cointegrated_pairs = find_cointegrated_pairs(train_data, sector_map, significance=0.05)

print(f"Found {len(cointegrated_pairs)} cointegrated pairs:\n")
cointegrated_pairs


In [None]:
# Select top pair(s) for trading
# If no pairs found, use a default pair known to be historically cointegrated
if len(cointegrated_pairs) > 0:
    best_pair = cointegrated_pairs.iloc[0]
    stock1, stock2 = best_pair['ticker1'], best_pair['ticker2']
    print(f"Best pair: {stock1} - {stock2}")
    print(f"Sector: {best_pair['sector']}")
    print(f"Cointegration p-value: {best_pair['pvalue']:.6f}")
else:
    # Fallback to a commonly traded pair
    stock1, stock2 = 'XOM', 'CVX'
    print(f"No significant pairs found. Using fallback pair: {stock1} - {stock2}")

# Visualize the price series (normalized)
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Normalized prices
norm_s1 = price_data[stock1] / price_data[stock1].iloc[0]
norm_s2 = price_data[stock2] / price_data[stock2].iloc[0]

axes[0].plot(norm_s1, label=stock1, alpha=0.8)
axes[0].plot(norm_s2, label=stock2, alpha=0.8)
axes[0].axvline(train_data.index[-1], color='red', linestyle='--', label='Train/Test Split')
axes[0].set_title(f'Normalized Prices: {stock1} vs {stock2}')
axes[0].legend()
axes[0].set_ylabel('Normalized Price')

# Price ratio
ratio = price_data[stock1] / price_data[stock2]
axes[1].plot(ratio, color='purple', alpha=0.8)
axes[1].axhline(ratio.mean(), color='green', linestyle='--', label=f'Mean: {ratio.mean():.2f}')
axes[1].axvline(train_data.index[-1], color='red', linestyle='--', label='Train/Test Split')
axes[1].set_title(f'Price Ratio: {stock1}/{stock2}')
axes[1].legend()
axes[1].set_ylabel('Ratio')

plt.tight_layout()
plt.show()


## 3. Spread Modeling and Signal Generation

The trading signal is based on the **z-score** of the price spread (or ratio). When the z-score deviates significantly from zero, we expect mean reversion:

- **Entry Signal (Long spread)**: z-score < -2 (spread is unusually low, expect it to rise)
- **Entry Signal (Short spread)**: z-score > +2 (spread is unusually high, expect it to fall)
- **Exit Signal**: z-score crosses back through 0

We use a **rolling window** to calculate the z-score to adapt to changing market conditions.


In [None]:
# Calculate the hedge ratio using OLS regression on training data
# This tells us how many shares of stock2 to trade per share of stock1

X = sm.add_constant(train_data[stock2])
model = sm.OLS(train_data[stock1], X).fit()
hedge_ratio = model.params[stock2]
intercept = model.params['const']

print(f"Hedge Ratio (beta): {hedge_ratio:.4f}")
print(f"Intercept (alpha): {intercept:.4f}")
print(f"R-squared: {model.rsquared:.4f}")


In [None]:
# Calculate spread: S1 - beta * S2 - alpha
# The spread should be stationary if the pair is cointegrated

spread = price_data[stock1] - hedge_ratio * price_data[stock2] - intercept

# Calculate rolling z-score
# Using 20-day window (approximately 1 month of trading)
lookback = 20

rolling_mean = spread.rolling(window=lookback).mean()
rolling_std = spread.rolling(window=lookback).std()
zscore_spread = (spread - rolling_mean) / rolling_std

# Visualize
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Spread
axes[0].plot(spread, label='Spread', color='blue', alpha=0.7)
axes[0].axhline(spread.mean(), color='green', linestyle='--', label='Mean')
axes[0].axvline(train_data.index[-1], color='red', linestyle='--', label='Train/Test Split')
axes[0].set_title(f'Spread: {stock1} - {hedge_ratio:.2f}*{stock2}')
axes[0].legend()
axes[0].set_ylabel('Spread Value')

# Z-score
axes[1].plot(zscore_spread, label='Z-Score', color='purple', alpha=0.7)
axes[1].axhline(0, color='black', linestyle='-', linewidth=0.5)
axes[1].axhline(2, color='red', linestyle='--', label='Upper Threshold (+2)')
axes[1].axhline(-2, color='green', linestyle='--', label='Lower Threshold (-2)')
axes[1].axvline(train_data.index[-1], color='red', linestyle='--', alpha=0.5)
axes[1].fill_between(zscore_spread.index, -2, 2, alpha=0.1, color='gray')
axes[1].set_title('Rolling Z-Score of Spread')
axes[1].legend()
axes[1].set_ylabel('Z-Score')
axes[1].set_ylim(-5, 5)

plt.tight_layout()
plt.show()


In [None]:
# Generate trading signals
# Position: 1 = long spread (long S1, short S2), -1 = short spread (short S1, long S2), 0 = no position

entry_threshold = 2.0
exit_threshold = 0.0

def generate_signals(zscore, entry_thresh=2.0, exit_thresh=0.0):
    """
    Generate trading signals based on z-score thresholds.
    
    Long spread when z-score < -entry_thresh (spread is cheap, expect increase)
    Short spread when z-score > +entry_thresh (spread is expensive, expect decrease)
    Exit when z-score crosses exit_thresh (mean reversion complete)
    """
    signals = pd.Series(index=zscore.index, data=0.0)
    position = 0
    
    for i in range(len(zscore)):
        if pd.isna(zscore.iloc[i]):
            signals.iloc[i] = 0
            continue
            
        z = zscore.iloc[i]
        
        if position == 0:  # No position
            if z < -entry_thresh:
                position = 1  # Long spread (expect z to increase)
            elif z > entry_thresh:
                position = -1  # Short spread (expect z to decrease)
        elif position == 1:  # Long spread
            if z > exit_thresh:
                position = 0  # Exit
        elif position == -1:  # Short spread
            if z < exit_thresh:
                position = 0  # Exit
                
        signals.iloc[i] = position
    
    return signals

signals = generate_signals(zscore_spread, entry_threshold, exit_threshold)

print(f"Total trading days: {len(signals)}")
print(f"Days in long position: {(signals == 1).sum()}")
print(f"Days in short position: {(signals == -1).sum()}")
print(f"Days with no position: {(signals == 0).sum()}")


## 4. Backtesting Framework

We simulate the strategy with the following assumptions:
- **Equal dollar allocation** to each leg (long and short)
- **Transaction costs**: 0.02% per trade (matching the lab)
- **No leverage**: Net dollar exposure is approximately zero (market neutral)
- **Daily rebalancing**: Position sizes adjusted based on price changes

The strategy P&L comes from the spread returning to its mean value.


In [None]:
# Build the backtesting dataframe
backtest_df = pd.DataFrame({
    'date': price_data.index,
    'price_s1': price_data[stock1].values,
    'price_s2': price_data[stock2].values,
    'spread': spread.values,
    'zscore': zscore_spread.values,
    'signal': signals.values
}).set_index('date')

# Calculate log returns for each stock
backtest_df['ret_s1'] = np.log(backtest_df['price_s1'] / backtest_df['price_s1'].shift(1))
backtest_df['ret_s2'] = np.log(backtest_df['price_s2'] / backtest_df['price_s2'].shift(1))

# Calculate spread returns
# When long spread: long S1, short S2 -> return = ret_s1 - ret_s2
# When short spread: short S1, long S2 -> return = ret_s2 - ret_s1 = -(ret_s1 - ret_s2)
backtest_df['spread_return'] = backtest_df['ret_s1'] - backtest_df['ret_s2']

# Strategy return (delayed by 1 day since we trade on signal)
backtest_df['signal_prev'] = backtest_df['signal'].shift(1)
backtest_df['strategy_return'] = backtest_df['signal_prev'] * backtest_df['spread_return']

backtest_df.head(10)


In [None]:
# Account for transaction costs
# We pay fees when we enter or exit a position (position changes)

fee_rate = 0.0002  # 0.02% per trade

# Detect position changes
backtest_df['position_change'] = backtest_df['signal'].diff().abs()

# Fee is applied to both legs when position changes
# Since we have two legs (long one stock, short another), fee is 2x
backtest_df['fees'] = backtest_df['position_change'] * fee_rate * 2

# Strategy return after fees
backtest_df['strategy_return_net'] = backtest_df['strategy_return'] - backtest_df['fees']

# Drop NaN rows
backtest_df = backtest_df.dropna()

# Count trades
num_trades = (backtest_df['position_change'] > 0).sum()
print(f"Total number of trades (entries + exits): {num_trades}")
print(f"Total fees paid: {backtest_df['fees'].sum():.4%}")


## 5. Results and Performance Analysis

We evaluate the strategy using multiple metrics:
- **Cumulative Returns**: Total P&L over the period
- **Sharpe Ratio**: Risk-adjusted returns (annualized)
- **Maximum Drawdown**: Largest peak-to-trough decline
- **Win Rate**: Percentage of profitable trades

We also compare in-sample (training) vs out-of-sample (testing) performance to assess robustness.


In [None]:
# Performance metric functions
def sharpe_ratio(returns, periods_per_year=252):
    """Calculate annualized Sharpe ratio"""
    if returns.std() == 0:
        return 0
    return (returns.mean() / returns.std()) * np.sqrt(periods_per_year)

def max_drawdown(cumulative_returns):
    """Calculate maximum drawdown"""
    rolling_max = cumulative_returns.cummax()
    drawdown = (cumulative_returns - rolling_max) / rolling_max
    return drawdown.min()

def calculate_metrics(returns, label=""):
    """Calculate and print key performance metrics"""
    cum_return = returns.cumsum()
    total_return = np.exp(cum_return.iloc[-1]) - 1  # Convert log return to simple
    
    metrics = {
        'Total Return': f"{total_return:.2%}",
        'Annualized Return': f"{(returns.mean() * 252):.2%}",
        'Volatility (Ann.)': f"{(returns.std() * np.sqrt(252)):.2%}",
        'Sharpe Ratio': f"{sharpe_ratio(returns):.2f}",
        'Max Drawdown': f"{max_drawdown(np.exp(cum_return)):.2%}",
        'Win Rate': f"{(returns > 0).sum() / (returns != 0).sum():.2%}" if (returns != 0).sum() > 0 else "N/A"
    }
    
    print(f"\n{'='*40}")
    print(f"  {label}")
    print(f"{'='*40}")
    for k, v in metrics.items():
        print(f"  {k:20}: {v}")
    
    return metrics

print("Performance functions defined.")


In [None]:
# Split results into training and testing periods
train_end_date = train_data.index[-1]

train_results = backtest_df[backtest_df.index <= train_end_date]
test_results = backtest_df[backtest_df.index > train_end_date]

# Calculate metrics for each period
print("PAIRS TRADING STRATEGY PERFORMANCE")
print(f"Pair: {stock1} / {stock2}")

train_metrics = calculate_metrics(train_results['strategy_return_net'], 
                                   f"IN-SAMPLE (Training: {train_results.index[0].date()} to {train_results.index[-1].date()})")

test_metrics = calculate_metrics(test_results['strategy_return_net'], 
                                  f"OUT-OF-SAMPLE (Testing: {test_results.index[0].date()} to {test_results.index[-1].date()})")

full_metrics = calculate_metrics(backtest_df['strategy_return_net'], 
                                  f"FULL PERIOD")


In [None]:
# Plot cumulative returns
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Cumulative returns comparison
strategy_cum = backtest_df['strategy_return_net'].cumsum()
s1_cum = backtest_df['ret_s1'].cumsum()
s2_cum = backtest_df['ret_s2'].cumsum()

# Convert to simple returns for plotting
axes[0].plot(np.exp(strategy_cum) - 1, label=f'Pairs Strategy ({stock1}/{stock2})', linewidth=2, color='blue')
axes[0].plot(np.exp(s1_cum) - 1, label=f'{stock1} (Buy & Hold)', alpha=0.6, linestyle='--')
axes[0].plot(np.exp(s2_cum) - 1, label=f'{stock2} (Buy & Hold)', alpha=0.6, linestyle='--')
axes[0].axvline(train_end_date, color='red', linestyle='--', label='Train/Test Split', alpha=0.7)
axes[0].axhline(0, color='black', linestyle='-', linewidth=0.5)
axes[0].set_title('Cumulative Returns: Pairs Strategy vs Individual Stocks', fontsize=12)
axes[0].set_ylabel('Cumulative Return')
axes[0].legend(loc='upper left')
axes[0].grid(True, alpha=0.3)

# Strategy returns with signals overlay
ax2 = axes[1]
ax2.fill_between(backtest_df.index, 0, backtest_df['strategy_return_net'] * 100, 
                  where=backtest_df['strategy_return_net'] > 0, color='green', alpha=0.5, label='Profit')
ax2.fill_between(backtest_df.index, 0, backtest_df['strategy_return_net'] * 100, 
                  where=backtest_df['strategy_return_net'] < 0, color='red', alpha=0.5, label='Loss')
ax2.axvline(train_end_date, color='red', linestyle='--', alpha=0.7)
ax2.axhline(0, color='black', linestyle='-', linewidth=0.5)
ax2.set_title('Daily Strategy Returns (%)', fontsize=12)
ax2.set_ylabel('Daily Return (%)')
ax2.set_xlabel('Date')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Trading signal visualization with z-score
fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Z-score with signals
axes[0].plot(backtest_df['zscore'], color='purple', alpha=0.7, label='Z-Score')
axes[0].axhline(2, color='red', linestyle='--', alpha=0.5)
axes[0].axhline(-2, color='green', linestyle='--', alpha=0.5)
axes[0].axhline(0, color='black', linestyle='-', linewidth=0.5)
axes[0].fill_between(backtest_df.index, -2, 2, alpha=0.1, color='gray')
axes[0].axvline(train_end_date, color='red', linestyle='--', alpha=0.7)
axes[0].set_ylabel('Z-Score')
axes[0].set_title('Z-Score of Spread and Trading Signals')
axes[0].legend()
axes[0].set_ylim(-5, 5)

# Position
axes[1].fill_between(backtest_df.index, 0, backtest_df['signal'], 
                      where=backtest_df['signal'] > 0, color='green', alpha=0.5, label='Long Spread')
axes[1].fill_between(backtest_df.index, 0, backtest_df['signal'], 
                      where=backtest_df['signal'] < 0, color='red', alpha=0.5, label='Short Spread')
axes[1].axhline(0, color='black', linestyle='-', linewidth=0.5)
axes[1].axvline(train_end_date, color='red', linestyle='--', alpha=0.7)
axes[1].set_ylabel('Position')
axes[1].set_title('Trading Position')
axes[1].legend()

# Cumulative P&L
axes[2].plot(np.exp(strategy_cum) - 1, color='blue', linewidth=2)
axes[2].axhline(0, color='black', linestyle='-', linewidth=0.5)
axes[2].axvline(train_end_date, color='red', linestyle='--', alpha=0.7, label='Train/Test Split')
axes[2].set_ylabel('Cumulative Return')
axes[2].set_xlabel('Date')
axes[2].set_title('Cumulative Strategy Return')
axes[2].legend()

plt.tight_layout()
plt.show()


In [None]:
# Summary statistics table
print("\n" + "="*60)
print("STRATEGY SUMMARY")
print("="*60)
print(f"\nPair Traded: {stock1} / {stock2}")
print(f"Sector: {sector_map.get(stock1, 'Unknown')}")
print(f"Hedge Ratio: {hedge_ratio:.4f}")
print(f"\nTrading Parameters:")
print(f"  - Entry Threshold: +/- {entry_threshold} standard deviations")
print(f"  - Exit Threshold: {exit_threshold} (mean reversion)")
print(f"  - Lookback Window: {lookback} days")
print(f"  - Transaction Cost: {fee_rate:.2%} per trade")
print(f"\nTotal Trades: {num_trades}")
print(f"Average Holding Period: {len(backtest_df) / max(num_trades/2, 1):.1f} days")

# Out-of-sample is the true test of strategy viability
print("\n" + "="*60)
print("KEY RESULT: OUT-OF-SAMPLE PERFORMANCE")
print("="*60)
if len(test_results) > 0:
    oos_sharpe = sharpe_ratio(test_results['strategy_return_net'])
    oos_return = np.exp(test_results['strategy_return_net'].cumsum().iloc[-1]) - 1
    print(f"\n  Sharpe Ratio: {oos_sharpe:.2f}")
    print(f"  Total Return: {oos_return:.2%}")
    
    if oos_sharpe > 1:
        print("\n  -> Strategy shows promise with Sharpe > 1")
    elif oos_sharpe > 0:
        print("\n  -> Strategy is profitable but Sharpe < 1")
    else:
        print("\n  -> Strategy did not perform well out-of-sample")
else:
    print("\n  No out-of-sample data available")


---
# PAPER: Pairs Trading Strategy Using Cointegration Analysis
## IEOR 198 Final Project
---

## Abstract

This paper presents a market-neutral pairs trading strategy based on cointegration analysis of S&P 500 stocks. The strategy identifies statistically related stock pairs within the same sector and generates trading signals when their price relationship diverges from historical norms. Using the Engle-Granger cointegration test to screen for viable pairs and z-score-based entry/exit rules, we backtest the strategy over a multi-year period with an 80/20 train-test split. We account for transaction costs and evaluate performance using Sharpe ratio, total returns, and maximum drawdown. The results demonstrate the viability of statistical arbitrage approaches while highlighting the challenges of maintaining cointegration relationships out-of-sample.


## 1. Introduction

### Background

Pairs trading is a classic market-neutral strategy that has been employed by quantitative hedge funds since the 1980s, pioneered by Nunzio Tartaglia's quantitative group at Morgan Stanley. The fundamental premise is simple: identify two securities whose prices have historically moved together, and when they temporarily diverge, bet on their convergence.

### Why Pairs Trading?

Unlike directional strategies that profit from market movements, pairs trading aims to be **market neutral** - insulated from broad market swings. This is achieved by simultaneously holding a long position in one stock and a short position in another. The profit comes not from the overall market direction, but from the *relative* movement between the two securities.

### Cointegration vs Correlation

A critical distinction in pairs trading is between **correlation** and **cointegration**:

- **Correlation** measures how two price series move together over time. However, two correlated stocks can drift apart permanently.
- **Cointegration** is a stronger statistical property indicating that two non-stationary series share a common stochastic trend. If cointegrated, their spread (linear combination) is stationary and will revert to a mean.

We use cointegration as our primary screening criterion because it provides a theoretical basis for mean reversion, which is the core assumption of pairs trading.

### Objectives

1. Screen S&P 500 stocks for cointegrated pairs within the same sector
2. Develop a systematic trading strategy based on z-score signals
3. Backtest the strategy with realistic transaction costs
4. Evaluate out-of-sample performance to assess strategy robustness


## 2. Dataset

### Data Source

We use **Yahoo Finance** (via the `yfinance` Python library) to obtain historical adjusted closing prices for S&P 500 constituent stocks. Adjusted close prices account for stock splits and dividends, providing a more accurate representation of total returns.

### Universe Selection

To make pair screening computationally tractable and fundamentally sound, we focus on 40 stocks from four sectors:
- **Technology** (10 stocks): AAPL, MSFT, GOOGL, META, NVDA, AMD, INTC, CRM, ADBE, ORCL
- **Financial** (10 stocks): JPM, BAC, WFC, GS, MS, C, USB, PNC, TFC, COF
- **Consumer Discretionary** (10 stocks): AMZN, TSLA, HD, MCD, NKE, SBUX, TGT, LOW, TJX, ROST
- **Energy** (10 stocks): XOM, CVX, COP, SLB, EOG, MPC, PSX, VLO, OXY, HAL

### Time Period

- **Full Period**: January 2022 - December 2024 (approximately 3 years)
- **Training Set** (80%): Used for cointegration testing and parameter estimation
- **Test Set** (20%): Used for out-of-sample performance evaluation

### Data Processing

1. Downloaded daily adjusted close prices for all 40 tickers
2. Removed tickers with missing data during the period
3. Calculated log returns: $r_t = \ln(P_t / P_{t-1})$
4. Aligned all time series to common trading days
