# Pair Trading Analysis
  
Given a pair of stocks that are Conintegrated, grab their PricePredict Objects from the
ppo/ directory and plot a median/spread graph of the two stocks. This will allow us to see
how the two stocks are moving relative to each other and if there are any opportunities to
trade the spread between the two stocks.

Run a simple trading simulation that uses a strategy of trading the 2 stocks at the same time
as follows:
* Opening a Pairs Trade...
    * When the 2 stocks diverge and the stock that is overvalued is more than the historical standard deviation
      from the mean (based on N most profitable moves) and the undervalued stock is more than the historical 
      standard deviation from the mean (etc.). 
        * Sell the overvalued stock and buy the undervalued stock.
          Only do these trades if the prediction for the next day indicates that the overvalued stock
          is going to go down and the undervalued stock is going to go up on on the current timeframe,
          and on the next higher timeframe (weekly, given a daily trading timeframe).
          Then buy/sell the stocks at the opening price of the next day.
        * Use additional indicators that indicate a reversal in the spread (as needed). 
    * Do we trade from convergence?
        * It does not make sense to trade from convergence, as it is difficult to determine which
          stock to go long on and which stock to go short on.
        * We should only trade from divergence, as we can determine which stock is overvalued and which
          stock is undervalued. 
* Closing a Pairs Trade...
    * Given an open Pairs Trade that started from a Spread Divergence... 
        * We wait for the spread to converge to the median and then close the trade.
        * We then calculate the profit/loss of the trade and add it to the total profit/loss.
* How to choose going long vs going short (one stock will be long and the other short)...
    * The Spread calculation is essentially the difference between the two stocks. Stock_A - Stock_B.
        * As Stock_A goes up and Stock_B goes down, the spread will increase (go up in the spread graph).
        * As Stock_A goes down and Stock_B goes up, the spread will decrease (go down in the spread graph).    
    * If the spread is above 2 standard deviations from the mean, we go short on Stock_A and long on Stock_B.
    * If the spread is below 2 standard deviations from the mean, we go long on Stock_A and short on Stock_B.
* How much to trade...
    * We use the Hedge Ratio to determine how much of each stock to trade.
        * The Hedge Ratio is the Beta from the OLS regression of Stock_A on Stock_B.
    * The Hedge Ratio is the amount of Stock_B that is needed to hedge the risk of Stock_A.
        * If the Hedge Ratio is 1.5, then we would trade 1.5 shares of Stock_B for every 1 share of Stock_A.
        * If the Hedge Ratio is negative (-1.5), then for every 1 share of Stock_B, we would trade 1.5 shares of Stock_A.

## Questions...

* Which is a stronger indicator of Cointegration...
    * Weekly or Daily?
    * Both are useful much the way a weekly prediction indicates the longer term trend and a daily prediction
      indicates the shorter term trend.

## Insights...

* Different start-end periods result in different Hedge Ratios and Spread Deviations.
    * Longer timeframes (5yrs vs 30days) result in lower Hedge Ratios and Spread Deviations. It takes longer for the spread to converge.
* It probably makes sense to hold on to the data for a given trading pair trade entry so that one can
  continue to track the move to the originally calculate median (exit point), sticking to the original
  trading plan. We add more data to the close columns and we calculate the spread and the current spread
  the original Beta/Hedge Ratio, while the median (our exit point) stays the same.


    


In [1]:
import sys
from types import ModuleType, FunctionType
from gc import get_referents

# Helper function to get the size of an object (Curiosity)
# Custom objects know their class.
# Function objects seem to know way too much, including modules.
# Exclude modules as well.
BLACKLIST = type, ModuleType, FunctionType


def getsize(obj):
    """sum size of object & members."""
    if isinstance(obj, BLACKLIST):
        raise TypeError('getsize() does not take argument of type: '+ str(type(obj)))
    seen_ids = set()
    size = 0
    objects = [obj]
    while objects:
        need_referents = []
        for obj in objects:
            if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
                seen_ids.add(id(obj))
                size += sys.getsizeof(obj)
                need_referents.append(obj)
        objects = get_referents(*need_referents)
    return size 


In [46]:
import pandas as pd
import statsmodels.api as sm
from decimal import Decimal
from pandas_decimal import DecimaldDtype

def get_trading_pair_spread(ppos: tuple, beta: Decimal = None, 
                            prev_days: int = None,
                            start_period: int = None, end_period: int = None,
                            start_date: str = None, end_date: str = None):
    
    # Create a DataFrame of the closing prices from the PPO[0 and 1].orig_data dataframes
    closes1 = ppos[0].orig_data['Close'].astype(DecimaldDtype(5))
    closes2 = ppos[1].orig_data['Close'].astype(DecimaldDtype(5))
    # Make closes1 and closes2 the same length
    min_len = min(len(closes1), len(closes2))
    if prev_days is None:
        prev_days = min_len
    elif prev_days > min_len:
        prev_days = min_len
    if start_period is not None and end_period is not None:
        # Gather closes based numeric index    
        closes1 = closes1[start_period:end_period]
        closes2 = closes2[start_period:end_period]
    elif start_date is not None and end_date is not None:
        # Gather closes based on the date index column
        closes1 = closes1.loc[start_date:end_date]
        closes2 = closes2.loc[start_date:end_date]
    else:
        # Default to the last prev_days    
        closes1 = closes1.tail(prev_days - 1)
        closes2 = closes2.tail(prev_days - 1)
    df_closes = pd.DataFrame({'Stock_A': closes1, 'Stock_B': closes2})
    df_closes = df_closes.bfill().ffill()
    if beta is None:
        # Perform OLS to find beta
        X = df_closes['Stock_B']
        X = sm.add_constant(X)  # Adds a constant term to the predictor
        model = sm.OLS(df_closes['Stock_A'], X).fit()
        beta = model.params['Stock_B']

    # Detrend the closes
    # closes1m = (closes1 - closes1.rolling(window=3)).mean()
    closes1m = closes1.rolling(window=3).apply(lambda x: (x - x.mean()).mean())
    # closes2m = (closes2 - closes2.rolling(window=3)).mean()
    closes2m = closes2.rolling(window=3).apply(lambda x: (x - x.mean()).mean())
    df_detrend = pd.DataFrame({'Stock_A': closes1m, 'Stock_B': closes2m})
    df_detrend = df_detrend.bfill().ffill()
    # Calculate the spread and its mean using the Hedge-Ratio beta 
    df_detrend['Spread'] = df_closes['Stock_A'] - beta * df_closes['Stock_B']
    spread_mean = df_detrend['Spread'].mean()
    # Create a line that is 1 standard deviation above from the spread-mean
    df_detrend['Mean_1std_a'] = spread_mean + df_detrend['Spread'].std()
    # Create a line that is 2 standard deviation above from the spread-mean
    df_detrend['Mean_2std_a'] = spread_mean + 2 * df_detrend['Spread'].std()
    # Create a line that is 1 standard deviation below from the spread-mean
    df_detrend['Mean_1std_b'] = spread_mean - df_detrend['Spread'].std()
    # Create a line that is 2 standard deviation below from the spread-mean
    df_detrend['Mean_2std_b'] = spread_mean - 2 * df_detrend['Spread'].std()

    return ppos, df_closes, df_detrend, spread_mean, beta 


In [45]:
import matplotlib.pyplot as plt

def show_annotation(sel):
    x, y = sel.target
    ind = sel.index
    sel.annotation.set_text(f'{x:.0f}, {y:.0f}: {labels[ind]}')
    
def plot_spread(ppos: tuple, beta: Decimal = None, 
                prev_days: int = None,
                title: str = None,   
                spread_name: str = 'Spread',
                spread_color: str = 'black',
                start_period: int = None, end_period: int = None,
                start_date: str = None, end_date: str = None):
    
    ppos, df_closes, df_detrend, spread_mean, beta = get_trading_pair_spread(ppos, beta, 
                                                                             prev_days, 
                                                                             start_period, end_period,
                                                                             start_date, end_date)    
    # Save the plot data to the PPO objects
    pair = (ppos[0].ticker, ppos[1].ticker)
    sp = spread_mean.copy()
    cl = df_closes.copy(deep=True)
    cl.reset_index(inplace=True)
    cl = cl.to_json()
    dc = df_detrend.copy(deep=True)
    dc.reset_index(inplace=True)
    dc = dc.to_json()
    spread_analysis = {'pair': (ppos[0].ticker, ppos[1].ticker),
                       'spread_mean': sp, 
                       'beta': beta,
                       'closes': cl,
                       'detrended_closes': dc
                       }
    ppos[0].spread_analysis[pair] = spread_analysis
    ppos[1].spread_analysis[pair] = spread_analysis
    
    # Plot the spread with mean line
    plt.plot(df_detrend['Spread'], marker='o', label=spread_name, color=spread_color)
    plt.plot(df_detrend['Mean_2std_a'], label='2std_a', color='green')
    plt.plot(df_detrend['Mean_1std_a'], label='1std_a', color='blue')
    plt.plot(df_detrend['Mean_1std_b'], label='1std_b', color='blue')
    plt.plot(df_detrend['Mean_2std_b'], label='2std_b', color='green')
    plt.axhline(spread_mean, color='red', linestyle='--', label='Mean Spread')
    plt.legend()
    if title is None:
        title = 'Spread Between Stock A and Stock B'
    plt.title(title)
    plt.xlabel('Time')
    plt.ylabel(spread_name)
    # Enable x, y grid lines
    plt.grid(True)
    plt.show()

    return plt, beta


In [51]:
%matplotlib notebook

# Import Libraries
import os.path
import numpy as np
import pandas as pd
import logging
import sys
import json
import dill
import pandas as pd
import matplotlib.pyplot as plt
import copy
from pricepredict import PricePredict
from datetime import datetime, timedelta

# Use an Object Cache to reduce the prep time for creating and loading the PricePredict objects.
if 'ObjCache' not in globals():
    global ObjCache
    ObjCache = {}

DirPPO = '../ppo/'
def get_ppo(symbol: str, period: str):
    file_name_starts_with = symbol + '_' + period
    # Find all PPO files for the symbol in the PPO directory
    ppo_files = [f for f in os.listdir(DirPPO) if f.startswith(file_name_starts_with)]
    # Sort the files by date
    ppo_files.sort()
    # Get the latest PPO file
    ppo_file = ppo_files[-1]
    # Unpickle the PPO file using dill
    with open(DirPPO + ppo_file, 'rb') as f:
        ppo = dill.load(f)
    return ppo_file, ppo

def get_tradingpair_ppos(trading_pair: tuple):
    tp1_weekly_ppo_file, tp1_weekly_ppo = get_ppo(trading_pair[0], PricePredict.PeriodWeekly)
    tp1_daily_ppo_file, tp1_daily_ppo = get_ppo(trading_pair[0], PricePredict.PeriodDaily)
    tp2_weekly_ppo_file, tp2_weekly_ppo = get_ppo(trading_pair[1], PricePredict.PeriodWeekly)
    tp2_daily_ppo_file, tp2_daily_ppo = get_ppo(trading_pair[1], PricePredict.PeriodDaily)
    print(f'{trading_pair[0]} Weekly PPO: {tp1_weekly_ppo_file} {tp1_weekly_ppo}:[{round(getsize(tp1_weekly_ppo)/1024/1024, 2)}]M')
    print(f'{trading_pair[0]} Daily PPO: {tp1_daily_ppo_file} {tp1_daily_ppo}:[{round(getsize(tp1_daily_ppo)/1024/1024, 2)}]M')
    print(f'{trading_pair[1]} Weekly PPO: {tp2_weekly_ppo_file} {tp2_weekly_ppo}:[{round(getsize(tp2_weekly_ppo)/1024/1024, 2)}]M')
    print(f'{trading_pair[1]} Daily PPO: {tp2_daily_ppo_file} {tp2_daily_ppo}:[{round(getsize(tp2_daily_ppo)/1024/1024, 2)}]M')
    return tp1_weekly_ppo, tp1_daily_ppo, tp2_weekly_ppo, tp2_daily_ppo    

def get_prop_ppos(trading_pair: tuple):
    global ObjCache
    
    model_dir = '../models/'
    chart_dir = '../charts/'
    preds_dir = '../predictions/'

    tp1_weekly_ppo = PricePredict(ticker=trading_pair[0], period=PricePredict.PeriodWeekly,
                                  model_dir=model_dir, chart_dir=chart_dir, preds_dir=preds_dir)
    tp1_daily_ppo = PricePredict(ticker=trading_pair[0], period=PricePredict.PeriodDaily,
                                 model_dir=model_dir, chart_dir=chart_dir, preds_dir=preds_dir)
    tp2_weekly_ppo = PricePredict(ticker=trading_pair[1], period=PricePredict.PeriodWeekly,
                                  model_dir=model_dir, chart_dir=chart_dir, preds_dir=preds_dir)
    tp2_daily_ppo = PricePredict(ticker=trading_pair[1], period=PricePredict.PeriodDaily,
                                 model_dir=model_dir, chart_dir=chart_dir, preds_dir=preds_dir)
        
    # Train the models on 5 yeas of data...
    end_dt = datetime.now()
    start_dt = end_dt - timedelta(days=5*400)
    end_date = end_dt.strftime('%Y-%m-%d')
    start_date = start_dt.strftime('%Y-%m-%d')
    
    print(f"ObjCache: {ObjCache.keys()}")
    
    # Load 2 years of data for the trading pair
    ppo_name = trading_pair[0] + '_weekly_ppo'
    if ppo_name not in ObjCache.keys():
        tp1_weekly_ppo.fetch_train_and_predict(tp1_weekly_ppo.ticker, 
                                               start_date, end_date, 
                                               start_date, end_date,
                                               period=PricePredict.PeriodWeekly,
                                               force_training=False,
                                               use_curr_model=True,
                                               save_model=False)
        ObjCache[ppo_name] = tp1_weekly_ppo.serialize_me()
    else:
        tp1_weekly_ppo = PricePredict.unserialize(ObjCache[ppo_name])
    if ppo_name not in ObjCache.keys():
        tp1_daily_ppo.fetch_train_and_predict(tp1_daily_ppo.ticker, 
                                               start_date, end_date, 
                                               start_date, end_date,
                                               period=PricePredict.PeriodDaily,
                                               force_training=False,
                                               use_curr_model=True,
                                               save_model=False)
        ObjCache[ppo_name] = tp1_daily_ppo.serialize_me()
    else:
        tp1_daily_ppo = PricePredict.unserialize(ObjCache[ppo_name])   
    ppo_name = trading_pair[1] + '_weekly_ppo'
    if ppo_name not in ObjCache.keys():
        tp2_weekly_ppo.fetch_train_and_predict(tp2_weekly_ppo.ticker,
                                               start_date, end_date, 
                                               start_date, end_date,
                                               period=PricePredict.PeriodWeekly,
                                               force_training=False,
                                               use_curr_model=True,
                                               save_model=False)
        ObjCache[ppo_name] = tp2_weekly_ppo.serialize_me()
    else:
        tp2_weekly_ppo = PricePredict.unserialize(ObjCache[ppo_name])
    ppo_name = trading_pair[1] + '_daily_ppo'
    if ppo_name not in ObjCache.keys():
        tp2_daily_ppo.fetch_train_and_predict(tp2_daily_ppo.ticker,
                                               start_date, end_date, 
                                               start_date, end_date,
                                               force_training=False,
                                               use_curr_model=True,
                                               save_model=False)
        ObjCache[ppo_name] = tp2_daily_ppo.serialize_me()
    else:
        tp2_daily_ppo = PricePredict.unserialize(ObjCache[ppo_name])

    return tp1_weekly_ppo, tp1_daily_ppo, tp2_weekly_ppo, tp2_daily_ppo
    
def analyze_trading_pair(trading_pair: tuple):    
    # Gather the Weekly and Daily PPOs for the trading pair.
    # tp1_weekly_ppo, tp1_daily_ppo, tp2_weekly_ppo, tp2_daily_ppo = get_tradingpair_ppos(trading_pair)
    
    tp1_weekly_ppo, tp1_daily_ppo, tp2_weekly_ppo, tp2_daily_ppo = get_prop_ppos(trading_pair)
        
    # Plot the median & spread of the trading pair given the daily PPOs)
    # Plot the Weekly Spread using the Weekly calculated Beta
    plt, beta = plot_spread((tp1_weekly_ppo, tp2_weekly_ppo), 
                            title=f"Weekly Spread [{trading_pair[0]} vs {trading_pair[1]}]",
                            spread_name='Weekly')
    print(f"Weekly Hedge Ratio: {beta}")
    # Plot the Daily Spread, Using the Weekly Beta
    plt, beta = plot_spread((tp1_daily_ppo, tp2_daily_ppo), beta, 60, 
                title=f"Daily Spread [{trading_pair[0]} vs {trading_pair[1]}]", 
                spread_name='Daily (Wkly Beta)', spread_color='grey')
    print(f"Daily using Weekly Hedge Ratio: {beta}")
    # Plot the Daily Spread, Using the Daily calculated Beta
    plt, beta = plot_spread((tp1_daily_ppo, tp2_daily_ppo), None, 60,
                            title=f"Daily Spread [{trading_pair[0]} vs {trading_pair[1]}]", 
                            spread_name='Daily', spread_color='orange')
    print(f"Daily Hedge Ratio: {beta}")
    plt, beta = plot_spread((tp1_daily_ppo, tp2_daily_ppo),
                            title=f"Daily[1:37] Spread [{trading_pair[0]} vs {trading_pair[1]}]", 
                            spread_name='Daily [1:37]', spread_color='orange',
                            start_period=1, end_period=37)
    print(f"Daily[1:37] Hedge Ratio {beta}")
    plt, beta = plot_spread((tp1_daily_ppo, tp2_daily_ppo),
                            title=f"Daily[4/1/21 to 8/1/21] Spread [{trading_pair[0]} vs {trading_pair[1]}]", 
                            spread_name='Daily [4/1/21 to 8/1/21]', spread_color='orange',
                            start_date='4/1/2021', end_date='7/30/2021')
    print(f"Daily[4/1/21 to 8/1/21] Hedge Ratio {beta}")
    
    return plt

if 'plt' in locals():
    plt.close()

plt = analyze_trading_pair(('AAPL', 'AMX'))


ObjCache: dict_keys(['AAPL_weekly_ppo', 'AMX_weekly_ppo', 'AMX_daily_ppo'])


<IPython.core.display.Javascript object>

Weekly Hedge Ratio: 8.221527699723318
Daily using Weekly Hedge Ratio: 8.221527699723318
Daily Hedge Ratio: -18.12334080055895
Daily[1:37] Hedge Ratio 9.222221830020068
Daily[4/1/21 to 8/1/21] Hedge Ratio 3.3081532472693125


In [4]:
getsize(ObjCache)

3396087

# Pair Trading Simulation

* Given the current Trading Pair...
    * From the beginning of the data...
        * Perform the Spread Analysis on an 30day window, moving weekly through the data. 
            * When the spread goes above 2 standard deviations, open a pairs trade.
              Be sure not to trade, trades that have already occurred. 
                * Immediatly move forward in time until the spread converges to the mean.
                  Use the beta and append to the dataset (if needed) to calculate the spread 
                  and to keep the mean stable.
                    * Calculate the profit/loss for each period. Are the draw-downs acceptable?
                    * Hold on to the final profit/loss of the trade upon exit.
    * Throw out open trades and calculate the total profit/loss.

In [57]:
#
def simulate_pairs_trading(ppos):
    # Get or create the required Trading Pair PPOs
    tp1_weekly_ppo, tp1_daily_ppo, tp2_weekly_ppo, tp2_daily_ppo = get_prop_ppos(ppos)
    
    # Check the begin and end dates of the data...
    start_date1 = tp1_daily_ppo.orig_data.index[0]
    end_date1 = tp1_daily_ppo.orig_data.index[-1]
    start_date2 = tp2_daily_ppo.orig_data.index[0]
    end_date2 = tp2_daily_ppo.orig_data.index[-1]
    
    # Align the start and end dates
    start_date = min(start_date1, start_date2)
    end_date = max(end_date1, end_date2)
    
    print(f"Start Date: {start_date},  End Date: {end_date}")

    # Create an iterable date range from start to end date
    date_range = pd.date_range(start=start_date, end=end_date, freq='W')
    
    for win_date in date_range:
        print(f"Window Date: {win_date}")
        # Calculate the spread for the 30 days prior to the win_date
        win_date_start = win_date - timedelta(days=30)
        win_date_end = win_date
        ppos, df_closes, df_detrend, spread_mean, beta = get_trading_pair_spread((tp1_daily_ppo, tp2_daily_ppo), None, 30, start_date=win_date_start, end_date=win_date_end)    
        # Return the dates in df_detrend when the spread is above 2 standard deviations
        df_detrend['Spread_Diff'] = df_detrend['Spread'] - spread_mean
        df_detrend['Spread_Diff_2std'] = 2 * df_detrend['Spread'].std()
        df_detrend['Spread_Diff_2std'] = df_detrend['Spread_Diff_2std'].apply(lambda x: abs(x))
        df_detrend['Spread_Diff_2std'] = df_detrend['Spread_Diff_2std'].apply(lambda x: Decimal(x))
        df_detrend['Spread_Diff'] = df_detrend['Spread_Diff'].apply(lambda x: Decimal(x))
        df_detrend['Spread_Diff_2std'] = df_detrend['Spread_Diff_2std'].apply(lambda x: Decimal(x))
        df_detrend['Spread_Diff'] = df_detrend['Spread_Diff'].apply(lambda x: Decimal(x))
        df_detrend['Spread_Diff'] = df_detrend['Spread_Diff'].abs()
        print(df_detrend.loc[df_detrend['Spread_Diff'] > df_detrend['Spread_Diff_2std']])
        
        
        
simulate_pairs_trading(('AAPL', 'AMX'))


ObjCache: dict_keys(['AAPL_weekly_ppo', 'AMX_weekly_ppo', 'AMX_daily_ppo'])
Start Date: 2019-07-19 00:00:00,  End Date: 2025-01-12 00:00:00
Window Date: 2019-07-21 00:00:00
Empty DataFrame
Columns: [Stock_A, Stock_B, Spread, Mean_1std_a, Mean_2std_a, Mean_1std_b, Mean_2std_b, Spread_Diff, Spread_Diff_2std]
Index: []
Window Date: 2019-07-28 00:00:00
Empty DataFrame
Columns: [Stock_A, Stock_B, Spread, Mean_1std_a, Mean_2std_a, Mean_1std_b, Mean_2std_b, Spread_Diff, Spread_Diff_2std]
Index: []
Window Date: 2019-08-04 00:00:00
Empty DataFrame
Columns: [Stock_A, Stock_B, Spread, Mean_1std_a, Mean_2std_a, Mean_1std_b, Mean_2std_b, Spread_Diff, Spread_Diff_2std]
Index: []
Window Date: 2019-08-11 00:00:00
Empty DataFrame
Columns: [Stock_A, Stock_B, Spread, Mean_1std_a, Mean_2std_a, Mean_1std_b, Mean_2std_b, Spread_Diff, Spread_Diff_2std]
Index: []
Window Date: 2019-08-18 00:00:00
Empty DataFrame
Columns: [Stock_A, Stock_B, Spread, Mean_1std_a, Mean_2std_a, Mean_1std_b, Mean_2std_b, Spread_Diff