In [14]:
## anaconda3 (Python 3.9.13) Kernel

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# pair trade packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from datetime import datetime

# Load Pairs Data


In [15]:
def custom_date_parser(date_str):
    return datetime.strptime(date_str, '%d/%m/%Y')

# Load the dictionary from the pickle file
with open('pairsOutcome.pkl', 'rb') as file:
    pairsOutcome = pickle.load(file)

print("Dictionary loaded from pairsOutcome.pkl")


# Load stock data and get return 
tpxData = pd.read_csv('TPX_prices.csv', index_col=0, parse_dates=True, date_parser=custom_date_parser)
tpxData = tpxData.dropna(axis='columns')
return_df = (tpxData / tpxData.shift(1)) - 1

Dictionary loaded from pairsOutcome.pkl


In [16]:
len(pairsOutcome)

508

# Get Pair Trade Portfolio
`pairsOutcome` already have TOPIX stocks with highest liquidity and are tested for stationarity over a 1 year window

Choose top 10 known pair trades by returns in the total dataset

In [17]:
# Sort the keys by their cumpnl[-2] values in descending order
top_keys = sorted(
    pairsOutcome,
    key=lambda k: pairsOutcome[k].cumpnl.iloc[-2],  # Access cumpnl[-2] safely
    reverse=True
)[:10]  # Get the top 10 keys

# Print the top 10 performing trades
print("Top 10 performing trades:")
for i, key in enumerate(top_keys, 1):
    print(f"{i}. Key: {key}, Value: {pairsOutcome[key].cumpnl.iloc[-2]}")

Top 10 performing trades:
1. Key: 1801 JP Equity 2670 JP Equity, Value: 2.5797887367591246
2. Key: 3778 JP Equity 6701 JP Equity, Value: 2.537242032391529
3. Key: 2760 JP Equity 6254 JP Equity, Value: 2.3688208386917404
4. Key: 5706 JP Equity 6954 JP Equity, Value: 2.2676474298290237
5. Key: 7951 JP Equity 9684 JP Equity, Value: 2.0657325467200596
6. Key: 1808 JP Equity 6481 JP Equity, Value: 1.9929348941248262
7. Key: 3099 JP Equity 5831 JP Equity, Value: 1.939742664925484
8. Key: 1808 JP Equity 6971 JP Equity, Value: 1.9132602773493155
9. Key: 4021 JP Equity 9843 JP Equity, Value: 1.8675031161000868
10. Key: 5929 JP Equity 6504 JP Equity, Value: 1.811533049967201


## Make indicators and spread stationary around 0
Deduct the mean from all values to translate to 0 axis

# Machine Learning Challenge

## Background
Initial evaluation of the baseline portfolio shows that draw downs are small. Originally team had the idea of using Machine Learning to optimise for sizing of these pair trades. However since there was no significant drawdowns the returns are linearly increasing with investment sizing i.e. greater nominal investment in the the pair trade the proportionate increase in returns without realising significant drawdown risk.

Instead of optimising for sizing, we can explore Machine Learning in terms of strategy on this stationary dataset. Whereas our prescribed strategy is to enter at +/- 1 std dev, exit at 0 with +/- 2 std dev stop loss. These are only suggestions and arbitrary levels.

With Machine Learning, we can discover if it will uncover the mean reverting nature and recommend another threshhold. We use Q Learner to understand state space with the same spread, mid, std dev parameters as the baseline.

### Steps
#### Environment:
- State Space: A set of all possible states the agent can be in.  
  - [spread, mid, 2 sd low, 1 sd low, 1 sd high, 2 sd high]
- Action Space: A set of all possible actions the agent can take in each state.   
  - [-1, # short\
      0, # uninvested\
      1  # long]   
- Reward Function: A function that assigns a numerical reward to each state-action pair, indicating the immediate consequence of taking a particular action in a specific state.
  - dailypnl
- Transition Function: A function that determines the probability of transitioning from one state to another when a particular action is taken.
  - deterministic based on historical performance
#### Agent:

- Q-Table: A matrix that stores the estimated Q-values for each state-action pair. Q-values represent the expected future reward for taking a specific action in a given state.   
  - continuous Q table?
- Learning Rate (α): A parameter that controls how much the Q-values are updated with each new experience.   
- Discount Factor (γ): A parameter that determines the importance of future rewards. A higher discount factor gives more weight to future rewards.   
- Exploration Rate (ε): A parameter that controls the balance between exploration (trying new actions) and exploitation (choosing the action with the highest Q-value).   
- Q-Learning Algorithm:

  - Initialization: Initialize the Q-table with random values or zeros.   
  - Exploration and Exploitation: Use an exploration strategy (e.g., ε-greedy) to choose an action:
    - With probability ε, choose a random action.   
    - With probability 1-ε, choose the action with the highest Q-value for the current state.   
  
  - Take Action: Execute the chosen action in the environment.   
  - Observe Reward and Next State: Observe the immediate reward and the next state resulting from the action.
- Update Q-Value: Update the Q-value of the current state-action pair using the following formula:

#### Training and Test set

2013 is used for warm start\
2014 - 2023 train data since NN need a lot of training data {end 2023 idx == 2868}\
2024 onwards (5 months) test data


In [19]:
## Get pair stock data
def custom_date_parser(date_str):
    return datetime.strptime(date_str, '%d/%m/%Y')
valid = pd.read_csv('validPairs4.csv', 
                    index_col=0, 
                    parse_dates=True, 
                    date_parser=custom_date_parser)
## get list of pair stocks
validPairsList = [
    [item.strip() + ' Equity' for item in pair.split('Equity') if item.strip()]
    for pair in top_keys
]

In [20]:
rollingWindow = 262
cutLossSd = 2

In [21]:
for pair in validPairsList:
    df = pd.DataFrame()

    #Calculate Standard Deviations
    df['spread'] = valid[f'spread_{pair[0]}_{pair[1]}']
    df['mid'] =  df['spread'].rolling(rollingWindow).mean()
    df['1sd high'] = df['spread'].rolling(rollingWindow).mean() + df['spread'].rolling(rollingWindow).std()
    df['1sd low'] = df['spread'].rolling(rollingWindow).mean() - df['spread'].rolling(rollingWindow).std()
    df['2sd high'] = df['spread'].rolling(rollingWindow).mean() + df['spread'].rolling(rollingWindow).std() * cutLossSd
    df['2sd low'] = df['spread'].rolling(rollingWindow).mean() - df['spread'].rolling(rollingWindow).std() * cutLossSd
    df['position'] = 0

    df.loc[(df['spread'] > df['1sd high']) & (df['spread'] < df['2sd high']), 'position'] = -1
    df.loc[(df['spread']< df['1sd low']) & (df['spread'] > df['2sd low']), 'position'] = 1

    #Calculate PnL
    df[f'{pair[0]} position'] = df['position']
    df[f'{pair[1]} position'] = df['position'] * -1
    df['dailypnl'] = df[f'{pair[1]} position']*return_df[f'{pair[1]}'].shift(-1) + df[f'{pair[0]} position']*return_df[f'{pair[0]}'].shift(-1)
    df['cumpnl'] = df['dailypnl'].cumsum()

    pairsOutcome[f'{pair[0]} {pair[1]}'] = df

In [52]:
workingPairOutcome = {}

for pair in top_keys:
    dummy_df = pairsOutcome[pair].iloc[::,:6]
    dummy_df = dummy_df.subtract(dummy_df['mid'], axis=0).drop(columns=['mid'])
    workingPairOutcome[pair] = dummy_df

workingPairOutcome[pair]

Unnamed: 0_level_0,spread,1sd high,1sd low,2sd high,2sd low
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,,,,,
2013-01-04,,,,,
2013-01-07,,,,,
...,...,...,...,...,...
2024-05-27,335.484182,159.593167,-159.593167,319.186334,-319.186334
2024-05-28,229.637004,158.974787,-158.974787,317.949574,-317.949574
2024-05-29,169.122976,158.188834,-158.188834,316.377667,-316.377667
2024-05-30,187.103754,157.699351,-157.699351,315.398701,-315.398701


In [22]:
# validPairsList, top_keys

- Test one timestep at a time (even though we can test all at the same time)
- give state
- Trading should be path dependent due to stop loss. in this case I can only give last position as one of the parameters

In [None]:
import gym
import random

class PairTradeEnv(gym.Env):
    # ... (define your environment's state space, action space, etc.)
    def __init__(self, pairsOutcome, top_keys, validPairsList, return_df):
        # ... (initialize other parameters)
        self.earliest_step = 261  # hot start
        self.last_step = 2868
        # self.current_step = random.randint(self.earliest_step, self.last_step - 1)
        self.current_step = self.earliest_step


    def step(self, action):
        """
        Input
            action: single value e.g. -1 (short)
        Output:
            next_state: next state 
            reward: reward for last timestep
            done: boolean for if end of dataset
            info: optional
        """
        # Advance the time step
        self.current_step += 1
        # Get the next state
        next_state = np.append(
                            np.array(pairsOutcome[top_keys[0]].iloc[self.current_step][:6]),
                            action)
        # Calculate reward (implement your reward function here)
        reward = self.calculate_reward(action, self.current_step, validPairsList[0]) # TODO change pair selected
        # Check for termination (implement your termination condition here)
        done = self.current_step >= self.last_step

        # Provide additional information (optional)
        info = {}

        return next_state, reward, done, info

    def reset(self):
        # ... (implement the reset function to initialize the environment)
        # reset to start of 2014 every time
        # self.current_step = random.randint(self.earliest_step, self.last_step - 1)
        self.current_step = self.earliest_step
        initial_state = np.append(
                            np.array(pairsOutcome[top_keys[0]].iloc[self.current_step][:6]),
                            0)
        return initial_state
    
    def calculate_reward(self, position, idx, pair):
        """
        Give one _previous_ day's return
        Input:
            position: position for idx (current step)
            idx: usually current timestp 
            pair: tuple of tpx stock
        Output:
            dailypnl
        """
        # position = position_vector @ np.array([-1,0,1])
        position_0 = position
        position_1 = position * -1
        ## return_df gives the return for the previous day for the given idx
        dailypnl = position_0*return_df[f'{pair[0]}'].iloc[idx] + position_1*return_df[f'{pair[1]}'].iloc[idx] 

        return dailypnl

# Instantiate the custom environment
env = PairTradeEnv(pairsOutcome, top_keys, validPairsList, return_df)

In [24]:

class QNetwork(nn.Module):
    def __init__(self, input_size, output_size):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, output_size)

    def forward(self, x):
        # Assuming x is a tensor of shape (7, 1) for a single state
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

class QLearningAgent:
    def __init__(self, input_size, output_size, learning_rate, discount_factor, epsilon, epsilon_decay):
        self.q_network = QNetwork(input_size, output_size)
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
        self.loss_fn = nn.MSELoss()
        self.discount_factor = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay

    def choose_action(self, state):
        if np.random.rand() < self.epsilon:
            # Explore: Choose a random action
            return np.random.choice([-1, 0, 1])
        else:
            # Exploit: Choose the action with the highest Q-value
            with torch.no_grad():
                q_values = self.q_network(torch.tensor(state, dtype=torch.float32))
                action_index = torch.argmax(q_values).item()
        return np.array([-1, 0, 1])[action_index]  

    def learn(self, state, action, reward, total_reward, next_state, done):
        state = torch.tensor(state, dtype=torch.float32)
        action = torch.tensor([action], dtype=torch.long)
        reward = torch.tensor([reward], dtype=torch.float32)
        next_state = torch.tensor(next_state, dtype=torch.float32)
        done = torch.tensor([done], dtype=torch.float32)

        q_value = self.q_network(state)[action]
        next_q_value = torch.max(self.q_network(next_state)).detach()
        if total_reward > 0:
            target_q_value = reward + self.discount_factor * next_q_value * (1 - done) + total_reward * .1
        else:
            target_q_value = reward + self.discount_factor * next_q_value * (1 - done) + total_reward * .1

        loss = self.loss_fn(q_value, target_q_value)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # if self.epsilon > 0.3:
        #     self.epsilon *= self.epsilon_decay

# Example usage:
input_size = 7  # Adjust to your specific input size
output_size = 3  # Adjust to your desired number of discrete actions
learning_rate = 0.1
discount_factor = 0.9
epsilon = 1
epsilon_decay = 0.99

agent = QLearningAgent(input_size, output_size, learning_rate, discount_factor, epsilon, epsilon_decay)

# Training loop:
# for episode in range(num_episodes):
    # ... (your training loop logic, including getting state, taking action, 
    #      getting reward and next state, and updating the agent)

In [25]:
def train(agent, env, num_episodes, max_steps_per_episode):
    for episode in range(num_episodes):
        state = env.reset()
        total_reward = 0
        episode_memory = []  # Store experiences during the episode

        for step in range(max_steps_per_episode):
            action = agent.choose_action(state)
            next_state, reward, done, _ = env.step(action)
            episode_memory.append((state, action, reward, next_state, done))
            state = next_state
            total_reward += reward

            if done:
                break

        # Learn from the entire episode
        for state, action, reward, next_state, done in episode_memory:
            agent.learn(state, action, reward, total_reward, next_state, done)

        if episode%1==0:
            agent.epsilon *= agent.epsilon_decay

        print(f"Episode: {episode+1}, Total Reward: {total_reward}, epsilon: {agent.epsilon}")

In [26]:
num_episodes = 300
max_steps_per_episode = 3000

train(agent, env, num_episodes, max_steps_per_episode)

Episode: 1, Total Reward: -0.37685298593508254, epsilon: 0.99
Episode: 2, Total Reward: 0.3030325300598151, epsilon: 0.9801
Episode: 3, Total Reward: 1.009566442467768, epsilon: 0.9702989999999999
Episode: 4, Total Reward: 0.4192967397444707, epsilon: 0.96059601
Episode: 5, Total Reward: -0.49123619684492403, epsilon: 0.9509900498999999


KeyboardInterrupt: 

In [None]:
agent.epsilon


0.04904089407128576

In [31]:
dummy_df = pairsOutcome[top_keys[0]].iloc[::,:6]
dummy_df

Unnamed: 0_level_0,spread,mid,1sd high,1sd low,2sd high,2sd low
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2013-01-01,-829.459706,,,,,
2013-01-02,-829.459706,,,,,
2013-01-03,-829.459706,,,,,
2013-01-04,-788.012196,,,,,
2013-01-07,-751.666698,,,,,
...,...,...,...,...,...,...
2024-05-27,-39.304773,-46.289714,319.903354,-412.482783,686.096423,-778.675852
2024-05-28,204.731719,-44.231112,321.850855,-410.313080,687.932823,-776.395048
2024-05-29,345.042254,-41.541554,324.802844,-407.885953,691.147243,-774.230351
2024-05-30,178.588811,-39.334254,326.585734,-405.254242,692.505722,-771.174230


In [43]:
dummy_df.iloc[261:] = dummy_df.subtract(dummy_df['mid'], axis=0)

In [None]:
dummy_df

Unnamed: 0_level_0,spread,1sd high,1sd low,2sd high,2sd low
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,,,,,
2013-01-04,,,,,
2013-01-07,,,,,
...,...,...,...,...,...
2024-05-27,6.984942,366.193069,-366.193069,732.386137,-732.386137
2024-05-28,248.962831,366.081968,-366.081968,732.163935,-732.163935
2024-05-29,386.583808,366.344399,-366.344399,732.688797,-732.688797
2024-05-30,217.923065,365.919988,-365.919988,731.839976,-731.839976


Unnamed: 0_level_0,spread,1sd high,1sd low,2sd high,2sd low
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,,,,,
2013-01-04,,,,,
2013-01-07,,,,,
...,...,...,...,...,...
2024-05-27,335.484182,159.593167,-159.593167,319.186334,-319.186334
2024-05-28,229.637004,158.974787,-158.974787,317.949574,-317.949574
2024-05-29,169.122976,158.188834,-158.188834,316.377667,-316.377667
2024-05-30,187.103754,157.699351,-157.699351,315.398701,-315.398701


- first few tries, network is very large
- added epsilon search in "choose_action" functionso that there will be some chance to explore
- changed reward function to multiply losses and give exponential returns to incentivise risk taking

### 1 dec 2105: 
- might have performance is always oscillating negative and positive. This might be because of too large a learning rate. also start from start of training periods max steps to be 3000 so that total results are comparable
    - this helped quite abit. 
`
input_size = 7  # Adjust to your specific input size
output_size = 3  # Adjust to your desired number of discrete actions
learning_rate = 0.1
discount_factor = 0.8
epsilon = 1 # down to .3
epsilon_decay = 0.9999
num_episodes = 500
max_steps_per_episode = 3000
`
- want to try changing epsilon to only update after the entire episode instead of after each step. its decaying too quickly
    - 
- I want to try with changing reward by changing "learn" to use total_reward instead of "reward"
- Scale the states. need to explore scaling the state since it is still in terms of absolute differences. NN is not able to do proportions
- training epochs should be smaller at up to 30 days because mean reversion pattern is 1 to 33 days
    - very bad performance with 40 day epochs

### 1 dec 2217:
- changed target q value fxn to remove exponential reward and scaled negative reward. now both positive and negative are the same. added portion of total reward in episode to incentivise more long term rewards.
    - `        if reward > 0:
            target_q_value = reward + self.discount_factor * next_q_value * (1 - done) + total_reward * .1
        else:
            target_q_value = reward + self.discount_factor * next_q_value * (1 - done) + total_reward * .1`
    -       `  if episode%1==0:
            agent.epsilon *= agent.epsilon_decay`