In [1]:
## anaconda3 (Python 3.9.13) Kernel

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# pair trade packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from datetime import datetime

# Load Pairs Data


In [2]:
def custom_date_parser(date_str):
    return datetime.strptime(date_str, '%d/%m/%Y')

# Load the dictionary from the pickle file
with open('pairsOutcome.pkl', 'rb') as file:
    pairsOutcome = pickle.load(file)

print("Dictionary loaded from pairsOutcome.pkl")


# Load stock data and get return 
tpxData = pd.read_csv('TPX_prices.csv', index_col=0, parse_dates=True, date_parser=custom_date_parser)
tpxData = tpxData.dropna(axis='columns')
return_df = (tpxData / tpxData.shift(1)) - 1

Dictionary loaded from pairsOutcome.pkl


# Get Pair Trade Portfolio
`pairsOutcome` already have TOPIX stocks with highest liquidity and are tested for stationarity over a 1 year window

Choose top 10 known pair trades by returns in the total dataset

In [3]:
# Sort the keys by their cumpnl[-2] values in descending order
top_keys = sorted(
    pairsOutcome,
    key=lambda k: pairsOutcome[k].cumpnl.iloc[-2],  # Access cumpnl[-2] safely
    reverse=True
)[:10]  # Get the top 10 keys

# Print the top 10 performing trades
print("Top 10 performing trades:")
for i, key in enumerate(top_keys, 1):
    print(f"{i}. Key: {key}, Value: {pairsOutcome[key].cumpnl.iloc[-2]}")

Top 10 performing trades:
1. Key: 1801 JP Equity 2670 JP Equity, Value: 2.5797887367591246
2. Key: 3778 JP Equity 6701 JP Equity, Value: 2.537242032391529
3. Key: 2760 JP Equity 6254 JP Equity, Value: 2.3688208386917404
4. Key: 5706 JP Equity 6954 JP Equity, Value: 2.2676474298290237
5. Key: 7951 JP Equity 9684 JP Equity, Value: 2.0657325467200596
6. Key: 1808 JP Equity 6481 JP Equity, Value: 1.9929348941248262
7. Key: 3099 JP Equity 5831 JP Equity, Value: 1.939742664925484
8. Key: 1808 JP Equity 6971 JP Equity, Value: 1.9132602773493155
9. Key: 4021 JP Equity 9843 JP Equity, Value: 1.8675031161000868
10. Key: 5929 JP Equity 6504 JP Equity, Value: 1.811533049967201


# Machine Learning Challenge

## Background
Initial evaluation of the baseline portfolio shows that draw downs are small. Originally team had the idea of using Machine Learning to optimise for sizing of these pair trades. However since there was no significant drawdowns the returns are linearly increasing with investment sizing i.e. greater nominal investment in the the pair trade the proportionate increase in returns without realising significant drawdown risk.

Instead of optimising for sizing, we can explore Machine Learning in terms of strategy on this stationary dataset. Whereas our prescribed strategy is to enter at +/- 1 std dev, exit at 0 with +/- 2 std dev stop loss. These are only suggestions and arbitrary levels.

With Machine Learning, we can discover if it will uncover the mean reverting nature and recommend another threshhold. We use Q Learner to understand state space with the same spread, mid, std dev parameters as the baseline.

### Steps
#### Environment:
- State Space: A set of all possible states the agent can be in.  
  - [spread, mid, 2 sd low, 1 sd low, 1 sd high, 2 sd high]
- Action Space: A set of all possible actions the agent can take in each state.   
  - [-1, # short\
      0, # uninvested\
      1  # long]   
- Reward Function: A function that assigns a numerical reward to each state-action pair, indicating the immediate consequence of taking a particular action in a specific state.
  - dailypnl
- Transition Function: A function that determines the probability of transitioning from one state to another when a particular action is taken.
  - deterministic based on historical performance
#### Agent:

- Q-Table: A matrix that stores the estimated Q-values for each state-action pair. Q-values represent the expected future reward for taking a specific action in a given state.   
  - continuous Q table?
- Learning Rate (α): A parameter that controls how much the Q-values are updated with each new experience.   
- Discount Factor (γ): A parameter that determines the importance of future rewards. A higher discount factor gives more weight to future rewards.   
- Exploration Rate (ε): A parameter that controls the balance between exploration (trying new actions) and exploitation (choosing the action with the highest Q-value).   
- Q-Learning Algorithm:

  - Initialization: Initialize the Q-table with random values or zeros.   
  - Exploration and Exploitation: Use an exploration strategy (e.g., ε-greedy) to choose an action:
    - With probability ε, choose a random action.   
    - With probability 1-ε, choose the action with the highest Q-value for the current state.   
  
  - Take Action: Execute the chosen action in the environment.   
  - Observe Reward and Next State: Observe the immediate reward and the next state resulting from the action.
- Update Q-Value: Update the Q-value of the current state-action pair using the following formula:

#### Training and Test set

2013 is used for warm start\
2014 - 2023 train data since NN need a lot of training data {end 2023 idx == 2868}\
2024 onwards (5 months) test data


In [4]:
## Get pair stock data
def custom_date_parser(date_str):
    return datetime.strptime(date_str, '%d/%m/%Y')
valid = pd.read_csv('validPairs4.csv', 
                    index_col=0, 
                    parse_dates=True, 
                    date_parser=custom_date_parser)
## get list of pair stocks
validPairsList = [
    [item.strip() + ' Equity' for item in pair.split('Equity') if item.strip()]
    for pair in top_keys
]

In [5]:
rollingWindow = 262
cutLossSd = 2

In [6]:
for pair in validPairsList:
    df = pd.DataFrame()

    #Calculate Standard Deviations
    df['spread'] = valid[f'spread_{pair[0]}_{pair[1]}']
    df['mid'] =  df['spread'].rolling(rollingWindow).mean()
    df['1sd high'] = df['spread'].rolling(rollingWindow).mean() + df['spread'].rolling(rollingWindow).std()
    df['1sd low'] = df['spread'].rolling(rollingWindow).mean() - df['spread'].rolling(rollingWindow).std()
    df['2sd high'] = df['spread'].rolling(rollingWindow).mean() + df['spread'].rolling(rollingWindow).std() * cutLossSd
    df['2sd low'] = df['spread'].rolling(rollingWindow).mean() - df['spread'].rolling(rollingWindow).std() * cutLossSd
    df['position'] = 0

    df.loc[(df['spread'] > df['1sd high']) & (df['spread'] < df['2sd high']), 'position'] = -1
    df.loc[(df['spread']< df['1sd low']) & (df['spread'] > df['2sd low']), 'position'] = 1

    #Calculate PnL
    df[f'{pair[0]} position'] = df['position']
    df[f'{pair[1]} position'] = df['position'] * -1
    df['dailypnl'] = df[f'{pair[1]} position']*return_df[f'{pair[1]}'].shift(-1) + df[f'{pair[0]} position']*return_df[f'{pair[0]}'].shift(-1)
    df['cumpnl'] = df['dailypnl'].cumsum()

    pairsOutcome[f'{pair[0]} {pair[1]}'] = df

## Make indicators and spread stationary around 0
Deduct the mean from all values to translate to 0 axis

In [7]:
workingPairOutcome = {}

for pair in top_keys:
    dummy_df = pairsOutcome[pair].iloc[::,:6]
    dummy_df = dummy_df.subtract(dummy_df['mid'], axis=0).drop(columns=['mid']) # centre spread and SD
    dummy_df = dummy_df.div(dummy_df['2sd high']-dummy_df['1sd high'],axis=0)   # express SD as integers, give spread as propotionate
    workingPairOutcome[pair] = dummy_df

In [8]:
workingPairOutcome[top_keys[5]].tail()      # spread is not a proportion and direction of SD

Unnamed: 0_level_0,spread,1sd high,1sd low,2sd high,2sd low
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2024-05-27,-0.871229,1.0,-1.0,2.0,-2.0
2024-05-28,-0.866973,1.0,-1.0,2.0,-2.0
2024-05-29,-0.803063,1.0,-1.0,2.0,-2.0
2024-05-30,-0.726676,1.0,-1.0,2.0,-2.0
2024-05-31,-0.77645,1.0,-1.0,2.0,-2.0


In [9]:
# validPairsList, top_keys

- Test one timestep at a time (even though we can test all at the same time)
- give state
- Trading should be path dependent due to stop loss. in this case I can only give last position as one of the parameters

In [10]:
import gym
import random

class PairTradeEnv(gym.Env):
    # ... (define your environment's state space, action space, etc.)
    def __init__(self, workingPairOutcome, top_keys, validPairsList, return_df):
        # ... (initialize other parameters)
        self.earliest_step = 261  # hot start
        self.last_step = 2868
        # self.current_step = random.randint(self.earliest_step, self.last_step - 1)
        self.current_step = self.earliest_step


    def step(self, action):
        """
        Input
            action: single value e.g. -1 (short)
        Output:
            next_state: next state 
            reward: reward for last timestep
            done: boolean for if end of dataset
            info: optional
        """
        # Advance the time step
        self.current_step += 1
        # Get the next state
        next_state = np.append(
                            np.array(workingPairOutcome[top_keys[0]].iloc[self.current_step][:6]),
                            action)
        # Calculate reward (implement your reward function here)
        reward = self.calculate_reward(action, self.current_step, validPairsList[0]) # TODO change pair selected
        # Check for termination (implement your termination condition here)
        done = self.current_step >= self.last_step

        # Provide additional information (optional)
        info = {}

        return next_state, reward, done, info

    def reset(self):
        # ... (implement the reset function to initialize the environment)
        # reset to start of 2014 every time
        # self.current_step = random.randint(self.earliest_step, self.last_step - 1)
        self.current_step = self.earliest_step
        initial_state = np.append(
                            np.array(workingPairOutcome[top_keys[0]].iloc[self.current_step][:6]),
                            0)
        return initial_state
    
    def calculate_reward(self, position, idx, pair):
        """
        Give one _previous_ day's return
        Input:
            position: position for idx (current step)
            idx: usually current timestp 
            pair: tuple of tpx stock
        Output:
            dailypnl
        """
        # position = position_vector @ np.array([-1,0,1])
        position_0 = position
        position_1 = position * -1
        ## return_df gives the return for the previous day for the given idx
        dailypnl = position_0*return_df[f'{pair[0]}'].iloc[idx] + position_1*return_df[f'{pair[1]}'].iloc[idx] 

        return dailypnl

# Instantiate the custom environment
env = PairTradeEnv(workingPairOutcome, top_keys, validPairsList, return_df)

In [27]:
class QNetwork(nn.Module):
    def __init__(self, input_size, output_size):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, 32)
        self.fc2 = nn.Linear(32, 16)
        self.fc3 = nn.Linear(16, output_size)

    def forward(self, x):
        # Assuming x is a tensor of shape (6, 1) for a single state
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

class QLearningAgent:
    def __init__(self, input_size, output_size, learning_rate, discount_factor, epsilon, epsilon_decay):
        self.q_network = QNetwork(input_size, output_size)
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
        self.loss_fn = nn.MSELoss()
        self.discount_factor = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay

    # def choose_action(self, state):
    #     if np.random.rand() < self.epsilon:
    #         # Explore: Choose a random action
    #         return np.random.choice([-1, 0, 1])
    #     else:
    #         # Exploit: Choose the action with the highest Q-value
    #         with torch.no_grad():
    #             q_values = self.q_network(torch.tensor(state, dtype=torch.float32))
    #             action_index = torch.argmax(q_values).item()
    #     return np.array([-1, 0, 1])[action_index]  
    def choose_action(self, state):
        if np.random.rand() < self.epsilon:
            # Explore: Choose a random action
            return np.random.choice([-1, 0, 1])
        else:
            # Exploit: Choose the action with the highest Q-value
            with torch.no_grad():
                q_values = self.q_network(torch.tensor(state, dtype=torch.float32).unsqueeze(0))
                action_index = torch.argmax(q_values, dim=1).item()  # Index: 0, 1, or 2
            return [-1, 0, 1][action_index]  # Map index to action

    # def learn(self, state, action, reward, total_reward, next_state, done):
    #     state = torch.tensor(state, dtype=torch.float32)
    #     action = torch.tensor([action], dtype=torch.long)
    #     reward = torch.tensor([reward], dtype=torch.float32)
    #     next_state = torch.tensor(next_state, dtype=torch.float32)
    #     done = torch.tensor([done], dtype=torch.float32)
    #     state = state.unsqueeze(0)  # Reshape to (1, input_size)
    #     next_state = next_state.unsqueeze(0)

    #     q_value = self.q_network(state)[action]
    #     next_q_value = torch.max(self.q_network(next_state)).detach()
    #     if total_reward > 0:
    #         # target_q_value = reward + self.discount_factor * next_q_value * (1 - done) + total_reward * .1
    #         target_q_value = reward + self.discount_factor * next_q_value * (1 - done)
    #     else:
    #         # target_q_value = reward + self.discount_factor * next_q_value * (1 - done) + total_reward * .1
    #         target_q_value = reward + self.discount_factor * next_q_value * (1 - done)

    #     loss = self.loss_fn(q_value, target_q_value)
    #     self.optimizer.zero_grad()
    #     loss.backward()
    #     self.optimizer.step()

    def learn(self, state, action, reward, total_reward, next_state, done):
        # Convert inputs to tensors
        state = torch.tensor(state, dtype=torch.float32).unsqueeze(0)  # (1, input_size)
        next_state = torch.tensor(next_state, dtype=torch.float32).unsqueeze(0)  # (1, input_size)
        reward = torch.tensor([reward], dtype=torch.float32)
        done = torch.tensor([done], dtype=torch.float32)

        # Map action to index
        action_index = torch.tensor([[[-1, 0, 1].index(action)]], dtype=torch.long)  # Shape (1, 1)

        # Calculate Q-value for the chosen action
        q_value = self.q_network(state).gather(1, action_index).squeeze(-1)  # Shape (1)

        # Calculate target Q-value
        next_q_value = torch.max(self.q_network(next_state), dim=1)[0].detach()  # Shape (1)
        target_q_value = reward + self.discount_factor * next_q_value * (1 - done)

        # Compute loss and update network
        loss = self.loss_fn(q_value, target_q_value)
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), max_norm=1.0)
        self.optimizer.step()

# Example usage:
input_size = 6  # Adjust to your specific input size
output_size = 3  # Adjust to your desired number of discrete actions
learning_rate = 0.25
discount_factor = 0.99
epsilon = 1
epsilon_decay = 0.85

agent = QLearningAgent(input_size, output_size, learning_rate, discount_factor, epsilon, epsilon_decay)

In [28]:
def train(agent, env, num_episodes, max_steps_per_episode):
    for episode in range(num_episodes):
        state = env.reset()
        total_reward = 0
        episode_memory = []  # Store experiences during the episode

        for step in range(max_steps_per_episode):
            action = agent.choose_action(state)
            next_state, reward, done, _ = env.step(action)
            episode_memory.append((state, action, reward, next_state, done))
            state = next_state
            total_reward += reward

            if done:
                break

        # Learn from the entire episode
        for state, action, reward, next_state, done in episode_memory:
            agent.learn(state, action, reward, total_reward, next_state, done)

        if episode%1==0:
            agent.epsilon = max(.05, agent.epsilon * agent.epsilon_decay)

        print(f"Episode: {episode+1}, Total Reward: {total_reward}, epsilon: {agent.epsilon}")

In [29]:
num_episodes = 100
max_steps_per_episode = 3000

train(agent, env, num_episodes, max_steps_per_episode)

Episode: 1, Total Reward: 0.07921787086103693, epsilon: 0.85
Episode: 2, Total Reward: 0.3049690508304137, epsilon: 0.7224999999999999
Episode: 3, Total Reward: 0.29396127219553114, epsilon: 0.6141249999999999
Episode: 4, Total Reward: -0.9673899796583354, epsilon: 0.5220062499999999
Episode: 5, Total Reward: 0.3561046913816679, epsilon: 0.4437053124999999
Episode: 6, Total Reward: 0.6945607904422253, epsilon: 0.3771495156249999
Episode: 7, Total Reward: -1.0560858027630353, epsilon: 0.32057708828124987
Episode: 8, Total Reward: 0.4052085401489176, epsilon: 0.2724905250390624
Episode: 9, Total Reward: -0.7092316161950062, epsilon: 0.23161694628320303
Episode: 10, Total Reward: 0.5448784869208478, epsilon: 0.19687440434072256
Episode: 11, Total Reward: -0.7622597765887497, epsilon: 0.16734324368961417
Episode: 12, Total Reward: -0.2305852127103475, epsilon: 0.14224175713617204
Episode: 13, Total Reward: 0.5150558332276077, epsilon: 0.12090549356574623
Episode: 14, Total Reward: -0.14627

KeyboardInterrupt: 

In [14]:
## Get baseline results
t_pair = validPairsList[0]
max_steps_per_episode = 

def get_baseline(env, max_steps_per_episode, t_pair):
    env.reset()
    total_reward = 0
    current_step = 261
    env.current_step = current_step
    env.last_step = 2868

    for step in range(max_steps_per_episode):
        action = pairsOutcome[t_pair].iloc[env.current_step]['position']
        _, reward, done, _ = env.step(action)
        total_reward += reward

        if done:
            break

    print(f"Baseline {t_pair}, Total Reward: {total_reward}, step {step}")

SyntaxError: invalid syntax (1458872436.py, line 3)

In [None]:
get_baseline(env, 3000, top_keys[0])

Baseline 1801 JP Equity 2670 JP Equity, Total Reward: 2.3267375595549673, step 2606


- first few tries, network is very large
- added epsilon search in "choose_action" functionso that there will be some chance to explore
- changed reward function to multiply losses and give exponential returns to incentivise risk taking

### 1 dec 2105: 
- might have performance is always oscillating negative and positive. This might be because of too large a learning rate. also start from start of training periods max steps to be 3000 so that total results are comparable
    - this helped quite abit. 
`
input_size = 7  # Adjust to your specific input size
output_size = 3  # Adjust to your desired number of discrete actions
learning_rate = 0.1
discount_factor = 0.8
epsilon = 1 # down to .3
epsilon_decay = 0.9999
num_episodes = 500
max_steps_per_episode = 3000
`
- want to try changing epsilon to only update after the entire episode instead of after each step. its decaying too quickly
    - 
- I want to try with changing reward by changing "learn" to use total_reward instead of "reward"
- Scale the states. need to explore scaling the state since it is still in terms of absolute differences. NN is not able to do proportions
- training epochs should be smaller at up to 30 days because mean reversion pattern is 1 to 33 days
    - very bad performance with 40 day epochs

### 1 dec 2217:
- changed target q value fxn to remove exponential reward and scaled negative reward. now both positive and negative are the same. added portion of total reward in episode to incentivise more long term rewards.
    - `        if reward > 0:
            target_q_value = reward + self.discount_factor * next_q_value * (1 - done) + total_reward * .1
        else:
            target_q_value = reward + self.discount_factor * next_q_value * (1 - done) + total_reward * .1`
    -       `  if episode%1==0:
            agent.epsilon *= agent.epsilon_decay`

### 2 Dec 2101:
- managed to scale but results are not any better
- thinking of reducing learning rate to reduce the oscillations
    - will try to run with learning rate at 0.01
- right now total reward is taking all of the target q function. maybe can make it a 50/50 split