In [1]:
## anaconda3 (Python 3.12.0) Kernel

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

#f for q
import gym
import numpy as np
import random

# pair trade packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from datetime import datetime

# Load Pairs Data


`validPairs4.csv` already have TOPIX stocks with highest liquidity and are tested for stationarity over a 1 year window

Choose top 10 known pair trades by returns in the total dataset

In [2]:
from load_data import load_data

workingPairOutcome, top_keys, validPairsList, return_df = load_data()

In [3]:
# Print the top 10 performing trades
print("Top 10 performing trades:")
for i, key in enumerate(top_keys,1):
    print(f"{i}. Pair: {key}")

Top 10 performing trades:
1. Pair: 1801 JP Equity 2670 JP Equity
2. Pair: 3778 JP Equity 6701 JP Equity
3. Pair: 2760 JP Equity 6254 JP Equity
4. Pair: 5706 JP Equity 6954 JP Equity
5. Pair: 7951 JP Equity 9684 JP Equity
6. Pair: 1808 JP Equity 6481 JP Equity
7. Pair: 3099 JP Equity 5831 JP Equity
8. Pair: 1808 JP Equity 6971 JP Equity
9. Pair: 4021 JP Equity 9843 JP Equity
10. Pair: 5929 JP Equity 6504 JP Equity


In [4]:
## Sample pair data 
workingPairOutcome[top_keys[0]].tail(), workingPairOutcome[top_keys[0]].shape

(             spread  1sd high  1sd low  2sd high  2sd low
 Date                                                     
 27/5/2024  0.019074       1.0     -1.0       2.0     -2.0
 28/5/2024  0.680074       1.0     -1.0       2.0     -2.0
 29/5/2024  1.055247       1.0     -1.0       2.0     -2.0
 30/5/2024  0.595548       1.0     -1.0       2.0     -2.0
 31/5/2024  0.143814       1.0     -1.0       2.0     -2.0,
 (2979, 5))

## Make indicators and spread stationary around 0
Deduct the mean from all values to translate to 0 axis

- Test one timestep at a time (even though we can test all at the same time)
- give state
- Trading should be path dependent due to stop loss. in this case I can only give last position as one of the parameters

# Machine Learning Challenge

## Background
Initial evaluation of the baseline portfolio shows that draw downs are small. Originally team had the idea of using Machine Learning to optimise for sizing of these pair trades. However since there was no significant drawdowns the returns are linearly increasing with investment sizing i.e. greater nominal investment in the the pair trade the proportionate increase in returns without realising significant drawdown risk.

Instead of optimising for sizing, we can explore Machine Learning in terms of strategy on this stationary dataset. Whereas our prescribed strategy is to enter at +/- 1 std dev, exit at 0 with +/- 2 std dev stop loss. These are only suggestions and arbitrary levels.

With Machine Learning, we can discover if it will uncover the mean reverting nature and recommend another threshhold. We use Q Learner to understand state space with the same spread, mid, std dev parameters as the baseline.

### Steps
#### Environment:
- State Space: A set of all possible states the agent can be in.  
  - [spread, mid, 2 sd low, 1 sd low, 1 sd high, 2 sd high]
- Action Space: A set of all possible actions the agent can take in each state.   
  - [-1, # short\
      0, # uninvested\
      1  # long]   
- Reward Function: A function that assigns a numerical reward to each state-action pair, indicating the immediate consequence of taking a particular action in a specific state.
  - dailypnl
- Transition Function: A function that determines the probability of transitioning from one state to another when a particular action is taken.
  - deterministic based on historical performance
#### Agent:

- Q-Table: A matrix that stores the estimated Q-values for each state-action pair. Q-values represent the expected future reward for taking a specific action in a given state.   
  - continuous Q table?
- Learning Rate (α): A parameter that controls how much the Q-values are updated with each new experience.   
- Discount Factor (γ): A parameter that determines the importance of future rewards. A higher discount factor gives more weight to future rewards.   
- Exploration Rate (ε): A parameter that controls the balance between exploration (trying new actions) and exploitation (choosing the action with the highest Q-value).   
- Q-Learning Algorithm:

  - Initialization: Initialize the Q-table with random values or zeros.   
  - Exploration and Exploitation: Use an exploration strategy (e.g., ε-greedy) to choose an action:
    - With probability ε, choose a random action.   
    - With probability 1-ε, choose the action with the highest Q-value for the current state.   
  
  - Take Action: Execute the chosen action in the environment.   
  - Observe Reward and Next State: Observe the immediate reward and the next state resulting from the action.
- Update Q-Value: Update the Q-value of the current state-action pair using the following formula:

#### Training and Test set

2013 is used for warm start\
2014 - 2023 train data since NN need a lot of training data {end 2023 idx == 2868}\
2024 onwards (5 months) test data


In [5]:
from q_agent import MultiStockEnv

In [6]:
env = MultiStockEnv(workingPairOutcome, top_keys, validPairsList, return_df)

In [7]:
env.last_step

2868

In [8]:
"""
# TODO NEED TO CHANGE GET BASELINE TO GET REWARD FOR EACH OF THE STOCKS FOR THAT TIME STEP

"""
env.current_step = 265
env.calculate_reward(np.ones(10)*-1)


-0.017012574193340124

In [9]:
with open('pairsOutcome.pkl', 'rb') as file:
    pairsOutcome = pickle.load(file)

print("Dictionary loaded from pairsOutcome.pkl")

Dictionary loaded from pairsOutcome.pkl


In [10]:
## need to go through each df and get the position then join all of them together
pairsOutcome[top_keys[0]]

Unnamed: 0_level_0,spread,mid,1sd high,1sd low,2sd high,2sd low,position,1801 JP Equity position,2670 JP Equity position,dailypnl,cumpnl
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1/1/2013,-829.459706,,,,,,0,0,0,0.00000,0.000000
2/1/2013,-829.459706,,,,,,0,0,0,0.00000,0.000000
3/1/2013,-829.459706,,,,,,0,0,0,0.00000,0.000000
4/1/2013,-788.012196,,,,,,0,0,0,0.00000,0.000000
7/1/2013,-751.666698,,,,,,0,0,0,-0.00000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
27/5/2024,-39.304773,-46.289714,319.903354,-412.482783,686.096423,-778.675852,0,0,0,0.00000,2.552379
28/5/2024,204.731719,-44.231112,321.850855,-410.313080,687.932823,-776.395048,0,0,0,-0.00000,2.552379
29/5/2024,345.042254,-41.541554,324.802844,-407.885953,691.147243,-774.230351,-1,-1,1,0.02741,2.579789
30/5/2024,178.588811,-39.334254,326.585734,-405.254242,692.505722,-771.174230,0,0,0,0.00000,2.579789


In [11]:
## Get baseline results
t_pair = validPairsList[0]
max_steps_per_episode = 100

def get_baseline(env, max_steps_per_episode, t_pair):
    env.reset()
    total_reward = 0
    current_step = 261
    env.current_step = current_step
    env.last_step = 2868

    for step in range(max_steps_per_episode):
        action = workingPairOutcome[t_pair].iloc[env.current_step]['position']
        _, reward, done, _ = env.step(action)
        total_reward += reward

        if done:
            break

    print(f"Baseline {t_pair}, Total Reward: {total_reward}, step {step}")

get_baseline(env, max_steps_per_episode, t_pair)

TypeError: unhashable type: 'list'

In [None]:
import gym
import numpy as np
import random

class MultiStockEnv(gym.Env):
    def __init__(self, workingPairOutcome, top_keys, validPairsList, return_df):
        self.workingPairOutcome = workingPairOutcome
        self.top_keys = top_keys
        self.validPairsList = validPairsList
        self.return_df = return_df
        
        self.earliest_step = 261  # Starting step
        self.last_step = 2868  # Ending step
        self.current_step = self.earliest_step
        
        # Number of stocks and possible outcomes per stock (3 outcomes per stock)
        self.num_stocks = 10
        self.num_actions = 3  # Buy, hold, sell
        
    def step(self, actions):
        """
        Input:
            actions: List of actions (length of 10, each corresponding to a stock)
        Output:
            next_state: next state 5 features
            reward: total reward for this timestep
            done: boolean for if end of dataset
            info: optional
        """
        # Advance the time step
        self.current_step += 1
        done = self.current_step >= self.last_step

        # Get the state for each stock
        state = np.zeros((self.num_stocks, 5))  # 10 stocks with 3 possible outcomes
        for i in range(self.num_stocks):
            # Get the current state (outcomes) for the stock
            state[i] = self.workingPairOutcome[self.top_keys[i]].iloc[self.current_step].values
        
        # Calculate the reward (based on action for each stock)
        reward = self.calculate_reward(actions)
        
        # Provide next state
        next_state = state.flatten()  # Flatten to 1D array for the agent
        info = {}  # Optional information
        
        return next_state, reward, done, info

    def reset(self):
        """ Reset to the starting point of the dataset """
        self.current_step = self.earliest_step
        state = np.zeros((self.num_stocks, 5))  # Initialize state for all stocks
        for i in range(self.num_stocks):
            # Set the state for each stock (first row from each stock's data)
            state[i] = self.workingPairOutcome[self.top_keys[i]].iloc[self.current_step].values
        
        return state.flatten()  # Return flattened state
    
    def calculate_reward(self, actions):
        """ Calculate reward for the actions taken for each stock """
        reward = 0
        for i in range(self.num_stocks):
            position = actions[i]  # Action for the current stock (buy, hold, sell)
            reward += self.stock_reward(position, self.current_step, self.validPairsList[i])
        return reward
    
    def stock_reward(self, position, idx, pair):
        """ Compute reward for each stock based on position and return data """
        position_0 = position
        position_1 = position * -1
        dailypnl = position_0 * self.return_df[f'{pair[0]}'].iloc[idx] + position_1 * self.return_df[f'{pair[1]}'].iloc[idx]
        return dailypnl

# Instantiate the environment
env = MultiStockEnv(workingPairOutcome, top_keys, validPairsList, return_df)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
from collections import deque

class QNetwork(nn.Module):
    def __init__(self, input_size, output_size):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, 32)
        self.fc2 = nn.Linear(32, 64)
        self.fc3 = nn.Linear(64, output_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

class QLearningAgent:
    def __init__(self, input_size, output_size, learning_rate, discount_factor, epsilon, epsilon_decay, batch_size=30, replay_buffer_size=10000):
        self.q_network = QNetwork(input_size, output_size)
        self.target_network = QNetwork(input_size, output_size)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
        self.loss_fn = nn.MSELoss()
        self.discount_factor = discount_factor
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.replay_buffer = deque(maxlen=replay_buffer_size)
        
        # Action to index mapping
        self.action_to_index = {-1: 0, 0: 1, 1: 2}
        self.index_to_action = {0: -1, 1: 0, 2: 1}

    def choose_action(self, state):
        if np.random.rand() < self.epsilon:
            # Randomly select actions for each stock
            actions = np.random.choice([-1, 0, 1], size=10).tolist()
        else:
            with torch.no_grad():
                state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
                q_values = self.q_network(state_tensor)
                
                # Get the best action index for each stock
                action_indices = torch.argmax(q_values, dim=1).unsqueeze(0).tolist()
                actions = []
                for idx in action_indices:
                    action = self.index_to_action[idx[0]]  # Use idx directly as the key
                    actions.append(action)
            
        return actions

    def store_experience(self, state, actions, reward, next_state, done):
        for action in actions:  # Store each stock's action separately
            self.replay_buffer.append((state, action, reward, next_state, done))

    def learn(self):
        if len(self.replay_buffer) < self.batch_size:
            return

        batch = random.sample(self.replay_buffer, self.batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)

        states = torch.tensor(states, dtype=torch.float32)
        next_states = torch.tensor(next_states, dtype=torch.float32)

        # Convert actions to indices (ensure actions are single values)
        actions = torch.tensor([self.action_to_index[action] for action in actions], dtype=torch.long).view(-1, 1)
        rewards = torch.tensor(rewards, dtype=torch.float32).view(-1, 1)
        dones = torch.tensor(dones, dtype=torch.float32).view(-1, 1)

        # Q-value updates
        q_values = self.q_network(states).gather(1, actions)
        next_q_values = self.target_network(next_states).max(1, keepdim=True)[0].detach()
        target_q_values = rewards + self.discount_factor * next_q_values * (1 - dones)

        loss = self.loss_fn(q_values, target_q_values)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Update target network and decay epsilon
        self.update_target_network()
        self.epsilon = max(0.01, self.epsilon * self.epsilon_decay)

    def update_target_network(self):
        self.target_network.load_state_dict(self.q_network.state_dict())

# Example usage:
input_size = 50  # 10 stocks * 5 indicators per stock
output_size = 30  # Action space per stock
learning_rate = 0.25
discount_factor = 0.99
epsilon = 1.0
epsilon_decay = 0.99999
ls_total_reward = []
total_episodes = 1_000_000

agent = QLearningAgent(input_size, output_size, learning_rate, discount_factor, epsilon, epsilon_decay)

# Simulating agent learning (in practice, use a loop with environment interaction)
for episode in range(total_episodes):
    state = env.reset() 
    done = False
    total_reward = 0
    
    while not done:
        action = agent.choose_action(state)
        next_state, reward, done, _ = env.step(action)
        
        agent.store_experience(state, action, reward, next_state, done)
        agent.learn()
        
        state = next_state
        total_reward += reward

    ls_total_reward.append(total_reward)
        
    print(f"Episode {episode+1}: Total Reward: {total_reward}, Epsilon: {agent.epsilon:.2f}")


IndexError: list index out of range

In [None]:
np.array(ls_total_reward[-100:]).mean()

  np.array(ls_total_reward[-100:]).mean()
  ret = ret.dtype.type(ret / rcount)


nan

In [None]:
## Get baseline results
t_pair = validPairsList[0]
max_steps_per_episode = 100

def get_baseline(env, max_steps_per_episode, t_pair):
    env.reset()
    total_reward = 0
    current_step = 261
    env.current_step = current_step
    env.last_step = 2868

    for step in range(max_steps_per_episode):
        action = pairsOutcome[t_pair].iloc[env.current_step]['position']
        _, reward, done, _ = env.step(action)
        total_reward += reward

        if done:
            break

    print(f"Baseline {t_pair}, Total Reward: {total_reward}, step {step}")

SyntaxError: invalid syntax (1458872436.py, line 3)

In [None]:
get_baseline(env, 3000, top_keys[0])

Baseline 1801 JP Equity 2670 JP Equity, Total Reward: 2.3267375595549673, step 2606


- first few tries, network is very large
- added epsilon search in "choose_action" functionso that there will be some chance to explore
- changed reward function to multiply losses and give exponential returns to incentivise risk taking

### 1 dec 2105: 
- might have performance is always oscillating negative and positive. This might be because of too large a learning rate. also start from start of training periods max steps to be 3000 so that total results are comparable
    - this helped quite abit. 
`
input_size = 7  # Adjust to your specific input size
output_size = 3  # Adjust to your desired number of discrete actions
learning_rate = 0.1
discount_factor = 0.8
epsilon = 1 # down to .3
epsilon_decay = 0.9999
num_episodes = 500
max_steps_per_episode = 3000
`
- want to try changing epsilon to only update after the entire episode instead of after each step. its decaying too quickly
    - 
- I want to try with changing reward by changing "learn" to use total_reward instead of "reward"
- Scale the states. need to explore scaling the state since it is still in terms of absolute differences. NN is not able to do proportions
- training epochs should be smaller at up to 30 days because mean reversion pattern is 1 to 33 days
    - very bad performance with 40 day epochs

### 1 dec 2217:
- changed target q value fxn to remove exponential reward and scaled negative reward. now both positive and negative are the same. added portion of total reward in episode to incentivise more long term rewards.
    - `        if reward > 0:
            target_q_value = reward + self.discount_factor * next_q_value * (1 - done) + total_reward * .1
        else:
            target_q_value = reward + self.discount_factor * next_q_value * (1 - done) + total_reward * .1`
    -       `  if episode%1==0:
            agent.epsilon *= agent.epsilon_decay`

### 2 Dec 2101:
- managed to scale but results are not any better
- thinking of reducing learning rate to reduce the oscillations
    - will try to run with learning rate at 0.01
- right now total reward is taking all of the target q function. maybe can make it a 50/50 split