##### <br>***Code written in this notebook is in reference to the  expanation paper 'Reinforcement Learning in Foreign Exchange Trading' in the same directory***<br>  

This notebook is for training, validating, and testing new models.  
- To use the pre-trained model for daily prediction, please refer to the **Pre_trained_implementation** notebook.
- To use the pre-trained model to backtesting its performance, please refer to the **Pre_trained_Model_Performance** notebook.
- The algorithm can be train to trade any currency pair or a portfolio of currency pairs<br><br>  

Before training a new Model, user can decide:
- Whether to use the same price indicators as proposed to build an agent-observable dataset.  
- Whether to import any pre-acquired agent-observable dataset.<br><br>  

This notebook can be seperated into parts of:  
- Training  
- Validation  
- Testing  
- Implementation  
- Details<br>

where inputs from the user are required are tagged with `[User-input required]`
<br><br>

### Content  

1.[Define Historical Price Dataset](#def_bid_ask) <font size=3>*[Training and Validation]*</font>`[User-input required]`   

2.[Define Agent-observable Dataset](#def_obs) <font size=3>*[Training and Validation]*</font>`[User-input required]`  

3.[Extract Price Indicators](#extract) <font size=3>*[Details]*</font>  

4.[Preprocess](#prep) <font size=3>*[Details]*</font>   

5.[Define Train & Validation Data](#def_train_val) <font size=3>*[Training and Validation]*</font>  

6.[Experience Replay Buffer](#replay) <font size=3>*[Details]*</font>  

7.[Actor Network & Critic Network](#ac_cr) <font size=3>*[Details]*</font>  

8.[Twin Delayed Deep Deterministic Policy Gradients](#td3) <font size=3>*[Details]*</font>   

9.[Forex Spread-Betting Environment](#env) <font size=3>*[Details]*</font>  

10.[Set hypermarameters](#hparms) <font size=3>*[Training and Validation]*</font>`[User-input optional]`  

11.[Training and Validation](#train) <font size=3>*[Training and Validation]*</font>  

12.[Agent-Iteration Testing](#test) <font size=3>*[Testing]*</font>`[User-input required]`  

13.[Simple implementation within the notebook](#simple_imp) <font size=3>*[Implementation]*</font>  


In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import os
import math
from tensorflow.keras.optimizers import Adam
import tensorflow.keras as keras
from tensorflow.keras import layers
import random
from tqdm.notebook import trange
import pickle
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (16,9)

### Define Historical Price Dataset  <a name="def_bid_ask"></a>
#### [Training and Validation] 
#### **(User-input required)**  

Historical price dataset, ```historical_data```, an unprocessed price 
dataset that must contain prices for defining 
trade entries and exits for reward calculation. (pandas DataFrame expected)<br>  
e.g. ```historical_data = pd.read_csv('...')```  

<br><br>
- ```historical_data.shape``` is expected to be ```(timestamps, bid_ask_prices)```,  
where the columns must contain at least one bid price and one ask price for 
defining trade entries and exits.<br>  

    The historical price data is generally expected to be have index and columns as shown:

    | Index | Bid Open | Bid High | Bid Low | Bid Close | Ask Open | Ask High | Ask Low | Ask Close |
    | :---: | :------: | :------: | :-----: | :-------: | :------: | :------: | :-----: | :-------: |
    | Timestamp_0 | ... | ... | ... | ... | ... | ... | ... | ... |
    | Timestamp_1 | ... | ... | ... | ... | ... | ... | ... | ... |
    | Timestamp_2 | ... | ... | ... | ... | ... | ... | ... | ... |
    |     ...     | ... | ... | ... | ... | ... | ... | ... | ... |
    | Timestamp_n | ... | ... | ... | ... | ... | ... | ... | ... |

<br><br>
- ```historical_data.columns``` are expected to be 
```['Bid Open', 'Bid High', 'Bid Low', 'Bid Close', 'Ask Open', 'Ask High', 'Ask Low', 'Ask Close']```  
    - If different, specify the columns to use for entries and exits when defining 
    ```ForexEnv``` (trading environment) by setting the [hyperparameters (bid_col, ask_col)](#hparms), 
    and details can be found [here](#env)<br>  
        e.g. If ```historical_data.columns``` are ```['Bid Close', Ask Close']```, 
        then ```bid_col = 0```, ```ask_col = 1```.<br><br>      

In [None]:
# (user-input)
historical_data = 

### Define Agent-observable Dataset  <a name="def_obs"></a>
#### [Training and Validation]
#### **(User-input required)**   
An agent-observable dataset, ```agent_observable_data```,  is what the learning 
algorithm can observe at each timestamp. It can be dependent on 
the particular currency the algorithm learns to trade.  

<br><br>
There are two ways of defining it:<br>  

1. By default we build an agent-observable dataset from the user-provided unprocessed historical price 
dataset (```historical_data```). We extract some technical indicators (same indicators as proposed in the paper) 
from the user's dataset, which then becomes our agent-observable dataset.
    - The ```historical_data.columns``` must contains candlesticks data, e.g. BidOpen, BidHigh, BidLow, BidClose of a price.
    - User can still decide the training portion of the dataset by defining ```user_train_portion```, default 0.8.
    - To build an agent-observable dataset using the same price indicators but with different parameters, 
    refer to ```build_agent_obs_dataset()```under [Preprocess](#prep)
    - To disable, Set ```build_agent_obs_off_users = False```<br><br>  

2. User can import custom agent-observable dataset by either<br>  
    I. specifying ```agent_observable_data``` and ```user_train_portion``` to define the training 
    and validation portions of the datasets, e.g.  
    ```agent_observable_data = pd.read_csv("...")```  
    ```user_train_portion = 0.8```<br><br>  
    
    II. or specifying ```train_agent_obs``` & ```val_agent_obs```, e.g.  
    ```train_agent_obs = pd.read_csv("...")```  
    ```val_agent_obs = pd.read_csv("...")```

In [None]:
# 1. Whether or not to build an agent-observable dataset from the user-provided historical_data
build_agent_obs_off_users = True

# 2I. Define an agent-observable dataset and the train portion to represent the market states 
# (pandas dataframe expected)
agent_observable_data = None
user_train_portion = None  

# 2II. Define the training and vaidation agent-observation dataset
train_agent_obs = None    
val_agent_obs = None

### Extract Price Indicators <a name='extract'></a>
#### [Details]<br>  

Build an agent-observable dataset with the same set of variables used in the paper, 
from a price dataset that contains columns of ```['Open', 'High', 'Low', 'Close']```

In [None]:
def ma(df, ma_ranges=[10, 21, 50]):
    """
    Simple Moving Average
    
    df : pandas.DataFrame, must include columns ['Close']

    ma_ranges: list, default [10, 21, 50]
    List of periods of Simple Moving Average to be extracted
    """
        
    df = df.copy()
    for period in ma_ranges:
        df[f"MA{period}"] = df['Close'].rolling(window=period).mean()
        
    return df

In [None]:
def macd(df, short_long_signals=[(12,26,9)]):
    """
    Moving Average Convergence Divergence
    
    df : pandas.DataFrame, must include columns ['Close']

    short_long_signals : list, default [(12, 26, 9)] 
    List of periods of (short_ema, long_ema, signal) of Moving Average Convergence Divergence to be extracted
    """

    df = df.copy()
    for (short, long, signal) in short_long_signals:
        df[f"EMA{short}"] = df["Close"].ewm(span=short, adjust=False).mean()
        df[f"EMA{long}"] = df["Close"].ewm(span=long, adjust=False).mean()
        df[f"MACD{long}"] = df[f"EMA{short}"] - df[f"EMA{long}"]
        df[f"MACD{long}Signal"] = df[f"MACD{long}"].ewm(span=signal, adjust=False).mean()
        df = df.drop(columns=[f"EMA{short}", f"EMA{long}"])
        
    return df

In [None]:
def full_stochastic(df, stochastic_ranges=[(14,3,3)]):
    """
    Full Stochastic Indicator
    
    df : pandas.DataFrame, must include columns ['High', 'Low', 'Close']

    stochastic_ranges : list, default [(14, 3, 3)]
    List of periods of (fast_k, fast_d, slow_d) of Full Stochastic Indicator to be extracted
    """

    df = df.copy()
    for (fast_k, fast_d, slow_d) in stochastic_ranges:
        df[f"L{fast_k}"] = df["Low"].rolling(window=fast_k).min()
        df[f"H{fast_k}"] = df["High"].rolling(window=fast_k).max()
        df[f"fast_%K{fast_k}"] = 100*((df["Close"] - df[f"L{fast_k}"])
                                                 /(df[f"H{fast_k}"] - df[f"L{fast_k}"]))
        df[f"full_%K{fast_k}_fast_%D{fast_d}"] = df[f"fast_%K{fast_k}"].rolling(window=fast_d).mean()
        df[f"full_%K{fast_k}_slow_%D{slow_d}"] = df[f"full_%K{fast_k}_fast_%D{fast_d}"].rolling(window=slow_d).mean()
        df = df.drop(columns=[f"L{fast_k}", f"H{fast_k}", f"fast_%K{fast_k}"])
                
    return df

In [None]:
 def rsi(df, rsi_ranges=[14]):
    """
    Relative Strength Index
    
    df : pandas.DataFrame, must include columns ['Open', 'Close]

    rsi_ranges: list, default [14]
    List of periods of rsi to be extracted
    """

    df = df.copy()
    df["Up_Close"] = np.where((df["Close"] > df["Open"]), df["Close"], 0)
    df["Down_Close"] = np.where((df["Close"] < df["Open"]), df["Close"], 0)
    for period in rsi_ranges:
        df[f"RS{period}_RollUpAvg"] = df["Up_Close"].ewm(span=period, adjust=False).mean()
        df[f"RS{period}_RollDownAvg"] = df["Down_Close"].ewm(span=period, adjust=False).mean()
        df[f"RSI{period}"] = 100 - (100 / (1 + (df[f"RS{period}_RollUpAvg"] / df[f"RS{period}_RollDownAvg"])))
        df = df.drop(columns=[f"RS{period}_RollUpAvg", f"RS{period}_RollDownAvg"])
    df = df.drop(columns=["Up_Close", "Down_Close"])
    
    return df

In [None]:
def bollinger_bands(df, bollinger_period_sd_ranges=[(20,2)]):
    """
    Bollinger Bands (including %B and Bandwidth)
    
    df : pandas.DataFrame, must include columns ['High', 'Low', 'Close]

    bollinger_period_sd_ranges : list, default [(20,2)]
    List of (period, standard_deviation) to be extracted
    """

    df = df.copy()
    df["TypicalPrice"] = ((df["High"] + df["Low"] + df["Close"]) / 3)
    for (period, sd) in bollinger_period_sd_ranges:
        ma_typicalPrice = df["TypicalPrice"].rolling(window=period).mean()
        sd_typicalPrice = ma_typicalPrice.rolling(window=period).std(ddof=1)
        df[f"UpperBollinger{period}"] = ma_typicalPrice + sd * sd_typicalPrice
        df[f"LowerBollinger{period}"] = ma_typicalPrice - sd * sd_typicalPrice
        df[f"%B{period}"] = ((df["Close"] - df[f"LowerBollinger{period}"]) / 
                             (df[f"UpperBollinger{period}"] - df[f"LowerBollinger{period}"]))
        df[f"Bandwidth{period}"] = (df[f"UpperBollinger{period}"] - df[f"LowerBollinger{period}"]) / ma_typicalPrice
    df = df.drop(columns=[f"TypicalPrice"])
        
    return df

In [None]:
def add_lag(df, lag=5):
    """
    Add lags to dataset to provide historical context.
    
    df : pandas.DataFrame

    lag: int, default 5
    Number of lags to be added
    """

    df = df.copy()
    cols = list(df.columns)
    cols_len = len(cols)
    for i in range(1, lag+1):
        df = pd.concat([df, df.iloc[:,:cols_len].shift(i)], axis=1)
        cols.extend([x + f"n-{i}" for x in df.columns[:cols_len]])
        df.columns = cols
    
    return df

### Preprocess <a name='prep'></a>
#### [Details]

In [None]:
def remove_outlier(df, one_side_remove_percentile=0.005, train_portion=0.8):
    """
    Remove the outliers of each column of the dataset.

    Parameters
    ----------
    df : Dataframe to remove outliers from

    one_side_remove_percentile : The percentile of outliers to be removed on each end, default 0.005

    train_portion : The percentage of dataset used for training, default 0.8

    Returns
    --------
    df : pandas.DataFrame, Processed dataframe

    mins : numpy.array, Min values of each columns

    maxs : numpy.array, Max values of each columns
    """

    df = df.copy()
    mins, maxs = [], []
    df_cols = df.columns
    df_ind = df.index
    df = np.array(df)
    for i in range(df.shape[1]):
        temp = df[:,i]
        mins.append(np.sort(temp[:int(df.shape[0]*train_portion)])[int(df.shape[0]*one_side_remove_percentile)])
        maxs.append(np.sort(temp[:int(df.shape[0]*train_portion)])[-int(df.shape[0]*one_side_remove_percentile)])
    df = np.clip(df, mins, maxs)
    df = pd.DataFrame(df, columns=df_cols, index=df_ind)
    
    return df, np.array(mins), np.array(maxs)

In [None]:
def minmaxscaler(df, mins, maxs):
    """
    Rescale the data of the dataset using MinMaxScaler given the min and max values of each column.
    
    df : pandas.DataFrame

    mins : Min values of each columns

    maxs : Max values of each columns
    """

    df = df.copy()
    df = (df - mins) / (maxs - mins)
    
    return df

In [None]:
def build_agent_obs_dataset(unprocessed_df,
                            rename_columns=['Open', 'High', 'Low', 'Close', 'AskOpen', 'AskHigh', 'AskLow', 'AskClose'],
                            ma_ranges=[10, 21, 50],
                            short_long_signals=[(12, 26, 9)],
                            stochastic_ranges=[(14, 3, 3)],
                            rsi_ranges=[14],
                            bollinger_period_sd_ranges=[(20, 2)],
                            lag=5,
                            one_side_remove_percentile=0.005,
                            train_portion=0.8):
    """
    Build an agent-observable dataset that represents states the agent encounters at each timestamp.
    The default parameters are in reference to the ones used in the explanation paper.

    Parameters
    ----------
    unprocessed_df : pandas.DataFrame
     Dataframe where the price indicators are extracted must contains candlestick data.
     e.g. columns = ['BidOpen', 'BidHigh', 'BidLow', 'BidClose', 'AskOpen', 'AskHigh', 'AskLow', 'AskClose']

    rename_columns : list, default ['Open', 'High', 'Low', 'Close', 'AskOpen', 'AskHigh', 'AskLow', 'AskClose']
     (Default price indicators are extracted from bid prices)
     Rename columns to include ['Open', 'High', 'Low', 'Close'], where the price indicators are extracted from. 
     If they already exist, simply pass on the unprocessed dataframe columns. (unprocessed_df.columns)

    ma_ranges : list, default [10, 21, 50]
     List of periods of simple moving average to be extracted

    short_long_signals : list, default [(12, 26, 9)]
     List of periods (short_ema, long_ema, signal) of Moving Average Convergence Divergence to be extracted

    stochastic_ranges : list, default [(14, 3, 3)]
     List of periods (fast_k, fast_d, slow_d) of Full Stochastic Indicator to be extracted

    rsi_ranges : list, default [14]
     List of periods of Relative Strength Index to be extracted

    bollinger_period_sd_ranges : list, default [(20,2)]
     List of (period, standard_deviation) of Bollinger Bands (including %B and Bandwidth) to be extracted

    lag : int, default 5
     Number of lags added to the dataset to provide historical context

    one_side_remove_percentile : float, default 0.005
     The percentile of outliers to be removed on each end

    train_portion : float, default 0.8
     The percentage of dataset used for training

    Returns
    -------
    train_agent_obs : pandas.DataFrame
     Agent-observable dataset for training

    val_agent_obs : pandas.DataFrame
     Agent-observable dataset for validation
    """

    agent_obs_data = unprocessed_df.copy()
    agent_obs_data.columns = rename_columns
    agent_obs_data = ma(df=agent_obs_data, ma_ranges=ma_ranges)
    agent_obs_data = macd(df=agent_obs_data, short_long_signals=short_long_signals)
    agent_obs_data = full_stochastic(df=agent_obs_data, stochastic_ranges=stochastic_ranges)
    agent_obs_data = rsi(df=agent_obs_data, rsi_ranges=rsi_ranges)
    agent_obs_data = bollinger_bands(df=agent_obs_data, bollinger_period_sd_ranges=bollinger_period_sd_ranges)
    agent_obs_data = agent_obs_data.iloc[:, len(rename_columns):]
    agent_obs_data = add_lag(df=agent_obs_data, lag=lag)
    agent_obs_data = agent_obs_data.dropna()

    agent_obs_data, mins, maxs = remove_outlier(agent_obs_data, one_side_remove_percentile=one_side_remove_percentile,
                                                train_portion=train_portion)

    # We allow Upper and Lower BollingerBands to share the same mins and maxs
    mins[np.where(['UpperBollinger' in x for x in list(agent_obs_data.columns)])[0]] = mins[
        np.where(['LowerBollinger' in x for x in list(agent_obs_data.columns)])[0]]
    maxs[np.where(['LowerBollinger' in x for x in list(agent_obs_data.columns)])[0]] = maxs[
        np.where(['UpperBollinger' in x for x in list(agent_obs_data.columns)])[0]]

    # We do not specify train_portion here as the mins and maxs are extracted from the training set
    agent_obs_data = minmaxscaler(agent_obs_data, mins, maxs)

    train_agent_obs = agent_obs_data.iloc[:int(agent_obs_data.shape[0] * train_portion)]
    val_agent_obs = agent_obs_data.iloc[int(agent_obs_data.shape[0] * train_portion):]
    
    return train_agent_obs, val_agent_obs, mins, maxs

### Define Train & Validation Data  <a name='def_train_val'></a>  
#### [Training and Validation]<br>  

If ```historical_data.columns``` is not ```['BidOpen', 'BidHigh', 'BidLow', 'BidClose', 'AskOpen', 'AskHigh', 'AskLow', 'AskClose']```,  

- read the docstring in ```build_agent_obs_dataset()``` for the ```rename_columns``` argument.

In [None]:
if (train_agent_obs is None) & (val_agent_obs is None):
    if agent_observable_data is not None:
        if user_train_portion is None:
            print("Did not specify train portion, train portion will be set to 80%")
            user_train_portion = 0.8
        train_agent_obs = agent_observable_data.iloc[:int(agent_observable_data.shape[0]*user_train_portion)]   
        train_historical = historical_data.loc[train_agent_obs.index]
        
        val_agent_obs = agent_observable_data.iloc[int(agent_observable_data.shape[0]*user_train_portion):]
        val_historical = historical_data.loc[val_agent_obs.index]
        
    elif build_agent_obs_off_users:
        if user_train_portion is not None:
            (train_agent_obs, val_agent_obs, 
             scaler_mins, scaler_maxs) = build_agent_obs_dataset(historical_data, train_portion=user_train_portion)   
        else:
            print("Did not specify train portion, train portion will be set to 80%")
            (train_agent_obs, val_agent_obs, 
             scaler_mins, scaler_maxs) = build_agent_obs_dataset(historical_data)
            
        train_historical = historical_data.loc[train_agent_obs.index]
        val_historical = historical_data.loc[val_agent_obs.index]
        
elif (train_agent_obs is not None) & (val_agent_obs is not None):
    train_historical = historical_data.loc[train_agent_obs.index]
    val_historical = historical_data.loc[val_agent_obs.index]

# Agent-observable dataset in training
train_agent_obs = np.array(train_agent_obs)

# Unprocessed price dataset for reward calculation in training
train_historical = np.array(train_historical)

# Agent-observable dataset in validation
val_agent_obs = np.array(val_agent_obs)

# Unprocess price dataset for reward calculation in validation
val_historical = np.array(val_historical)

### Experience Replay Buffer <a name='replay'></a>
#### [Details]

In [None]:
class ReplayBuffer:
    def __init__(self, max_size, input_shape, n_actions): 
        self.mem_size = max_size # The maximum size of the replay buffer
        self.mem_counter = 0 
        self.state_memory = np.empty((self.mem_size, *input_shape)) * np.nan
        self.new_state_memory = np.empty((self.mem_size, *input_shape)) * np.nan
        self.action_memory = np.empty((self.mem_size, n_actions)) * np.nan
        self.reward_memory = np.empty(self.mem_size) * np.nan
        self.terminal_memory = np.empty(self.mem_size, dtype=bool) * np.nan
        
    def store_transition(self, state, action, reward, new_state, done):
        """
        Store an experience into the experience replay buffer
        """
        
        index = self.mem_counter % self.mem_size
        self.state_memory[index] = state
        self.new_state_memory[index] = new_state
        self.action_memory[index] = action
        self.reward_memory[index] = reward
        self.terminal_memory[index] = done 
        self.mem_counter += 1
        
    def sample_buffer(self, batch_size):
        """
        Sample experience batches to train our model
        """

        current_mem_size = min(self.mem_counter, self.mem_size)
        batch = np.random.choice(current_mem_size, batch_size, replace=False)
        state_batch = tf.convert_to_tensor(self.state_memory[batch])
        next_state_batch = tf.convert_to_tensor(self.new_state_memory[batch])
        action_batch = tf.convert_to_tensor(self.action_memory[batch])
        reward_batch = tf.convert_to_tensor(self.reward_memory[batch])
        done_batch = tf.convert_to_tensor(self.terminal_memory[batch])
        
        return state_batch, next_state_batch, action_batch, reward_batch, done_batch

### Actor Network & Critic Network <a name='ac_cr'></a>
#### [Details]

In [None]:
class ActorLayer(layers.Layer):
    """
    Hidden layer in the Actor network
    """
    
    def __init__(self, fc_dim, activation='relu'):
        super(ActorLayer, self).__init__()
        
        self.dense = layers.Dense(fc_dim, activation=activation)
        
    def call(self, state):
        prob = self.dense(state)
        return prob

In [None]:
class ActorNetwork(keras.Model):
    """
    Approximation for the optimal actions given the observations
    """

    def __init__(self, fc_dim=512, num_layers=2, activation='relu',
                 n_actions=1, name='actor'):        
        super(ActorNetwork, self).__init__()
        
        self.num_layers = num_layers       
        self.actorlayers = [ActorLayer(fc_dim, activation) for _ in range(num_layers)]
        self.mu = layers.Dense(n_actions, activation='tanh', 
                               kernel_initializer=tf.random_uniform_initializer(minval=-0.003, maxval=0.003))
        
    def call(self, state):
        for i in range(self.num_layers):
            state = self.actorlayers[i](state)
        
        actions = self.mu(state)
        
        if tf.reduce_sum(abs(actions)) == 0:   # Practically will not happen, but technically can
            bal_allocation = tf.cast(0, tf.float32)
        else:
            # bal_allocation is set to account for when n_actions >= 2
            bal_allocation = tf.cast(tf.reduce_sum(actions ** 2) / (tf.reduce_sum(abs(actions)) ** 2), tf.float32)
        return actions * bal_allocation

In [None]:
class CriticLayer(layers.Layer):
    """
    Hidden layer in the Critic network
    """

    def __init__(self, fc_dim, activation='relu'):
        super(CriticLayer, self).__init__()
        
        self.dense = layers.Dense(fc_dim, activation=activation)
        
    def call(self, state_action):
        value = self.dense(state_action)
        return value

In [None]:
class CriticNetwork(keras.Model):
    """
    Approximation for the Q-values
    """

    def __init__(self, fc_dim=512, num_layers=2, activation='relu',
                 name='critic'):
        super(CriticNetwork, self).__init__()
        
        self.num_layers = num_layers
        self.q1_criticlayers = [CriticLayer(fc_dim, activation) for _ in range(num_layers)]
        self.q1_output_layer = layers.Dense(1, activation=None)
        
        self.q2_criticlayers = [CriticLayer(fc_dim, activation) for _ in range(num_layers)]
        self.q2_output_layer = layers.Dense(1, activation=None)
        
    def call(self, state_input, action_input):
        q1_state_action = layers.concatenate([state_input, action_input])
        q2_state_action = layers.concatenate([state_input, action_input])
        
        for i in range(self.num_layers):
            q1_state_action = self.q1_criticlayers[i](q1_state_action)
            q2_state_action = self.q2_criticlayers[i](q2_state_action)
            
        q1 = self.q1_output_layer(q1_state_action)
        q2 = self.q2_output_layer(q2_state_action)
        return q1, q2
    
    def Q1(self, state_input, action_input):
        q1_state_action = layers.concatenate([state_input, action_input])
        
        for i in range(self.num_layers):
            q1_state_action = self.q1_criticlayers[i](q1_state_action)
            
        q1 = self.q1_output_layer(q1_state_action)
        return q1

### Twin Delayed Deep Deterministic Policy Gradients <a name='td3'></a>
#### [Details]

In [None]:
class TD3:
    """
    Twin Delayed Deep Deterministic Policy Gradient
    """

    def __init__(self, 
                 state_dim, 
                 n_actions=1,
                 max_action=1,
                 min_action=-1,
                 gamma=0.995, 
                 tau=0.005,
                 exploration_noise=0.2, 
                 policy_noise=0.2, 
                 noise_clip=0.5, 
                 policy_freq=2, 
                 fc_dim=512, 
                 num_actor_layers=2, 
                 num_critic_layers=2,
                 activation='relu',
                 ac_lr=3e-4, 
                 cr_lr=3e-4,
                 batch_size=64,
                 max_memory_size=100000, 
                 uniform_action_steps=10000,
                 ckpt_name="ckpt", 
                 model_name="TD3"):
        
        """
        Parameters
        ----------
        state_dim : Number of features/variables in the agent-observable dataset

        n_actions : Number of actions the agent can take

        max_action : Maximum action from the actor, default 1 (i.e. 100%)

        min_action : Minimum action from the actor, default -1 (i.e. -100%)
         
        gamma : Farsightedness, how much the agent values future return, default 0.995
        
        tau : Target networks update rate, default 0.002

        exploration_noise : Standard deviation of noise added to actor's action in training, default 0.3

        policy_noise: Standard deviation of noise added to target_actor's action batch, default 0.15

        noise_clip : Clip values of added noise to target_actor's action batch, default 0.3

        policy_freq : Policy update frequency, default 4

        fc_dim : Number of nodes in one hidden dense layer, default 1024

        num_actor_layers :  Number of hidden layers in each actor networks, default 4

        num_critic_layers : Number of hidden layers in each critic networks, default 4

        activation : Hidden layers activation function in all actors and critics (keras dense) networks, default "relu"

        ac_lr : Learning rate of the actors, default 1e-5

        cr_lr : Learning rate of the critics, default 1e-5

        batch_size : Batch size of each experience sample, default 64

        max_memory_size : Maximum replay buffer memory size, default 1000000

        uniform_action_steps : Steps of consecutive uniformly sampled actions in the beginning of training, default 10000

        ckpt_name : Name of the checkpoint directory for model weights, default "ckpt"
        
        model_name : Name of the Model, default "TD3"
        """
        
        self.n_actions = n_actions
        self.gamma = tf.cast(gamma, dtype=tf.float32)
        self.tau = tau
        self.memory = ReplayBuffer(max_memory_size, (state_dim,), n_actions)
        self.batch_size = batch_size
        self.max_action = max_action
        self.min_action = min_action
        self.uniform_action_steps = uniform_action_steps
        self.exploration_noise = exploration_noise
        self.policy_noise = policy_noise
        self.noise_clip = noise_clip
        self.policy_freq = policy_freq
        self.total_steps = 0
        self.ckpt_dir = f"{model_name}/{ckpt_name}"
        
        self.actor = ActorNetwork(fc_dim=fc_dim, 
                                  num_layers=num_actor_layers, 
                                  activation=activation,
                                  n_actions=n_actions,
                                  name=f"actor")
        
        self.target_actor = ActorNetwork(fc_dim=fc_dim, 
                                         num_layers=num_actor_layers, 
                                         activation=activation,
                                         n_actions=n_actions,
                                         name=f"target_actor")
        
        self.critic = CriticNetwork(fc_dim=fc_dim, 
                                    num_layers=num_critic_layers, 
                                    activation=activation,
                                    name=f"critic")
        
        self.target_critic = CriticNetwork(fc_dim=fc_dim, 
                                           num_layers=num_critic_layers, 
                                           activation=activation,
                                           name=f"target_critic")
        
        self.actor.compile(optimizer=Adam(learning_rate=ac_lr))
        self.critic.compile(optimizer=Adam(learning_rate=cr_lr))
        self.target_actor.compile(optimizer=Adam(learning_rate=ac_lr))
        self.target_critic.compile(optimizer=Adam(learning_rate=cr_lr))
        
        self.update_network_parameters(tau=1)
        
        
    def update_network_parameters(self, tau=None):
        """
        Target networks weights update
        """

        if tau is None:
            tau = self.tau
            
        updates = [current_w*tau + target_w*(1-tau) for current_w, target_w in
                   zip(self.actor.weights, self.target_actor.weights)]
        self.target_actor.set_weights(updates)
        
        updates = [current_w*tau + target_w*(1-tau) for current_w, target_w in
                   zip(self.critic.weights, self.target_critic.weights)]
        self.target_critic.set_weights(updates)
        
        
    def choose_action(self, observation=None, training=False, explore=False):
        """
        Return actions from the actor network
        """
        
        if not explore:
            actions = self.actor(observation)
            if training:
                actions = tf.clip_by_value(
                    tf.add(actions, tf.multiply(tf.random.normal(shape=actions.shape), 
                                                self.exploration_noise)), 
                    clip_value_min=self.min_action, 
                    clip_value_max=self.max_action)
        else:
            actions = tf.random.uniform(minval=self.min_action, maxval=self.max_action, shape=(1,self.n_actions))  
            # technically tf.random.uniform does not include the upper bound but it practically does not make a difference
        return actions
    
    
    @tf.function
    def update(self, state_batch, next_state_batch, action_batch, reward_batch, done_batch):
        """
        Update weights of the networks
        """

        with tf.GradientTape() as tape:
            noise = tf.clip_by_value(tf.multiply(tf.random.normal(shape=action_batch.shape), 
                                                 self.policy_noise), 
                                     clip_value_min=-self.noise_clip, 
                                     clip_value_max=self.noise_clip)
            
            target_action_batch = tf.clip_by_value(tf.add(self.target_actor(next_state_batch), noise),
                                                   clip_value_min=self.min_action, 
                                                   clip_value_max=self.max_action)

            target_Q1, target_Q2 = self.target_critic(next_state_batch, target_action_batch)
            target_Q = tf.squeeze(tf.math.minimum(target_Q1, target_Q2), 1)
            done_batch = tf.cast(done_batch, dtype=tf.float32)
            reward_batch = tf.cast(reward_batch, dtype=tf.float32)
            target_Q = reward_batch + (1-done_batch) * self.gamma * target_Q
            
            current_Q1, current_Q2 = self.critic(state_batch, action_batch)
            current_Q1 = tf.squeeze(current_Q1, 1)
            current_Q2 = tf.squeeze(current_Q2, 1)

            critic_loss = keras.losses.MSE(current_Q1, target_Q) + keras.losses.MSE(current_Q2, target_Q)
            
        critic_network_gradient = tape.gradient(critic_loss, self.critic.trainable_variables)
        self.critic.optimizer.apply_gradients(zip(critic_network_gradient, self.critic.trainable_variables))

        if self.total_steps % self.policy_freq == 0:
            with tf.GradientTape() as tape:
                actor_loss = tf.math.reduce_mean(-self.critic.Q1(state_batch, self.actor(state_batch)))

            actor_network_gradient = tape.gradient(actor_loss,self.actor.trainable_variables)
            self.actor.optimizer.apply_gradients(zip(actor_network_gradient, self.actor.trainable_variables))
        
        
    def learn(self):
        """
        Perform a complete learning step
        """

        self.total_steps += 1
        if self.memory.mem_counter < self.batch_size:
            return
        
        (state_batch, next_state_batch, action_batch, 
         reward_batch, done_batch) = self.memory.sample_buffer(self.batch_size)
        
        self.update(state_batch, next_state_batch, action_batch, reward_batch, done_batch)
        self.update_network_parameters()        
        
        
    def remember(self, state, action, reward, new_state, done):
        """
        Store an experience into the experience replay buffer
        """

        self.memory.store_transition(state, action, reward, new_state, done)
        
        
    def save_models(self, game_num):
        """
        Save models weights to the checkpoint directory
        """

        self.actor.save_weights(f"{self.ckpt_dir}/actor_game{game_num}.tf")
        self.target_actor.save_weights(f"{self.ckpt_dir}/target_actor_game{game_num}.tf")
        self.critic.save_weights(f"{self.ckpt_dir}/critic_game{game_num}.tf")
        self.target_critic.save_weights(f"{self.ckpt_dir}/target_critic_game{game_num}.tf")

        
    def load_models(self, game_num):
        """
        Load models weights from the checkpoint directory
        """

        self.actor.load_weights(f"{self.ckpt_dir}/actor_game{game_num}.tf")
        self.target_actor.load_weights(f"{self.ckpt_dir}/target_actor_game{game_num}.tf")
        self.critic.load_weights(f"{self.ckpt_dir}/critic_game{game_num}.tf")
        self.target_critic.load_weights(f"{self.ckpt_dir}/target_critic_game{game_num}.tf")

### Forex Spread-Betting Environment <a name='env'></a> 
#### [Details]

In [None]:
class ForexEnv:
    """
    Forex Environment that the agent interacts with in training and validation
    """
    def __init__(self,
                 agent_obs_arrays,
                 bidask_arrays,
                 initial_balance=1,
                 bid_col=3,
                 ask_col=7,
                 p=0.1,
                 n_actions=1):
        """
        Parameters
        ----------
        agent_obs_arrays: Agent observable arrays, representation of market's states, where agent_obs_arrays.shape = (timestamps, num_features)
         

        bidask_arrays: Unprocessed price data for reward calculation, must include Bid and Ask prices for reward calculation
         

        initial_balance : Initial balance available to the agent, default 1 (i.e. 100%)
         

        bid_col: Column of Bid price in pandas.DataFrame format or index of Bid price in numpy.array format, to be used for reward calculation, default 3
        (A list of Bid columns can be pass when trading multiple currencies)
         
            e.g. bid_col = 3 when BidClose is used for reward calculation, where
            bidask_arrays.shape = (timestamps, ['BidOpen', 'BidHigh', 'BidLow', 'BidClose',
                                                'AskOpen', 'AskHigh', 'AskLow', 'AskClose'])

        ask_col: Column of Ask price in pandas.DataFrame format or index of Bid price in numpy.array format, to be used for reward calculation, default 7
        (A List of Ask columns can be pass when trading multiple currencies)

            e.g. ask_col = 7 when AskClose is used for reward calculation, where
            bidask_arrays.shape = (timestamps, ['BidOpen', 'BidHigh', 'BidLow', 'BidClose',
                                                'AskOpen', 'AskHigh', 'AskLow', 'AskClose'])

        p: p% of the entire balance that is controllable by the agent at each timestamp., default 0.1
        """
        
        self.dataset = tf.data.Dataset.from_tensor_slices(
            (tf.convert_to_tensor(agent_obs_arrays[:, np.newaxis], tf.float32),
             tf.convert_to_tensor(bidask_arrays[:, np.newaxis], tf.float32)))
        self.dataset_len = self.dataset.cardinality().numpy()
        self.num_features = agent_obs_arrays.shape[-1] + n_actions + 1  
        
        self.initial_balance = initial_balance
        self.bid_col = bid_col
        self.ask_col = ask_col
        self.p = p
        self.n_actions = n_actions
                    
            
    def is_done(self):
        """
        Determines if the agent has arrived at a terminal state
        """
        
        return (self.current_pos==self.dataset_len) or (self.balance<0)
    
    
    def get_state(self):
        """
        Return the state of the timestamp
        """
        
        state, ba = self.iterator.get_next()
        state = np.concatenate([self.current_action, state], axis=1)
        self.current_pos += 1
        
        return state, ba

    
    def get_reward(self, reward_state, reward_state_):
        """
        Reward calculation and agent's episodic balance update
        """
        
        short_reward = tf.reshape(tf.gather(reward_state_[0], indices=tf.constant(self.ask_col))
                                  - tf.gather(reward_state[0], indices=tf.constant(self.bid_col)),
                                  (1, -1))
        long_reward = tf.reshape(tf.gather(reward_state_[0], indices=tf.constant(self.bid_col))
                                  - tf.gather(reward_state[0], indices=tf.constant(self.ask_col)),
                                  (1, -1))
        reward = ((tf.cast(tf.less(self.current_action, 0), dtype=tf.float32) 
                   * self.current_action
                   * 0.01 * self.p * self.balance 
                   * 10000 * short_reward) # 10000 here defines the change in reward given the pip_delta
                  
                  + (tf.cast(tf.greater(self.current_action, 0), dtype=tf.float32)
                     * self.current_action
                     * 0.01 * self.p * self.balance 
                     * 10000 * long_reward)) # 10000 here defines the change in reward given the pip_delta
        
        reward = tf.reshape(tf.reduce_sum(reward), (1, -1))
        self.balance += reward
        
        return reward
    
    
    def step(self, action, reward_state):
        """
        Move to the next timestamp
        """
        
        self.current_action = action
        state_, reward_state_ = self.get_state()
        reward = self.get_reward(reward_state, reward_state_)
        state_ = np.concatenate([self.balance, state_], axis=1)
        
        return state_, reward, self.is_done(), reward_state_, None
    
    
    def reset(self, evaluate=False):
        """
        Reset to a new episode
        """
        
        self.iterator = iter(self.dataset)
        self.current_pos = 0
        self.balance = tf.convert_to_tensor([[self.initial_balance]], dtype=tf.float32)
        self.current_action = tf.zeros(shape=(1, self.n_actions), dtype=tf.float32)
        if not evaluate:
            steps = np.random.randint(low=1, high=self.dataset_len-2, 
                                      size=(1,), dtype=np.int32)
            for step in range(int(steps)):
                observation, reward_state = self.get_state()
        else:
            observation, reward_state = self.get_state()
        observation = tf.concat([self.balance, observation], axis=1)
        
        return observation, reward_state

### Set hypermarameters <a name="hparms"></a>  
#### [Training and Validation]
#### **(User-input optional)**<br>  

For details, refer to ```TD3``` class & ```ForexEnv``` class / explanation paper

In [None]:
num_games = 2000             # Number of training games 
initial_balance = 1          # Initial balance of the agent
n_actions = 1                # Number of action the agent takes at each timestamp
max_action = 1               # Maximum amount of the p% balance the agent can long at each timestamp (e.g. 1, i.e. long 100%)
min_action = -1              # Maximum amount of the p% balance the agent can short at each timestamp (e.g. -1, i.e. short 100%)
bid_col = 3                  # The position of the Bid column in the bidask dataframe to use for reward calculation. 
                             # (For details, please refer to ForexEnv class / explanation paper / Define Agent-observable dataset)
ask_col = 7                  # The position of the Ask column in the bidask dataframe to use for reward calculation. 
                             # (For details, please refer to ForexEnv class / explanation paper / Define Agent-observable dataset)
gamma = 0.995                # Farsightedness, how much the agent values the future return
tau = 0.002                  # Target networks update rate
exploration_noise = 0.3      # Standard deviation of normally sampled noise added to actor's action in training
p = 0.1                      # p% of the entire balance that is controllable by the agent at each timestamp
policy_noise = 0.15          # Standard deviation of normally sampled noise added to target_actor's action batch
noise_clip = 0.3             # Clip values of added noise to target_actor's action batch
activation = 'relu'          # Hidden layers activation function in all actors and critics (keras dense) networks
policy_freq = 4              # Policy update frequency
fc_dim = 1024                # Number of nodes in each hidden dense layer
ac_lr = 1e-5                 # Learning rate of the actors
cr_lr = 1e-5                 # Learning rate of the critics
batch_size = 64              # Batch size of each experience sample
max_memory_size = 1000000    # Maximum replay buffer memory size
uniform_action_steps = 10000 # Consecutive steps of uniformly sampled actions in the beginning of training
num_actor_layers = 4         # Number of hidden layers in each actor networks
num_critic_layers = 4        # Number of hidden layers in each critic networks
eval_freq = 10               # Frequency of evaluation during training
model_save_freq = 10         # Frequency of saving the model's weights
hist_save_freq = 30          # Frequency of saving the history
model_name = "TD3"           # Name of the model
ckpt_name = "ckpt"           # Name of the model's weights checkpoint folder
graph_name = "graphs"        # Name of the graphs folder
hist_name = "history"        # Name of the history folder
restore_ckpt_num = None      # Any checkpoint number to be restored from the checkpoint folder (TD3.load_models() will be called)

### Training and Validation <a name='train'></a>

In [None]:
ckpt_dir = f"{model_name}/{ckpt_name}"
graph_dir = f"{model_name}/{graph_name}"
hist_dir = f"{model_name}/{hist_name}"

if not os.path.exists(ckpt_dir):
    os.makedirs(ckpt_dir)
if not os.path.exists(graph_dir):
    os.makedirs(graph_dir)
if not os.path.exists(hist_dir):
    os.makedirs(hist_dir)

history = {}

train_env = ForexEnv(agent_obs_arrays=train_agent_obs, bidask_arrays=train_historical,
                     initial_balance=initial_balance, bid_col=bid_col, 
                     ask_col=ask_col, p=p, n_actions=n_actions)

val_env = ForexEnv(agent_obs_arrays=val_agent_obs, bidask_arrays=val_historical,
                   initial_balance=initial_balance, bid_col=bid_col, 
                   ask_col=ask_col, p=p, n_actions=n_actions)

agent = TD3(state_dim=train_env.num_features, n_actions=n_actions, gamma=gamma, 
            tau=tau, exploration_noise=exploration_noise, policy_noise=policy_noise, 
            noise_clip=noise_clip, activation=activation, policy_freq=policy_freq, fc_dim=fc_dim, 
            ac_lr=ac_lr, cr_lr=cr_lr, batch_size=batch_size, max_memory_size=max_memory_size, 
            uniform_action_steps=uniform_action_steps, num_actor_layers=num_actor_layers, 
            num_critic_layers=num_critic_layers, max_action=max_action, min_action=min_action, 
            ckpt_name=ckpt_name, model_name=model_name)

if restore_ckpt_num is not None:
    agent.update(np.ones((batch_size, train_env.num_features)),
                 np.ones((batch_size, train_env.num_features)),
                 np.ones((batch_size, n_actions)),
                 np.ones((batch_size, )),
                 np.ones((batch_size, )))
    agent.update_network_parameters()
    agent.load_models(restore_ckpt_num)
    tbar = trange(restore_ckpt_num+1, restore_ckpt_num+num_games+1)
    uniform_action_count = uniform_action_steps + 1
    print("Checkpoint Restored")

else:
    tbar = trange(1, num_games+1)
    uniform_action_count = 0

for i in tbar:
    if (uniform_action_count > uniform_action_steps) & (i % eval_freq == 0):
        evaluate = True
        env = val_env
    else:
        evaluate = False
        env = train_env
    observation, reward_state = env.reset(evaluate=evaluate)
    done = False
    score_list = []
    balance_list = []
    starting_position = env.current_pos - 1
    while not done:
        if uniform_action_count <= uniform_action_steps:
            action = agent.choose_action(explore=True)
            uniform_action_count += 1
        else:
            action = agent.choose_action(observation=observation, training=not evaluate)
        observation_, reward, done, reward_state_, info = env.step(action, reward_state)
        score_list.append(np.reshape(reward, (1,))[0])
        balance_list.append(np.reshape(env.balance, (1,))[0])
        if not evaluate:
            agent.remember(observation, action, reward, observation_, done)
            agent.learn()
        observation = observation_
        reward_state = reward_state_
        if env.current_pos % 200 == 0:
            tbar.set_description(f"Game {i} Current Position {env.current_pos}\
            Current Balance: {env.balance[0,0].numpy():.3f} Training Steps: {agent.total_steps}")

    history[f"{'Val ' * evaluate}Game{i} Score"] = score_list
    history[f"{'Val ' * evaluate}Game{i} Balance"] = balance_list
    
    print(f"{'Val ' * evaluate}Game {i} \
    Starting Position {starting_position} \
    Ending Position {env.current_pos} \
    Ending Balance {env.balance[0,0].numpy():.3f} \
    Avg_reward {(env.balance[0,0].numpy()) ** (1 / (env.current_pos - starting_position)) - 1}")

    if (i % model_save_freq == 0) & (uniform_action_count > uniform_action_steps):
        agent.save_models(i)

    if (i % hist_save_freq == 0):
        with open(f"{hist_dir}/history_up_to_game{i}.pkl", 'wb') as f:
            pickle.dump(history, f, pickle.HIGHEST_PROTOCOL)

    if evaluate:
        plt.plot(np.array(balance_list))
        plt.xlabel("Timestamps")
        plt.ylabel("Balance")
        plt.title(f"Val Game {i}")
        plt.savefig(fname=f"{graph_dir}/Val Game {i}")
        plt.show()

### Agent-Iteration Testing <font size=4>*(If validation results do not stablise)*</font>  <a name='test'></a>
#### [Testing]
#### **(User-input required)**
User can evaluate any particular iterations of the model if the model's performance does not stabilise<br><br>  

#### 1. Historical price dataset for testing
```test_historical```, same requirements as the historical price dataset for training and validation, but will be used for testing.
           
<br><br>
#### 2. An agent-observable dataset
```test_agent_obs```, same requirements as the agent-observable dataset for training and validation, but will be used for testing<br><br>
I. User can import custom agent-observable dataset by defining ```test_agent_obs```  
- Ensure entries and exits timestamps (indices) in ```test_historical``` represent the same as which of the ```test_agent_obs```'s<br><br>

II. User can build a ```test_agent_obs``` from ```test_historical```, the same way the ```agent_observable_data``` for training and validation is built by calling ```build_agent_obs_dataset()```.  
- Note: the first 55 days trading days in ```test_historical``` will not be visible in the ```test_agent_obs```, due to preprocessing purposes.

<br><br>
#### 3. Load weights of the actor(s)
The model's weights have to be restored to be tested<br><br>

- User can load any particular iteration of the model by  
    - defining ```restore_ckpt_dir``` (if not already defined during training), default ```ckpt_dir```
    - and appending the actor checkpoint filename(s) within the checkpoint directory to the ```load_ckpts``` list  
        - e.g. To restore ```"./TD3/ckpt/actor_100.tf"``` and ```"./TD3/ckpt/actor_120.tf"```  
          ```ckpt_dir = "./TD3/ckpt"```    
          ```load_ckpts = ["actor_100.tf", "actor_120.tf"]```<br><br>

In [None]:
# 1. Unprocessed dataset in testing environment for reward calculation (user-input)
test_historical = 

# 2I. Agent-observable dataset in testing enviornment
test_agent_obs = 

# 3.define checkpoints directory and ckeckpoints to load
restore_ckpt_dir = ckpt_dir
load_ckpts = []

In [None]:
test_historical = np.array(test_historical.loc[test_agent_obs.index])
test_agent_obs = np.array(test_agent_obs)
    
evaluate = True

ai_eval_env = ForexEnv(agent_obs_arrays=test_agent_obs, bidask_arrays=test_historical,
                       initial_balance=initial_balance, bid_col=bid_col, 
                       ask_col=ask_col, p=p, n_actions=n_actions)
    
ai_eval_actor = ActorNetwork(fc_dim=fc_dim, 
                             num_layers=num_actor_layers, 
                             activation=activation,
                             n_actions=n_actions)


ai_eval_history = {}
for ckpt in load_ckpts:
    name = ckpt.rsplit(".", 1)[0]
    ai_eval_actor(tf.ones((1, ai_eval_env.num_features)))
    ai_eval_actor.load_weights(f"{restore_ckpt_dir}/{ckpt}")
    print(f"Checkpoint {name} Restored")
    
    observation, reward_state = ai_eval_env.reset(evaluate=evaluate)
    done = False
    score_list = []
    balance_list = []
    starting_position = ai_eval_env.current_pos - 1
    while not done:
        action = ai_eval_actor(observation)
        observation_, reward, done, reward_state_, info = ai_eval_env.step(action, reward_state)
        score_list.append(np.reshape(reward,(1,))[0])
        balance_list.append(np.reshape(ai_eval_env.balance,(1,))[0])
        observation = observation_
        reward_state = reward_state_

    ai_eval_history[f"{name} Weights Test Score"] = score_list
    ai_eval_history[f"{name} Weights Test Balance"] = balance_list
    print(f"{name} Weights Test \
    Starting Position {starting_position} \
    Ending Position {ai_eval_env.current_pos} \
    Ending Balance {ai_eval_env.balance[0,0].numpy():.3f} \
    Avg_reward {(ai_eval_env.balance[0,0].numpy()) ** (1 / (ai_eval_env.current_pos - starting_position)) - 1}")

    plt.plot(np.array(balance_list))
    plt.xlabel("Timestamps")
    plt.ylabel("Balance")
    plt.title(f"{name} Test")
    plt.savefig(fname=f"{graph_dir}/{name} Test.png")
    plt.show()

### Simple implementation within the notebook <a name='simple_imp'></a>

Restore a trained model and generate the estimated optimal actions given current data

1. Load weights for an actor by defining restore_actor_ckpt_dir  
e.g. To restore ```"./TD3/ckpt/actor_100.tf"```  
```restore_actor_ckpt_dir = "./TD3/ckpt"```  
```actor_ckpt = "actor_100.tf"```<br><br>  

2. Input the observasions  
    - Market observation at one timestamp (usually the last timestamp)  
    e.g. ```market_observation = recent_df.iloc[-1]```<br><br>  

3. Receive the model's estimated optimal action  
    - The model (actor) estimates the optimal actions given the observation
    - Then long `(+)` or short `(-)` ```p% * actions``` of current balance 

In [None]:
# Define the directory to the actor's weights to be restored
restore_actor_ckpt_dir = 

# Define the actor's weights to be restored
actor_ckpt = 

# Define the current balance
current_balance = 

# Define the opened trades/positions (actions at the last timestamp)
previous_actions = 

# Define the observation of the market (last index in the agent-observable dataset)
market_observation = 


current_balance = np.reshape(current_balance, (1,-1))
previous_actions = np.reshape(previous_actions, (1, -1))
market_observation = np.array(market_observation).reshape((1, -1))
observation = np.concatenate([current_balance, previous_actions, market_observation], axis=-1)
observation = tf.cast(observation, dtype=tf.float32)

# Create the actor
final_actor = ActorNetwork(fc_dim=fc_dim, 
                           num_layers=num_actor_layers, 
                           activation=activation,
                           n_actions=n_actions)

# Weights will be initialised
_ = final_actor(observation)

# Load weights
final_actor.load_weights(f"{restore_actor_ckpt_dir}/{actor_ckpt}")

# Receive output
actions = final_actor(observation)
print("Estimated optimal action:", actions[0].numpy())