We train a deep reinforcement
learning agent and obtain an ensemble trading strategy
using three actor-critic based algorithms: 
Proximal Policy
Optimization (PPO), 
Advantage Actor Critic (A2C), 
Deep Deterministic Policy Gradient (DDPG)



The proposed deep ensemble strategy is shown to outperform
the three individual algorithms and two baselines in terms of
the risk-adjusted return measured by the Sharpe ratio.


Fundamentals
data (earnings report) and alternative data (market news,
academic graph data, credit card transactions, and GPS
traffic, etc.) are combined with machine learning algorithms
to extract new investment alphas or predict a company’s
future performance

Thus, a predictive
alpha signal is generated to perform stock selection.


In this paper, we propose a novel ensemble strategy
that combines three deep reinforcement learning algorithms
and finds the optimal trading strategy in a complex and
dynamic stock market.


First, we build an environment and define
action space, state space, and reward function.

Second, we
train the three algorithms that take actions in the environment.

Third, we ensemble the three agents together using
the Sharpe ratio that measures the risk-adjusted return.

# critic-only learning approach

The critic-only learning approach, which is the most
common, solves a discrete action space problem using, for
example, Deep Q-learning (DQN) and its improvements,
and trains an agent on a single stock or asset


idea of the critic-only approach is to use a Qvalue
function to learn the optimal action-selection policy
that maximizes the expected future reward given the current
state.

It focuses solely on evaluating the value function without explicitly learning a policy

This approach is often used in value-based methods like Q-learning and Deep Q-Networks

The major limitation of
the critic-only approach is that it only works with discrete
and finite state and action spaces, which is not practical for
a large portfolio of stocks, since the prices are of course
continuous.

# actor-only approach

The idea here is that the agent directly learns the optimal
policy itself.

Instead of having a neural network to learn the
Q-value, the neural network learns the policy.

The policy is
a probability distribution that is essentially a strategy for a
given state, namely the likelihood to take an allowed action.


An actor-only approach in Reinforcement Learning (RL) focuses on learning a policy directly without maintaining a value function. This is typically done using policy gradient methods, where the policy is optimized based on feedback from the environment.

Key Features
Policy Gradient: Directly optimizes the policy by adjusting its parameters in the direction of the gradient of expected rewards.

Exploration: Ensures sufficient exploration of the action space to improve the policy.

Convergence: Can converge to local optima, which may be more efficient in certain environments.

# actor-critic approach


Reinforcement Learning (RL) combines elements of both policy-based (actor) and value-based (critic) methods. The actor decides the actions to take based on a policy, while the critic evaluates those actions by estimating the value function.

Over time, the actor learns to take better
actions and the critic gets better at evaluating those actions.

# MDP Model for Stock Trading

We model stock trading as a Markov Decision Process
(MDP)

State s = [p;h; b]: a vector that includes stock prices
p 2 RD+
, the stock shares h 2 ZD+
, and the remaining
balance b 2 R+, where D denotes the number of
stocks and Z+ denotes non-negative integers.



Action a: a vector of actions over D stocks. The
allowed actions on each stock include selling, buying,
or holding, which result in decreasing, increasing, and
no change of the stock shares h, respectively.


Reward r(s; a; s0): the direct reward of taking action
a at state s and arriving at the new state s0.


Policy (s): the trading strategy at state s, which is
the probability distribution of actions at state s.


Q-value Q(s; a): the expected reward of taking action
a at state s following policy .





At each state, one of three possible actions is
taken on stock d

Selling k[d] 2 [1;h[d]] shares results in ht+1[d] =
ht[d] 􀀀 k[d], where k[d] 2 Z+ and d = 1; :::;D.

Holding, ht+1[d] = ht[d].

Buying k[d] shares results in ht+1[d] = ht[d]+k[d].


At time t an action is taken and the stock prices update
at t+1, accordingly the portfolio values may change from
”portfolio value 0” to ”portfolio value 1”

# Incorporating Stock Trading Constraints

Assumptions:
Market liquidity: the orders can be rapidly executed at
the close price.
We assume that stock market will not
be affected by our reinforcement trading agent.

Nonnegative balance b >= 0: the allowed actions should
not result in a negative balance.

Transaction cost: transaction costs are incurred for
each trade.

Risk-aversion for market crash: there are sudden
events that may cause stock market crash, such as
wars, collapse of stock market bubbles, sovereign debt
default, and financial crisis.

we employ the financial turbulence index turbulence to measure extreme asset price movements

When turbulencet is higher than a
threshold, which indicates extreme market conditions,
we simply halt buying and the trading agent sells all
shares.

We resume trading once the turbulence index
returns under the threshold.

# Return Maximization as Trading Goal

We define our reward function as the change of the
portfolio value when action a is taken at state s and arriving
at new state s0.

Goal is to design a trading strategy that
maximizes the change of the portfolio value


The optimal strategy is defined by the BEllman equation on fomrula 11

Action Space: For a single stock, the action space
is defined as f􀀀k; :::;􀀀1; 0; 1; :::; kg, where k and 􀀀k


The action space is then normalized to [􀀀1; 1], since the
RL algorithms A2C and PPO define the policy directly on
a Gaussian distribution, which needs to be normalized and
symmetric

# Advantage Actor Critic (A2C)

is a typical actor-critic algorithm and we use
it a component in the ensemble strategy

A2C utilizes an
advantage function to reduce the variance of the policy
gradient.

the
evaluation of an action not only depends on how good the
action is, but also considers how much better it can be. So
that it reduces the high variance of the policy network and
makes the model more robust.


A2C is a great model for stock trading because of its
stability.


Objective function on formula 12

Advantage function formula 13

# Deep Deterministic Policy Gradient

DDPG [18] is used to encourage maximum investment
return.


DDPG combines the frameworks of both Q-learning
[38] and policy gradient, uses neural networks as
function approximators.

DDPG learns directly from
the observations through policy gradient. It is proposed
to deterministically map states to actions to better fit the
continuous action space environment.



The Policy Gradient method is a reinforcement learning approach that directly optimizes the policy by adjusting its parameters based on the gradients of expected rewards. It’s particularly useful for handling continuous action spaces and complex policies.

Key Features
Direct Optimization: Adjusts the policy parameters to maximize the expected return.

Stochastic Policies: Often uses stochastic policies, which are particularly useful in exploration.

Gradient Ascent: Uses gradient ascent to update the policy parameters in the direction of higher expected reward.



DDPG is effective at handling continuous action space, and
so it is appropriate for stock trading.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers

# Environment setup (simplified example)
num_actions = 2
num_states = 4

# Policy Model - Neural network that outputs a probability distribution over actions.
policy_model = tf.keras.Sequential([
    layers.Dense(24, activation='relu', input_shape=(num_states,)),
    layers.Dense(24, activation='relu'),
    layers.Dense(num_actions, activation='softmax')
])

# Optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

# Training function - Updates the policy parameters using the policy gradient.
def train_policy_gradient(state, action, reward, next_state, done):
    with tf.GradientTape() as tape:
        action_probs = policy_model(state, training=True)
        chosen_action_prob = action_probs[0, action]
        log_prob = tf.math.log(chosen_action_prob)
        loss = -log_prob * reward

    grads = tape.gradient(loss, policy_model.trainable_variables)
    optimizer.apply_gradients(zip(grads, policy_model.trainable_variables))

# Simplified training loop - Simulates episodes and updates the policy based on observed rewards.
for episode in range(1000):
    state = np.random.rand(num_states)  # Random initial state
    done = False
    episode_rewards = []

    while not done:
        action_probs = policy_model(state[np.newaxis, :])
        action = np.random.choice(num_actions, p=action_probs.numpy().flatten())
        next_state = np.random.rand(num_states)  # Random next state
        reward = np.random.randn()  # Random reward
        done = np.random.choice([0, 1])  # Random done flag

        episode_rewards.append(reward)
        train_policy_gradient(state, action, reward, next_state, done)
        state = next_state

print("Training completed!")


# Proximal Policy Optimization

PPO [14] is introduced to control the policy
gradient update and ensure that the new policy will not be
too different from the previous one. PPO tries to simplify
the objective of Trust Region Policy Optimization (TRPO)
by introducing a clipping term to the objective function

The
function clip(rt(); 1 􀀀 ; 1 + ) clips the ratio rt() to be
within [1 􀀀 ; 1 + ]. The objective function of PPO takes
the minimum of the clipped and normal objective.

PPO
discourages large policy change move outside of the clipped
interval. Therefore, PPO improves the stability of the policy
networks training by restricting the policy update at each
training step.

# Ensemble strategy

So we use an ensemble strategy to automatically select the
best performing agent among PPO, A2C, and DDPG to
trade based on the Sharpe ratio.


Step 1. We use a growing window of n months to retrain
our three agents concurrently. In this paper we retrain our
three agents at every three months.

Step 2. We validate all three agents by using a 3-month
validation rolling window after training window to pick the
best performing agent with the highest Sharpe ratio

We also adjust risk-aversion by using turbulence index in our validation stage.



#TODO =========================

Step 3. After the best agent is picked, we use it to predict
and trade for the next quarter.


The
higher an agent’s Sharpe ratio, the better its returns have
been relative to the amount of investment risk it has taken.

# Stock Data Preprocessing

Our dataset consists
of two periods: in-sample period and out-of-sample period.
In-sample period contains data for training and validation
stages. Out-of-sample period contains data for trading stage.


Then, a validation stage is then
carried out for validating the 3 agents by Sharpe ratio, and
adjusting key parameters, such as learning rate, number of
episodes, etc.


so we use A2C to trade for the next quarter 


# Performance Comparisons

1. Cumulative return: is calculated by subtracting the
portfolio’s final value from its initial value, and then
dividing by the initial value.
2. Annualized return: is the geometric average amount
of money earned by the agent each year over the time
period.
3. Annualized volatility: is the annualized standard deviation
of portfolio return.
4. Sharpe ratio: is calculated by subtracting the annualized
risk free rate from the annualized return, and the
dividing by the annualized volatility.
5. Max drawdown: is the maximum percentage loss during
the trading period.


Analysis of Agent Performance: From both Table 2
and Figure 5, we can observe that the A2C agent is more
adaptive to risk. bearish market.


PPO agent
is good at following trend and acts well in generating
more returns, it has the highest annual return

So PPO
is preferred when facing a bullish market. DDPG performs
similar but not as good as PPO, it can be used as a
complementary strategy to PPO in a bullish market.



By incorporating the turbulence
index, the agents are able to cut losses and successfully
survive the stock market crash in March 2020.

# More reading online


Q-learning: is a value-based Reinforcement Learning algorithm that is used to find the optimal action-selection policy using a Q function.
DQN: In deep Q-learning, we use a neural network to approximate the Q-value function. The state is given as the input and the Q-value of allowed actions is the predicted output.



https://medium.com/p/f1dad0126a02

# Look at Deep RL paper here:

TODO

FinRL: A Deep Reinforcement Learning Library for
 Automated Stock Trading in Quantitative Finance