<table>
    <tr>
        <td>
            <img src='./text_images/nvidia.png' width="200" height="450">
        </td>
        <td> & </td>
        <td>
            <img src='./text_images/udacity.png' width="350" height="450">
        </td>
    </tr>
</table>

<div style = "font-family:Georgia;
              font-size:1.6vw;
              color:#017A9B;
              font-style:bold;
              text-align:left;
              background:url('./text_images/4.jpeg') no-repeat center;
              background-size:cover)">
              
     <br><br>
     DRL for Optimal Execution of Portfolio Transactions     
     <br><br>
    
     
</div>

# Introduction

We begin with a brief review of reinforcement learning and actor-critic methods.  Then, you will use an actor-critic method to generate optimal trading strategies that maximize profit when liquidating a block of shares. 

In reinforcement learning, an agent makes observations and takes actions within an environment, and in return it receives rewards.  Its objective is to learn to act in a way that will maximize its expected long-term rewards. 

<img src="./text_images/RL.png" width="500" height="900">

There are several types of RL algorithms, and they can be divided into three groups:  
- critic-only (_or value-based_),
- actor-only (_or policy-based_), and
- actor-critic methods.

The words "actor" and "critic" are synonyms for the policy and value function, respectively. 

Q-learning algorithms are considered as **critic-only** methods.  In 2015, [DeepMind published a paper](https://arxiv.org/abs/1312.5602) to show the world how using deep Q-Networks can be used to approximate the value function for complex problems. 

**Actor-only** methods typically work with a parameterized family of policies over which optimization procedures can be used directly.  To learn more about actor-only methods, you're encouraged to peruse [this paper](http://ieeexplore.ieee.org/document/6392457/).

**Actor-critic** methods combine the advantages of actor-only and critic-only methods. While the actor brings the advantage of computing continuous actions without the need for optimization procedures on a value function, the critic’s merit is that it supplies the actor with knowledge of the performance.  Actor-critic methods usually have good convergence properties, in contrast to critic-only methods.  The **Deep Deterministic Policy Gradients (DDPG)** algorithm is one example of an actor-critic method.

<img src="./text_images/Actor-Critic.png" width="500" height="700">

In this lab, we will use DDPG to determine the optimal execution of portfolio transactions.

# Modeling Optimal Execution as a Reinforcement Learning Problem
---

### States
We will use the following features to define the state at time $t_k$:

$$
[r_{k-5},\, r_{k-4},\, r_{k-3},\, r_{k-2},\, r_{k-1},\, r_{k},\, t_{k},\, i_{k}]
$$

where:
- $r_{k} := \log(\frac{\tilde{S}_k}{\tilde{S}_{k-1}})$ is the log-return at time $t_k$
- $i_{k} := X - \sum_{j=1}^{k}n_j$ is the remaining inventory at time $t_k$ 

_In a real world setting, if there is enough data, market variables such as the current trading rate, current [spread](https://www.investopedia.com/trading/basics-of-the-bid-ask-spread/), and [limit order book](https://www.investopedia.com/terms/l/limitorderbook.asp) density can be added to the state vector. You may also consider limit order book imbalance within ϵ basis points from the mid-spread, buy transaction volume, and sell transaction volume. These variables will likely need to be normalized using a long-term average on a per stock basis._

### Actions
The action $a_k$ at time $t_{k}$ is defined in terms of the average selling rate between times $t_{k}$ and $t_{k+1}$.  When we express the average selling rate is as a multiple of the constant selling rate $\frac{X}{N}$:

$$
c_k \times \frac{X}{N}
$$

where $c_k$ is a scalar.  Note that this is equal to $\frac{n_k}{\tau}$. yeah need to fix this.

### Rewards
Defining the rewards is trickier than defining states and actions, since the original problem is a minimization problem. One option is to use difference of two consecutive utility functions of Almgren-Chriss calculated in time. After each time interval, we compute Chriss-Almgren model for the remaining time and inventory while holding parameter λ constant. Denoting optimal trading trajectory computed at time t as $x^*_t$, we define reward as: 

$$
R_{t} = {{U_t(x^*_t) - U_{t+1}(x^*_{t+1})}\over{U_t(x^*_t)}}
$$

We normalized the difference to train the actor and critic models easier.

# Reinforcement Learning

In [1]:
import numpy as np

import syntheticChrissAlmgren as sca
from ddpg_agent import Agent

from collections import deque

# Create simulation environment
env = sca.MarketEnvironment()

# Initialize Feed-forward DNNs for Actor and Critic models. 
agent = Agent(state_size=env.observation_space_dimension(), action_size=env.action_space_dimension(), random_seed=0)

# Set the liquidation time
lqt = 60

# Set the number of trades
n_trades = 60

# Set trader's risk aversion
tr = 1e-6

# Set the number of episodes to run the simulation
episodes = 100

shortfall_hist = np.array([])
shortfall_deque = deque(maxlen=100)
reward_hist = np.array([])

for episode in range(episodes): 
    # Reset the enviroment
    cur_state = env.reset(seed = episode, liquid_time = lqt, num_trades = n_trades, lamb = tr)

    # set the environment to make transactions
    env.start_transactions()

    for i in range(n_trades + 1):
      
        # Predict the best action for the current state. 
        action = agent.act(cur_state, add_noise = False)
        
        # Action is performed and new state, reward, info are received. 
        new_state, reward, done, info = env.step(action)
        
        # current state, action, reward, new state are stored in the experience replay
        agent.step(cur_state, action, reward, new_state, done)
        
        # roll over new state
        cur_state = new_state
        
        reward_hist = np.append(reward_hist, reward)

        if info.done:
            shortfall_hist = np.append(shortfall_hist, info.implementation_shortfall)
            shortfall_deque.append(info.implementation_shortfall)
            break
        
    if (episode + 1) % 100 == 0: # print average shortfall over last 100 episodes
        print('\rEpisode [{}/{}]\tAverage Shortfall: ${:,.2f}'.format(episode + 1, episodes, np.mean(shortfall_deque)))
        

print('\nAverage Implementation Shortfall: ${:,.2f} \n'.format(np.mean(shortfall_hist)))

Episode [100/100]	Average Shortfall: $1,941,808.44

Average Implementation Shortfall: $1,941,808.44 

