# DEEP REINFORCEMENT LEARNING FOR HIGH FREQUENCY TRADING


In [1]:
import os
import sys
root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if root not in sys.path:
    sys.path.insert(0, root)
import backend.QRModel.QR_only as qr
import backend.RL_agents.QDRL as QDL_Agent
import backend.QRModel.QR_only as qr

## I. Environement

We present an entire environement and agent for high frequency trading. The environement is mainly based on the Queue Reactive model (QR) created by M. Rosenbaum and C-A. Lehalle, with some minor twists, especially in the handling of exogenous information. The parameters are changeable in the backend/utils.intensity_fct_params file. Once defined they are used by every file of the project.

### A. Limit Order Book

Interaction between buyers and sellers takes place algorithmically, according to the order of arrival of orders in the supply and demand queues. This process is managed by the Limit Order Book, which lists and executes all orders sent by the various market players. These modifications are of three main types:
\begin{itemize}
- Limit Order : Add a buy or sell proposal at a given price. 

- Market Order : Buy or sell immediately at the best price available on the market. 

- Cancellation : Withdrawal of a buy or sell proposal previously registered in the order book.

The market microstructure created by these interactions between buyers and sellers is at the root of the price formation process, and drives price variations over time.

At the high-frequency level, price movements are thus largely the result of interactions between supply and demand. Thus, the bid-ask difference can only take on integer tick values and are located at a distance more or less close to $p_{ref} = \frac{p_{bid}+p_{ask}}{2}$. At any given moment, new players can be added to a price, which will then be modeled as a queue, whose order is defined by the order of arrival.

The study of the first limit of the MBO (Bouchaud et al., 2004 and Besson et al., 2016) allows us to understand how the size of the first limit of the bid and ask influence the next price movement. Unfortunately, this prediction is not precise enough to be profitable in the long term, and a more detailed study of the price formation process is therefore required. However, the study of the first limit allows us to extract a first key piece of information from the order book, the imbalance, defined as follows:
$$\text{Imb}_t = \frac{Q^{best\ bid}_t-Q^{best\ ask}_t}{Q^{best\ bid}_t+Q^{best\ ask}_t}$$

### B. Construction of our environment

The construction of our environment is mainly based on the Queue Reactive model created by M. Rosenbaum and C-A. Lehalle. 

#### 1. Fixed price QR
Let $p_{ref}$ be the fixed reference price. We'll model the evolution of the queues around this price. We model the order book by a vector: $$Q(t) = (Q_{-3}(t),Q_{-2}(t),Q_{-1}(t),Q_{1}(t),Q_{2}(t),Q_{3}(t))$$ which evolves over time according to a Markov process. The element $Q_{\pm i}(t)$ corresponds to the availability of supply or demand at $p_{ref}\pm i \ \text{tick}$ at time $t$.

To model the MB's next move, we draw $2K$ ($K=3$) independent Poisson random variablesof respective parameters $\lambda_i^{U}(\text{Imb}_t)$ with $U\in \{Add,Cancel,Trade\}$. The next action will then be chosen as an AES (average event size) of action $U_{min}$ at tail $Q_{i_{min}}$ completing the minimum on these Poisson variables:
$$(i_{min},U_{min}) = \underset{(i,U) \in \{-K,K\}\times\{A,C,T\}}{\text{argmin}}\ \mathcal{P}(\lambda_i^{U}(\text{Imb}_t))$$
The action will be performed at date $t+\mathcal{P}(\lambda_{i_{min}}^{U_{min}}(\text{Imb}_t))$.
To ensure that more is added than consumed, we force the add intensities to be lower than the summed cancels and orders.

This model encapulates the basis of the financial market movements at high frequency has it makes different response times considering the current state of the market. For istance, if the ask is greater than the bid, people will tend to go on the bid side as they can leverage their inferiority with better prices.

#### 2. QR model with variable price:

We still need to model price movements $p_{ref}$. To do this, we consider that price $p_{ref}$ can only move under two conditions:

- exhaustion of a limit
- adding a new limit


In all likelihood, the $p_{ref}$ price should move in the direction of the market, i.e. upwards when a limit is exhausted at the ask level, for example. However, this behavior is not observed on the market, and a mean-reversion phenomenon predominates. To capture this evolution, each time the price $p_{ref}$ changes, we draw a binomial distribution with parameter $\theta$, which will be worth 1 if the price follows the market movement and 0 if it goes in the opposite direction.

Initial observations by C-A. Lehalle and M. Rosembaum, showed that this model failed to capture actual market volatility. Indeed, even assuming a QR totally driven by the LOB, i.e. $\theta=0$, the volatility did not reach the value observed in practice. To overcome this problem, they added another parameter $\theta_{reinit}\approx0.13$, which trigger a price jump with probability $\theta_{reinit}$ and which creates a new LOB at the new price. However this modeling is not possible in our case as we need to have a continuity for our agent to be able to interact with the environment.

#### 3. Order size and Exogenous information modeling

In order to recreate the usual volatility seen in HF markets, we need to tweak the QR model. We first consider the possibility not only to make an action of size 1 (1 add, 1 cancel, 1 market order) , but of different sizes taken uniformly randomly between $[1,\text{Size}_\text{max}]$, with $\text{Size}_\text{max}$ changing for each action. 

To recreate exogenous information, which is extremely important in finance as mentioned by Bouchaud, we had a random event simulation:

- At each step an exogenous event occurs with probability $\theta_{event}$
- If no event occurs we continue with the model developed before
- If an event occurs, we draw its average length (in number of event) from a predefined vector $\nu=[10,100,1000]$ for instance. After this draw, we pick the length of our event with a Poisson variable: $$\text{length}_\text{event}\sim\mathcal P(\nu[\mathcal U(\text{length of }\nu)])$$
- the intensity of each event (add, cancel, market order) instantaneously becomes $\frac{\lambda}{\text{length}_\text{event}}$, simulating the higher flow of order during these events.
- The intensity progressively returns to its normal rate with its intensity being $\frac{\lambda}{\text{Nb of remaining events}}$.

These changes makes the market a lot more realistic and will force our agent to incorporate exogenous information in its decision making.

However this model is rather complex and we will not try to fit it with real market data in this paper as it is not the first objective. Other models known are the Bouchaud diffusion model or any tick by tick modeling of the price that incorporates bid-ask demand. Recent studies highlighted the possibility of using Hawkes processes for a better modeling.

### C. Visualisation of our environment

We simulate a Queue Reactive model with the specified modificaion in the cell below. All parameters are set (and can be freely modified) in the file backend/utils/intensity_fct_params

In [2]:
NB_EVENTS_SIMULATED = 10000
qr.Run_QR_simulated(NB_EVENTS_SIMULATED, False)

## II. Agent Interaction with our Environment and Reinforcement Learning Strategies

### A. Actions possible by the agent

In the current set up of the environement the agent will be able to choose between:
- Do Nothing : It doesn't do anything
- Order Ask : Send an order at the ask side (buy at the best price)
- Order Bid : Send an order at the bid side (sell at the best price)

The reward will be the curent P&L (observed price)

### B. Q-Deep Reincement Learning
We then train an agent on it using deep Q-reinforcement Learning to start with.

In [3]:
NB_EPISODES_FOR_TRAINING = 20000

agent = QDL_Agent.Deep_Q_Learning_Agent(network_architecture = 'Classic')
agent.train(nb_episode = NB_EPISODES_FOR_TRAINING, window_size = 20)

                                                            --- Q-DEEP-REINFORCED AGENT ---


--- TRAINING THE AGENT OVER 20000 EPISODES ---

     ---> TRAINING...



           Training (Classic Network): 100%|██████████| 20000/20000 [10:14<00:00, 32.54it/s, total_reward=5.00]  



     ---> TRAINING FINISHED

--- VISUALISING REWARD AND DECISION EVOLUTION ---




--- STATS ---

Action taken by the agent every 2
Average Reward for the Random Strategy : -21.042
Actions taken of the last episode by the agent:
         Do Nothing ---> 0.7450980392156863%
          Order Bid ---> 0.0196078431372549%
          Order Ask ---> 0.23529411764705882%

________________________________________________________________


We then test our agent in a random simulation of NB_EVENT_SIM events.

In [4]:
NB_EVENT_SIM = 10000
agent.test(NB_EVENT_SIM)


                                             --- TESTING THE AGENT OVER A SIMULATION OF 10000 EVENTS ---



---> Testing:   0%|          | 0/10000 [00:05<?, ?it/s, total_reward=5467.00]

---> Testing:   0%|          | 0/10000 [00:05<?, ?it/s, total_reward=5467.00]


We then compare the Q-Deep Reinforcement learning algorithm using different architectures for the neural network.

In [5]:
tab_network = ['Classic']
NB_EPISODES_FOR_TRAINING_COMPARAISON = 3000
QDL_Agent.compare_networks(tab_network, nb_episode = NB_EPISODES_FOR_TRAINING_COMPARAISON)

           Training (Classic Network): 100%|██████████| 3000/3000 [01:32<00:00, 32.45it/s, total_reward=45.00] 
