# The MC Model

The Markov chain (MC) model is the second model used in this thesis. It is significantly more complex than the simple probabilistic model, however, it is still quite rudimentary in comparison with real-life markets. 

In this model, the limit order book (LOB) is modelled explicitly. There are six event types:

> 1. Buy limit orders 
> 2. Sell limit orders
> 3. Cancel buy orders
> 4. Cancel sell orders
> 5. Buy market orders
> 6. Sell market orders

The arrival of an order triggers a state transition in the Markov chain. An illustration of how different order arrivals modify the limit order book is provided in the figure below.

<div>
    <img src="images/LOBDynamics.png" width=800/>
</div>


Same as in SPM there is:

> * The time _t_ can take integer values between _0_ and _T_.
>
> * The market maker has to quote bid and ask prices every second.
>
> * The market maker can put the bid and ask depths at *max\_quote\_depth* different levels, from _1_ to *max\_quote\_depth* ticks away from the best ask and best bid price respectively.
>
> * The cash process _X<sub>t</sub>_ denotes the market makers cash at time _t_.
>
> * The inventory process _Q<sub>t</sub>_ denotes the market makers inventory at time _t_.
>
> * The value process _V<sub>t</sub>_ denotes the value of the market maker's position at time _t_, that is its cash plus the value of its current inventory.
>
> * The market maker can see the current time _t_ , its inventory _Q<sub>t</sub>_, the spread and the *full LOB* before taking an action.
>
> * At time _t = T_ the market maker is forced to liquidate its position.

The _tick_ is the smallest tradeable unit of the underlying, for instance $0.01 of 1AAPL.

Contrary to the SPM, it is not possible to derive an analytically optimal strategy in the MC model.

## DRL stage
1) Import the file: 

In [1]:
import logging
from mc_model_mm_deep_rl_batch import (
    train_multiple_agents_batch,
    evaluate_DDQN_batch
)

logging.basicConfig(level=logging.INFO)




The environment parameters and the DDQN hyperparameters must first be specified.

The Markov-chain model introduces several additional environment parameters; these are summarised in the code snippet below. A key design choice is the use of a longer episode (trading window), with *(T = 100)*.

DDQN adds further choices due to the neural-network function approximator. In particular, the reward is rescaled to help keep it within *([-1,1])* for training stability. Moreover, because the agent observes the full limit order book, the initial book state is randomised at the start of each episode to increase state coverage during learning.


In [2]:
model_params = {
                "dt": 1,                    # the length of the time steps
                "T": 100,                   # the length of the episode
                "num_levels": 10,           # number of depth levels to be included in the LOB
                "default_order_size": 5,    # the size of the orders the MM places
                "max_quote_depth": 5,       # how deep the MM can put its quotes
                "reward_scale": 0.1,        # a factor all rewards will be multiplied with
                "randomize_reset": True     # a random LOB state is chosen at the start of every episode
}

The next step is to select the hyperparameter values.

DDQN introduces a substantial number of tunable settings. In brief, these choices cover the neural-network architecture, experience replay configuration, and the parameters governing the Ïµ-greedy exploration policy.

In [3]:
hyperparams = {
                "n_train": int(2e5),    # the number of steps the agents will be trained for
                "n_test": int(1e2),     # the number of episodes the agents will be evaluated for
                "n_runs": 4             # the number of agents that will be trained
}

DDQN_params = {
                # network params
                "hidden_size": 64,                                          # the hidden size of the network
                "buffer_size": hyperparams["n_train"] / 200,                # the size of the experience replay bank
                "replay_start_size": hyperparams["n_train"] / 200,          # after how many number of steps the experience replay is started
                "target_update_interval": hyperparams["n_train"] / 100,     # how often the target network is updated
                "update_interval": 2,                                       # how often the online network is updated
                "minibatch_size": 16,                                       # the size of the minibatches used

                # epsilon greedy (linear decay)
                "exploration_initial_eps": 1,                               # the starting value of the exploration rate
                "exploration_final_eps": 0.05,                              # the final value of the exploration rate
                "exploration_fraction": 0.5,                                # when the final value is reached

                # learning rate
                "learning_rate_dqn": 1e-4,                                  # the learning rate used (Adam)
                
                # other params
                "num_envs": 10,                                             # how many parallelized environments
                "n_train": hyperparams["n_train"], 
                "n_runs": hyperparams["n_runs"],
                "reward_scale": model_params["reward_scale"],

                # logging params
                "log_interval": hyperparams["n_train"] / 100,               # the frequency of saving information
                "num_estimate": 10000,                                      # how many states that should be used for estimating q_values
                "n_states": 10                                              # the number of states heatmaps are averaged over
                
}

For this model it is the emulating of the market that is the bottleneck, so it runs faster on a cpu than a gpu. This holds even when multithreading is used for the emulation, which we use in this example.

In [4]:
gpu = -1

The results are saved

In [5]:
# naming the dir with the results
folder_name = "mc_deep_example"

outdir = f"results/mc_model_deep/{folder_name}/"

To run DRL use the function *train\_multiple\_agents\_batch*.

In [6]:
train_multiple_agents_batch(
    DDQN_params, 
    model_params, 
    hyperparams["n_train"], 
    outdir, 
    hyperparams["n_runs"], 
    gpu=gpu
)

INFO:mc_model_mm_deep_rl_batch:The folder results/mc_model_deep/mc_deep_example/estimate_folder already exists.
INFO:mc_model_mm_deep_rl_batch:Run 1 in progress.
INFO:mc_model_mm_deep_rl_batch:- Step 40000 (20%), 0:05:15.470000 remaining of the run.
INFO:mc_model_mm_deep_rl_batch:- Step 80000 (40%), 0:04:12.870000 remaining of the run.
INFO:mc_model_mm_deep_rl_batch:- Step 120000 (60%), 0:02:55.090000 remaining of the run.
INFO:mc_model_mm_deep_rl_batch:- Step 160000 (80%), 0:01:28.160000 remaining of the run.
INFO:mc_model_mm_deep_rl_batch:- Step 200000 (100%), 0:00:00 remaining of the run.
INFO:mc_model_mm_deep_rl_batch:Saved the agent to results/mc_model_deep/mc_deep_example/model_folder/200000_finish_1
INFO:mc_model_mm_deep_rl_batch:Finished in 0:07:13.630000.
INFO:mc_model_mm_deep_rl_batch:0:21:40.900000 remaining of the training.
INFO:mc_model_mm_deep_rl_batch:Run 2 in progress.
INFO:mc_model_mm_deep_rl_batch:- Step 40000 (20%), 0:05:31.020000 remaining of the run.
INFO:mc_model_

## Evaluating the strategies

To evaluate the agents use the function *evaluate\_DDQN\_batch*.

In [6]:
evaluate_DDQN_batch(
    outdir, 
    n_test=hyperparams["n_test"],                  
    Q=10,       # how many depths that should be displayed in the heatmaps
    randomize_start=model_params["randomize_reset"]
)

INFO:mc_model_mm_deep_rl_batch:The folder results/mc_model_deep/mc_deep_example/image_folder already exists.
INFO:mc_model_mm_deep_rl_batch:Plotting training.
INFO:mc_model_mm_deep_rl_batch:Plotting strategies.
INFO:mc_model_mm_deep_rl_batch:Evaluating agents.
INFO:mc_model_mm_deep_rl_batch:Evaluating benchmarks...
INFO:mc_model_mm_deep_rl_batch:...best agent
INFO:mc_model_mm_deep_rl_batch:...mean agent
INFO:mc_model_mm_deep_rl_batch:...constant strategy
INFO:mc_model_mm_deep_rl_batch:...random_strategy
INFO:mc_model_mm_deep_rl_batch:Visualizing the strategies.
INFO:mc_model_mm_deep_rl_batch:The folder results/mc_model_deep/mc_deep_example/image_folder already exists.


<Figure size 432x288 with 0 Axes>


For the results of *evaluate\_DDQN\_batch* see the figures below.

The reward, the estimated state-value at (0,0) and the network loss during training:

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/training_graph.png"/>
</div>

Here it looks like that the algorithm hasn't converged. Indeed, it has to be trained for much longer. It probably also needs hyperparameter tuning since the q-estimate and the loss seems to be diverging.

The figure below shows the learnt bid depths:

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/bid_heat_randomized_10.png" width="500"/>
</div>

The average rewards of the Q-learning strategies versus some benchmarking strategies are displayed in the boxplot below.

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/box_plot_benchmarking.png"/>
</div>



In [7]:
f = open(f"{outdir}image_folder/table_benchmarking")
print(f.read())
f.close()

strategy           mean reward    std reward    reward per action    reward per second
---------------  -------------  ------------  -------------------  -------------------
constant (d=1)          0.0402     0.104441              0.000402             0.000402
random                 -0.013      0.075743             -0.00013             -0.00013
DDQN (best run)         0.0254     0.0751188             0.000254             0.000254
DDQN (mean)             0.014      0.0967264             0.00014              0.00014


To witness how the mean strategy and the individual strategies act see the figures below. They shows the average inventory, cash and value process of the different strategies when evaluted for *n\_test* episodes.

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/visualization_mean.png"/>
</div>

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/visualization_all.png"/>
</div>