# The MC Model

The Markov chain (MC) model is the second model used in this thesis. It is significantly more complex than the simple probabilistic model, however, it is still quite rudimentary in comparison with real-life markets. 

In this model, the limit order book (LOB) is modelled explicitly. There are six event types:

> 1. Buy limit orders 
> 2. Sell limit orders
> 3. Cancel buy orders
> 4. Cancel sell orders
> 5. Buy market orders
> 6. Sell market orders

The arrival of an order results in a state transition in the Markov chain. An example of how the arrival of different orders affect the LOB is shown in the image below.

<div>
    <img src="images/LOBDynamics.png" width=800/>
</div>


As in the simple probabilistic model, the following assumptions and definitions are retained:

> - Time \(t\) takes integer values from \(0\) to \(T\).
> - Bid and ask quotes are updated once per second.
> - Bid and ask depths are chosen from *max_quote_depth* discrete levels, corresponding to \(1\) through *max_quote_depth* ticks away from the best bid and best ask, respectively.
> - The cash process \(X_t\) denotes the market maker’s cash position at time \(t\).
> - The inventory process \(Q_t\) denotes the market maker’s inventory at time \(t\).
> - The value process \(V_t\) denotes the marked-to-liquidation value of the market maker’s position at time \(t\), i.e., cash plus the liquidation value of the current inventory.
> - The state available for decision-making includes the current time \(t\) and the current inventory. Inventory \(Q_t\) is discretised into bins with width determined by the parameter \(\kappa\).
> - At \(t = T\), the position is liquidated compulsorily.


The _tick_ is the smallest tradeable unit of the underlying, for instance $0.01 of 1AAPL.

Contrary to the SPM, it is not possible to derive an analytically optimal strategy in the MC model.

## The Q-learning

After that short introduction, it's time for some reinforcement learning in the form of Q-learning.

We start by importing the needed file.

In [1]:
# import the Q-learning file for the markov chain model
from mc_model_evaluation import *




The environment parameters and the Q-learning hyperparameters must next be specified.

The MC model introduces several additional environment parameters, as outlined in the code snippet below. A key choice is the use of a longer episode (trading window), with *T=100*.

In [2]:
model_params = {
                "dt": 1,                    # the length of the time steps
                "T": 100,                   # the length of the episode
                "num_time_buckets": 100,    # how many bins that should be used for the time
                "kappa": 3,                 # the size of the inventory bins
                "num_levels": 10,           # how many depth levels that should be included in the LOB
                "default_order_size": 5,    # the size of the orders the MM places
                "max_quote_depth": 5,       # how deep the MM can put its quotes
}

The hyperparameter values must next be specified. The set of choices is relatively limited: parameter schedules are selected for the ϵ-greedy exploration policy and for the learning rate. In addition, the experimental budget is fixed by choosing the training horizon (number of episodes), the number of independent training runs, and the evaluation horizon.

> *\_start* denotes the initial value of a parameter;
>
> *\_end* denotes its terminal value;
>
> *\_cutoff* specifies the fraction of training at which the terminal value is reached (e.g., 0.5 corresponds to 50% of the training episodes);

**Note:** exploring starts in not used this setting. 

In [3]:
Q_learning_params = {
        # epsilon-greedy values (linear decay)
        "epsilon_start": 1,
        "epsilon_end": 0.05,
        "epsilon_cutoff": 0.5,

        # learning-rate values (exponential decay)
        "alpha_start": 0.5,
        "alpha_end": 0.001,
}

hyperparams = {
        "n_train" : 3e3,
        "n_test" : 3e2,
        "n_runs" : 4
}

Finally we decide where to save our results.

In [1]:
# naming the results folder
folder_mode = True
folder_name = "mc_example"
save_mode = True

Run Q-learning with *Q\_learning\_comparison*.

In [5]:
Q_learning_comparison(
    **hyperparams,
    args=model_params,
    Q_learning_args=Q_learning_params,
    folder_mode = folder_mode,
    folder_name = folder_name,
    save_mode = save_mode
)

RUN 1 IN PROGRESS...
	Episode 600 (20%), 0:04:02.470000 remaining of this run
	Episode 1200 (40%), 0:03:00.580000 remaining of this run
	Episode 1800 (60%), 0:01:59.500000 remaining of this run
	Episode 2400 (80%), 0:00:59.630000 remaining of this run
	Episode 3000 (100%), 0:00:00 remaining of this run
THE FOLDER mc_example ALREADY EXISTS
...FINISHED IN 0:04:58.260000
0:14:54.790000 REMAINING OF THE TRAINING
RUN 2 IN PROGRESS...
	Episode 600 (20%), 0:03:58.930000 remaining of this run
	Episode 1200 (40%), 0:02:58.210000 remaining of this run
	Episode 1800 (60%), 0:01:56.890000 remaining of this run
	Episode 2400 (80%), 0:00:58.010000 remaining of this run
	Episode 3000 (100%), 0:00:00 remaining of this run
THE FOLDER mc_example ALREADY EXISTS
...FINISHED IN 0:04:47.830000
0:09:35.650000 REMAINING OF THE TRAINING
RUN 3 IN PROGRESS...
	Episode 600 (20%), 0:03:52.790000 remaining of this run
	Episode 1200 (40%), 0:02:56.420000 remaining of this run
	Episode 1800 (60%), 0:01:59.340000 rema

## Evaluating the strategies

Below are some figures resulting from running *Q\_learning\_comparison*.

The reward and the state-value at (0,0) during training:

<div>
    <img src="results/mc_model/mc_example/results_graph.png"/>
</div>

Here it looks like that the Q-learning has converged, however, it has not. It has to be trained for *much* longer. This will become very evident further.

The figure below shows the learnt bid depths.

<div>
    <img src="results/mc_model/mc_example/opt_bid_heat.png" width="500"/>
</div>

he average rewards of the Q-learning strategies versus some benchmarking strategiesare compared below:

<div>
    <img src="results/mc_model/mc_example/box_plot_benchmarking.png"/>
</div>


In [6]:
f = open("results/mc_model/mc_example/table_benchmarking")
print(f.read())
f.close()

strategy                 mean reward    std reward
---------------------  -------------  ------------
constant (d=1)              1.77           8.81346
random                     -0.423333       7.72123
Q_learning (best run)      -0.48           5.63941
Q_learning (average)       -0.79           5.94019


The results indicate that substantially longer training is required before any effective strategies emerge. Notably, the constant-depth baseline markedly outperforms all other strategies.