# SPM (Simple Probabilistic Model)

The simple probabilistic model (SPM) is the first to be used. 
> * The time _t_ can take integer values between _0_ and _T_.
>
> * The midprice _S<sub>t</sub>_ is a Brownian motion rounded to the closest tick.
>
> * The market maker has to quote bid and ask prices every second.
>
> * The market maker can put the bid and ask depths at _d_ different levels, from _0_ to _d - 1_ ticks away from the mid price.
>
> * The cash process _X<sub>t</sub>_ denotes the market maker's cash at time _t_.
>
> * The inventory process _Q<sub>t</sub>_ denotes the market maker's inventory at time _t_.
>
> * The value process _V<sub>t</sub>_ denotes the value of the market maker's position at time _t_, that is its cash plus the value of its current inventory.
>
> * The market maker can see the current time and its inventory _(t,Q<sub>t</sub>)_ before taking an action.
>
> * At time _t = T_ the market maker is forced to liquidate its position.

The _tick_ is the smallest tradeable unit of the underlying, for instance $0.01 of 1AAPL.

Based on Cartea et al.'s definition, an analytically optimal strategy can be defined, that is used as a benchmark for the strategies derived with Q-learning. However, there is no guarantee that these strategies are optimal in the discretized version of the model.

An example of the optimal bid depths for a specific set of model parameters is shown in the figure below. **Note** that these depths are _not_ discretized in terms of depth, that is they're not rounded to the closest tick.

<div>
    <img src="images/ContinuousBid30.png"/>
</div>


## The Q-learning

Importing the source file.

In [None]:
# import the Q-learning file for the SPM
from simple_model_evaluation import *

Define the parameters to be used for the environment and the hyperparameters for the Q-learning.

The model has an episode length of *T = 20* and a running inventory penalty of *$\phi$ = 10<sup>-4</sup>*.

In [2]:
model_params = {
                "d": 4,         # the number of different quotation depths
                "T": 20,        # the length of the episode
                "dp": 0.01,     # the tick size
                "min_dp": 0,    # the minimum number of ticks from the mid price that is allowed to put prices at
                "phi": 1e-4     # the running inventory penalty
}

The next step is to specify the hyperparameter values.

The set of choices is relatively limited. The required selections concern the parameter schedules for the Ïµ-greedy policy, the learning-rate schedule, and whether exploring starts are enabled. In addition, the training and evaluation budgets must be fixed: the training horizon (number of episodes), the number of independent training runs, and the length of the evaluation phase.

> *\_start* denotes the initial value of a parameter;
> 
> *\_end* denotes its terminal value;
> 
> *\_cutoff* specifies the fraction of training at which the terminal value is reached (e.g., 0.5 corresponds to 50% of the training episodes);

In [3]:
Q_learning_params = {
        # epsilon-greedy values (linear decay)
        "epsilon_start": 1,
        "epsilon_end": 0.05,
        "epsilon_cutoff": 0.5,

        # learning-rate values (exponential decay)
        "alpha_start": 0.5,
        "alpha_end": 0.001,
        "alpha_cutoff": None,

        # exploring starts values (linear decay)
        "beta_start": 1,
        "beta_end": 0.05,
        "beta_cutoff": 0.5,
        "exploring_starts": True
}

hyperparams = {
        "n_train" : 1e5,
        "n_test" : 1e4,
        "n_runs" : 4
}

Saving the results

In [4]:
# naming the the results dir
folder_mode = True
folder_name = "spm_example"
save_mode = True

Use the function *Q\_learning\_comparison* to run the qlearning

In [5]:
Q_learning_comparison(
    **hyperparams,
    args                = model_params,
    Q_learning_args     = Q_learning_params,
    folder_mode         = folder_mode,
    folder_name         = folder_name,
    save_mode           = save_mode
)

RUN 1 IN PROGRESS...
	Episode 20000 (20%), 0:02:15.900000 remaining of this run
	Episode 40000 (40%), 0:01:31.910000 remaining of this run
	Episode 60000 (60%), 0:00:55.660000 remaining of this run
	Episode 80000 (80%), 0:00:26.240000 remaining of this run
	Episode 100000 (100%), 0:00:00 remaining of this run
THE FOLDER spm_example ALREADY EXISTS
...FINISHED IN 0:02:06.430000
0:06:19.300000 REMAINING OF THE TRAINING
RUN 2 IN PROGRESS...
	Episode 20000 (20%), 0:02:24.510000 remaining of this run
	Episode 40000 (40%), 0:01:40.060000 remaining of this run
	Episode 60000 (60%), 0:01:01.650000 remaining of this run
	Episode 80000 (80%), 0:00:29.790000 remaining of this run
	Episode 100000 (100%), 0:00:00 remaining of this run
THE FOLDER spm_example ALREADY EXISTS
...FINISHED IN 0:02:26.520000
0:04:53.050000 REMAINING OF THE TRAINING
RUN 3 IN PROGRESS...
	Episode 20000 (20%), 0:02:18.700000 remaining of this run
	Episode 40000 (40%), 0:01:35.560000 remaining of this run
	Episode 60000 (60%),

## Evaluating the strategies

Some graphical representations of the results obtained while running *Q\_learning\_comparison*.

1) The reward and the state-value at (0,0) during training.

<div>
    <img src="results/simple_model/spm_example/results_graph.png"/>
</div>

Here it looks like that the Q-learning has converged, however, it has not. It has to be trained for longer, which will be evident in the future.

The figure below shows the learnt bid depths.

<div>
    <img src="results/simple_model/spm_example/opt_bid_strategy.png" width="500"/>
</div>

The average rewards of the Q-learning strategies versus benchmarking strategies are compare below.

<div>
    <img src="results/simple_model/spm_example/box_plot_benchmarking.png"/>
</div>

Further results below


In [1]:
f = open("results/simple_model/spm_example/table_benchmarking")
print(f.read())
f.close()

strategy                 mean reward    std reward
---------------------  -------------  ------------
analytical_discrete        0.127752      0.0764701
analytical_continuous      0.132987      0.0717148
constant (d=2)             0.0987989     0.0796333
random                     0.0618333     0.0958115
Q_learning (best run)      0.123278      0.0772611
Q_learning (average)       0.127178      0.0785348


All in all it looks like the Q-learning has been able to find decent strategies, but it needs to train for longer in order to find strategies that equal the analytical strategies in performance.

## More results?

There are a lot more figures and tables to explore which can be found in the **[spm_example](https://github.com/KodAgge/Reinforcement-Learning-for-Market-Making/tree/main/code/results/simple_model/spm_example)** folder.