In [8]:
%load_ext autoreload
import sys
import os
import numpy as np
import pandas as pd
import tools as ut
import plot_tools as plt
import policy_gradient as pg

from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [9]:
data = ut.load_data(rootDir, "2018-01-01", "2018-11-30", prefix = '', suffix = '.csv', date_format = '%Y%m%d')
data = data.query('ticker in ["BTCUSD", "LTCUSD"]')

## Optimal Trading Policy Learning

Charbel Saad, Olivier Pham, Benoit Zhou

## Abstract

Our project targets the optimal trading problem, a core challenge in quantitative trading. More precisely, we are interested by buying or selling an asset and maximize our future returns, given a predictor of the asset's return which we know to be statistically significant. In presence of linear trading costs (quite common when market impact is negligible compared to fees or spread), JP.Bouchaud et al. [1] solved the multiperiod optimization problem, which takes into account the current prediction but also future potential trading decisions, giving an explicit policy when the predictor is an Ornstein-Uhlenbeck process. We are going to model this problem as an MDP and solve it using a policy gradient algorithm.

## Introduction

Quantitative trading aims at taking advantage of small inefficiencies in the market to earn money by trading an asset on which we have a small statistical prediction of its future price. Because this trading is automated and model-based, quantitative hedge funds usually trade multiple times a day and make a steady profit thanks to the law of large numbers. The predictor needs to be good enough, for the agent to beat the trading costs. They usually breakdown into fees, spread and market impact costs. High-frequency traders usually hit the best limit and their costs are due to fees and spread which are linear in the traded quantity. 

In our situation, we consider a agent facing a discret time trading problem. At each time $t$, the agent can either buy or sell the asset, and knows the value of his predictor at this time. He can choose to trade a signed quantity $q_t$, constrained by a maximum exposure to the asset, $M$ (ie $|x_t + q_t| < M$ where $x_t$ refers to the exposure at time $t$). The goal of the agent is to maximize his profit across a finite episode (a day). By modeling this problem as an MDP and simplify it using some results from [1], we can use a policy gradient algorithm to follow the sensitivity of the PNL to our parameter and perform better than the naive solution on the long run. 

We are going to use real cryptocurrency data of the Bitcoin and Litecoin cryptocurrencies on the Coinbase exchange. Bitcoin being the most traded cryptocurrency by far, it oftens acts as a leader to other smaller currencies: we are going to leverage this to predict the short-term return of Litecoin and backtest our trading algorithm.

## Problem description

The trading agent disposes of a predictor $p_t$ at time $t$, which is an unbiased estimator of the forward return: $\mathbb{E}[r_{t+1}] = p_t$ where $r_{t+1}$ is the return of the asset between time $t$ and $t+1$ (as usual in finance, we implicitely consider the expectations conditioned to the filtration at time $t$). The predictor is assumed to have the same properties as in [1], notably positive autocorrelation, markovian property and symmetry. When trading a signed quantity $q_t$, the agent has to pay a linear cost $\Gamma |q_t|$ proportional to the traded quantity.

The naive solution, and usually not such a bad approximation, is to consider the single period optimization problem. Our greedy algorithm solves for the maximum expected one-step PnL:

<center>
    $\mathbb{E}[{PnL}_{t+1}] = \mathbb{E}[r_{t+1}x_t] + \mathbb{E}[r_{t+1}q_t] - \Gamma |q_t|$, and knowing $\mathbb{E}[r_{t+1}] = p_t$,
    <br/>
    <br/>
    ${q_t}^* = argmax\{p_t q_t - \Gamma |q_t|\space s.t. |x_t + q_t| < M\}$
</center>

This solution leads to the following policy:

 - If $p_t > \Gamma$, the agent buys until he reaches his maximum exposure $M$
 
 - If $p_t < -\Gamma$, the agent sells until he reaches his maximum exposure $-M$
 
 - Otherwise, nothing is done
 
In the case of linear costs, the shape of the solution is pretty simple: we trade everything we can when our prediction exceeds the costs. This is a strategy commonly used among high frequency players. More interestingly, Bouchaud et al. proved that solving the full multiperiod optimization problem yields to a similar strategy parametrized by a threshold, which happens to be lower than the trading costs. Intuitively, it seems that the positive autocorrelation property will make us want to trade even lower predictors in order to capture the effects of potential multi-step earnings. An explicit threshold formula is given when the predictor follows an Ornstein-Uhlenbeck process.

In our framework, we are going to model the same problem as an Markov Decision Process $(S,A,T,R,\gamma)$ with finite episodes (one episode corresponds to a day). We can thus set $\gamma = 1$ (we want to maximize the daily PnL). We also consider:

  - $S$ to be the space of states, ie $s = (t,x_t,p_t)$
  - $A$ to be the set of actions (quantity we can trade), $A(s) = [-M-x_t, M-x_t]$
  - $T$ the transition function: we don't know T but we assume the hypothesis on the predictor (positive autocorrelation, markovian propery)
  - $R$ the reward function, ie $R(s,a) = r_{t+1}(x_t+q_t) - \Gamma |q_t|$ with $\mathbb{E}[r_{t+1}] = p_t$
  
The result given in [1] allow us to simplify the set of actions into $\{-1,0,1\}$, precisely go full long, full short or do nothing. Because no risk penalization is involved, the position at time $t$ doesn't interfere with this decision. This will enable us to parametrize a trading policy by a threshold.

## Predicting the forward returns

We consider cryptocurrency data in forms of 1-minute Open-Low-High-Close-Volume information pooled from the Coinbase cryptocurrency exchange from 2017-01-01 until 2018-11-30. We consider the trading price at the time of a datapoint to be the corresponding closing price. Thus, the forward return is defined as $r_{t} = \frac{close_{t} - close_{t-1}}{close_{t-1}}$.

We consider specificially the 2 tickers BTC-USD and LTC-USD (Bitcoin and Litecoin quoted in USD). We are going to build a simple predictor based on the idea that movements on the Bitcoin will affect other cryptocurrencies later on. For this, we consider a strategy trading every minute (which corresponds to the unit in our discrete time problem) and will predict the 1-minute Litecoin return using the past 5-minute Bitcoin return. Our hypothesis is the following:

<br/>
<center>
   $\mathbb{E}[r_{t+1}] = \mathbb{E}[r_{t,t+1}(LTC)] = \beta r_{t-5,t}(BTC)$
</center>

An empirical overwiew of the data lead us to fix $\beta = 0.1$, thus leading to a good hypothesis fit, as seen in the figure below.

In [10]:
ut.compute_return(data, "close", "fwd_ret1", -1)
ut.compute_return(data, "close", "ret5", 5)
data["ret5"] *= 0.1
ut.index(data, unit='s')
data.loc[data.ticker == "LTCUSD", "predictor"] = data.loc[data.ticker == "BTCUSD", "ret5"]
data.query('ticker == "LTCUSD"', inplace=True)

In [13]:
plt.plot_by_group(data, ['predictor'], ['fwd_ret1'], np.mean,  20, 'Average forward return as a function of predictor', 'predictor (bps)', 'return (bps)')

For not too large values of the predictor, we approximately have $\mathbb{E}[r_{t+1}] = p_t$. We also have a positive autocorrelation as seen in the autocorrelation plot below (by construction of the predictor with overlapping returns).

In [14]:
def corr(i):
    return 100*np.corrcoef(data.predictor[5+i:], data.predictor.shift(i)[5+i:])[0,1]
plt.plot_scatter([[i for i in range(10)]], [[corr(i) for i in range(10)]], ['Autocorrelation function'], 'Autocorrelation', 'lag', 'Autocorrelation (in %)')

## Reinforcement Learning Approach

Based on the previous considerations, we are going to parametrize our policy by a trading threshold $q$. The decision making is the same as previously explained, which is to trade the asset in the direction of the predictor when it exceeds the threshold $q$. A natural solving approach would be to use a policy gradient algorithm, and conduct an online gradient ascent over the threshold $q$ to converge to a smarter solution. We already know that the one-step problem yields $q = \Gamma$ and we want to leverage the positive autocorrelation to trade when the predictor is lower. An Ornstein-Uhlenbeck assumption for the process is quite strong, but we will try to express Bouchaud's threshold as a function of the parameter estimates we would use under the O-U hypothesis.

We chose to consider daily episodes, as one day is considerably long compared to the predictor's calculation time and realization. Limiting the episode to a day is approximately a finite-horizon consideration. We consider the following policy gradient algorithm:

  - Start from threshold $q_0 = \Gamma$
  - For all days $i >= 1$:
      - Backtest day $i$ using threshold $q_{i-1}$
      - Compute derivative of PnL w.r.t. the threshold: $d(i) = \frac{PnL(i, q+dq) - PnL(i, q-dq)}{dq}$
      - Update threshold using a learning rate $\alpha$: $q_i = q_{i-1} + \alpha d(i)$
  - Do this between 2018-01-01 and 2018-11-30
  
A backtest consists in replaying one day of data and simulating the trades we would have made by following a trading policy. The trades are marked at the close price of each 1-minute interval (the corresponding timestamp marking ensures no forward looking). The fees are deterministically charged as $\Gamma turnover(i)$ where the turnover is the total unsigned traded quantity on day $i$.

## Results

We backtested this strategy using day backtest to evaluate and improve the policy. We used $M = 10000 \$$ and $\Gamma = 1$ basis points. We start every day with a neutral position and compute our predictor every minute. As expected, the threshold diminishes over time and appear to reach a stable value.

In [None]:
res = pg.policy_gradient(data[6:], 1)

In [23]:
plt.plot_scatter([res.index], [res.q], ['q'], '$q*$', 'date', 'bps')

We compared this strategy's PnL with the naive strategy, and we can clearly see that as time passes by, the dynamic strategy starts to outperform the naive strategy on an almost daily basis. 

In [24]:
plt.plot_scatter([res.index], [res['pnl_naive'].cumsum(), res['pnl_dynamic'].cumsum()], ['Naive threshold', 'Learned threshold'], 'PnL', 'date', 'USD')

In [25]:
plt.plot_scatter([res.index], [res.eval('pnl_dynamic - pnl_naive').cumsum()], ['Difference'], 'PnL difference', 'date', 'USD')

## Ornstein-Uhlenbeck parametric estimation

We are going to test our threshold finding against the O-U threshold for the process. We keep in mind that this assumption is strong and our predictor may not take advantage of this formulation. An standard Ornstein-Uhlenbeck process is described by the following dynamics:

<br/>
<center>
    ${dX_{t}=-\epsilon X_{t}dt+\beta dW_{t}}$ where $W_t$ refers to the standard Brownian motion
</center>

It is often used to describe a standard mean reverting process, where $\beta$ controls the amount of noise in the process and $\epsilon$ controls the speed at which we converge to 0 (the mean of the process). It's often used to describe financial quantities as some of them tend to exhibit this behaviour. It's not obvious in our formulation that our predictor will obey these dynamics.

Parameter estimations of an O-U process are given by the following equations:

  - $Var(X_t) = \frac{\beta^2}{2\epsilon}$
  - $Cov(X_t, X_{t+1}) = Var(X_t)exp(-\epsilon)$
  
Applying this estimation to our process over the whole data, we get $\beta = 1.89, \epsilon = 0.22$. According to the paper, this would put us in the $\beta > \Gamma$ regime, where the optimal threshold is approximatively $\Gamma$. This estimation does perform close to the naive solution on our dataset.

In [27]:
V = np.std(data.predictor[6:])**2
C = np.cov(data.predictor[6:], data.predictor.shift()[6:])[0,1]
eps = np.log(V/C)
b = np.sqrt(2*V*eps)

## Conclusion

Optimal trading considering a predictor of future returns is an essential problem, where we want to convert statistical prediction into buys or sells on the market. By parametrizing a simple strategy, we managed to improve the one-step optimization solution PnL by more than 10%, which is quite significative regarding that we trade the same predictor. 

This methods assumes no particular dynamic on the predictor, and could be extended to multi-asset trading where the policy would be parametrized over the set of thresholds. This could enable us to incorporate interactions between asset, as the gradient ascent considers an update of the thresholds simultaneously. The main assumption in this problem is the cost linearity, so it can be applicable where there are very little market impact costs. In fact, it is very common to consider trading at the best available limit only.

We need to keep in mind that because we are using a single predictor, we backtested the strategy using 1 basis point of fees for research purposes. The fees on cryptocurrency exchanges are usually higher and more statistical bias is needed to construct a steady realistic strategy.

## References

[1] JP.Bouchaud et al. Optimal Trading with Linear Costs. <i>arXiv:1203.5957</i>.