In [34]:
%load_ext autoreload

import utils as ut
import plot_tools as plt
import stat_tools as st
import pandas as pd
import plotly
import numpy as np

%autoreload 2

plotly.offline.init_notebook_mode(connected=True)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Notes

The data is extracted from the Binance API.

https://github.com/binance-exchange/binance-official-api-docs/blob/master/rest-api.md#klinecandlestick-data

Connection is made via the binance.py script, which enables local dump of the data

The data precision is 1 minute

Some studies have been made with different parameters (especially non-working ones), only one is included in the presentation in general

The notebook is still under construction

## Data

In [2]:
dataPath = "/home/charbel/Documents/Stanford/Project/data/"

In [3]:
data = ut.load_data(dataPath, "2019-07-01", "2019-12-31")
##PAXUSDT is a tether, we remove it
data = data[data.ticker != "PAXUSDT"]

100%|██████████| 184/184 [00:02<00:00, 82.22it/s]


In [4]:
##Indexing the data
data['close_time'] += 1
ut.index(data)

In [5]:
## Columns 
cols = ['close_time', 'ticker','close','volume_quote','buy_vol_quote','num_trades']
data = data[cols]
data.rename(columns = {'close_time':'t', 'close':'price', 'volume_quote':'volume', 'buy_vol_quote':'buy_vol'}, inplace=True)
data.eval('sell_vol = volume - buy_vol', inplace=True)

In [6]:
## Selecting tickers available from a certain date
start_dates = pd.to_datetime(data.groupby('ticker').first().t, unit='ms').sort_values()
tickers = start_dates[start_dates<='2019-07-02'].index
data.query('ticker in @tickers', inplace=True)

In [7]:
##Just to check if duplicates (API seems to send duplicates when data not available)
#t=data.groupby(['datetime','ticker']).count()
#t[t>=2].dropna()

In [8]:
data = data.drop_duplicates()

In [9]:
st.compute_volume(data, 'buy_vol', 'buy_vol_r', 60)
st.compute_volume(data, 'sell_vol', 'sell_vol_r', 60)
st.compute_volume(data, 'volume', 'volume_r', 60)

## Preliminary study

We compute the returns for 5 min, 1 hour, and 12 hours (different timescales)

In [10]:
st.compute_return(data, 'price', 'r5', 5)
st.compute_return(data, 'price', 'r60', 60)
st.compute_return(data, 'price', 'r720', 720)

#### Cross-sectional correlations

The returns become more correlated when the lookback is bigger (as expected)., we can observe some coins which seem more isolated

In [11]:
plt.plot_corr_heatmap(st.cross_correlation(data, 'r5'))
plt.plot_corr_heatmap(st.cross_correlation(data, 'r60'))
plt.plot_corr_heatmap(st.cross_correlation(data, 'r720'))

#### Lagged-sectional correlations

- Lagged returns (5min/1H/12H) on x-axis
- Blue diagonal represents self mean reversion

We would expect larger cryptocurrencies to lead smaller one. We observe this effect on the 5-min timescale and much less later. That means the leadlag effect is generally realized under an hour. For Bitcoin, the high liquidity allows this effect to realize on an even shorter scale (disappears in the 1-hour plot)

In [12]:
plt.plot_corr_heatmap(st.lagged_correlation(data, 'r5', 5), False)
plt.plot_corr_heatmap(st.lagged_correlation(data, 'r60', 60), False)
plt.plot_corr_heatmap(st.lagged_correlation(data, 'r720', 720), False)

#### PCA

These heatmaps represent the principal components of the cross-sectional returns

The x-axis is labeled as the cumulative percentage of variance predited

We can see that the first component is stronger for longer lookbacks (in line with the higher observed correlations). The weightings on the first component are relatively uniform, that component represents the market return.

In [13]:
plt.plot_pca(data, 'r5')
plt.plot_pca(data, 'r60')
plt.plot_pca(data, 'r720')

In [14]:
##Cross sectional centering of the returns
st.cross_center(data, 'r5')
st.cross_center(data, 'r60')
st.cross_center(data, 'r720')

#### Residual reversion

We now center our returns cross-sectionally:

- We take the nonweighted average of returns per timestamp (the market return)
- We substract it from the individual returns

This computes a proxy for the market orthogonalized return, and allow us to more clearly observe the self mean-reversion effect of the assets.

The lagged correlation plot of the residuals is shown below

In [15]:
plt.plot_corr_heatmap(st.lagged_correlation(data, 'r5-c', 5), False)
plt.plot_corr_heatmap(st.lagged_correlation(data, 'r60-c', 60), False)
plt.plot_corr_heatmap(st.lagged_correlation(data, 'r720-c', 720), False)

We observe high negative autocorrelation on the diagonal, which suggests a strong reversion effect. This effect is present over the course of a few minutes and hours, but disappears afterwards.

## Factor study

In this section we are going to benchmark different factor ideas. Factor potential should be benchmarked versus the market residual, defined by:

$$res^i_{m} = r^i - \beta^i*r_m$$

For simplicity we take the residual to be the cross-sectionally centered return, ie using $\beta_i=1, r_{market}=\bar{r^i}$

For now we are going to consider a frequency of 1 hour: 12 hours seems quite a long timespan for cryptocurrency (the reversion effect wasn't strong, meaning not many traders perform statistical arbitrage over this timespan) and 5 minutes would make it harder to have a profitable strategy given the linear fees.

Our methodology is as follow: for each factor idea, we will compute a metric accross the universe for each timestamp. We will rank the assets by timestamp with respect to this metric and observe the average residual per rank.

In [16]:
data = data[data.index.minute==0]
data['perf'] = data.groupby('ticker')['r60-c'].transform(lambda x: x.shift(-1))

### Mean-reversion

Mean-reversion is a very typical effect we observe on stocks. When orthogonalizing to the market or sector momentum effect, we observe a consistent negative autocorrelation of the residuals. That is due to a very low probability of consistently underperforming/outperforming the market or sector.

The previous plots leads us to think that this effect is similar for cryptocurrencies: they revert around the market return.

Our metric for this factor is going to be the past 1-hour return

$${MR}^i_t = -res^i_{m,t-1}$$

In [17]:
data['mr'] = -data['r60-c']

In [18]:
data['r']  = data.groupby('datetime')['mr'].rank('min')
plotData = data.groupby('r')['perf'].mean()
plt.plot_scatter(plotData.index, [plotData.values], [plotData.name])

As predicted by the previous plots we observe a strong relationship between a higher past return and a future lower one. Recent top performers perform worse in the future and vice-versa

### Volatility

Stock factor models often incorpore a volatility metric, which quantifies the risk linked to investing in a specific asset. The idea behind this reasoning is that more risky investments will tend to yield better returns, ie highly volatile stocks tend to outperform lower volatility ones.

We are going to experimentally test this effect on cryptocurrencies. For this we compute a short term and a long term volatility

$$shortvol^i_t = std(res^i_{m,t}, 12h)$$
$$longvol^i_t = std(res^i_{m,t}, 7D)$$

In [19]:
st.compute_volatility(data, 'r60', 'svol', 12)
st.compute_volatility(data, 'r60', 'lvol', 24*7)

In [20]:
data['r']  = data.groupby('datetime')['svol'].rank('min')
plt.plot_df(data.groupby('r')['perf'].mean())

In [21]:
data['r']  = data.groupby('datetime')['lvol'].rank('min')
plt.plot_df(data.groupby('r')['perf'].mean())

We observe similar results for both versions, volatility doesn't seem to yield a risk premium on cryptocurrencies, at least on an intraday basis.

### Volume

Another common factor in the stock market is size, which measures a company market capitalization. We usually observe higher returns on small caps. In the context of cryptocurrencies, we could think that ones with smaller market cap are still in expansion phase and are more likely to outperform the market. Unfortunately as of now we do not dispose of market cap data.

We are going to compute a cryptocurrency liquidity, as measured by its trailing volume over the past week

$$liq^i = volume(i, 7D)$$

We could maybe expect less popular cryptocurrencies to be growing. Although this effect has more chance to realize on a longer term (a couple of days), we can evaluate its predictability on the 1-hour returns

In [22]:
st.compute_volume(data, 'volume_r', 'liq', 24*7)

In [23]:
data['r']  = data.groupby('datetime')['liq'].rank('min')
plt.plot_df(data.groupby('r')['perf'].mean())

That factor doesn't seem to produce useful results. Prehaps it's worth trying to scrap the market capitalization data and compute a proper size factor

### Cash flow

Cryptocurrency markets, similarly to stock, work with a continuous double-auction mechanism. This introduces an asymetry between a taker (who sends a market order to the exchange) and a maker (who places an order on the book), regardless of the trader being a buyer or a seller. Market orders are believed to be more aggressive and tend to cost more (half of a spread at least), as the order is filled immediately.

Using other taker trader's informations is a successful strategies on stocks (in some situations), this can be naively done by computing an imbalance between the taker buy quantity and the taker sell quantity over a time period. We are going to compute this metric in our universe and evaluate its predictability. 

$$cf^i=\frac{buy\_volume(i, 1h) - sell\_volume(i, 1h)}{volume(i, 1h)}$$

In [24]:
data['cf'] = data.eval("(buy_vol_r-sell_vol_r)/(buy_vol_r+sell_vol_r)")

In [26]:
data['r']  = data.groupby('datetime')['cf'].rank('min')
plt.plot_df(data.groupby('r')['perf'].mean())

We get a quite surprising result, that is a negative correlation with the rank. The cashflow is a momentum indicator, we expect a positive correlation with the returns. However in our case, we already know that the residuals incorpore a strong reversion effect. By construction, the cashflow is correlated with the past return (buying/selling stocks induces instantaneous changes of prices, so we expect a stock under a buy pressure to have rosen in the past hour) so what we might see is the reversion effect.

In order to check whether this indicator has predictability, we will benchmark it against the returns we are going to obtain by residualizing further to the reversion effect. We are going to compute this residals later.

## Factor model

We are going to build on these ideas to construct a factor model which incorporates CAPM and our reversion indicator. The model can be written as:

$$r^i = \beta^ir_m + {MR}^ir_{mr} + res^i$$

where $r_{mr}$ is the return of the mean-reversion factor, and $r_m$ is the market return (to be defined).

Our methodology is the following:

- For each ticker i, compute $\beta^i$ by regressing $r^i$ over $r_m$ for each month. To avoid forward looking, the coefficients of month m will be used in month m+1
- For each ticker i, compute the market residual $res^i_m = r^i - \beta^ir_m$
- For each timestamp t, compute $r_mr$ by regressing $res^i_m$ over ${MR}^i$ (cross-sectional regression here)

In the future, a ticker's weight will refer to it's trailing volume over the past week

$$w^i = volume(i, 7D)$$

### Market return

Stock markets use commonly traded ETFs to represent market returns, thus $r_m$ corresponds to a weighted portfolio of stocks. To create an analog of the cryptocurrency market, we are going to consider 4 ideas:

- Equally weighted portfolio
- Weighting portfolio with $w^i$
- Weighting portfolio with $\sqrt{w^i}$
- Using Bitcoin as the market return (note that this prevents us from using Bitcoin as an asset in the model universe)

The equally weighted portfolio corresponds to the cross-centering of returns we executed previously. Weighting with the trailing volume will most likely put a heavy weight on Bitcoin which has a huge market share. We can omit the 4th version then, as the results will be similar to the weighted version. The square-root version is typical for stocks, it aims to tighten the gaps between the stocks, given that the traded volume distribution spans over multiple magnitudes.

In [35]:
st.compute_volume(data, 'volume_r', 'w', 24*7)
data['w'] = data.groupby('datetime')['w'].transform(lambda x: x/x.sum())
data['sqw'] = np.sqrt(data['w'])
data['sqw'] = data.groupby('datetime')['sqw'].transform(lambda x: x/x.sum())

In order to visualize our computation, we are going to look at the average weight of each ticker over the period

In [36]:
plt.plot_df(data.groupby('ticker')[['w','sqw']].mean())

As expected we observe a near dirac on Bitcoin for $w_i$, and the square-root version smoothens the differences out. It still keeps a heavy weight on Bitcoin though.

In [37]:
data['m60'] = data.groupby('datetime')['r60'].transform('mean')

In [38]:
data['period'] = data.index.month + 12*data.index.year
data['period'] -= data.period.min()

In [39]:
st.wmean(data, 'r60', 'w', 'm60-w')
st.wmean(data, 'r60', 'sqw', 'm60-sqw')

In [40]:
data[['m60', 'm60-w', 'm60-sqw']].corr()

Unnamed: 0,m60,m60-w,m60-sqw
m60,1.0,0.915774,0.981981
m60-w,0.915774,1.0,0.966315
m60-sqw,0.981981,0.966315,1.0


The 3 versions are highly correlated. The square root being closer to an equally weighted portfolio, we observe higher correlation.

In order to pick a version, we are going to see if weighting helps improving the cross-centering we already have. In order to do this, we are going to look at the reversion potential of the residuals, and see if its stronger against the weighted portfolios.

In [41]:
data.eval('res_w = r60 - r60 - r60-w', inplace=True)
data.eval('res_sqw = r60 - r60 - r60-sqw', inplace=True)

In [42]:
plt.plot_corr_heatmap(st.lagged_correlation(data, 'r60-c', 1), False)
plt.plot_corr_heatmap(st.lagged_correlation(data, 'res_w', 1), False)
plt.plot_corr_heatmap(st.lagged_correlation(data, 'res_sqw', 1), False)

The autocorrelation plots of the 3 versions suggest a better uniformity for the equally-weighted portfolio, it seems to work better accross cryptocurrencies. We notice however that the weighted versions help reduce the outlying positive autocorrelation of Bitcoin. As our goal is to cover a broader universe, we will not give significant preference to this

To evaluate the strength of the signal, we are going to measure proxy returns and a proxy sharpe ratio. These metrics correspond to replicating a simple strategy which consists in holding at each timestamp $t$ a position on asset $i$ proportional to ${MR}^i_t$.

The metrics are therefore given by:

$$ret({MR}_t) = \frac{\sum_i{{MR}^i_tr^i}}{\sum_i{|{MR}^i_t|}}$$

$$ret(MR) = mean(ret({MR}_t))$$

$$sr(MR) = \frac{ret(MR)}{std(ret({MR}_t))}$$

In [43]:
st.pred_stats(data, 'r60-c', 'perf')


invalid value encountered in double_scalars



Unnamed: 0,return,sharpe
0,-10.416865,-0.204516


In [44]:
data['perf'] = -data.groupby('ticker')['res_w'].transform(lambda x: x.shift(-1))
st.pred_stats(data, 'res_w', 'perf')


invalid value encountered in double_scalars



Unnamed: 0,return,sharpe
0,7.298636,0.096832


In [45]:
data['perf'] = -data.groupby('ticker')['res_sqw'].transform(lambda x: x.shift(-1))
st.pred_stats(data, 'res_sqw', 'perf')


invalid value encountered in double_scalars



Unnamed: 0,return,sharpe
0,7.29874,0.09683


The results confirm our observations, the cross-sectionally centered return revert better accross the universe than their weighted counterparts. From now on $r_m$ will refer to the equally-weighted portfolio.

### Model

We first compute the betas issued from the rolling regression. The function regresses the returns over the market returns, outputs the betas for each ticker and residualize the returns using the coefficient of the past month.

In [46]:
#Removing null values from shifts
datan = data.loc["2019-07-02":]

In [47]:
beta = st.exposure_regression(datan, 'r60', 'm60', 'mres60')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [48]:
plt.plot_df(pd.DataFrame(np.array([beta[t].values.T[0] for t in beta.keys()]).T, columns = beta.keys()))

The betas does not span over a very wide range as we can see, and they seem to be quite stable through the period. That is pretty reassuring, as the market portfolio will rarely have to change.

In [49]:
plt.plot_corr_heatmap(st.lagged_correlation(datan, 'mres60', 1), False)

The model residuals (which are very close to the cross-sectionally centered returns, given that the betas are all close to 1) exhibits the desired reversion behaviour. We note the same Bitcoin singularity, for which the residuals seem to be autocorrelated.

We will now compute the mean-reversion exposures, as the opposite of the previous residual for each ticker. We will execute the regression that outputs us the vector of mean-reversion returns

In [50]:
datan['mr'] = -datan.groupby('ticker')['mres60'].transform('shift')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [51]:
datan = datan.loc["2019-08-02":]

In [52]:
fret = st.return_regression(datan, 'mres60', ['mr'])

In [53]:
datan = datan.merge(fret, left_index=True, right_index = True)

For each factor in a factor model, we can construct what is called the replicating portfolio. This portfolio is a linear combination of the assets and replicates the factor's return. Indeed, let us consider the factor model

$R_t = B_tf_t + RES_t$

where $R_t$ is the vector of asset returns, $B_t$ is the matrix of factor exposures, $f_t$ is the vector of factor returns, $RES_t$ is the residual 

The minimization that yields $f_t$ is a least-squares so we can write the explicit solution

$f_t = (B_t^TB_t)^{-1}B_tR_t$

The row $(F_t)_k$ of matrix $F_t = (B_t^TB_t)^{-1}B_t$ corresponds to the k-th factor's replicating portfolio, indeed we have $f_t = F_tR_t$

In [54]:
fport = datan.groupby('datetime').apply(lambda x: st.replicating_portfolio(x, ['mr'])).reset_index(level=1, drop=True)

In [55]:
plt.plot_df(1e-2*(fret['r60_mr']/np.abs(fport).sum(axis=1)).cumsum(), mode='lines')

This factor is a directional factor: contrarily to the market return, its mean is nonzero and following it yields positive returns. As shown in the plot above, the replicating portfolio performs very well over the period.

In [56]:
datan['res'] = datan.eval('mres60 - mr*r60_mr')

In [57]:
datan['pred'] = datan.eval('(r60-mres60) + mr*r60_mr')

In [58]:
plt.plot_pca(datan, 'res')

The PCA of the model residuals shows already much less correlation between the assets. In the first component, we can distinguish positive weight in particular for currencies traded on the Coinbase widely popular exchange (Bitcoin, Euthereum, Litecoin). We can pursue more on that idea later.

## Advanced studies

Here we follow on more ideas that could help us refine our model

### Residual cashflow

We saw previously that the cashflow metric yielded negative returns, probably because of its correlation with the reversion effect. We can benchmark the indicator on the new reversion-orthogonalized returns

In [59]:
datan['perf'] = datan.groupby('ticker')['res'].transform(lambda x: x.shift(-1))

In [60]:
datan['r']  = datan.groupby('datetime')['cf'].rank('min')
plt.plot_df(datan.groupby('r')['perf'].mean())

We can see that residualization indeed eliminated the reversion effect. However, this indicator does not yield momentum here. A possible cause would be its realization over a much shorter term (when we capture the pressure, the price move already realized)

### Reversion potential

In [61]:
datan['perf'] = datan.groupby('ticker')['res'].transform(lambda x: x.shift(-1))

In [62]:
plt.plot_corr_heatmap(st.lagged_correlation(datan, 'res', 1), False)

The figure above shows the autocorrelation plot of the fully residualized returns. We observe a singularity for Bitcoin, which is due to its reversion being realized quicker. To see if this effect is worth investigating, we examine its consistency through time. 

For this we benchmark the reversion of the fully-residualized return using the proxy return metric described previously. We compute the cumulative sum of the returns over the period.

In [63]:
plt.plot_df(datan.query("ticker=='BTCUSDT'").eval('res*perf/abs(res)').cumsum())

This quite steady upwards curve supports time consistency of this effect. As it is present on one single asset, we won't address it directly, however it can be the basis of the idea of reversion potential.

We assume that assets undergo momentum and reverting periods. Instead of assuming the asset will always revert, we compute its return autocorrelation over 2 days, and use it to condition the reversion indicator

$${Rpot}^i = -ac(res^i, 2D)$$

$$sig^i = {Rpot}^i*{MR}^i$$


In [64]:
datan['rpot'] = datan.eval('num=mr*mres60').groupby('ticker')['num'].transform(lambda x:x.rolling(48).mean()) / datan.groupby('ticker')[['mr','mres60']].transform(lambda x: x.rolling(48).std()).eval('mr*mres60')

In [65]:
plt.plot_df(datan.groupby('ticker')['rpot'].mean())

The mean of the computed reversion potential matches our expectations: positive for all cryptocurrencies except Bitcoin (which had a momentum tendency)

In [66]:
datan['perf'] = datan.groupby('ticker')['mres60'].transform(lambda x: x.shift(-1))

In [67]:
plt.plot_df(
    pd.merge(
        datan.eval('sig = -mres60*rpot').groupby('ticker')['perf','sig'].apply(lambda x: x.corr().iat[0,1]).rename('signal'),
        datan.eval('sig = -mres60').groupby('ticker')['perf','sig'].apply(lambda x: x.corr().iat[0,1]).rename('benchmark'),
        left_index=True,
        right_index=True
    )
)

This plot shows the correlation of this signal with the forward residuals, benchmarked against the regular reversion indicator. The new signal improves Bitcoin condition as expected, however it yields lower returns overall. Because of the added complexity we will stick with the regular indicator in the model.

### Long term mean-reversion

For now we incorporated an immediate mean reversion factor. Market participants include traders that consider different time scales in their analysis, so we could try to analyze a longer term mean reversion effect. Here we present the 6-hour mean reversion effect. To compute the indicator, we use the following formula:

$$\frac{1}{5}\sum_{s=t-1}^{t-6}{{MR}^i_s}$$

Note that we do not include the mean reversion at time $t$, we just consider what happened between 6-hours before and 1-hour before (as the short term reversion is already incorporated)

In [68]:
datan['lmr'] = datan.groupby('ticker')['mr'].transform(lambda x: x.rolling(5).mean().shift())

In [69]:
datan['r']  = datan.groupby('datetime')['lmr'].rank('min')
plotData = datan.groupby('r')['res'].mean()
plt.plot_scatter(plotData.index, [plotData.values], [plotData.name])

Performance is increasing with the rank, so this effect adds to the predictability of residuals. We can thus create a new factor based on this metric.

### Sharpe

We compute the past day sharpe ratio, to see if steadiness in the returns induce less reversion.

In [70]:
datan['sharpe'] = datan.groupby('ticker')['mres60'].transform(lambda x:x.rolling(24).mean()/x.rolling(24).std())

In [71]:
plt.plot_df(
    pd.merge(
        datan.eval('sig = -sharpe').groupby('ticker')['perf','sig'].apply(lambda x: x.corr().iat[0,1]).rename('signal'),
        datan.eval('sig = -mres60').groupby('ticker')['perf','sig'].apply(lambda x: x.corr().iat[0,1]).rename('benchmark'),
        left_index=True,
        right_index=True
    )
)