## A Case Study: Fast Rolling Betas Calculation

In my previous [notebook](https://github.com/coindataschool/pytips/blob/main/pandas/apply/08-roll-n-groll-are-slow.ipynb), 
I gave two custom functions, `roll()` and `groll()`, that can take multiple 
columns as input and output rolling statistics. They are generic, but they can 
be very slow, so we may want to look for faster solutions whenever we can. For 
example, given two daily return series, say Apple's stock and the SP500,
can you calculate the 30-day moving betas of Apple against the SP500? What if I 
give you 10,000 stocks' return series, can you calculate the 30-day moving betas 
for each stock against the SP500? 

In the code blocks below, I give a very fast solution, `rolling_betas()`. It uses
`numpy` and I recommend you to study it line by line, in particular, how 
`as_strided()` is used. I also copy-n-pasted our `roll()` and `groll()` from last 
time because we'll compare their speed.

In [1]:
from typing import Union
import numpy as np
import pandas as pd
from numpy.lib.stride_tricks import as_strided
from defillama2 import DefiLlama

pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 50)
pd.options.display.float_format = '{:,.4f}'.format

In [4]:
def calc_beta(y: np.ndarray, x: np.ndarray):
    """
    Solves for beta in a simple linear regression fit
    """
    x = np.vstack((np.ones_like(x), x)) # add a column of 1's, intercept
    b = np.linalg.pinv(x.dot(x.T)).dot(x).dot(y) 
    return b[1] # beta of 1st x, assume simple linear regression

def rolling_betas(y_df: pd.DataFrame, x_df: pd.DataFrame, window: int):
    """
    Fast rolling betas calculation for many stocks against the market. Same as 
    running a simple linear regression for each stock against the market: 
    stock = a + b * market + noise. It uses `calc_beta()` defined above.
    """    
    result = np.ndarray(shape=y_df.shape, dtype=float)
    l, w = y_df.shape
    ls, ws = y_df.values.strides
    result[0:window-1, :] = np.nan
    y_arr = as_strided(y_df.values, shape=(l - window + 1, window, w), strides=(ls, ls, ws))
    x_arr = as_strided(x_df.values, shape=(l - window + 1, window), strides=(ls, ls))
    for row in range(window-1, l):
        result[row, :] = calc_beta(y_arr[row - window + 1, :], x_arr[row - window + 1])
    return pd.DataFrame(data=result, index=y_df.index, columns=y_df.columns)

# # how to use
# rolling_betas(stocks, market, ndays)

In [2]:
def roll(df: pd.DataFrame, window: int, **kwargs):
    """
    Create all rolling subframes and group them by time index and return a 
    groupby object to be chained with apply(). Slow for large datasets. 
    Doesn't pad NaNs at the head of the resulting DataFrame. This behavior is 
    different from `df.rolling().apply()`, which returns a DataFrame with 
    leading NaNs by default so that its index and row count are the same with 
    the input DataFrame. Credit: https://stackoverflow.com/a/38879051. 

    Parameters
    ----------
    df : DataFrame
    window : int
        Number of periods to roll back.
    **kwargs
        Additional arguments for groupby.
    """

    v = df.values
    d0, d1 = v.shape
    s0, s1 = v.strides

    # memory efficient
    array3d = as_strided(v, (d0 - (window-1), window, d1), (s0, s0, s1))

    # # this is slow cuz of pd.concat(), do not use 
    # rolled_df = pd.concat({
    #     row: pd.DataFrame(values, columns=df.columns)
    #     for row, values in zip(df.iloc[window-1:,].index, array3d)
    # })

    # this is faster
    a,b,c = array3d.shape    
    rolled_df = pd.DataFrame(
        array3d.transpose(2,0,1).reshape(c,-1).T,
        index = pd.MultiIndex.from_arrays(
            [np.repeat(df.iloc[window-1:,].index, b), 
             np.tile(np.arange(b), a)]),
        columns = df.columns
    )
    
    return rolled_df.groupby(level=0, **kwargs)

# # how to use
# roll(df, window).apply(your_function)
# roll(df, window).mean()

In [3]:
def groll(df: pd.DataFrame, window: int): 
    """
    Returns a generator that yield each rolling subframe when called.

    Parameters
    ----------
    df : DataFrame
    window : int 
        Number of periods to roll back.
    """
    for i in range(df.shape[0] - window + 1):
        yield pd.DataFrame(df.values[i:i+window, :], 
                           df.index[i:i+window], 
                           df.columns)

# # how to use
# [your_function(subdf, arg1, arg2, ...) for subdf in groll(df, window)]

In [29]:
def equal(
    a: Union[pd.DataFrame, pd.Series, np.ndarray], 
    b: Union[pd.DataFrame, pd.Series, np.ndarray]):
    """ 
    Check if the corresponding values of two data frames or series or numpy arrays are the same.
    """
    return (abs(a - b) > 1e-8).sum().sum() == 0 # 0 means same values

### Prep Data

First, let's download the daily close prices of GMX and ETH on Arbitrum between 
Oct 02, 2021 and Aug 14, 2023 and calculate their daily returns.

In [5]:
dd = {'0xfc5a1a6eb076a2c7ad06ed22c90d7e710e35ad0a':'arbitrum', # GMX on arbitrum
      '0x82aF49447D8a07e3bd95BD0d56f35241523fBab1':'arbitrum', # ETH on arbitrum      
      }

obj = DefiLlama() # create a DefiLlama instance

# get historical daily close prices 
df = obj.get_daily_open_close(dd, start='2021-10-02', end='2023-08-14', kind='close')

# calc daily returns
rets = df.pct_change().dropna()
rets.head()

Unnamed: 0_level_0,GMX,WETH
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-10-03,0.1398,0.0089
2021-10-04,0.0126,-0.006
2021-10-05,0.4427,0.0319
2021-10-06,-0.0819,0.0239
2021-10-07,-0.0184,-0.0013


### Calculate 30-day Rolling Betas of GMX (Daily Returns) against ETH (Daily Returns)

Let's first check our `calc_beta()` and statsmodels' `OLS()` give the same result. 

In [15]:
import statsmodels.api as sm

# calculate GMX's beta against ETH using all daily returns via statsmodels OLS function
X = sm.add_constant(rets['WETH'])
y = rets['GMX']
results = sm.OLS(y,X).fit()
beta_statsmod = results.params['WETH']

# do the same calculation with our custom function
beta_our = calc_beta(y, rets['WETH'])

# the results should be the same
print(beta_statsmod, beta_our)
assert abs(beta_statsmod - beta_our) < 1e-8

1.2568395310736313 1.2568395310736309


Our `rolling_betas()` function uses `calc_beta()` under the hood. Let's now 
calculate 30-day rolling betas of GMX against ETH.

In [22]:
gmx_betas = rolling_betas(rets[['GMX']], rets[['WETH']], 30) # needs to pass in dataframes instead of series
gmx_betas.dropna().head()

Unnamed: 0_level_0,GMX
date,Unnamed: 1_level_1
2021-11-01,0.985
2021-11-02,0.8308
2021-11-03,0.8098
2021-11-04,0.5958
2021-11-05,0.6202


### Calculate 30-day Rolling Betas of 10,000 Stocks' Daily Returns against Market Daily Returns

Let's now level up and do the same calculation for 10,000 returns series, which 
I randomly generate in the code block below.

In [23]:
num_sec_dfs, num_periods = 10000, 480

dates = pd.date_range('1995-12-31', periods=num_periods, freq='M', name='Date')
stocks = pd.DataFrame(
    data=np.random.rand(num_periods, num_sec_dfs), index=dates,
    columns=['s{:04d}'.format(i) for i in range(num_sec_dfs)])\
        .pct_change().dropna()
market = pd.DataFrame(
    data=np.random.rand(num_periods), index=dates, columns=['Market'])\
        .pct_change().dropna()
rets = stocks.join(market)

In [24]:
stocks.head()

Unnamed: 0_level_0,s0000,s0001,s0002,s0003,s0004,s0005,s0006,...,s9993,s9994,s9995,s9996,s9997,s9998,s9999
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1996-01-31,-0.5438,0.0315,-0.6048,-0.6435,0.0406,-0.6119,-0.2818,...,0.1971,0.0897,-0.9973,-0.6251,20.9318,-0.073,-0.8389
1996-02-29,-0.6325,-0.2092,7.4822,10.1074,-0.964,11.7094,0.3576,...,-0.1037,-0.9353,246.3936,2.1032,-0.4286,-0.657,33.618
1996-03-31,-0.9328,-0.3605,0.9112,0.7131,35.8162,-0.008,0.2914,...,-0.1727,33.0561,-0.4715,-0.6931,-0.4697,3.7343,0.2018
1996-04-30,75.2387,-0.192,-0.7064,-0.8032,-0.4282,-0.0007,-0.4952,...,-0.4688,-0.5775,1.0425,-0.7937,3.1597,-0.6727,-0.9413
1996-05-31,-0.1664,-0.1067,1.8771,3.9923,-0.3388,-0.0438,-0.4358,...,1.2672,1.5297,0.1108,9.1529,-0.8352,0.2703,16.9293


In [25]:
market.head()

Unnamed: 0_level_0,Market
Date,Unnamed: 1_level_1
1996-01-31,-0.7311
1996-02-29,-0.0931
1996-03-31,2.3873
1996-04-30,0.3506
1996-05-31,-0.6008


In [27]:
rets.head()

Unnamed: 0_level_0,s0000,s0001,s0002,s0003,s0004,s0005,s0006,...,s9994,s9995,s9996,s9997,s9998,s9999,Market
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1996-01-31,-0.5438,0.0315,-0.6048,-0.6435,0.0406,-0.6119,-0.2818,...,0.0897,-0.9973,-0.6251,20.9318,-0.073,-0.8389,-0.7311
1996-02-29,-0.6325,-0.2092,7.4822,10.1074,-0.964,11.7094,0.3576,...,-0.9353,246.3936,2.1032,-0.4286,-0.657,33.618,-0.0931
1996-03-31,-0.9328,-0.3605,0.9112,0.7131,35.8162,-0.008,0.2914,...,33.0561,-0.4715,-0.6931,-0.4697,3.7343,0.2018,2.3873
1996-04-30,75.2387,-0.192,-0.7064,-0.8032,-0.4282,-0.0007,-0.4952,...,-0.5775,1.0425,-0.7937,3.1597,-0.6727,-0.9413,0.3506
1996-05-31,-0.1664,-0.1067,1.8771,3.9923,-0.3388,-0.0438,-0.4358,...,1.5297,0.1108,9.1529,-0.8352,0.2703,16.9293,-0.6008


Let's now calculate the 30-day rolling betas of each stock against the market. 

In [28]:
ndays = 30
betas_00 = rolling_betas(stocks, market, ndays)
betas_01 = roll(rets, ndays).apply(lambda x: calc_beta(x.iloc[:, 0], x['Market']))
betas_02 = pd.concat([pd.Series(calc_beta(subdf.iloc[:, 0], subdf['Market']), index=[subdf.index[-1]]) for subdf in groll(rets, ndays)])

In [31]:
print(equal(betas_00.iloc[:,0], betas_01))
print(equal(betas_01, betas_02))

True
True


We see all methods give the same results. Let's compare their speed.

In [32]:
%timeit rolling_betas(stocks, market, ndays)

1.32 s ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%timeit roll(rets, ndays).apply(lambda x: calc_beta(x.iloc[:, 0], x['Market']))

In [None]:
%timeit pd.concat([pd.Series(calc_beta(subdf.iloc[:, 0], subdf['Market']), index=[subdf.index[-1]]) for subdf in groll(rets, ndays)])

It takes `rolling_betas()` 1.32 seconds on average to calculate 30-day moving betas for all 10,000 stocks, whereas it takes forever for our old `roll()` and `groll()` that I had to kill the runs. 

### Summary

- The `calc_beta()` and `rolling_betas()` functions implemented in numpy are 
  pretty fast for finding rolling betas in simple linear regression. 
- If you need to write your own function to calculate some complex rolling statistics, 
  the numpy function `as_strided()` is your fren.

### Good Read

- [All my notebooks on pandas `apply()`](https://coindataschool.substack.com/p/pandas-apply)

### Referral

- Digital Ocean is a cloud computing platform where you can rent remote servers for cheap. 
  I have my remote data science server there. You can do the same and [get $200 credit](https://m.do.co/c/0a435cb96813). 