## Our `roll()` or `groll()` Can Be Slow

Previously, I showed you that we can't use `rolling().apply()` to calculate 
rolling metrics with two or more columns as input. You can read it [here](https://github.com/coindataschool/pytips/blob/main/pandas/apply/07-pandas-rolling-apply-cannot-take-multiple-cols-as-input.ipynb). As a 
solution, I gave you two custome functions, namely, `roll()` and `groll()`. 
They get the job done but they can be slow. And in this notebook, I will show you 
how slow they can be by example. First, let's define them again.

In [1]:
from typing import Union
import numpy as np
import pandas as pd
from numpy.lib.stride_tricks import as_strided 
from defillama2 import DefiLlama

pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 50)
pd.options.display.float_format = '{:,.4f}'.format

In [2]:
def roll(df: pd.DataFrame, window: int, **kwargs):
    """
    Create all rolling subframes and group them by time index and return a 
    groupby object to be chained with apply(). Slow for large datasets. 
    Doesn't pad NaNs at the head of the resulting DataFrame. This behavior is 
    different from `df.rolling().apply()`, which returns a DataFrame with 
    leading NaNs by default so that its index and row count are the same with 
    the input DataFrame. Credit: https://stackoverflow.com/a/38879051. 

    Parameters
    ----------
    df : DataFrame
    window : int
        Number of periods to roll back.
    **kwargs
        Additional arguments for groupby.
    """

    v = df.values
    d0, d1 = v.shape
    s0, s1 = v.strides

    # memory efficient
    array3d = as_strided(v, (d0 - (window-1), window, d1), (s0, s0, s1))

    # # this is slow cuz of pd.concat(), do not use 
    # rolled_df = pd.concat({
    #     row: pd.DataFrame(values, columns=df.columns)
    #     for row, values in zip(df.iloc[window-1:,].index, array3d)
    # })

    # this is faster
    a,b,c = array3d.shape    
    rolled_df = pd.DataFrame(
        array3d.transpose(2,0,1).reshape(c,-1).T,
        index = pd.MultiIndex.from_arrays(
            [np.repeat(df.iloc[window-1:,].index, b), 
             np.tile(np.arange(b), a)]),
        columns = df.columns
    )
    
    return rolled_df.groupby(level=0, **kwargs)

# # how to use
# roll(df, window).apply(your_function)
# roll(df, window).mean()

In [3]:
def groll(df: pd.DataFrame, window: int): 
    """
    Returns a generator that yield each rolling subframe when called.

    Parameters
    ----------
    df : DataFrame
    window : int 
        Number of periods to roll back.
    """
    for i in range(df.shape[0] - window + 1):
        yield pd.DataFrame(df.values[i:i+window, :], 
                           df.index[i:i+window], 
                           df.columns)

# # how to use
# [your_function(subdf, arg1, arg2, ...) for subdf in groll(df, window)]

### Prep Data

Let's download the daily close prices of GMX and ETH on Arbitrum between 
Oct 02, 2021 and Aug 07, 2023. Let's also calculate their daily returns.

In [4]:
dd = {'0xfc5a1a6eb076a2c7ad06ed22c90d7e710e35ad0a':'arbitrum', # GMX on arbitrum
      '0x82aF49447D8a07e3bd95BD0d56f35241523fBab1':'arbitrum', # ETH on arbitrum      
      }

obj = DefiLlama() # create a DefiLlama instance

# get historical daily close prices 
df = obj.get_daily_open_close(dd, start='2021-10-02', end='2023-08-07', kind='close')

# calc daily returns
rets = df.pct_change().dropna()
rets.head()

Unnamed: 0_level_0,GMX,WETH
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-10-03,0.1398,0.0089
2021-10-04,0.0126,-0.006
2021-10-05,0.4427,0.0319
2021-10-06,-0.0819,0.0239
2021-10-07,-0.0184,-0.0013


### Calculate Rolling p-value of Cointegration test

Let's run Cointegration test between the daily returns of GMX and ETH on a 7-day 
rolling basis and extract the p-values. If you are unfamiliar with cointegration test and how it's used, these [two](https://www.youtube.com/watch?v=g-qvFjvyqcs) [video](https://www.youtube.com/watch?v=q5wbOSjbVW4) 
give a good introduction.  

Now, the `statsmodels` package has a function called `coint()` that takes two 
series as input and runs the cointegration test. I wrote a wrapper function below 
that simply extracts the p-value from the output. 

In [5]:
from statsmodels.tsa.stattools import coint

def coint_pval(
    s1: Union[np.ndarray, np.array, pd.Series], 
    s2: Union[np.ndarray, np.array, pd.Series]): 
    """ Test if two series are cointegrated and returns the p-value. """
    return coint(s1, s2)[1]

Now let's apply this `coint_pval()` function to our returns data frame, taking 
the 'GMX' and 'WETH' columns as input on a 7-day rolling basis. 

In [6]:
%%timeit 
pvals_roll = roll(rets, 7).apply(lambda df: coint_pval(df['GMX'], df['WETH']))

1.8 s ± 46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%%timeit 
pvals_groll = pd.concat([pd.Series(coint_pval(df['GMX'], df['WETH']), index=[df.index[-1]]) for df in groll(rets, 7)])

2.14 s ± 257 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We see both approaches are pretty slow. Keep in mind that the speed test was done 
on only 22 months of data. Definitely hard to scale on much larger datasets. Now 
one bottleneck is the `coint()` function from `statsmodels`. It is slow. So if 
we want to improve the speed for this particular case, we'd want to write a faster 
version of cointegration test.

### Summary

- `rolling().apply()` can NOT take multiple columns as input.
- `roll()` or `groll()` defined above can. But they can be very SLOW!

So how do we improve their speed? Unfortunately, we can't do much in general 
because often the slowness is caused by the function we are applying. We'll need 
to deal with it case by case. For example, in the next notebook, I'm going to 
show you a very fast way to calculate rolling betas of linear regressions. Make 
sure you star this [repo](https://github.com/coindataschool/pytips/tree/main/pandas/apply) 
to stay informed.

### Good Read

- [All my notebooks on pandas `apply()`](https://coindataschool.substack.com/p/pandas-apply).
- [The original `roll()` function](https://stackoverflow.com/a/38879051).

### Referral

- Digital Ocean is a cloud computing platform where you can rent remote servers for cheap. 
  I have my remote data science server there. You can do the same and [get $200 credit](https://m.do.co/c/0a435cb96813). 