## Pandas `rolling().apply()` Can't Take Multiple Columns as Input

As the title indicates, `rolling().apply()` can't take multiple columns as input,
so we can't use it to calculate rolling metrics that involve two or more columns 
in the calculation. In this notebook, I will provide a couple of solutions and 
demonstrate with several examples. Let's get started. 

In [1]:
from typing import Union
import numpy as np
import pandas as pd
from numpy.lib.stride_tricks import as_strided 
from defillama2 import DefiLlama

pd.set_option('display.max_columns', 15)
pd.set_option('display.max_rows', 50)
pd.options.display.float_format = '{:,.4f}'.format

In [2]:
def equal(
    a: Union[pd.DataFrame, pd.Series, np.ndarray], 
    b: Union[pd.DataFrame, pd.Series, np.ndarray]):
    """ 
    Check if the corresponding values of two data frames or series or numpy arrays are the same.
    """
    return (abs(a - b) > 1e-8).sum().sum() == 0 # 0 means same values

Solution 1: using `as_strided` from numpy. It pays if you know numpy well!

In [3]:
def roll(df: pd.DataFrame, window: int, **kwargs):
    """
    Create all rolling subframes and group them by time index and return a 
    groupby object to be chained with apply(). Slow for large datasets. 
    Doesn't pad NaNs at the head of the resulting DataFrame. This behavior is 
    different from `df.rolling().apply()`, which returns a DataFrame with 
    leading NaNs by default so that its index and row count are the same with 
    the input DataFrame. Credit: https://stackoverflow.com/a/38879051. 

    Parameters
    ----------
    df : DataFrame
    window : int
        Number of periods to roll back.
    **kwargs
        Additional arguments for groupby.
    """

    v = df.values
    d0, d1 = v.shape
    s0, s1 = v.strides

    # memory efficient
    array3d = as_strided(v, (d0 - (window-1), window, d1), (s0, s0, s1))

    # # this is slow cuz of pd.concat(), do not use 
    # rolled_df = pd.concat({
    #     row: pd.DataFrame(values, columns=df.columns)
    #     for row, values in zip(df.iloc[window-1:,].index, array3d)
    # })

    # this is faster
    a,b,c = array3d.shape    
    rolled_df = pd.DataFrame(
        array3d.transpose(2,0,1).reshape(c,-1).T,
        index = pd.MultiIndex.from_arrays(
            [np.repeat(df.iloc[window-1:,].index, b), 
             np.tile(np.arange(b), a)]),
        columns = df.columns
    )
    
    return rolled_df.groupby(level=0, **kwargs)

# # how to use
# roll(df, window).apply(your_function)
# roll(df, window).mean()

Solution 2: using a generator to yield each rolling sub dataframe. We can then 
use list comprehension to wrap up whatever calculation we want to do using 
whichever columns from each sub dataframe.

In [4]:
def groll(df: pd.DataFrame, window: int): 
    """
    Returns a generator that yield each rolling subframe when called.

    Parameters
    ----------
    df : DataFrame
    window : int 
        Number of periods to roll back.
    """
    for i in range(df.shape[0] - window + 1):
        yield pd.DataFrame(df.values[i:i+window, :], 
                           df.index[i:i+window], 
                           df.columns)

# # how to use
# [your_function(subdf, arg1, arg2, ...) for subdf in groll(df, window)]

### Prep Data

In [5]:
dd = {'0xfc5a1a6eb076a2c7ad06ed22c90d7e710e35ad0a':'arbitrum', # GMX on arbitrum
      '0x912CE59144191C1204E64559FE8253a0e49E6548':'arbitrum', # ARB on arbitrum
      '0x82aF49447D8a07e3bd95BD0d56f35241523fBab1':'arbitrum', # ETH on arbitrum      
      }

obj = DefiLlama() # create a DefiLlama instance

# get historical daily close prices 
df = obj.get_daily_open_close(dd, start='2023-03-23', end='2023-07-27', kind='close')

# calc daily returns
rets = df.pct_change().dropna()
rets.head()

Unnamed: 0_level_0,ARB,GMX,WETH
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-03-24,-0.0459,-0.0659,-0.0338
2023-03-25,-0.038,-0.0109,-0.0042
2023-03-26,0.0407,-0.0069,0.0129
2023-03-27,-0.0859,0.0392,-0.033
2023-03-28,0.0403,0.0815,0.0337


### Correctness Check

Let's check that our custom `roll()` and `groll()` functions give correct results
by comparing with `rolling().apply()` or vectorization on 1 input column.

In [6]:
# calc 7d rolling means
ndays = 7
means_true = rets.rolling(ndays).mean()
means_roll = roll(rets, ndays).mean()
means_groll = pd.concat([pd.DataFrame([subdf.mean()], index=[subdf.index[-1]]) for subdf in groll(rets, ndays)])
print(means_true.dropna().head(), '\n\n')
print(means_roll.head(), '\n\n')
print(means_groll.head(), '\n\n')

               ARB    GMX    WETH
date                             
2023-03-30  0.0051 0.0008 -0.0021
2023-03-31  0.0137 0.0116  0.0051
2023-04-01  0.0099 0.0118  0.0058
2023-04-02 -0.0069 0.0107  0.0017
2023-04-03 -0.0005 0.0016  0.0076 


               ARB    GMX    WETH
date                             
2023-03-30  0.0051 0.0008 -0.0021
2023-03-31  0.0137 0.0116  0.0051
2023-04-01  0.0099 0.0118  0.0058
2023-04-02 -0.0069 0.0107  0.0017
2023-04-03 -0.0005 0.0016  0.0076 


               ARB    GMX    WETH
2023-03-30  0.0051 0.0008 -0.0021
2023-03-31  0.0137 0.0116  0.0051
2023-04-01  0.0099 0.0118  0.0058
2023-04-02 -0.0069 0.0107  0.0017
2023-04-03 -0.0005 0.0016  0.0076 




In [7]:
assert equal(means_true, means_roll)
assert equal(means_true, means_groll)

In [8]:
# calc 10d auto correlations
ndays = 10
autocorr_true = rets.rolling(ndays).apply(lambda xs: xs.autocorr())
autocorr_roll = roll(rets, ndays).apply(lambda df: df.apply(lambda xs: xs.autocorr()))
autocorr_groll = pd.DataFrame.from_dict({dt: df.apply(lambda xs: xs.autocorr()) for dt, df in roll(rets, ndays)}).transpose()
print(autocorr_true.dropna().head(), '\n\n')
print(autocorr_roll.head(), '\n\n')
print(autocorr_groll.head(), '\n\n')


               ARB     GMX    WETH
date                              
2023-04-02  0.0068  0.0442 -0.3296
2023-04-03  0.0423 -0.0081 -0.4963
2023-04-04 -0.0004 -0.1027 -0.3730
2023-04-05  0.1173  0.0664 -0.2141
2023-04-06  0.2224 -0.2227  0.0646 


               ARB     GMX    WETH
date                              
2023-04-02  0.0068  0.0442 -0.3296
2023-04-03  0.0423 -0.0081 -0.4963
2023-04-04 -0.0004 -0.1027 -0.3730
2023-04-05  0.1173  0.0664 -0.2141
2023-04-06  0.2224 -0.2227  0.0646 


               ARB     GMX    WETH
2023-04-02  0.0068  0.0442 -0.3296
2023-04-03  0.0423 -0.0081 -0.4963
2023-04-04 -0.0004 -0.1027 -0.3730
2023-04-05  0.1173  0.0664 -0.2141
2023-04-06  0.2224 -0.2227  0.0646 




In [9]:
assert equal(autocorr_true, autocorr_roll)
assert equal(autocorr_true, autocorr_groll)

### Find the p-value of Cointegration test

If you don't know what cointegration test is or how it's used in trading, work through this [notebook](https://www.quantrocket.com/code/?repo=quant-finance-lectures&path=%2Fcodeload%2Fquant-finance-lectures%2Fquant_finance_lectures%2FLecture42-Introduction-to-Pairs-Trading.ipynb.html) first. The `statsmodels` package has a function called `coint()` that 
takes two input series and runs the cointegration test. I created a wrapper below 
that simply extract the p-value from the output. 

In [10]:
from statsmodels.tsa.stattools import coint

def coint_pval(
    s1: Union[np.ndarray, np.array, pd.Series], 
    s2: Union[np.ndarray, np.array, pd.Series]): 
    """ Test if two series are cointegrated and returns the p-value. """
    return coint(s1, s2)[1]

Say we want to find the 7-day rolling p-values of cointegration test between GMX and ARB.
We cannot just `df.rolling().apply(lambda ser1, ser2: coint_pval(ser1, ser2))` 
because `apply()` only takes 1 input column! So we'll need to use our custom 
`roll()` or `groll()` functions defined above.

In [11]:
pvals_roll = roll(rets, 7).apply(lambda df: coint_pval(df['GMX'], df['ARB']))
pvals_groll = pd.concat([pd.Series(coint_pval(df['GMX'], df['ARB']), index=[df.index[-1]]) for df in groll(rets, 7)])
print(pvals_roll.dropna().head(), '\n\n')
print(pvals_groll.head(), '\n\n')

date
2023-03-30   0.3185
2023-03-31   0.1819
2023-04-01   0.1916
2023-04-02   1.0000
2023-04-03   0.4440
dtype: float64 


2023-03-30   0.3185
2023-03-31   0.1819
2023-04-01   0.1916
2023-04-02   1.0000
2023-04-03   0.4440
dtype: float64 




In [12]:
assert equal(pvals_roll, pvals_groll)

### Summary

- `rolling().apply()` can NOT take multiple columns as input.
- Use `roll()` or `groll()` defined in this notebook to calculate 
  rolling metrics that require two or more input columns in the calculation.

Sadly, neither `roll()` or `groll()` is fast, which we'll see in the next notebook.
Make sure you star and watch this [repo](https://github.com/coindataschool/pytips/tree/main/pandas/apply) 
to stay informed.

### Good Read

- [All my notebooks on pandas `apply()`](https://coindataschool.substack.com/p/pandas-apply)
- [The original `roll()` function](https://stackoverflow.com/a/38879051)