## Pandas `rolling().apply()` 

All previous notebooks discussed `apply()` in the context of non-timeseries data. 
When working with timeseries data, we often want to calculate simple rolling 
statistics such as rolling sums, averages, medians, and standard deviations. 
That's when we use `rolling().apply()`. Let me demonstrate with DEX volume data 
from DeFiLlama.

In [1]:
import pandas as pd
import numpy as np
from defillama2 import DefiLlama
from typing import Union

In [2]:
def equal(
    a: Union[pd.DataFrame, pd.Series, np.ndarray], 
    b: Union[pd.DataFrame, pd.Series, np.ndarray],
    threshold=1e-8):
    """ 
    Check if the corresponding values of two data frames or series or numpy arrays are the same.
    """
    return (abs(a - b) > threshold).sum().sum() == 0 # 0 means same values

In [3]:
obj = DefiLlama() # create a DefiLlama instance
dd = obj.get_dexes_volumes()
dd.keys()

dict_keys(['volume_overall', 'volume_by_dex', 'volume_by_dex_by_chain_24h', 'daily_volume', 'daily_volume_by_dex'])

In [4]:
df = dd['daily_volume'] # focus on daily volumes
df.head()

Unnamed: 0_level_0,volume
date,Unnamed: 1_level_1
2019-10-11 00:00:00+00:00,943502.8
2019-10-12 00:00:00+00:00,637271.5
2019-10-13 00:00:00+00:00,593975.9
2019-10-14 00:00:00+00:00,1160122.0
2019-10-15 00:00:00+00:00,1118992.0


### Calculate Rolling Sums, Means, Medians, and Standard Deviations

In [5]:
# calc 7-day rolling sums, means, and medians
rolling_sums_7d = df['volume'].rolling(7).apply(np.sum)
rolling_means_7d = df['volume'].rolling(7).apply(np.mean)
rolling_meds_7d = df['volume'].rolling(7).apply(np.median)

For these simple rolling statistics, there are designated functions that perform better than the `apply()` versions above.

In [6]:
# do the same without apply()
rolling_sums_7d_v2 = df['volume'].rolling(7).sum()
rolling_means_7d_v2 = df['volume'].rolling(7).mean()
rolling_meds_7d_v2 = df['volume'].rolling(7).median()

In [7]:
assert equal(rolling_sums_7d, rolling_sums_7d_v2, 1e-4) # pandas rolling.sum() and rolling.mean() round differently than np.sum, np.mean
assert equal(rolling_means_7d, rolling_means_7d_v2, 1e-5) # so we need to adjust the threshold 
assert equal(rolling_meds_7d, rolling_meds_7d_v2) 

In [8]:
# pandas rolling.std() calculates the sample standard deviation by default (ddof=1), 
# whereas numpy np.std() calculates the population standard deviation by default (ddof=0)
# most of the time we want to find sample standard deviation cuz we work with samples!
# 
# pandas std() uses ddof=1 by default:
#   https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html
# numpy std() uses ddof=0 by default: 
#   https://numpy.org/doc/stable/reference/generated/numpy.std.html
# 
# also, rolling.std() now stops rounding tiny numbers to zero:
#   https://stackoverflow.com/a/70629589

# let's be explicit and pass `ddof=1` to both the numpy and pandas version
rolling_stds_7d = df['volume'].rolling(7).apply(lambda xs: np.std(xs, ddof=1)) 
rolling_stds_7d_v2 = df['volume'].rolling(7).std(ddof=1) 
# need to pick a larger threshold cuz 7 obs is small and different rounding 
# really affect the end result
assert equal(rolling_stds_7d, rolling_stds_7d_v2, 1e-3) 

In [9]:
rolling_stds_30d = df['volume'].rolling(30).apply(lambda xs: np.std(xs, ddof=1)) 
rolling_stds_30d_v2 = df['volume'].rolling(30).std(ddof=1) 
# using 30 obs allows us to decrease the threshold by a factor of 10
assert equal(rolling_stds_30d, rolling_stds_30d_v2, 1e-4) 

The above code will also work on a data frame of all numerical columns. 
Let me show you. First, let me prepare a data frame of all three versions of 
Uniswap's daily volumes since 01 Jan 2023.

In [10]:
df = dd['daily_volume_by_dex']
dexes = ['Uniswap V1', 'Uniswap V2', 'Uniswap V3']
subdf = df.loc['2023-01-01':, dexes]
subdf.head()

Unnamed: 0_level_0,Uniswap V1,Uniswap V2,Uniswap V3
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-01-01 00:00:00+00:00,63368.03101,31751490.0,300478200.0
2023-01-02 00:00:00+00:00,20800.45717,37900350.0,441824900.0
2023-01-03 00:00:00+00:00,8090.465679,31364640.0,322498700.0
2023-01-04 00:00:00+00:00,45177.833705,47486690.0,806916800.0
2023-01-05 00:00:00+00:00,72497.322509,88848050.0,442464500.0


I can then copy and paste the above `rolling().apply()` code to apply to this new frame.

In [11]:
rolling_sums_7d = subdf.rolling(7).apply(np.sum)
rolling_means_7d = subdf.rolling(7).apply(np.mean)
rolling_meds_7d = subdf.rolling(7).apply(np.median)
rolling_stds_7d = subdf.rolling(7).apply(lambda xs: np.std(xs, ddof=1))

And of course, the better versions also work!

In [12]:
rolling_sums_7d_v2 = subdf.rolling(7).sum()
rolling_means_7d_v2 = subdf.rolling(7).mean()
rolling_meds_7d_v2 = subdf.rolling(7).median()
rolling_stds_7d_v2 = subdf.rolling(7).std() # ddof=1 is the default

In [13]:
assert equal(rolling_sums_7d, rolling_sums_7d_v2, 1e-5)
assert equal(rolling_means_7d, rolling_means_7d_v2, 1e-6)
assert equal(rolling_meds_7d, rolling_meds_7d_v2)
assert equal(rolling_stds_7d, rolling_stds_7d_v2, 1e-4)

And there's a more versatile function called `agg()` that allows us to calculate 
different and/or multiple rolling statistics for different columns. 

In [14]:
subdf.rolling(7, min_periods=7).agg(
    {'Uniswap V1': ['median', 'mean'], 
     'Uniswap V2': ['mean', 'std'],
     'Uniswap V3': ['sum'],
})

Unnamed: 0_level_0,Uniswap V1,Uniswap V1,Uniswap V2,Uniswap V2,Uniswap V3
Unnamed: 0_level_1,median,mean,mean,std,sum
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2023-01-01 00:00:00+00:00,,,,,
2023-01-02 00:00:00+00:00,,,,,
2023-01-03 00:00:00+00:00,,,,,
2023-01-04 00:00:00+00:00,,,,,
2023-01-05 00:00:00+00:00,,,,,
...,...,...,...,...,...
2023-07-23 00:00:00+00:00,30228.619734,29588.990736,1.331265e+08,2.429102e+07,4.727569e+09
2023-07-24 00:00:00+00:00,30228.619734,29098.222843,1.441589e+08,3.929839e+07,4.456881e+09
2023-07-25 00:00:00+00:00,30228.619734,29723.699323,1.500543e+08,3.175472e+07,4.075148e+09
2023-07-26 00:00:00+00:00,30228.619734,29391.135849,1.397862e+08,3.645313e+07,4.102507e+09


Beyond simple rolling statistics, there are no readily available functions in 
Pandas that we can use to compute complex statistics. For example, say we want to 
calculate the 30-day rolling auto-correlations of the daily volumes for each 
Uniswap version. We can't simply use `subdf.rolling(30).autocorr()`. Instead, 
we can use `apply(lambda ser: ser.autocorr())`.

In [15]:
# # try it. It will throw an error
# subdf.rolling(30).autocorr()

In [16]:
subdf.rolling(30).apply(lambda ser: ser.autocorr()).dropna()

Unnamed: 0_level_0,Uniswap V1,Uniswap V2,Uniswap V3
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-01-30 00:00:00+00:00,-0.154849,0.539381,0.483956
2023-01-31 00:00:00+00:00,-0.154838,0.511798,0.424802
2023-02-01 00:00:00+00:00,-0.167159,0.456956,0.350963
2023-02-02 00:00:00+00:00,-0.036204,0.638197,0.341853
2023-02-03 00:00:00+00:00,-0.156475,0.695448,0.315810
...,...,...,...
2023-07-23 00:00:00+00:00,-0.073116,0.396239,0.118452
2023-07-24 00:00:00+00:00,-0.082624,0.444259,0.128597
2023-07-25 00:00:00+00:00,-0.086377,0.391377,0.139397
2023-07-26 00:00:00+00:00,-0.065568,0.355146,0.122298


### Summary

- `rolling().apply(func)` is very handy when calculating rolling statistics 
  on numerical columns of a data frame (or a numerical series). 
- when calculating simple rolling statistics such as sum, mean, median, and 
  standard deviation, drop `apply()` and call the statistical functions directly.
- `rolling().agg()` allows calculation of different and/or multiple simple 
  rolling statistics for different columns. 

There are two drawbacks of `rolling().apply()`:
1. it is slow and memory-inefficient.
2. it cannot apply a function with input parameters from multiple columns. 
   For example, if you want to calculate the rolling betas of a stock against 
   SP500, or if you want to calculate the rolling p-values of the cointegration 
   test of two series.

I'll dive deep into these two drawbacks and provide solutions in the upcoming 
notebooks. Star and watch the repo to stay informed.

### Good Read

- [All my notebooks on pandas `apply()`](https://coindataschool.substack.com/p/pandas-apply).

### Referral

- Digital Ocean is a cloud computing platform where you can rent remote servers for cheap. 
  I have my remote data science server there. You can do the same and [get $200 credit](https://m.do.co/c/0a435cb96813). 