## Pandas `apply(func, axis=0)` Speed

> Pandas `apply()` is often slower than vectorization and list comp and 
> consumes a lot more memory.

*Note: The performance of pandas `apply()` has improved as of July 2023. In particular, `apply()` column-wise (`axis=0`) is pretty fast, and in some instances, can be on a par with vectorization or list comp. But the above quote holds true in general.*

This notebook compares the speed of using pandas `apply()` column wise 
(`axis=0`, default) on a DataFrame vs. using vectorization and list comp. 
If you are new to pandas `apply()`, you may want to learn its [most common use pattern](https://github.com/coindataschool/pytips/blob/main/pandas/apply/01-pandas-apply-common-use-pattern.ipynb) first.

In [1]:
import pandas as pd
import numpy as np
from defillama2 import DefiLlama

### Data Prep

In [2]:
obj = DefiLlama() # create a DefiLlama instance
df = obj.get_protocols_fundamentals() # get fundamentals for all protocols
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3028 entries, 0 to 3027
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         3028 non-null   object 
 1   symbol       3028 non-null   object 
 2   chain        3028 non-null   object 
 3   category     3028 non-null   object 
 4   chains       3028 non-null   object 
 5   tvl          3028 non-null   float64
 6   change_1d    2497 non-null   float64
 7   change_7d    2479 non-null   float64
 8   mcap         1421 non-null   float64
 9   forked_from  2185 non-null   object 
dtypes: float64(4), object(6)
memory usage: 236.7+ KB


In [3]:
cond = df['category'] == 'Liquid Staking' # focus on LSD protocols
cols = ['name', 'tvl', 'change_1d', 'change_7d', 'mcap'] # focus on these cols
subdf = df.loc[cond, cols].reset_index(drop=True) # drop the original int index
subdf.set_index('name', inplace=True) # use name as index
subdf.head()

Unnamed: 0_level_0,tvl,change_1d,change_7d,mcap
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Lido,14975090000.0,-0.943175,3.378131,1807643000.0
Coinbase Wrapped Staked ETH,2301431000.0,-0.938579,3.595419,
Rocket Pool,1910880000.0,-1.204842,2.385279,679194600.0
Frax Ether,455790800.0,-0.976026,2.766195,
StakeWise,184284600.0,-0.863176,2.17606,21577750.0


### Speed Comparison

In [4]:
%timeit subdf.apply(lambda col: col.max() - col.min(), axis=0) # apply with raw=False
# pass each col as a ndarray to the function, lose index, faster than raw=False (default)
%timeit subdf.apply(lambda col: col.max() - col.min(), axis=0, raw=True) # apply with raw=True
%timeit subdf.max() - subdf.min() # vectorization

957 µs ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
169 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
816 µs ± 59.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


($\mu s$ is microseconds and $ms$ is milliseconds.)

In this toy example, we see 
- `apply(raw=True)` is the fastest, 
- vectorization comes in the 2nd place, and 
- `apply(raw=False)` is the slowest.

Unfortunately, we cannot use `raw=True` all the time. Setting `raw=True` passes 
each column as a ndarray to the function. When the function is a pandas Series 
method, it will throw an error because ndarray doesn't have Series attributes. 
(See code chunk below for an example.) 

On the other hand, having `raw=False` (default) passes each column as a pandas 
Series to the function, so it will still work if the function is a Series method.

In [5]:
# extract protocols with the largest TVL, 1 and 7 day change in TVL, or MCap
%timeit subdf.apply(lambda col: col.idxmax(), axis=0) # apply with raw=False
# %timeit subdf.apply(lambda col: col.idxmax(), axis=0, raw=True) # throws error because ndarray does not have .idxmax() method
%timeit subdf.idxmax() # vectorization

704 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
365 µs ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In this example, setting `raw=True` throws an error because we are calling 
on each column the `idxmax()` method, which is associated with pandas Series. 
Notice that vectorization is again faster than `apply(raw=False)`. 

Let's now throw list comp in the mix and compare all three methods. In the next 
example, We'll apply a transformation function over 2 columns on a DataFrame. 
(Transformation fucntions don't change input data shape.) 

In [6]:
# create categorical versions for TVL and Mcap: if value > 500M, then "500M+", else "500M-"
def bin_var(col): 
    return np.where(col >= 500*1e6, "500M+", "500M-")

%timeit subdf[['tvl', 'mcap']].apply(bin_var, raw=True) # apply with raw=True
%timeit [bin_var(subdf[[cname]].values) for cname in ['tvl', 'mcap']] # list comp
%timeit bin_var(subdf[['tvl', 'mcap']].values) # vectorization

727 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1.04 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
524 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


We see vectorization beats `apply(raw=True)` and `apply(raw=True)` beats list comp. 
But the output from vectorization and list comp are not DataFrames, and we'd need 
to write extra code to convert them to DataFrame and assign index and column names. 

Also, look at the above code chunk again, why did I use 
`bin_var(subdf[[cname]].values)` instead of `bin_var(subdf[[cname]])` and 
`bin_var(subdf[['tvl', 'mcap']].values)` instead of `bin_var(subdf[['tvl', 'mcap']])`?
That's because `.values` extracts the underlying numpy array from Series or DataFrame, 
and calling `bin_var()` on numpy arrays improves performance.

In [7]:
%timeit subdf[subdf['tvl'] > 1e9]          # slower
%timeit subdf[subdf['tvl'].values > 1e9]   # faster

258 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
120 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### Summary

- `apply(func, axis=0)` is pretty fast and it conveniently returns a pandas DataFrame/Series.
- Vectorization is usually the fastest. So vectorize if possible. (Did you 
  realize that all my functions above that went into `apply()` were vectorized?)
  Also, numpy vectorization beats pandas vectorization. (Remember the examples 
  I gave above where `df['col'].values` makes things faster than `df['col']`?)
- List comp can be slower or faster than `apply()` or vectorization. So 
  definitely keep it in your toolbox. Can you find an example where 
  list comp beats vectorization?
- Vectorization and list comp strip away indices and column names and output numpy 
  arrayes. If you need a DataFrame/Series, you will need to write extra code to 
  covert and re-assign the index and header. 

As a rule of thumb, you want to do things in this order:
1. Vectorization (numpy vectorization first, and then pandas vectorization).
2. List comp.
3. Pandas `apply()`.

And you want to avoid iteration over `df.to_dict()`, `df.to_records()`, `df.iloc[]`, or `df.iterrows()`.

### Good Read

- Get DeFi data easily using [defillama2](https://github.com/coindataschool/defillama2).
- [Stop using `iterrows()`](https://ryxcommar.com/2020/01/15/for-the-love-of-god-stop-using-iterrows/).
- [Don't iterate over rows in Pandas DataFrame](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758).
- [Are `for` loops bad in Pandas?](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care).

### Referral

- Digital Ocean is a cloud computing platform where you can rent remote servers for cheap. 
  I have my remote data science server there. You can do the same and [get $200 credit](https://m.do.co/c/0a435cb96813). 