## Pandas `apply(func, axis=1)` Speed

> Pandas `apply()` is often slower than vectorization and list comp and 
> consumes a lot more memory.

*Note: The performance of pandas `apply()` has improved as of July 2023. But `apply(func, axis=1)` row-wise is still slow. In fact, not only it's slower than vectorization or list comp, it's even slower than `itertuples()`. So you should avoid using it.*

This notebook tries to convince you `apply()` row-wise (`axis=1`) on a DataFrame 
is bad because it's slower than vectorization, list comp, and even `itertuples()`. 
If you are new to pandas `apply()`, you will want to learn its [most common use pattern](https://github.com/coindataschool/pytips/blob/main/pandas/apply/01-pandas-apply-common-use-pattern.ipynb) and read [the speed 
analysis of `apply(func, axis=0)`](https://github.com/coindataschool/pytips/blob/main/pandas/apply/02-pandas-apply-axis%3D0-speed.ipynb) first. 

In [1]:
import pandas as pd
import numpy as np
from defillama2 import DefiLlama

### Data Prep

In [2]:
obj = DefiLlama() # create a DefiLlama instance
df = obj.get_protocols_fundamentals() # get fundamentals for all protocols
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3032 entries, 0 to 3031
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         3032 non-null   object 
 1   symbol       3032 non-null   object 
 2   chain        3032 non-null   object 
 3   category     3032 non-null   object 
 4   chains       3032 non-null   object 
 5   tvl          3032 non-null   float64
 6   change_1d    2471 non-null   float64
 7   change_7d    2486 non-null   float64
 8   mcap         1421 non-null   float64
 9   forked_from  2189 non-null   object 
dtypes: float64(4), object(6)
memory usage: 237.0+ KB


In [3]:
cond = df['category'] == 'Dexes' # focus on DEXes
cols = ['name', 'symbol', 'chain'] # focus on these cols
subdf = df.loc[cond, cols].reset_index(drop=True) # drop the original index
# subdf.set_index('name', inplace=True) # use name as index
subdf.head()

Unnamed: 0,name,symbol,chain
0,Curve DEX,CRV,Multi-Chain
1,Uniswap V3,UNI,Multi-Chain
2,PancakeSwap AMM,CAKE,Multi-Chain
3,Uniswap V2,UNI,Ethereum
4,Balancer V2,BAL,Multi-Chain


In [4]:
# derive a new col from the name col by striping away 'DEX' and any versioning
# note I'm using the vectorized string function 'replace()' in pandas to do it.
xs = subdf['name'].str.replace('DEX|V[0-9]+', '', regex=True)
# make this col the first col of the dataframe
subdf.insert(0, column='broad_name', value=xs)
subdf.head()

Unnamed: 0,broad_name,name,symbol,chain
0,Curve,Curve DEX,CRV,Multi-Chain
1,Uniswap,Uniswap V3,UNI,Multi-Chain
2,PancakeSwap AMM,PancakeSwap AMM,CAKE,Multi-Chain
3,Uniswap,Uniswap V2,UNI,Ethereum
4,Balancer,Balancer V2,BAL,Multi-Chain


### Compare Character Columns

A common problem we encounter as a data analyst is to check whether values in 
column A are present in column B on the same row, where column A and B are 
columns of strings. For example, given the above data, we'd want to do a sanity 
check to make sure `name_broad` is really a broad description of `name`. 
And we'd expect to see all values of `name_broad` are present in `name` but not 
vice versa. Unfortunately, there are no vectorized string methods in pandas that 
allow us to do this.

In [5]:
# # you may attempt to use `.contains()`, but it doesn't work. 
# # uncomment and run this block and examine the error yourself
# subdf['name'].str.contains(subdf['broad_name'])

In [6]:
# # you may say, "oh, what about .isin()". Unfortunately, it gives wrong results 
# # because it doesn't compare values row-wise. See doc for details:
# # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html
# # uncomment and run this block and compare results with blocks below
# is_broad_name_in_name = subdf['broad_name'].str.lower().isin(subdf['name'].str.lower())
# is_name_in_broad_name = subdf['name'].str.lower().isin(subdf['broad_name'].str.lower())
# n = len(subdf)
# print(f"How many values of `broad name` are present in `name`: {np.sum(is_broad_name_in_name)} out of {n}")
# print(f"How many values of `name` are present in `broad name`: {np.sum(is_name_in_broad_name)} out of {n}")

But we can do it with `apply(func, axis=1)`. (Doesn't mean we should as my goal 
is to convince you not. Keep reading to find out why.)

In [7]:
# apply(func, axis=1), i.e., apply func row-wise
is_broad_name_in_name = subdf.apply(
    lambda row: row['broad_name'] in row['name'], 
    axis=1
)
is_name_in_broad_name = subdf.apply(
    lambda row: row['name'] in row['broad_name'],
    axis=1
)
n = len(subdf)
print(f"How many values of `broad name` are present in `name`: {np.sum(is_broad_name_in_name)} out of {n}")
print(f"How many values of `name` are present in `broad name`: {np.sum(is_name_in_broad_name)} out of {n}")

How many values of `broad name` are present in `name`: 905 out of 909
How many values of `name` are present in `broad name`: 821 out of 909


In [8]:
# oops, we'd expect all broad names are in name, something is off, let's examine
subdf[~is_broad_name_in_name]

Unnamed: 0,broad_name,name,symbol,chain
25,Joe .1,Joe V2.1,JOE,Multi-Chain
59,Bancor .1,Bancor V2.1,BNT,Ethereum
619,Poly-Cryption Network,PolyDEX-Cryption Network,CNT,Polygon
775,SmartBCH,SmartDEXBCH,DSMART,smartBCH


In general, it's difficult to vectorize string operations, and you can consider 
`apply(func, axis=1)` when you are comparing or manipulating multiple string 
columns. But you really shouldn't because there is better alternative, namely, 
list comp. 

In [9]:
# list comp over zip(colA, colB) 
is_broad_name_in_name = [x in y for x, y in zip(subdf['broad_name'], subdf['name'])]
is_name_in_broad_name = [y in x for x, y in zip(subdf['broad_name'], subdf['name'])]
print(f"How many values of `broad name` are present in `name`: {np.sum(is_broad_name_in_name)} out of {n}")
print(f"How many values of `name` are present in `broad name`: {np.sum(is_name_in_broad_name)} out of {n}")
subdf[~pd.Series(is_broad_name_in_name)]

How many values of `broad name` are present in `name`: 905 out of 909
How many values of `name` are present in `broad name`: 821 out of 909


Unnamed: 0,broad_name,name,symbol,chain
25,Joe .1,Joe V2.1,JOE,Multi-Chain
59,Bancor .1,Bancor V2.1,BNT,Ethereum
619,Poly-Cryption Network,PolyDEX-Cryption Network,CNT,Polygon
775,SmartBCH,SmartDEXBCH,DSMART,smartBCH


### Speed Comparison

That's because list comp is much faster than `apply(func, axis=1)`. Don't believe me? 
Run the code block below.

In [10]:
%timeit subdf.apply(lambda row: row['broad_name'] in row['name'], axis=1) 
%timeit [x in y for x, y in zip(subdf['broad_name'], subdf['name'])] # zip() is powerful. Use more of it.

9.5 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
263 µs ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In fact, `apply(func, axis=1)` is even slower than `itertuples()`.

In [11]:
%timeit subdf.apply(lambda row: row['broad_name'] in row['name'], axis=1) 
%timeit [row.broad_name in row.name for row in subdf.itertuples()]

9.41 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.28 ms ± 43.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Summary

- Vectorize if possible.
- Use more of List comp. The `zip()` function is your fren. 
- Avoid `apply(func, axis=1)`. 


### Good Read

- Get DeFi data easily using [defillama2](https://github.com/coindataschool/defillama2).
- [When should you not use pandas `apply()`](https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code).
- [Don't iterate over rows in Pandas DataFrame](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758).
- [Are `for` loops bad in Pandas?](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care).
- [How Not to Use pandas' "apply"](https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/)