## Pandas `apply(func, axis=1)` is even slower than `itertuples()`

If you read my last [notebook](https://github.com/coindataschool/pytips/blob/main/pandas/apply/03-pandas-apply-axis%3D1-speed.ipynb), you know you should avoid using `apply(func, axis=1)` because `apply()` a function row-wise (for each row) on a DataFrame is super slow. It is even slower than `itertuples()`, which is not fast and should also be avoided. In this notebook, I want to show another example to really drive the message home:

> Don't use `apply(func, axis=1)`. It is slower than `itertuples()`.

In [1]:
import pandas as pd
import numpy as np
from defillama2 import DefiLlama

### Data Prep

Let's get the current prices of the following tokens from the corresponding chains.

In [2]:
dd = {'0xC02aaA39b223FE8D0A0e5C4F27eAD9083C756Cc2':'ethereum', # WETH on mainnet
      '0x912CE59144191C1204E64559FE8253a0e49E6548':'arbitrum', # ARB on arbitrum
      '0x4200000000000000000000000000000000000042':'optimism', # OP on optimism
      '0xfc5a1a6eb076a2c7ad06ed22c90d7e710e35ad0a':'arbitrum', # GMX on arbitrum
      }

obj = DefiLlama() # create a DefiLlama instance
df = obj.get_tokens_curr_prices(dd) 
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4 entries, 2023-07-25 12:27:53+00:00 to 2023-07-25 12:27:56+00:00
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   symbol         4 non-null      object 
 1   price          4 non-null      float64
 2   chain          4 non-null      object 
 3   decimals       4 non-null      int64  
 4   token_address  4 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 192.0+ bytes


In [3]:
# drop useless cols and 
# reset the index to integers so that the original datetime index becomes a col 
df = df.drop(columns = ['decimals', 'token_address']).reset_index()
# add a new col of timezones
df['timezone'] = ['America/New_York', 'America/Los_Angeles', 'Europe/London', 'Asia/Tokyo']
df

Unnamed: 0,timestamp,symbol,price,chain,timezone
0,2023-07-25 12:27:53+00:00,OP,1.49,optimism,America/New_York
1,2023-07-25 12:27:53+00:00,ARB,1.17,arbitrum,America/Los_Angeles
2,2023-07-25 12:27:55+00:00,WETH,1853.41,ethereum,Europe/London
3,2023-07-25 12:27:56+00:00,GMX,53.37,arbitrum,Asia/Tokyo


Say we want to convert the `timestamp` values, which are in UTC, to time zones in `timezone`. 
How should we do it? We may want to use `.tz_convert()` directly on the two columns, 
but it will throw an error.

In [4]:
# # Uncomment and run to see the error
# ha['timestamp'].tz_convert(ha['timezone'])

That's because `.tz_convert()` is not vectorized. But we can run it successfully 
on two values instead of two columns.

In [5]:
df['timestamp'][0].tz_convert(df['timezone'][0])

Timestamp('2023-07-25 08:27:53-0400', tz='America/New_York')

Now let's go one step further to remove the time zone info from the output via `tz_localize(None)`.

In [6]:
df['timestamp'][0].tz_convert(df['timezone'][0]).tz_localize(None)

Timestamp('2023-07-25 08:27:53')

Now that we did it for the first row. Let's do it for all rows through iteration via `itertuples()`.

In [7]:
pd.Series([row.timestamp.tz_convert(row.timezone).tz_localize(None) for row in df.itertuples()])

0   2023-07-25 08:27:53
1   2023-07-25 05:27:53
2   2023-07-25 13:27:55
3   2023-07-25 21:27:56
dtype: datetime64[ns]

But of course, we can also do it for all rows via `apply(func, axis=1)`.

In [8]:
df.apply(lambda row: row.timestamp.tz_convert(row.timezone).tz_localize(None), axis=1)

0   2023-07-25 08:27:53
1   2023-07-25 05:27:53
2   2023-07-25 13:27:55
3   2023-07-25 21:27:56
dtype: datetime64[ns]

Let's compare their speed.

In [9]:
dat = pd.concat([df for _ in range(1000)])
dat.shape

(4000, 5)

In [10]:
%timeit pd.Series([row.timestamp.tz_convert(row.timezone).tz_localize(None) for row in dat.itertuples()])

52 ms ± 891 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [11]:
%timeit dat.apply(lambda row: row['timestamp'].tz_convert(row['timezone']).tz_localize(None), axis=1)

96.5 ms ± 415 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


We see even for a small dataset of 4000 rows, `apply(func, axis=1)` is 2 or 3 times slower than `itertuples()`.
Now you might be asking, "Can we have both the speed of `itertuples()` and the convenience of `apply()`?" 
The answer is "Yes." Someone did it for us. 

In [12]:
def faster_rowwise_apply(df, func):
    # credit: https://stackoverflow.com/a/56213688
    cols = list(df.columns)
    data, index = [], []
    for row in df.itertuples(index=True): # iteration via itertuples()
        row_dict = {f:v for f, v in zip(cols, row[1:])}
        data.append(func(row_dict))
        index.append(row[0]) # collect index
    return pd.Series(data, index=index)

In [13]:
%timeit faster_rowwise_apply(dat, lambda row: row['timestamp'].tz_convert(row['timezone']).tz_localize(None))

60.3 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Summary

- Avoid `apply(func, axis=1)`. 
- If you can't avoid it, use iteration via `itertuples()` since it's much faster. 
- Use `faster_rowwise_apply()` defined above to get the speed of `itertuples()` 
  and the convenience of `apply()`.


**Good Read:**

- Get DeFi data easily using [defillama2](https://github.com/coindataschool/defillama2).
- [Stop using iterrows](https://ryxcommar.com/2020/01/15/for-the-love-of-god-stop-using-iterrows/).
- [Pete Cacioppi's faster df apply](https://stackoverflow.com/a/56213688)