<a href="https://colab.research.google.com/github/edelord/DS-practice/blob/main/2_7_Series__apply.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
https://www.gormanalysis.com/blog/python-pandas-for-your-grandpa-2-7-series-apply/

In this section, we’ll see how you can use the apply() method of a Series to apply a function to each element in the Series, and then we’ll see why apply() is usually inferior to a vectorized solution.

Suppose you have a Series called foo with 6 elements like this

In [1]:
import numpy as np
import pandas as pd

foo = pd.Series([3, 9, 2, 2, 8, 7])
print(foo)
## 0    3
## 1    9
## 2    2
## 3    2
## 4    8
## 5    7
## dtype: int64

0    3
1    9
2    2
3    2
4    8
5    7
dtype: int64


And you want to apply some complicated function like this one to each element in the Series.

In [2]:
def my_func(x):
    return x - 1 if x % 2 == 0 else x + 3

Here, my_func() takes in a scalar, x, and returns x-1 if x is even, otherwise it returns x + 3. Okay.. maybe this one-liner isn’t that complicated, but for the sake of argument, pretend this function has hundreds of lines of cryptic code. In cases like this, you can use the apply() method of the Series object to apply my_func() to each element of foo.

In this case, you’d say foo.apply(), passing in the function callabale, my_func to get back a new Series of values.

In [3]:
foo.apply(my_func)
## 0     6
## 1    12
## 2     1
## 3     1
## 4     7
## 5    10
## dtype: int64

0     6
1    12
2     1
3     1
4     7
5    10
dtype: int64

We could even generalized my_func, giving it some parameters, a and b like this

In [4]:
def my_func(x, a=1, b=3):
    return x - a if x % 2 == 0 else x + b

And this time, we can call foo.apply(my_func) and pass in trailing parameters for a and b.

In [5]:
foo.apply(my_func, a=2, b=4)
## 0     7
## 1    13
## 2     0
## 3     0
## 4     6
## 5    11
## dtype: int64

0     7
1    13
2     0
3     0
4     6
5    11
dtype: int64

the apply() method is great, because it’s easy to use and it generalizes well, but it’s slow because it’s not vectorized. If we apply my_func to a Series with 10M values, it takes about 3 seconds to execute on Google Colab.

In [7]:
# Create a Series of 10M values
bigfoo = pd.Series(np.random.randint(low=0, high=9, size=10**7))

In [8]:
# apply() based method
%%timeit
y1 = bigfoo.apply(my_func) # 3 seconds

1 loop, best of 5: 3.63 s per loop


By contrast, here’s a NumPy based solution that achieves the same thing in about 100 milliseconds, roughly 30 times faster.

In [9]:
# vectorized NumPy method
%%timeit
a = bigfoo.to_numpy()
y2 = pd.Series(np.where(a % 2 == 0, a - 1, a + 3))  # 100 milliseconds

1 loop, best of 5: 254 ms per loop


With that said, the apply() method is designed for convenience and code clarity, not speed. Keep in mind that sometimes my_func might actually be a function imported from another package, or maybe it makes http requests to some API, and so refactoring it into a vectorized solution just isn’t feasible.