In [1]:
import pandas as pd
import numpy as np
import concurrent.futures as futures

In [2]:
from IPython.display import display

In [3]:
N = 100
df = pd.DataFrame({'a': np.arange(N), 'b': np.random.randn(N)})
display(df.head(10))

Unnamed: 0,a,b
0,0,-0.217307
1,1,0.074995
2,2,0.753345
3,3,0.310279
4,4,0.019605
5,5,-0.371197
6,6,-1.695124
7,7,0.840065
8,8,-1.122004
9,9,-0.639414


## Using the apply function in pandas is a bad idea.  It will almost never give you competitive results

Let's first define a simple function that adds to numbers.

In [4]:
def add_ab(r):
    return r.a + r.b

def add_ab_i(r):
    return r[1] + r[2]

### The "best" way to do this is to just use the vectorization built into the underlying numpy data.

In [5]:
%timeit df.a+df.b

82.3 µs ± 886 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### We could also use the mostly built in python functional programing constructs

This is certainly slower than above, but what if you had a function that you couldn't break down into optimized, vectorized operations?

This first one uses integer indexes and so you loose a little bit of the expressiveness, but we can regain that using the namedtuples from itertuples though we pay a price in speed.

In [6]:
%timeit pd.Series(list(map(add_ab_i, df.itertuples(name=None))), index=df.index)

453 µs ± 2.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [7]:
%timeit pd.Series(list(map(add_ab, df.itertuples())), index=df.index)

916 µs ± 26.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Using the apply method of a data frame we pay a significant speed penalty.

In [8]:
%timeit df.apply(add_ab, axis=1, reduce=True)

2.42 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Define a slightly more complicated function where we want to get a dataframe back out

In [9]:
def add_sub_ab(r):
    return {'apb': r.a+r.b, 'amb': r.a-r.b}

def add_sub_ab_i(r):
    return {'apb': r[1]+r[2], 'amb': r[1]-r[2]}

Again using optimized vectorized operations is the fasted.  With the built-in python functional programming constructs doing well and using pandas apply running dismally behind.

In [10]:
%timeit pd.DataFrame({'apb': df.a + df.b, 'amb': df.a - df.b})
%timeit pd.DataFrame(list(map(add_sub_ab_i, df.itertuples(name=None))), index=df.index)
%timeit pd.DataFrame(list(map(add_sub_ab, df.itertuples())), index=df.index)
%timeit df.apply(lambda r: pd.Series(add_sub_ab(r)), axis=1, reduce=False)

471 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.02 ms ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.54 ms ± 60.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
25.4 ms ± 913 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Expand now to four operations and four columns

In [11]:
def four_op_ab(r):
    return {'apb': r.a+r.b, 'amb': r.a-r.b,
            'rtb': r.a*r.b, 'rdb': r.a/r.b}

def four_op_ab_i(r):
    return [r[1]+r[2], r[1]-r[2],
            r[1]*r[2], r[1]/r[2]]

The order of the results hasn't changed, but it is interesting that the lead of the vectorized solution is shrinking.  This is of course because it is really running 4 loops through the data while the others are just one, but I don't think you'll ever beat the optimized solution.

In [12]:
%timeit pd.DataFrame({'apb': df.a+df.b, 'amb': df.a-df.b, 'atb': df.a*df.b, 'adb': df.a/df.b})
%timeit pd.DataFrame(list(map(four_op_ab_i, df.itertuples(name=None))), index=df.index, columns=['apb','amb','atb','adb'])
%timeit pd.DataFrame(list(map(four_op_ab, df.itertuples())), index=df.index)
%timeit df.apply(lambda r: pd.Series(four_op_ab(r)), reduce=False, axis=1)

656 µs ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.37 ms ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.91 ms ± 25.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
27.4 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Now define a parallel solution and a long running operation that you can't translate into vectorized operations.

You'll see that the parallel apply is essentially the same pattern as the functional constructs that are built into python.

In [19]:
def parallel_apply(df, func, **kwargs):
    with futures.ProcessPoolExecutor(max_workers=kwargs.get('max_workers',None)) as executor:
        rslt = list(executor.map(func, df.itertuples(name=None), chunksize=kwargs.get('chunksize', max(1,df.shape[0]//64))))
    return pd.DataFrame(rslt, index=df.index)

In [14]:
import time

def long_op_ab(r):
    time.sleep(0.25)
    return {'apb': r.a+r.b, 'amb': r.a-r.b,
            'rtb': r.a*r.b, 'rdb': r.a/r.b}

def long_op_ab_i(r):
    time.sleep(0.25)
    return [r[1]+r[2], r[1]-r[2],
            r[1]*r[2], r[1]/r[2]]

In [15]:
N_big = int(1e3)
df_big = pd.DataFrame({'a': np.arange(N_big), 'b': np.random.randn(N_big)})

Running the pandas apply finishes as quickly as can be expected in a single threaded manner.

In [16]:
%time df_big.apply(lambda r: pd.Series(long_op_ab(r)), reduce=False, axis=1)
None

CPU times: user 1.52 s, sys: 24 ms, total: 1.55 s
Wall time: 4min 12s


Using the python function constructs is basically the same.

In [17]:
%time pd.DataFrame(list(map(long_op_ab_i, df_big.itertuples(name=None))), index=df_big.index)
None

CPU times: user 188 ms, sys: 4 ms, total: 192 ms
Wall time: 4min 10s


If you use the parallel apply function then the time drops by almost the number of worker nodes.  It wasn't a very hard to implement, but it cannot use the namedtuples because of pickle problems.

In [18]:
%time parallel_apply(df_big, long_op_ab_i, chunksize=100, max_workers=4)
None

CPU times: user 56 ms, sys: 52 ms, total: 108 ms
Wall time: 1min 16s


So the only time you that pandas apply makes sense from a performance point of view is when you are running a computationally intense fucntion and you have to do it in a single thread.  Since even most low end computers have more than one processor there doesn't seem to be a general need to ever use apply.  You're better off using the vectorized operations when you can and when you cannot you are better off using a the regular functional contructs in python like map(...) or a parallel map(...) over the iterator of rows as tuples (not Series).  It doesn't require much code at all and can save huge amounts of time.