With the release of the NumPy paper I've had people ask me more questions about efficiency and speed within pandas, and whether broadcasting is worth the added complexity. I find it important to clarify that I don't think broadcasting is complex - it's much simplier than writing loops yourself! - it's that we just don't typically think in this way when programming, and thus many of our programming languages and toolkits don't take advantage of the benefits of [array programming](https://en.wikipedia.org/wiki/Array_programming).

Let's look at a couple of examples.

In [3]:
import numpy as np
import pandas as pd

# some reusable setup code
def setup_dataframe(size=1000):
    df=pd.DataFrame({"data":np.random.normal(size=size), "mean":np.empty(size)})
    mean=np.mean(df["data"])
    return df, mean

First up is building a new column in a `DataFrame` through a traditional `for` loop. Note that for each time we want to write a value into the dataframe we have to look up the existing value and subtract the mean.

In [27]:
%%timeit df, mean = setup_dataframe()
for i in range(len(df)):
    df.iloc[i,1]=df["data"].iloc[i]-mean

165 ms ± 874 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Next up is iteratively building the new column through a list comprehension, then assigning it in one statement to the `DataFrame`.

In [12]:
%%timeit df, mean = setup_dataframe()
df["mean"]=[x-mean for x in df["data"]]

382 µs ± 3.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Now that's pretty fast - 382 microseconds on my machine, much much better than a straight forward iteration through the `DataFrame` filling in values!

It's worth for a moment considering python built in functions. In particular, the `map` function is built into the library and doesn't talk about whether it will work in parallel or not, just that it will take any iterable and apply a given function to it. Unfortunatly, that it will take any iterable is really an indication of how it will work. But let's do a bit of setup outside of the timed function to give it a fighting chance.

In [7]:
# define our function to run
def subtract_mean(x,mean):
    return x-mean

# some additional values we will need to be setup
def setup_additional_dataframe(size=1000):
    df=pd.DataFrame({"data":np.random.normal(size=size)})
    mean=np.mean(df["data"])
    list_of_means=[mean for i in range(len(df))]
    arguments=list(zip(df["data"], list_of_means))
    return (df, mean, list_of_means, arguments)

In [13]:
%%timeit df, mean, list_of_means, arguments = setup_additional_dataframe()
df["mean"]=pd.DataFrame(map(subtract_mean, df["data"], list_of_means))

578 µs ± 5.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Aha! But that's pretty close to our list comprehension. And surely, we could use python's parallization libraries to improve this, right? Even if `map()` isn't parallel, the `threading` and `multiprocessing` libraries within python must be and they have a function that looks just like `map()`! So let's setup some parameters for that exploration (it looks much the same as the previous).

In [11]:
import multiprocessing
pool = multiprocessing.Pool()
multiprocessing.cpu_count()

16

Now, before we fire this up, please keep in mind there is overhead each time we are going to create a new process on the system. There are plenty of parameters which effect this, but one is how many processes are going to work in parallel (and by not providing a parameter to `Pool()` we are saying use as many as are available) much data we want to batch into call. Let's try a few different batch sizes (which we do through the `chunksize` parameter).

In [14]:
%%timeit df, mean, list_of_means, arguments = setup_additional_dataframe()
df["mean"]=pd.DataFrame(pool.starmap(subtract_mean, arguments, chunksize=1))

38 ms ± 3.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [15]:
%%timeit df, mean, list_of_means, arguments = setup_additional_dataframe()
df["mean"]=pd.DataFrame(pool.starmap(subtract_mean, arguments, chunksize=10))

5.22 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [16]:
%%timeit df, mean, list_of_means, arguments = setup_additional_dataframe()
df["mean"]=pd.DataFrame(pool.starmap(subtract_mean, arguments, chunksize=25))

2.54 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [17]:
%%timeit df, mean, list_of_means, arguments = setup_additional_dataframe()
df["mean"]=pd.DataFrame(pool.starmap(subtract_mean, arguments, chunksize=250))

1.55 ms ± 47.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [18]:
%%timeit df, mean, list_of_means, arguments = setup_additional_dataframe()
df["mean"]=pd.DataFrame(pool.starmap(subtract_mean, arguments, chunksize=500))

2.05 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


I found this interesting, I wonder what happens if we just crank the number of items in our array up, do we eventually get closed to our current fastest implementation, the list comprehension?

In [19]:
%%timeit df, mean, list_of_means, arguments = setup_additional_dataframe(size=1000000)
df["mean"]=[x-mean for x in df["data"]]

281 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [23]:
%%timeit df, mean, list_of_means, arguments = setup_additional_dataframe(size=1000000)
df["mean"]=pd.DataFrame(pool.starmap(subtract_mean, arguments, chunksize=10000))

627 ms ± 16.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


So, we've seen a breadth of different approaches. Let's now take a look at broadcasting though

In [24]:
%%timeit df, mean, list_of_means, arguments = setup_additional_dataframe(size=1000000)
df["mean"]=df["data"]-mean

2.6 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


And to compare with our initial test cases...

In [26]:
%%timeit df, mean = setup_dataframe(size=1000)
df["mean"]=df["data"]-mean

213 µs ± 4.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


So, on my system right now this looks strong for broadcasting! Just over 200 microseconds for 1,000 items, and 2 and a half milliseconds for 1,000,000 values. That's much faster than our previous leader, the list comprehension with 380 microseconds, and 281 millisconds repsectively!