## High-Performance Pandas: eval() and query()

Pandas includes some experimental taools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the eval() and query() functions, which rely on the Numexpr package.

## Motivating query() and eval(): Compound Expressions

In [12]:
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(int(1e6))
y = rng.rand(int(1E6))
%timeit x + y

1.73 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [6]:
a=1e6
a

1000000.0

In [14]:
%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))

294 ms ± 6.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
# This abstraction can become less efficient when computing compound expressions
mask = (x > 0.5) & (y < 0.5)
# This is roughly equivalent to the following:

In [16]:
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2

Every intermediate step is explicitly allocated in memory. If the x and y arrays are very large, this can lead to significant memory and computational overhead. The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.

In [17]:
import numexpr
mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)

True

The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays. The Pandas eval() and query() tools that we will discuss here are conceptually similar, and depend on the Numexpr package.

## pandas.eval() for Efficient Operations

In [18]:
import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)) for i in range(4))

In [19]:
%timeit df1 + df2 + df3 + df4

62.2 ms ± 649 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [20]:
# We can compute the same result via pd.eval by constructing the expression as a string
%timeit pd.eval('df1 + df2 + df3 + df4')

29.3 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [21]:
np.allclose(df1 + df2 + df3 + df4, pd.eval('df1 + df2 + df3 + df4'))

True

### Operations supported by pd.eval()

In [22]:
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3))) for i in range(5))

### Arithmetic operators
pd.eval() supports all arithmetic operators

In [24]:
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)

True

### Comparison operators 
pd.eval() supports all comparision operators

In [25]:
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)

True

### Bitwise operators
pd.eval() supports the & and | bitwise operators

In [26]:
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)

True

In [27]:
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)

True

### Object attributes and indices
pd.eval() supports access to object attributes via obj.attr syntax and indexes via the obj[index] syntax

In [28]:
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)

True

## DataFrame.eval() for Column-Wise Operations

In [29]:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,0.375506,0.406939,0.069938
1,0.069087,0.235615,0.154374
2,0.677945,0.433839,0.652324
3,0.264038,0.808055,0.347197
4,0.589161,0.252418,0.557789


In [30]:
# Using pd.eval()
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)

True

In [31]:
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)

True

### Assignment in DataFrame.eval()

In [32]:
df.head()

Unnamed: 0,A,B,C
0,0.375506,0.406939,0.069938
1,0.069087,0.235615,0.154374
2,0.677945,0.433839,0.652324
3,0.264038,0.808055,0.347197
4,0.589161,0.252418,0.557789


In [33]:
df.eval('D = (A + B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,11.18762
1,0.069087,0.235615,0.154374,1.973796
2,0.677945,0.433839,0.652324,1.704344
3,0.264038,0.808055,0.347197,3.087857
4,0.589161,0.252418,0.557789,1.508776


In [34]:
df.eval('D = (A - B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,-0.449425
1,0.069087,0.235615,0.154374,-1.078728
2,0.677945,0.433839,0.652324,0.374209
3,0.264038,0.808055,0.347197,-1.566886
4,0.589161,0.252418,0.557789,0.603708


### Local variables in DataFrame.eval()

In [35]:
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

## DataFrame.query() Method

In [36]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)

True

In [39]:
# result2 = df.eval('df[(A < 0.5) & (B < 0.5)]')
# doesn't work
result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)

True

In [40]:
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)

True

## Performance: When to Use These Functions

When considering whether to use these functions, there are two considerations: *computation time* and *memory use*. Memory use is the most predictable aspect. As already mentioned, every compound expression involving NumPy arrays or Pandas Data Frames will result in implicit creation of temporary arrays.: for example, this:

In [41]:
x = df[(df.A < 0.5) & (df.B < 0.5)]

In [42]:
tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
tmp3 = tmp1 & tmp2
x = df[tmp3]

If the size of the temporary DataFrames is significant compared to your available system memory (typically several gigabytes), then it's a good idea to use an eval() or query() experssion. You can check the approximate size of your array in bytes using this:

In [43]:
df.values.nbytes

32000

On the performance side, eval() can be faster even when you are not maxing out your system memory. The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your system; if they are much bigger, then eval() can avoid some potentially slow movement of values between the different memory caches. In practice, I find that the difference in computation time betwen the traditional methods and the eval/query method is usually not significant - if anything, the traditional method is faster for smaller arrays! The benefit of eval/query is mainly in the saved memory, and the sometimes cleaner syntax they offer.