# High Performance Pandas

The power of NumPy and Pandas is their ability to push basic operations into C via intuitive syntaThese are efficient, but they often rely on the creation of temporary intermediate objects, which can cause overhead. Pandas has some tools with which we can directly access C-speed operations without this costly allocation. There are eval() and query()

## Motivation: Compound Expressions

In [1]:
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y

2.09 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


This is much faster than using Python loops:

In [2]:
%timeit np.fromiter((xi + yi for xi, yi in zip(x,y )), dtype=x.dtype, count=len(x))

194 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


For others, it becomes less efficient for compounding expressions:

In [3]:
mask = (x > 0.5) & (y < 0.5)

Since Numpy evaluates each subexpression, this is roughly equivalent:

In [5]:
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2

Thus every intermediate step is allocated in memory - can lead to significant overhead if x and y are very large. numexpr accepts a string with the numpy-style expression we want to compute:

In [6]:
import numexpr
mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)

True