In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


## High-Performance Pandas: eval() and query()

 Pandas includes some experimental tools
that allow you to directly access C-speed operations without costly allocation of inter‐
mediate arrays. These are the eval() and query() functions, which rely on the
Numexpr package

### Motivating query() and eval(): Compound Expressions

In [2]:
np.fromiter?

We’ve seen previously that NumPy and Pandas support fast vectorized operations; for
example, when you are adding the elements of two arrays

In [17]:
rng = np.random.RandomState(42)
x = rng.rand(10000000)
y = rng.rand(10000000)
%timeit x+y

113 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


As discussed in “Computation on NumPy Arrays: Universal Functions” on page 50,
this is much faster than doing the addition via a Python loop or comprehension

In [22]:
%timeit np.fromiter((xi + yi for xi, yi in zip(x,y)),dtype=x.dtype,count=len(x))

9 s ± 552 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


But this expression can become less efficient when when compound expression is computed

In other words, every intermediate step is explicitly allocated in memory. If the x and y
arrays are very large, this can lead to significant memory and computational overhead.
The Numexpr library gives you the ability to compute this type of compound
expression element by element, without the need to allocate full intermediate arrays.

In [32]:
import numexpr
mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)

True

The benefit here is that Numexpr evaluates the expression in a way that does not use
full-sized temporary arrays, and thus can be much more efficient than NumPy, espe‐
cially for large arrays

In [33]:
np.allclose?

### pandas.eval() for Efficient Operations

The eval() function in Pandas uses string expressions to efficiently compute opera‐
tions using DataFrames. For example, consider the following DataFrames

In [36]:
import pandas as pd
nrows,ncols=10000,100
rng=np.random.RandomState(42)
df1,df2,df3,df4=(pd.DataFrame(rng.rand(nrows,ncols))
                            for i in range(4))

In [37]:
%timeit df1+df2+df3+df4

25.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [38]:
%timeit pd.eval('df1+df2+df3+df4') # We can see that pd.eval consumes much less time to perform same operation

16.8 ms ± 519 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [39]:
np.allclose(df1+df2+df3+df4,pd.eval('df1+df2+df3+df4'))
# This tells that the dimension of the array used above and below is exactly same

True

### Operations supported by pd.eval()

In [40]:
# Lets construct dataframes
df1,df2,df3,df4,df5=(pd.DataFrame(rng.randint(0,1000,(100,3)))
                    for i in range (5))

#### Arithmetic operators

pd.eval() supports all arithmetic  function

In [41]:
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)


True

#### Comparison operators

pd.eval() supports all comparison operators, including
chained expressions

In [45]:
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
#result2 = pd.eval('df1 < df2 <= df3 != df4') # This is the chained expression , both this and the below expression runs
result2=pd.eval('(df1<df2)&(df2<=df3)&(df3!=df4)')
np.allclose(result1, result2)


True

#### Bitwise operators

 pd.eval() supports the & and | bitwise operators

In [46]:
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)') # and in place of & and or in place of | can be used too
np.allclose(result1, result2)

True

#### Object attributes and indices

In [47]:
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)

True

#### Other operations

Other operations, such as function calls, conditional statements,
loops, and other more involved constructs, are currently not implemented in
pd.eval(). If you’d like to execute these more complicated types of expressions, you
can use the Numexpr library itself.

### DataFrame.eval() for Column-Wise Operations

Just as Pandas has a top-level pd.eval() function, DataFrames have an eval()
method that works in similar ways. The benefit of the eval() method is that columns
can be referred to by name.

In [58]:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()


Unnamed: 0,A,B,C
0,0.259905,0.861337,0.232742
1,0.113747,0.154223,0.778527
2,0.460417,0.908127,0.264339
3,0.741817,0.243234,0.337768
4,0.449753,0.7487,0.057189


In [59]:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)

True

The DataFrame.eval() method allows much more succinct evaluation of expressions
with the columns

In [61]:
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)

True

Notice here that we treat column names as variables within the evaluated expression,
and the result is what we would wish.

#### Assignment in DataFrame.eval()


In [62]:
df.eval('D = (A + B) / C', inplace=True) # a D column is easily calculated using this method
df.head()

Unnamed: 0,A,B,C,D
0,0.259905,0.861337,0.232742,4.817534
1,0.113747,0.154223,0.778527,0.344202
2,0.460417,0.908127,0.264339,5.17724
3,0.741817,0.243234,0.337768,2.916348
4,0.449753,0.7487,0.057189,20.955826


#### Local variables in DataFrame.eval()

In [66]:
column_mean = df.mean(1)  # df.mean(1) gives mean of dataframe along axis=1
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

The @ character here marks a variable name rather than a column name, and lets you
efficiently evaluate expressions involving the two “namespaces”: the namespace of
columns, and the namespace of Python objects. Notice that this @ character is only
supported by the DataFrame.eval() method, not by the pandas.eval() function,
because the pandas.eval() function only has access to the one (Python) namespace.


#### DataFrame.query() Method


In [67]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)


True

In [68]:
result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)


True

In [75]:
result3=df.eval('(A < 0.5) & (B < 0.5)')  # df.eval() gives boolean output, Hence output is not matching
result3

0      False
1       True
2      False
3      False
4      False
       ...  
995     True
996    False
997    False
998    False
999    False
Length: 1000, dtype: bool

In [76]:
result2

Unnamed: 0,A,B,C,D
1,0.113747,0.154223,0.778527,0.344202
6,0.115987,0.119138,0.486502,0.483297
10,0.083683,0.192117,0.170891,1.613896
11,0.458644,0.006565,0.850275,0.547128
12,0.388111,0.492187,0.421714,2.087430
...,...,...,...,...
983,0.324687,0.277829,0.554381,1.086826
985,0.071297,0.033383,0.174708,0.599172
986,0.326709,0.123239,0.106085,4.241396
989,0.237517,0.208356,0.084580,5.271582


In addition to being a more efficient computation, compared to the masking expres‐
sion this is much easier to read and understand

In [78]:
Cmean = df['C'].mean()     # @ is applied to point a variable name , whereas column name doesn't need @ .
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)

True

### Performance: When to Use These Functions

When considering whether to use these functions, there are two considerations: com‐
putation time and memory use. Memory use is the most predictable aspect.For example, this

In [79]:
x = df[(df.A < 0.5) & (df.B < 0.5)]


is roughly equivalent to

In [81]:
tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
tmp3 = tmp1 & tmp2
x = df[tmp3]

If the size of the temporary DataFrames is significant compared to your available sys‐
tem memory (typically several gigabytes), then it’s a good idea to use an eval() or
query() expression. You can check the approximate size of your array in bytes using
this

In [82]:
df.values.nbytes

32000

On the performance side, eval() can be faster even when you are not maxing out
your system memory. The issue is how your temporary DataFrames compare to the
size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if
they are much bigger, then eval() can avoid some potentially slow movement of values 
between the different memory caches. In practice, I find that the difference in
computation time between the traditional methods and the eval/query method is
usually not significant—if anything, the traditional method is faster for smaller
arrays! The benefit of eval/query is mainly in the saved memory, and the sometimes
cleaner syntax they offer.
