# Using Numpy in Pandas

If the bottleneck in your code is an array operation on a pandas dataframe it is often much faster to access the underlying numpy array directly and then port the results back to pandas.

If we are working with numpy arrays then we can also use the numpy optimisation techniques such as numExpr

In [6]:
import numpy as np
import pandas as pd
import numexpr as ne

from IPython.display import Markdown, display


In [7]:
def printmd(string):
    display(Markdown(string))

In [2]:
def createDataframe(N:int):
    df = pd.DataFrame(np.random.standard_normal((N,N)))
    return df
df = createDataframe(N=100)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,2.287015,1.043294,0.6754,0.860762,0.121209,0.448892,-0.797251,0.84026,0.25312,0.702086,...,1.909484,-2.138513,0.058621,-0.173132,-0.617359,1.753565,-0.413478,0.049859,1.138952,-0.00073
1,-1.051643,-1.702365,-1.717464,-1.530425,-0.472547,-1.919646,-0.194076,-1.10284,0.625107,-0.72076,...,0.05199,0.937193,-0.165219,1.488532,-0.772458,0.939916,0.978447,3.012688,-0.474286,-1.634295
2,1.550799,-1.193952,-0.749828,0.112164,-0.265085,0.503282,0.142019,0.602865,0.440431,-0.111618,...,-2.897998,-1.739018,1.736361,0.348211,0.429868,0.192145,1.635783,-1.211355,1.014502,-0.396635
3,-1.467763,0.104484,0.453567,0.142141,1.428743,-1.691483,0.660587,-1.002569,-0.137528,-0.966293,...,-0.202935,0.164353,0.79468,-0.361157,0.068248,-0.445872,-0.356927,-0.063437,-1.305159,0.824749
4,1.217281,0.391434,-0.027387,-0.850208,1.013892,-0.899342,-0.849047,-0.129631,-0.791811,1.212166,...,-1.089641,-1.070132,0.482965,-1.399567,0.056385,0.544203,-0.839453,1.257635,-0.899517,0.143148


Create one function that carried out the operation in pandas and another in numpy. 

Test the outputs to check they are equivalent

In [3]:
def squarePandas(df):
    return df**2
def squareNumpy(df):
    return pd.DataFrame(df.values**2,index=df.index,columns=df.columns)
def squareNumExpr(df):
    values = df.values
    return pd.DataFrame(ne.evaluate("values**2"),index=df.index,columns=df.columns)

pd.testing.assert_frame_equal(squarePandas(df),squareNumpy(df))
pd.testing.assert_frame_equal(squarePandas(df),squareNumExpr(df))

Create a larger array for timing

In [10]:
df = createDataframe(N=10000)

In [11]:
printmd("**Pandas**")
%timeit -n 1 -r 5 squarePandas(df=df)
printmd("**Numpy**")
%timeit -n 1 -r 5 squareNumpy(df=df)
printmd("**NumExpr**")
%timeit -n 1 -r 5 squareNumExpr(df=df)


**Pandas**

692 ms ± 41.8 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


**Numpy**

423 ms ± 11.4 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


**NumExpr**

288 ms ± 5.6 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)


# Using NumExpr directly on a dataframe

If we are working on columns of a dataframe  - as opposed to the whole dataframe at once - we can use NumExpr via the `pd.eval` function.

In [17]:
N = 100
dfNew = pd.DataFrame({'order':np.arange(N)})
dfNew.head()

Unnamed: 0,order
0,0
1,1
2,2
3,3
4,4


In [20]:
pd.eval("doubleCol = dfNew.order * 2", target=dfNew)

Unnamed: 0,order,doubleCol
0,0,0
1,1,2
2,2,4
3,3,6
4,4,8
...,...,...
95,95,190
96,96,192
97,97,194
98,98,196
