# Advanced Pandas Analysis

Here we will touch on some of the most complex areas of Pandas, continuing from a number of the Intermediate topics mentioned previously, to give you as full of an experience as possible using `pandas`.

In [None]:
import pandas as pd
import numpy as np

## `pandas.melt`: From wide to long

A number of the packages require data to exist in *long-form*, this often means that columns contain duplicates and is memory and disk intensive. It is far more common to keep data in wide-form. However when we need to convert data that has many similar-like columns into *long-form*, `pd.melt` is one of the best functions in Pandas to achieve this.

Take the `cdystonia` dataset for example.

In [None]:
cdystonia = pd.read_csv("datasets/cdystonia.csv")
print(cdystonia.shape)
cdystonia.head(3)

Using aforementioned methods, we can expand out the `twstrs` response column to be multiple columns using a *pivot*. Here we use the `week` as the columns (identical to observation `obs`), and use the set difference to eliminate, keeping all the other columns available.

In [None]:
cdystonia_wide = cdystonia.pivot_table("twstrs", index=cdystonia.columns.difference(["twstrs","obs","week"]).tolist(), columns="week")
print(cdystonia_wide.shape)
cdystonia_wide.head()

You can see that $(631,9)$ is substantially larger than $(109,6)$ in terms of dimensional size. By specifying the columns we want to keep as identifiers, `pd.melt` selects every other column and collapses it into a single column, that we name back as `twstrs`:

In [None]:
cdystonia_long = pd.melt(cdystonia_wide.reset_index(), id_vars=["age","id","patient","sex","site","treat"], value_name="twstrs", var_name="week")
cdystonia_long.head(3)

## Vectorized String Operations

One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data. In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.

If we recall from NumPy, one of the key advantages was the *vectorization* of mathematical operations, such as this:

In [None]:
x=np.array([2,3,5,7,11,13])
x**2

Whereas for arrays of strings, NumPy does not provide such simple access, and we have to fall back to using a Pythonic list comprehension:

In [None]:
x=np.array(['peter','Paul','mary','guido'])
[s.capitalize() for s in x]

In [None]:
x.capitalize()

In addition, this Pythonic method will break in cases where there is missing data:

In [None]:
x=np.array(['peter','Paul',None,'mary','guido'])
[s.capitalize() for s in x]

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the `str` attribute of `pd.Series` and `pd.Index` objects containing string information. 

In [None]:
names = pd.Series(["Jeff", "alan", "Steve", "gUIDO", None, "job", None])
names

We can now call a single method to capitalize the entries, as follows:

In [None]:
names.str.capitalize()

### Available methods in `pandas.str`

Nearly all of the Python built-in string methods are mirrored in Pandas vectorized string methods, here is a tabular list:

| & | & | &  | &|
|------- | ----------- | ----------- | ------------- |
| `len()` | `lower()` | `translate()` | `islower()` |
| `ljust()` | `rjust()` | `lower()` | `upper()` | 
| `startswith()` | `endswith()` | `find()` | `isnumeric()` |
| `center()` | `rfind()` | `isalnum()` | `isdecimal()` | 
| `zfill()` | `index()` | `isalpha()` | `split()` |
| `strip()` | `rindex()` | `isdigit()` | `rsplit()` |
| `rstrip()` | `capitalize()` | `isspace()` | `partition()` |
| `lstrip()` | `swapcase()` | `istitle()` | `rpartition()` |

Note that there are variable return values, for instance `lower()` returns a string, but `len()` returns an integer, `startswith()` returns a boolean value, etc.

### Additional method using regular expressions

This is where the true power of Pandas comes in: not only can we do direct matching and string manipulation, but also provide functionality to examine the content of each element using a regular expression. Some of the below functions we can use are:

| **Method** | **Description** |
| ---------- | -------------------------------- |
| `match()` | Calls `re.match()` on each element, returning a boolean |
| `extract()` | Calls `re.extract()` on each element, returning matched groups as strings |
| `findall()` | Calls `re.findall()` on each element |
| `replace()` | Replaces occurences of pattern with some other string |
| `contains()` | Calls `re.search()` on each element, returning a boolean |
| `count()` | Count occurrences of pattern |
| `split()` | Calls `str.split()`, but accepts regular expressions |
| `rsplit()` | Calls `str.rsplit()` but accepts regular expressions |

With these, we have a wide range of interesting operations. For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of the element:

In [None]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'], name="names")

In [None]:
monte.str.extract("([A-Za-z]+)", expand=False)

Note that if we return `expand=True`, we return a 1-D dataframe, else we get a `pd.Series`. Or we could do something more complicated, like finding all the names that start and end with a consonant, make use of the start-of-string (^) and end-of-string (\$) regular expression characters:

In [None]:
monte.str.findall(r"^[^AEIOU].*[^aeiou]$")

### Miscallaneous methods

Finally, there are a number of convenient operations which Pandas uniquely provides that can be invaluable when *function chaining*:

| **Method** | **Description** |
| ----------- | ----------------------------- |
| `get()` | Index each element |
| `slice()` | Slice each element |
| `slice_replace()` | Replace slice in each element with passed value |
| `cat()` | Concatenate strings |
| `repeat()` | Repeat values |
| `normalize()` | Return a unicode form of the string |
| `pad()` | Add whitespace to the left, right or both sides of a string |
| `wrap()` | Split long strings into lines of length less than a given width |
| `join()` | Join strings in each element of the Series with passed separator |
| `get_dummies()` | Extract dummy variables as DataFrame |

### Vectorized item access and slicing

The `get()` and `slice()` operations, enable vectorized element access from each array. For example:

In [None]:
monte.str[:3]

Is equivalent to:

In [None]:
monte.str.slice(0,3)

In [None]:
monte.str.split(" ", expand=True)

### Indicator Variables

Another method that requires a bit of extra explanation is the `get_dummies()` method. This is useful when your data has a column containing some sort of coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":

In [None]:
info=pd.Series(["B|C|D","B|D","A|C","B|D","B|C", "B|C|D"])
info.name="info"

In [None]:
full_monte = pd.concat([monte, info],axis=1)

The `get_dummies` routine lets you split out indicator variables into a new DataFrame:

In [None]:
full_monte["info"].str.get_dummies("|")

## Categorical Types

Categoricals are a pandas data type corresponding to categorical variables, such as from statistics. A categorical variable takes on a limited, fixed, number of possible values. Examples include gender, blood type, country or rating. Categorical data may be ordered, but numerical operations are not possible on them.

All of the values in categorical data are either in categories or `np.nan`. Order is defined by the order of *categories*, not the lexical order of the values. Using a categorical data type has a number of **advantages**:

- A string variable consisting of only a few different values can be *efficiently* stored internally as each string is represented by an integer, and only unique strings are in the categories array.
- Sorting through an ordered categorical variable is substantially faster.
- Provides valuable metadata to Pandas when it comes to smart plotting, operations, etc.

Much of this material is drawn from the Pandas documentation, which is extensive and found [here](https://pandas.pydata.org/pandas-docs/stable/categorical.html). 

In [None]:
c = pd.Categorical(['a', 'b', 'b', 'c', 'a', 'b', 'a', 'a', 'a', 'c'])
c

In [None]:
c.describe()

In [None]:
c.codes

You can provide information as to the ordering of the categories:

In [None]:
c.as_ordered()

In [None]:
c.dtype

Converting an existing 'object' feature into a category:

In [None]:
s = pd.Series(["air", "water", "fire", "fire", "water", "earth", "fire", "fire", "water", "air"])
s.astype("category")

## Time-Series Data

Pandas as a tool was initially developed in the context of financial modelling, so as you might expect, there is a rather large suite of tools for working with dates, times and time-indexed data. There are a number of different formats that date data can come in:

- *Time stamps* reference particular moments in time (e.g Dec 25, 2011 at 7:45pm).
- *Time intervals* and periods reference a length of time with a beginning and end point.
- *Time deltas* or durations reference an exact length of time (e.g duration of 22.56 seconds).

### Native Datetime

Natively, Python has a representation of datetime objects from the `datetime` package:

In [None]:
from datetime import datetime
datetime(year=2015, month=7, day=4)

Or we can use the `dateutil` parse module to use dates from a variety of formats:

In [None]:
from dateutil import parser
parser.parse("4th of July, 2015")

In [None]:
parser.parse("4th of July, 2015").strftime("%A")

In the final line, we've used one of the standard string format codes for printing dates `%A`, where this is more to read within the documentation of Python's datetime function. 

### NumPy's `datetime64`

NumPy introduces a native time-series data type which is encoded as a 64-bit integer, and allows arrays of dates to be represented very compactly. The `datetime64` has a specific input format:

In [None]:
date = np.array("2015-07-04", dtype=np.datetime64)
date

Once we have this date formatted, we can quickly do vectorized operations on it:

In [None]:
date + np.arange(12)

Because of the uniform type of NumPy `datetime64` arrays, this type of operation can be accomplished more quickly than if we were working with native Python objects. One of the important features of `datetime64` and `timedelta64` is that they are built on a fundamental time unit. This means that because the object is limited to 64-bit precision, the range of encodable times is $2^{64}$ times this fundamental unit. This means that `datetime64` imposes a trade-off between *time resolution* and *maximum time span*.

For example, if you want a time resolution of one nanosecond, you only have enough information to encode a range of $2^{64}$ nanoseconds, or 600 years. 

In [None]:
np.datetime64("2015-07-04")

In [None]:
np.datetime64('2015-07-04 12:00')

The following table (from the NumPy `datetime64` documentation) lists the available format codes along with the relative timespans they can encode:

| **Code** | **Meaning** | **Time span (relative)** |
| --- | ---------- | ----------------- |
| Y | Year | $\pm 9.2 \times 10^{18}$ years | 
| M | Month | $\pm 7.6 \times 10^{17}$ years |
| W | Week | $\pm 1.7 \times 10^{17}$ years |
| D | Day | $\pm 2.5 \times 10^{16}$ years |
| h | Hour | $\pm 1 \times 10^{15}$ years |
| m | Minute | $\pm 1.7 \times 10^{13}$ years |
| s | Second | $\pm 2.9 \times 10^{12}$ years |
| ms | Millisecond | $\pm 2.9 \times 10^{9}$ years |
| $\mu$s | Microsecond | $\pm 2.9 \times 10^{6}$ years |
| ns | Nanosecond | $\pm 292$ years |
| ps | Picosecond | $\pm 106$ days |

For the real world, `datetime64[ns]` is sufficiently precise, with 292 years usually being more than sufficient for most modern applications.

### Dates and times in Pandas

Pandas builds on top of NumPy and the native Python libraries to get the best-of-both-worlds effect; efficient storage and vectorized interface with ease-of-use. For example:

In [None]:
date = pd.to_datetime("4th of July, 2015")
date

In [None]:
date.strftime("%A")

In [None]:
date + pd.to_timedelta(np.arange(12),"D")

### Time-Series Indexing

Where the Pandas time-series tools really become useful is when you can *index data by timestamps*. For example we could do the following:

In [None]:
index = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
                          '2015-07-04', '2015-08-04'])
data = pd.Series([0, 1, 4, 2], index=index)
data

Now that the data is in a `Series`, we can make use of the `Series` indexing patterns previously uncovered, passing values that can be interpreted as dates:

In [None]:
data["2014-07-04":"2015-07-04"]

There are additional date-only indexing operations, such as passing a year or a month to obtain a slice of all data from that year/month:

In [None]:
data["2015"]

### Time-series Data Structures

Pandas uses a number of fundamental data structures when describing time-series data:

| **Structure** | **Description** |
| ------------- | ---------------------------------------- |
| Time-stamp | Basic `Timestamp` type, a replacement for Python's native `datetime`, <br> but more efficient from NumPy. Associated to `DatetimeIndex` object. |
| Time Periods | `pd.Period` object provided. Encodes a fixed-frequency <br> interval based on `np.datetime64`. Associated to `PeriodIndex` object. |
| Time Delta/Duration | Pandas provides `Timedelta` type. More efficient than native <br> Python `datetime.timedelta` object, and based on `np.timedelta64`. <br> Associated to `TimedeltaIndex` object. | 

The fundamental objects are the `Timestamp` and `DatetimeIndex` objects. While these objects can be created directly, it's more popular to use the `pd.to_datetime()` function, which can parse a wide range of formats. 

In [None]:
dates = pd.to_datetime([datetime(2015, 7, 3), "4th of July, 2015", "2015-Jul-6", "07-07-2015", "20150708"])
dates

Any `DatetimeIndex` can be converted to a `PeriodIndex` with the `to_period()` function with the addition of a frequency code; in this case a daily frequency:

In [None]:
dates.to_period("D")

A `TimedeltaIndex` is created, for instance, when a date is subtracted from another:

In [None]:
dates - dates[0]

### Regular sequences: `pd.date_range()`

To aid in the creation of date sequences, Pandas has a number of functions that make input far easier:

| **Function** | **Description** |
| ------------ | ----------------------------- |
| `pd.date_range()` | Creates a sequence of time-stamps |
| `pd.period_range()` | Create a sequence of time periods |
| `pd.timedelta_range()` | Creates a sequence of time-deltas |

This format follows similarly to NumPy's `arange()` and `linspace()` functions, which accept a start point, endpoint, and optional stepsize into a sequence.

In [None]:
pd.date_range("2015-07-03","2015-07-10")

In [None]:
pd.date_range("2015-07-03", periods=8)

The spacing can be modified by altering the `freq` argument, which defaults to days (D). For example, we could step in hours:

In [None]:
pd.date_range("2015-07-03", periods=8, freq="H")

### Frequencies and Offsets

Pandas provides a number of additional codes to NumPy's standard when defining frequency/date offsets:

| **Code** | **Description** |  **Code** | **Description** |
| --- | -------------------- | --- | --------------------- |
| D | Calendar day | H | Hours |
| W | Weekly | A | Year end |
| M | Month end | M | Minutes | 
| Q | Quarter end | S | Seconds |
| B | Business day | L | Milleseconds |

Additionally, we can combine codes with numbers to specify more unique frequencies. For example, a frequency of 2 hours and 30 minutes, we can combine hour (H) and minute (T) codes as follows:

In [None]:
pd.timedelta_range(0, periods=9, freq="2H30T")

### Resampling, Shifting and Windowing

The ability to use dates and times as indices to intuitively organize and access data is an important piece of the Pandas time series tools. The benefits of indexed data in general (automatic alignment during operations, intuitive data slicing and access, etc.) still apply, and Pandas provides several additional time series-specific operations.

For these examples, we'll look at some stock price data from Google:

In [None]:
goog = pd.read_csv("datasets/goog_stock.csv", index_col=0, 
                   parse_dates=True, skipinitialspace=True).sort_index()
goog.head()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
goog["Close"].plot(grid=True)
plt.ylabel("Closing Price")
plt.title("GOOG STOCK")
plt.savefig("images/goog_stock.svg", format="svg")
plt.show()

One of the most common needs for time-series data is resampling at a higher or lower frequency. This can be done using the `resample()` method, or using `asfreq()`. The primary difference is that, while `resample()` is fundamentally a data aggregation technique, `asfreq()` is a data selection method. 

For the google data, let's compare what both return when we down-sample the data:

In [None]:
goog["Close"].plot(alpha=.5, style="-")
goog["Close"].resample("BQ").mean().plot(style=":")
goog["Close"].asfreq("BQ").plot(style="--")
plt.legend(["input", "resample", "asfreq"], loc="upper left")
plt.show()

Notice that `resample` reports the average of the previous year, whereas `asfreq` reports the value at the end of the year. 

For up-sampling, `resample()` and `asfreq()` are largely equivalent, although `resample()` has considerably more options. The default for both methods is to leave the up-sampled points empty, i.e NA values. We also have options to impute missing values.

In [None]:
fig, ax = plt.subplots(2, sharex=True, figsize=(6,6))
ss = goog["Close"].iloc[:20]
ss.asfreq("D").plot(ax=ax[0], marker="o")
ss.asfreq("D", method="bfill").plot(ax=ax[1], style="-o")
ss.asfreq("D", method="ffill").plot(ax=ax[1], style="--o")
ax[1].legend(["back-fill", "forward-fill"])
plt.show()

The top panel is the default `asfreq` behaviour, non-business days are left as NA values. The bottom panel shows the differences between *forward-filling* and *backward-filling*.

### Time-shifts

Another common time-series specific operation is shifting of data in time. Pandas has two methods for this operation:

1. `shift()`: Shifts the data by a given frequency.
2. `tshift()`: Shifts the index by a given frequency.



In [None]:
fig, ax = plt.subplots(3, sharey=True, figsize=(6,8))

# apply a frequency to the data
goog_d = goog["Close"].asfreq('D', method='pad')

goog_d.plot(ax=ax[0])
goog_d.shift(900).plot(ax=ax[1])
goog_d.tshift(900).plot(ax=ax[2])

# legends and annotations
local_max = pd.to_datetime(goog_d.index[30])
offset = pd.Timedelta(900, 'D')

ax[0].legend(['input'], loc=2)
ax[0].get_xticklabels()[2].set(weight='heavy', color='red')
ax[0].axvline(local_max, alpha=0.3, color='red')

ax[1].legend(['shift(900)'], loc=2)
ax[1].get_xticklabels()[2].set(weight='heavy', color='red')
ax[1].axvline(local_max + offset, alpha=0.3, color='red')

ax[2].legend(['tshift(900)'], loc=2)
ax[2].get_xticklabels()[1].set(weight='heavy', color='red')
ax[2].axvline(local_max + offset, alpha=0.3, color='red')

plt.tight_layout()
plt.show()

Here we see that `shift(900)` shifts the data by 900 days, pushing some of it off the end of the graph, while `tshift(900)` shifts the index values by 900 days. 

A common context for this type of shift is in computing differences over time. For example, we use shifted values to compute the one-year return on investment for Google stock over the course of the dataset:

In [None]:
ROI = 100 * (goog["Close"].tshift(-365) / goog["Close"] - 1)
ROI.plot()
plt.ylabel("Return on Investment (%)")
plt.show()

This helps us to see the overall trend in this particular stock option.

### Rolling windows

Rolling statistics are another type of series-specific operation implemented in Pandas. This is achieved through the `rolling()` attribute of `Series` and `DataFrame` objects, which returns a view similar to the `groupby()` operation. This rolling view makes available a number of aggregation operations by default.

For example, here is the one-year rolling mean and standard deviation of Google stocks:

In [None]:
rolling = goog["Close"].rolling(365//2, center=True)
data = pd.DataFrame({"input": goog["Close"],
                     "half-year-mean": rolling.mean(),
                     "half-year-std": rolling.std()})
ax = data.plot(style=["-","--",":"])
ax.lines[0].set_alpha(.5)

As with `groupby` operations, `aggregate()` and `apply()` methods can be used for custom rolling computations.

## Method Chaining

You may have noticed previously that a number of the operations can be chained together, skipping intermediate states of the `DataFrame`. This is known as **method chaining** and really helps not only with performance but *code readability*.

In [None]:
(cdystonia.assign(age_group=pd.cut(cdystonia.age, [0, 30, 40, 50, 60, 70, 80, 90], right=False))
    .groupby(['age_group','sex']).mean()
    .twstrs.unstack("sex")
    .fillna(0.0)
    .plot.barh(figsize=(10,5)))
plt.show()

## Pipes

One of the problems with method chaining is that it requires all of the functionality you need for data processing to be implemented somewhere as methods which return the actual DataFrame object in order to chain. Occasionally we want to do custom manipulations to our data, this is solved in *pipe*.

For example, we may wish to calculate the *proportion of twstrs* in the whole dataset to see differences between each patient in proportional terms across time to all of the other patients in their age group, their state of pain etc.

In [None]:
def to_proportions(df, axis=1):
    row_totals = df.sum(axis)
    return df.div(row_totals, True - axis)

(cdystonia.assign(age_group=pd.cut(cdystonia.age, [0, 30, 40, 50, 60, 70, 80, 90], right=False))
    .groupby(["week","age_group"]).mean()
    .twstrs.unstack("age_group")
    .pipe(to_proportions, axis=1))

We can now see the proportion of response variable across the age groups, per week.

## Data Transformation

We have several options for *transforming* labels and other columns into more useful features:

In [None]:
cdystonia.treat.replace({'Placebo': 0, "5000U": 1, "10000U": 2}).head()

In [None]:
cdystonia.treat.astype("category").head()

In [None]:
pd.cut(cdystonia.age, [20,40,60,80], labels=["Young","Middle-Aged","Old"])[-10:]

We can use qcut to automatically divide our data into even-sized $q$-tiles. For example $q=4$ refers to quartiles.

In [None]:
pd.qcut(cdystonia.age, 4)[-8:]

## Sparse Dataframes

*Sparse* version of Series and DataFrame are implemented in Pandas. They are not sparse in the typical sense, rather these objects are **compressed** where any data matching a specific value (`NaN`/missing) is omitted. A special `SparseIndex` object tracks where data has been *sparsified*. See this example:

In [None]:
ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan
sts = ts.to_sparse()
sts

The `to_sparse()` method allows us to fill the value with something other than `NaN`:

In [None]:
ts.fillna(0.).to_sparse(fill_value=0)

These Sparse objects are mostly useful for memory-efficient reasons. Suppose you had a mostly `NaN` DataFrame:

In [None]:
df = pd.DataFrame(np.random.rand(100,100))
df_sp = df.where(df < 0.02).to_sparse()
print(df_sp.density)
df_sp.head()

In [None]:
print("Memory usage [sparse]: %d bytes\nMemory usage [dense]: %d bytes" % (df_sp.memory_usage().sum(), df.memory_usage().sum()))

Pandas also supports creating sparse dataframes directly from `scipy.sparse` matrices. It is worth mentioning that Pandas converts scipy matrices NOT in COOrdinate format to COO, copying data as needed. 

In [None]:
from scipy import sparse

scip_sps = sparse.coo_matrix(np.random.choice([0,1], size=(1000,1000), p=(.95, .05)))
scip_sps

In [None]:
sdf = pd.SparseDataFrame(scip_sps)
sdf.head()

## High-Performance Pandas: `eval()` and `query()`

As we've seen in previous sections, the power of the PyData stack is built upon the capacity of NumPy and Pandas to push basic operations into C via an intuitive syntax. Examples are vectorized/broadcasted operations in NumPy, and grouping-type operations in Pandas. Whilst these abstractions are efficient for common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.

Recently (version 0.13+), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the `eval()` and `query()` functions, which rely on the [`numexpr` package](https://github.com/pydata/numexpr).

### Motivating `query()` and `eval()`: Compound Expressions

As we've seen before, NumPy and Pandas support fast vectorized operations; for example adding elements of 2 arrays:

In [None]:
rng = np.random.RandomState(777)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y

As discussed previously, this is much faster than doing the addition via a Python loop or comprehension:

In [None]:
%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))

But this abstraction is less efficient when computing compound expressions. For example:

In [None]:
mask = (x > .5) & (y < .5)

Because NumPy evaluates each subexpression, every intermediate step is explicitly allocated in memory. If $x$ and $y$ are very large, this can lead to significant memory and computational overhead. The `numexpr` library gives you the ability to compute this type of compound expression element by element, without the need for full allocation. 

In [None]:
import numexpr

In [None]:
mask_numexpr = numexpr.evaluate("(x > 0.5) & (y < 0.5)")
np.allclose(mask, mask_numexpr)

### `pd.eval()`

The `eval()` function in Pandas uses string expressions to efficiently compute operations using `DataFrame` objects. For example, consider the following:

In [None]:
nrow, ncol = 100000, 100
rng = np.random.RandomState(777)
df1,df2,df3,df4 = (pd.DataFrame(rng.rand(nrow,ncol)) for i in range(4))

To compute the sum of all 4 DataFrames using a typical approach, we would:

In [None]:
%timeit df1 + df2 + df3 + df4

The same result can be computed via `pd.eval` by constructing as:

In [None]:
%timeit pd.eval("df1 + df2 + df3 + df4")

The `eval()` version is about 50% faster, while giving the same result.

### Operations supported by `pd.eval()`

| **Operation Type** | **Description** | **Code Example** |
| ------------- | ------------------------------ | ------------------- |
| Arithmetic | `pd.eval` supports all arithmetic operators. | `result = pd.eval('-df1 * df2 / (df3 + df4) - df5')` |
| Comparison | `pd.eval` supports all comparison operators. | `result = pd.eval('df1 < df2 <= df3 != df4')` |
| Bitwise | `pd.eval` supports the `&`, `and`, `or` and \| bitwise operators | `result = pd.eval('(df1 < 0.5) & (df2 < 0.5) or (df3 < df4')` |
| Object attributes and indices | `pd.eval` supports access to object attributes <br> via the `obj.attr` syntax, and index via the `obj[index]` syntax. | `result = pd.eval('df2.T[0] + df3.iloc[1]')` |

Other operations such as function calls, conditional statements, loops and other more involved constructs are currently *not* implemented in `pd.eval()`. Some of these may exist in the `numexpr` library itself.

### `DataFrame.eval()` for column-wise operations

Pandas has a top-level `pd.eval()` function, DataFrames have an `eval()` method that works similarly. The benefit of `eval()` is that columns can be referred to *by name*. 

In [None]:
df = pd.DataFrame(rng.rand(1000,3), columns=["A","B","C"])
df.head()

In [None]:
res1 = (df["A"] + df["B"]) / (df["C"] - 1)
R = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(res1,R)

In [None]:
R2 = df.eval("(A + B) / (C - 1)")
np.allclose(res1,R2)

### Assignment in `df.eval()`

In addition to the previous operations, `DataFrame.eval()` allows assignment to any column. For example:

In [None]:
df.eval("D = (A + B) / C", inplace=True)
df.head()

This also allows modification of existing columns.

### Local variables in `df.eval()`

`df.eval()` also supports additional syntax that lets it work with local Python variables. Consider the following:

In [None]:
col_mean = df.mean(1)
res2 = df["A"] + col_mean
R3 = df.eval("A + @col_mean")
np.allclose(res2,R3)

### `DataFrame.query()` method

As for the examples used for `df.eval()`, this is an expression involving columns in the DataFrame. However this is a type of **filtering** operation instead of evaluation. 

In [None]:
res3 = df[(df.A < 0.5) & (df.B < 0.5)]
R4 = df.query("A < 0.5 and B < 0.5")
np.allclose(res3,R4)

In addition to being a more efficient computation, this is much easier to read and understand. Note that the `query()` method also accepts the `@` flag to mark local variables.

### When to use Performance functions

When considering whether to even bother using these functions, there are two main considerations:

1. *Computation time*
2. *Memory use*

Memory use is the most predictable aspect. As already mentioned, every compound expression involving NumPy arrays or Pandas `DataFrames` will result in implicit creation of temporary arrays. If the size of the temporary DataFrame is significant compared to available system memory, then it's a good idea to use an `eval()` or `query()` expression. You can check the approximate size of your array in bytes using:

In [None]:
df.values.nbytes

On the performance side, `eval()` is faster even when you are not maxing our your memory. The main bottleneck is usually how your temporary DataFrame size compares to the size of your L1 or L2 CPU cache on your system. It is often that there is not a considerable difference in computation times between the traditional methods and eval methods, but a larger dividend in saved memory and cleaner syntax. 

## Tasks

