# Using pandas.DataFrame.query

```
Authors: Alexandre Gramfort
         Thomas Moreau
```


In [None]:
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt

## Efficient filtering with `query`

When you want to select rows based on more complex conditions, using _boolean indexing_ can be either inefficient or tedious:

In [None]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries

Let's say I want to select the countries with an area larger than 100000 and population larger than 50. I could do:

In [None]:
countries[(countries['area'] > 100_000) & (countries['population'] > 50)]

This will create 2 boolean masks in memory and then index `countries` so it will scan the `DataFrame` 3 times.

Successive indexing will be more efficient (working on sub samples of the rows) but will require creating many temporary variable:

In [None]:
countries_area = countries[countries['area'] > 100_000]
countries_area[countries_area['population'] > 50]

In order to avoid this drawbacks, one can use the `query` method from a `DataFrame`:

In [None]:
countries.query('area > 1e5 & population > 50')

With this method, a single efficient query will be performed, evaluating and selecting the rows in  single efficient pass.

The `query` method accept complicated queries using a mini-language:

- column names are used as variable in the evaluted query
- `@foo` corresponds to the `foo` variable in the current namespace.
- special pandas method can be accessed if using `engine=python` (see bellow).

In [None]:
min_density = 1e2
countries.query(
    'population * 1e6 / area > @min_density'
    '& capital.str.contains("B")',
    engine='python'
)

## Let's try on a larger dataset to convince you:

In [None]:
df = pd.read_parquet(Path('data') / 'bike-counter-data.parquet')
df.head(1)

In [None]:
%%timeit 

df.query(
    "((latitude - longitude)**2 - 10 >= 2371 and bike_count > 13)",
    engine="numexpr"
).shape

In [None]:
%%timeit

df[
    ((df["latitude"]**2 - df["longitude"]**2 - 10) >= 2371) &
    (df["bike_count"] > 13)
]