# DataFrames: Parquet Predicate Pushdown Filtering

This notebook shows how to perform parquet predicate pushdown filtering to skip data when reading files.

Parquet stores metadata in the file footer, including the min / max value of each column.

Dask can use the min / max statistics to intelligently skip entire files when performing filtering operations.  The performance gains from using predicate pushdown filters depend on how many files can be skipped.

If you have 1,000 files and can skip 990 of them with Parquet predicate pushdown filtering, the performance gains will be massive.

## Example setup

There are four CSV files in the `data/pets` directory.  We'll convert these to four Parquet files.

There are four Parquet files in the `data/pets_parquet` directory:

```
data/
  pets_parquet/
    part.0.parquet
    part.1.parquet
    part.2.parquet
    part.3.parquet
```

Each Parquet file has `firstname` and `age` columns.

The Parquet footer stores the min and max value for the age column in each Parquet file.  Here are the min / max values in our example files:

```
| File          | min | max |
|---------------|-----|-----|
| pets0.parquet | 1   | 9   |
| pets1.parquet | 3   | 9   |
| pets2.parquet | 2   | 4   |
| pets3.parquet | 7   | 12  |
```

Suppose we'd like to perform a filtering operation and fetch pets with that are older than 10.  We know from the Parquet metadata that `pets1`, `pets2`, and `pets3` don't have any pets with an age greater than 10.  We can skip those files entirely and only filter `pets4`.

Reading files and transferring them to a cluster is time consuming.  In this example, we're able to skip 75% of the files, so Parquet partition pruning will give a nice performance gain.

In [6]:
import dask.dataframe as dd

Convert the CSV files into Parquet files

In [8]:
dd.read_csv('../data/pets/*.csv').to_parquet('data/pets_parquet')

Inspect the contents of one of the Parquet files

In [9]:
dd.read_parquet('data/pets_parquet/part.0.parquet').head(3)

Unnamed: 0,nickname,age
0,fofo,3
1,tio,1
2,lulu,9


## Inefficient approach

Read in all of the Parquet files into a DataFrame and perform a filtering operation to grab all the pets that are older than 10

In [10]:
df1 = dd.read_parquet('data/pets_parquet/*')

In [11]:
df1 = df1[df1['age'] > 10]

In [12]:
df1.head(1, npartitions=4)

Unnamed: 0,nickname,age
1,lll,12


In [13]:
df1.npartitions

4

This approach sends all four Parquet files to the cluster.  We know all four files are getting sent to Dask because four partitions are created.

Dask needs to filter over all the files, even in the files we know don't have any pets greater older than 10.  Let's use predicate pushdown filtering so we don't needlessly filter files that don't contain any matching data.

## Efficient approach

In [15]:
df1 = dd.read_parquet('data/pets_parquet/*', filters=[('age', '>', 10)])

In [16]:
df1.head(1, npartitions=1)

Unnamed: 0,nickname,age
0,fff,7


In [17]:
df1.npartitions

1

This approach performs the Parquet predicate pushdown filtering.  We can tell because the DataFrame only has one partition, so Dask only read one file.  When the `filters` parameter is populated, Dask will intelligently inspect the metadata of the Parquet files and skip entire files whenever possible.