In [1]:
import coiled
import dask.distributed
import dask.dataframe as dd

## Cluster setup

In [None]:
cluster = coiled.Cluster(configuration="coiled/default", n_workers=5)

Output()

Found software environment build


In [3]:
client = dask.distributed.Client(cluster)


+-------------+-----------+-----------+-----------+
| Package     | client    | scheduler | workers   |
+-------------+-----------+-----------+-----------+
| dask        | 2021.07.1 | 2021.07.2 | 2021.07.2 |
| distributed | 2021.07.1 | 2021.07.2 | 2021.07.2 |
+-------------+-----------+-----------+-----------+


## CSV files

In [7]:
ddf = dd.read_csv(
    "s3://coiled-datasets/timeseries/20-years/csv/*", 
    storage_options={"anon": True, 'use_ssl': True}
)

In [8]:
len(ddf)

662256000

In [9]:
%%time

len(ddf[ddf.id > 1170])

CPU times: user 587 ms, sys: 79.6 ms, total: 667 ms
Wall time: 3min 9s


65

## Parquet files

In [10]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/timeseries/20-years/parquet", 
    storage_options={"anon": True, 'use_ssl': True}
)

In [11]:
%%time

len(ddf[ddf.id > 1170])

CPU times: user 314 ms, sys: 44.1 ms, total: 358 ms
Wall time: 1min 35s


65

## Predicate pushdown filtering

In [12]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/timeseries/20-years/parquet", 
    storage_options={"anon": True, 'use_ssl': True},
    filters=[[('id', '>', 1170)]]
)

In [13]:
len(ddf)

38707200

In [14]:
%%time

len(ddf[ddf.id > 1170])

CPU times: user 42.7 ms, sys: 4.85 ms, total: 47.6 ms
Wall time: 3.77 s


65

## Predicate pushdown filtering and column pruning

In [15]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/timeseries/20-years/parquet", 
    storage_options={"anon": True, 'use_ssl': True},
    filters=[[('id', '>', 1170)]],
    columns=["id"]
)

In [16]:
len(ddf)

38707200

In [17]:
%%time

len(ddf[ddf.id > 1170])

CPU times: user 29.5 ms, sys: 3.35 ms, total: 32.8 ms
Wall time: 2.17 s


65

## Understanding predicate pushdowns

Predicate pushdowns are applied at the row group level.  They filter out row groups that don't contain id's greater than 1170 in our example.  Note that the row groups that contain ids greater than 1170 will also contain ids less than 1170.  You still need to apply the "regular filtering" after applying the "predicate filters" to get the final result.

In [21]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/timeseries/20-years/parquet", 
    storage_options={"anon": True, 'use_ssl': True},
    filters=[[('id', '>', 1170)]]
)

In [22]:
ddf.head()

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-29 00:00:00,1081,Edith,0.050667,-0.556958
2000-01-29 00:00:01,1022,Ursula,-0.642827,0.659931
2000-01-29 00:00:02,984,Jerry,0.449249,0.782695
2000-01-29 00:00:03,996,Alice,-0.124976,0.327127
2000-01-29 00:00:04,992,Victor,0.274238,-0.320963


In [23]:
len(ddf)

38707200

In [24]:
len(ddf[ddf.id > 1170])

65

The predicate filtering makes it so our "regular filtering" only needs to process 604,800 rows of data.  If the predicate filters are not applied, then Dask needs to run "regular filtering" on 31 million rows of data, as shown below.

In [25]:
ddf = dd.read_parquet(
    "s3://coiled-datasets/timeseries/20-years/parquet", 
    storage_options={"anon": True, 'use_ssl': True}
)

In [27]:
len(ddf)

662256000

distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
