# Scaling Pandas with Dask

Run the scripts in [coiled-datasets](https://github.com/coiled/coiled-datasets) to create local timeseries datasets to run the code in this notebook.

In [2]:
import glob
import os

import pandas as pd

In [3]:
home = os.path.expanduser("~")

## Pandas query on small dataset

In [4]:
path = f"{home}/data/timeseries/1-month/parquet"
all_files = glob.glob(path + "/*.parquet")

In [5]:
df = pd.concat((pd.read_parquet(f) for f in all_files))

In [7]:
df.memory_usage(deep=True).sum()

228240994

In [15]:
%%time

df[["id"]].nunique()

CPU times: user 38.2 ms, sys: 2.81 ms, total: 41 ms
Wall time: 39.3 ms


id    290
dtype: int64

## Pandas query on large dataset

In [16]:
path = f"{home}/data/timeseries/20-years/parquet"
all_files = glob.glob(path + "/*.parquet")

In [None]:
df = pd.concat((pd.read_parquet(f) for f in all_files))

In [None]:
%%time

df[["id"]].nunique()

## Dask query on large dataset

In [17]:
import dask
import dask.dataframe as dd

In [18]:
from dask.distributed import Client

client = Client()

Perhaps you already have a cluster running?
Hosting the HTTP server on port 50884 instead


In [23]:
ddf = dd.read_parquet(
    f"{home}/data/timeseries/20-years/parquet",
    engine="pyarrow",
)

In [24]:
%%time

ddf["id"].nunique().compute()



CPU times: user 3.74 s, sys: 519 ms, total: 4.26 s
Wall time: 8.65 s


367