## Grouby queries from h2o-benchmarks (parquet-pyarrow str)

The dataset "50GB_pyarr" was generated in the same way than the other ones, except that when reading the multiple csv files, id1, id2 and id3 where casted as pyarroew_strings, then performed a repartition using 100MB as target size, and then we wrote to parquet. 

Once reading the data back and casting id1, id2 and id3 as pyarrow strings each partition size in memory is around 72MB


- Dask = 2022.10.0
- pyarrow = 10.0.0.dev497
- pandas = 1.5.1

In [1]:
import os

import coiled
import dask.dataframe as dd
from dask.distributed import Client, performance_report
from distributed.diagnostics import MemorySampler
import pandas as pd

In [2]:
import dask
import distributed

In [3]:
dask.__version__

'2022.10.0'

In [4]:
distributed.__version__

'2022.10.0'

In [5]:
import coiled
coiled.__version__

'0.2.38'

In [6]:
#cluster = coiled.Cluster(name="h2o-benchmarks")

In [6]:
cluster = coiled.Cluster(
    name="h2o-benchmarks",
    n_workers=10,
    worker_vm_types=["t3.large"],  # 2CPU, 8GiB
    scheduler_vm_types=["t3.large"],
    software="dask-engineering/h2o_pyarrow_nightly",
#     package_sync=True,
)

Output()

In [7]:
client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: coiled.ClusterBeta
Dashboard: http://18.116.80.128:8787,

0,1
Dashboard: http://18.116.80.128:8787,Workers: 10
Total threads: 20,Total memory: 72.02 GiB

0,1
Comm: tls://10.0.15.238:8786,Workers: 10
Dashboard: http://10.0.15.238:8787/status,Total threads: 20
Started: Just now,Total memory: 72.02 GiB

0,1
Comm: tls://10.0.9.210:33039,Total threads: 2
Dashboard: http://10.0.9.210:8787/status,Memory: 7.21 GiB
Nanny: tls://10.0.9.210:35915,
Local directory: /scratch/dask-worker-space/worker-oyf45xvt,Local directory: /scratch/dask-worker-space/worker-oyf45xvt

0,1
Comm: tls://10.0.6.47:34001,Total threads: 2
Dashboard: http://10.0.6.47:8787/status,Memory: 7.20 GiB
Nanny: tls://10.0.6.47:37639,
Local directory: /scratch/dask-worker-space/worker-370d2l3n,Local directory: /scratch/dask-worker-space/worker-370d2l3n

0,1
Comm: tls://10.0.4.158:32797,Total threads: 2
Dashboard: http://10.0.4.158:8787/status,Memory: 7.21 GiB
Nanny: tls://10.0.4.158:33309,
Local directory: /scratch/dask-worker-space/worker-u27kby59,Local directory: /scratch/dask-worker-space/worker-u27kby59

0,1
Comm: tls://10.0.6.35:41143,Total threads: 2
Dashboard: http://10.0.6.35:8787/status,Memory: 7.20 GiB
Nanny: tls://10.0.6.35:45171,
Local directory: /scratch/dask-worker-space/worker-3wp7ita5,Local directory: /scratch/dask-worker-space/worker-3wp7ita5

0,1
Comm: tls://10.0.14.68:44901,Total threads: 2
Dashboard: http://10.0.14.68:8787/status,Memory: 7.19 GiB
Nanny: tls://10.0.14.68:33073,
Local directory: /scratch/dask-worker-space/worker-0_zv6ysv,Local directory: /scratch/dask-worker-space/worker-0_zv6ysv

0,1
Comm: tls://10.0.9.227:43897,Total threads: 2
Dashboard: http://10.0.9.227:8787/status,Memory: 7.21 GiB
Nanny: tls://10.0.9.227:40387,
Local directory: /scratch/dask-worker-space/worker-cvi30uka,Local directory: /scratch/dask-worker-space/worker-cvi30uka

0,1
Comm: tls://10.0.9.108:39389,Total threads: 2
Dashboard: http://10.0.9.108:8787/status,Memory: 7.20 GiB
Nanny: tls://10.0.9.108:45301,
Local directory: /scratch/dask-worker-space/worker-zwh2vl7c,Local directory: /scratch/dask-worker-space/worker-zwh2vl7c

0,1
Comm: tls://10.0.8.101:44417,Total threads: 2
Dashboard: http://10.0.8.101:8787/status,Memory: 7.20 GiB
Nanny: tls://10.0.8.101:43155,
Local directory: /scratch/dask-worker-space/worker-q56m9eeg,Local directory: /scratch/dask-worker-space/worker-q56m9eeg

0,1
Comm: tls://10.0.2.161:36669,Total threads: 2
Dashboard: http://10.0.2.161:8787/status,Memory: 7.20 GiB
Nanny: tls://10.0.2.161:38415,
Local directory: /scratch/dask-worker-space/worker-ppbbw_3_,Local directory: /scratch/dask-worker-space/worker-ppbbw_3_

0,1
Comm: tls://10.0.15.89:40063,Total threads: 2
Dashboard: http://10.0.15.89:8787/status,Memory: 7.20 GiB
Nanny: tls://10.0.15.89:34827,
Local directory: /scratch/dask-worker-space/worker-cbw2oy5_,Local directory: /scratch/dask-worker-space/worker-cbw2oy5_


In [8]:
ms = MemorySampler()

In [9]:
#"50GB_pyarr" I save the data as pyarrow strings, did repartitioning using pyarrow

data_size = {
    "50GB_pyarr": "s3://coiled-datasets/h2o-benchmark/pyarrow_strings/N_1e9_K_1e2/*.parquet",

}

In [10]:
ds = "50GB_pyarr" #run next with "50GB_orig"
report_dir = "performance-reports-pyarr_str-50GB"

In [11]:
ddf = dd.read_parquet(
    data_size[ds],
    engine="pyarrow",
    storage_options={"anon": True},
)
ddf

Unnamed: 0_level_0,id1,id2,id3,id4,id5,id6,v1,v2,v3
npartitions=1000,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,string,string,string,Int32,Int32,Int32,Int32,Int32,float64
,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...


### Q1

In [12]:
%%time
with performance_report(filename=os.path.join(report_dir, f"q1_{ds}.html")):
    with ms.sample(f"q1_22-10_{ds}"):
        ddf_q1 = ddf[["id1", "v1"]].astype({"id1": "string[pyarrow]"})
        ddf_q1.groupby("id1", dropna=False, observed=True).agg({"v1": "sum"}).compute()

CPU times: user 289 ms, sys: 75.7 ms, total: 364 ms
Wall time: 1min 13s


In [13]:
client.restart();

### Q2

In [14]:
%%time
with performance_report(filename=os.path.join(report_dir, f"q2_{ds}.html")):
    with ms.sample(f"q2_22-10_{ds}"):
        ddf_q2 = ddf[["id1", "id2", "v1"]].astype({"id1": "string[pyarrow]",
                                                   "id2": "string[pyarrow]"})
        (
            ddf_q2.groupby(["id1", "id2"], dropna=False, observed=True)
            .agg({"v1": "sum"})
            .compute()
        )

CPU times: user 292 ms, sys: 54.3 ms, total: 347 ms
Wall time: 1min 39s


In [15]:
client.restart();

### Q3

In [16]:
%%time
with performance_report(filename=os.path.join(report_dir, f"q3_{ds}-p2p.html")):
    
    with ms.sample(f"q3_22-10_{ds}_p2p"):
        ddf_q3 = ddf[["id3", "v1", "v3"]].astype({"id3": "string[pyarrow]"})
        (
            ddf_q3.groupby("id3", dropna=False, observed=True)
            .agg({"v1": "sum", "v3": "mean"}, shuffle="p2p")
            .compute()
        )

CPU times: user 7.78 s, sys: 2.57 s, total: 10.3 s
Wall time: 15min 17s


In [17]:
client.restart();

### Q7

In [18]:
%%time
with performance_report(filename=os.path.join(report_dir, f"q7_{ds}-p2p.html")):
    
    with ms.sample(f"q7_22-10_{ds}_p2p"):

        ddf_q7 = ddf[["id3", "v1", "v2"]].astype({"id3": "string[pyarrow]"})
        (
            ddf_q7.groupby("id3", dropna=False, observed=True)
            .agg({"v1": "max", "v2": "min"}, shuffle="p2p")
            .assign(range_v1_v2=lambda x: x["v1"] - x["v2"])[["range_v1_v2"]]
            .compute()
        )

CPU times: user 3.62 s, sys: 1.25 s, total: 4.87 s
Wall time: 11min 6s


In [19]:
client.restart();

### Q9

In [23]:
%%time
with performance_report(filename=os.path.join(report_dir, f"q9_{ds}.html")):
    
    with ms.sample(f"q9_22-10_{ds}"):
        
        ddf_q9 = ddf[["id2", "id4", "v1", "v2"]].astype({"id2": "string[pyarrow]"})
        (
            ddf_q9[["id2", "id4", "v1", "v2"]]
            .groupby(["id2", "id4"], dropna=False, observed=True)
            .apply(
                lambda x: pd.Series({"r2": x.corr()["v1"]["v2"] ** 2}),
                meta={"r2": "float64"},
            )
            .compute()
        )

CPU times: user 2.72 s, sys: 467 ms, total: 3.19 s
Wall time: 5min 54s


In [24]:
client.restart();

In [25]:
ms.to_pandas(align=True).to_csv("id1id2id3_pyarrow_str_2022-10.csv")

### Q4, Q5, Q6 and Q8
Note: Columns involved in these queries are not affected by pyarrow strings. 