## Get better at dask dataframes

In this lesson you will learn some good practices for dask dataframes and dealing with data in general.


### Work close to your data

To get started when you are working with data that is in the cloud it's always better to work close to your data, to minimize the impact of IO networking. 

In this lesson, we will use coiled clusters that will be created on the same region that our datasets are stored. (the region is `"us-east-2"`)

**NOTE:**
If you do not have access to a coiled cluster you, can follow along just make sure you use the smaller dataset (use the `"0.5GB-"` ones). 

## Parquet vs CSV

Most people are familiarized with csv files, but when it comes to working with data, working with parquet can make a big difference. The Parquet file format is column-oriented and it's designed to efficiently store and retrieve data. 

**Extra reading**
You can read of the multiple advantages of using parquet data format in the blog [Advantages of Parquet File Format](https://www.coiled.io/blog/parquet-file-column-pruning-predicate-pushdown).

Let's see an example where we compare reading the same data but in one case it is stored as `csv` files, while the other as `parquet` files. 





###NOTES ON CSV AND PARQUET

CSV: ROW BASED 

PARQUET: COLUMN BASED

put image

- pq suppots predicate pushdown, csv does not: apply filter first then select data
- pq stores schema information, 
- parquet has a metadata that makes it really efficient (row groups )

In [1]:
data ={"0.5GB-csv": "s3://coiled-datasets/h2o-benchmark/N_1e7_K_1e2/*.csv",
       "0.5GB-pq": "s3://coiled-datasets/h2o-benchmark/N_1e7_K_1e2_parquet/*.parquet",
       "5GB-csv": "s3://coiled-datasets/h2o-benchmark/N_1e8_K_1e2/*.csv",
       "5GB-pq": "s3://coiled-datasets/h2o-benchmark/N_1e8_K_1e2_parquet/*.parquet",}

In [2]:
import coiled
from dask.distributed import Client
import dask.dataframe as dd

### SECTION ON HOW TO LOGIN INTO COILED WHEN WE HAVE INFO

In [None]:
#cluster = coiled.Cluster(name="dask-tutorial")

In [3]:
%%time
cluster = coiled.Cluster(name="dask-tutorial",
                        n_workers=8,
                        package_sync=True,
                        backend_options={"region_name": "us-east-2"},
                        );

## maybe use mi6 instead, the default ones are slower...

{'channel': 'conda-forge', 'sdist': None, 'source': 'conda', 'conda_name': 'zstd', 'name': 'zstd', 'client_version': '1.5.2', 'specifier': '', 'include': False, 'note': None, 'error': '1.5.2 has no install candidate for linux-64', 'md5': None} -1
{'channel': 'conda-forge', 'sdist': None, 'source': 'conda', 'conda_name': 'zlib', 'name': 'zlib', 'client_version': '1.2.13', 'specifier': '', 'include': False, 'note': None, 'error': '1.2.13 has no install candidate for linux-64', 'md5': None} -1
{'channel': 'conda-forge', 'sdist': None, 'source': 'conda', 'conda_name': 'zeromq', 'name': 'zeromq', 'client_version': '4.3.4', 'specifier': '', 'include': False, 'note': None, 'error': '4.3.4 has no install candidate for linux-64', 'md5': None} -1
{'channel': 'conda-forge', 'sdist': None, 'source': 'conda', 'conda_name': 'xz', 'name': 'xz', 'client_version': '5.2.6', 'specifier': '', 'include': False, 'note': None, 'error': '5.2.6 has no install candidate for linux-64', 'md5': None} -1
{'channel'

Output()

CPU times: user 9.54 s, sys: 1.42 s, total: 11 s
Wall time: 1min 25s


In [4]:
client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: coiled.ClusterBeta
Dashboard: http://3.17.134.86:8787,

0,1
Dashboard: http://3.17.134.86:8787,Workers: 8
Total threads: 32,Total memory: 119.48 GiB

0,1
Comm: tls://10.0.21.131:8786,Workers: 8
Dashboard: http://10.0.21.131:8787/status,Total threads: 32
Started: Just now,Total memory: 119.48 GiB

0,1
Comm: tls://10.0.26.81:37455,Total threads: 4
Dashboard: http://10.0.26.81:8787/status,Memory: 14.93 GiB
Nanny: tls://10.0.26.81:35277,
Local directory: /scratch/dask-worker-space/worker-y98fqwll,Local directory: /scratch/dask-worker-space/worker-y98fqwll

0,1
Comm: tls://10.0.23.75:43723,Total threads: 4
Dashboard: http://10.0.23.75:8787/status,Memory: 14.94 GiB
Nanny: tls://10.0.23.75:33255,
Local directory: /scratch/dask-worker-space/worker-gri2mnuw,Local directory: /scratch/dask-worker-space/worker-gri2mnuw

0,1
Comm: tls://10.0.28.86:43421,Total threads: 4
Dashboard: http://10.0.28.86:8787/status,Memory: 14.94 GiB
Nanny: tls://10.0.28.86:46085,
Local directory: /scratch/dask-worker-space/worker-3siz72rt,Local directory: /scratch/dask-worker-space/worker-3siz72rt

0,1
Comm: tls://10.0.20.150:45093,Total threads: 4
Dashboard: http://10.0.20.150:8787/status,Memory: 14.93 GiB
Nanny: tls://10.0.20.150:40869,
Local directory: /scratch/dask-worker-space/worker-2tedpeci,Local directory: /scratch/dask-worker-space/worker-2tedpeci

0,1
Comm: tls://10.0.16.35:39337,Total threads: 4
Dashboard: http://10.0.16.35:8787/status,Memory: 14.94 GiB
Nanny: tls://10.0.16.35:35053,
Local directory: /scratch/dask-worker-space/worker-4fprgxa1,Local directory: /scratch/dask-worker-space/worker-4fprgxa1

0,1
Comm: tls://10.0.19.36:44023,Total threads: 4
Dashboard: http://10.0.19.36:8787/status,Memory: 14.94 GiB
Nanny: tls://10.0.19.36:42697,
Local directory: /scratch/dask-worker-space/worker-qgunmj7s,Local directory: /scratch/dask-worker-space/worker-qgunmj7s

0,1
Comm: tls://10.0.22.180:42141,Total threads: 4
Dashboard: http://10.0.22.180:8787/status,Memory: 14.94 GiB
Nanny: tls://10.0.22.180:36381,
Local directory: /scratch/dask-worker-space/worker-jbkj5v_b,Local directory: /scratch/dask-worker-space/worker-jbkj5v_b

0,1
Comm: tls://10.0.16.116:46603,Total threads: 4
Dashboard: http://10.0.16.116:8787/status,Memory: 14.93 GiB
Nanny: tls://10.0.16.116:34083,
Local directory: /scratch/dask-worker-space/worker-o8f8qw3i,Local directory: /scratch/dask-worker-space/worker-o8f8qw3i


In [11]:
ddf_csv = dd.read_csv(data["5GB-csv"], storage_options={"anon": True})
ddf_pq = dd.read_parquet(data["5GB-pq"], storage_options={"anon": True})
#dd.read_parquet(data["5GB-pq"], storage_options={"anon": True})

In [12]:
ddf_csv

Unnamed: 0_level_0,id1,id2,id3,id4,id5,id6,v1,v2,v3
npartitions=100,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,object,object,object,int64,int64,int64,int64,int64,float64
,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...


In [13]:
ddf_pq

Unnamed: 0_level_0,id1,id2,id3,id4,id5,id6,v1,v2,v3
npartitions=100,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,category[unknown],category[unknown],category[unknown],Int32,Int32,Int32,Int32,Int32,float64
,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...


In [10]:
ddf_csv

Unnamed: 0_level_0,id1,id2,id3,id4,id5,id6,v1,v2,v3
npartitions=100,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,category[unknown],category[unknown],category[unknown],Int32,Int32,Int32,Int32,Int32,float64
,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...


In [9]:
%%time
ddf_csv.groupby("id1").agg({"v1": "sum"}).compute()

CPU times: user 315 ms, sys: 105 ms, total: 420 ms
Wall time: 2min 25s


Unnamed: 0_level_0,v1
id1,Unnamed: 1_level_1
id048,2000740
id080,2001520
id035,1999546
id086,2001468
id009,1997231
...,...
id087,2000103
id037,1996179
id008,1999858
id085,1997778


In [7]:
%%time
ddf_pq.groupby("id1").agg({"v1": "sum"}).compute()

CPU times: user 40.4 ms, sys: 24.3 ms, total: 64.7 ms
Wall time: 3.07 s


Unnamed: 0_level_0,v1
id1,Unnamed: 1_level_1
id048,2000740
id080,2001520
id035,1999546
id086,2001468
id009,1997231
...,...
id087,2000103
id037,1996179
id008,1999858
id085,1997778


Notice that the `parquet` version without doing much it is already ~5X faster. 

Let's take a look at the dtypes in both cases and see if we can make some things faster:

In [None]:
ddf_csv

In [8]:
#IF I SPECIFY THE DTYPES THIS GETS MUCH SLOWER ??? Thoughts??

ddf_csv = dd.read_csv(
            data["5GB-csv"],
            dtype={
                "id1": "category",
                "id2": "category",
                "id3": "category",
                "id4": "Int32",
                "id5": "Int32",
                "id6": "Int32",
                "v1": "Int32",
                "v2": "Int32",
                "v3": "float64",
            },
            storage_options={"anon": True},)

In [None]:
ddf_pq

In [None]:
## example to exaplain column prunning. 


### Read about why in read_parquet we read the dtypes but not csv?

- show ddf.partitions[0].memory_usage(deep=True).compute() / 1e6
- see what happens with csv and with parquet, 

## dtypes

NOTE: 

FOR THE PUSPOSE OF THE TUTORIAL I NEED TO GENERATE THE DATA FOR 5GB WITH PYARROW STRINGS. 
OR TYPECAST, EXPLORE THAT.

THEN RUN 
```python
        ddf_q3 = ddf[["id3", "v1", "v3"]].astype({"id3": "string[pyarrow]"})
        (
            ddf_q3.groupby("id3", dropna=False, observed=True)
            .agg({"v1": "sum", "v3": "mean"})  
            .compute()
        )
```

chat with james to see if there is anything else about pyarrow dtypes we could be showing here.

## High cardinality 

- id1 has 100 unique values
- id3 has 1_000_000 unique values

Let's see what happens when we try to groupby on a high cardinality column, and what can we do to make this better. 

Read docs about shuffle, and explain advantages, extract useful info. Ask about p2p docs?
https://docs.dask.org/en/stable/dataframe-groupby.html#shuffle-methods


In [None]:
#With 5 workers

In [None]:
# CPU times: user 833 ms, sys: 338 ms, total: 1.17 s
# Wall time: 3min 9s

In [14]:
%%time
ddf = ddf_pq[["id3", "v1", "v3"]]
(
    ddf.groupby("id3")
    .agg({"v1": "sum", "v3": "mean"})
    .compute()
)

CPU times: user 644 ms, sys: 269 ms, total: 912 ms
Wall time: 2min 6s


Unnamed: 0_level_0,v1,v3
id3,Unnamed: 1_level_1,Unnamed: 2_level_1
id0000608844,156,45.221557
id0000466449,252,56.924430
id0000573987,151,50.175290
id0000776204,177,48.865545
id0000608718,190,52.758464
...,...,...
id0000821599,167,51.620558
id0000937302,197,45.701999
id0000248458,181,52.910110
id0000428431,185,53.959775


2022-12-13 12:36:18,002 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
Traceback (most recent call last):
  File "/Users/ncclementi/mambaforge/envs/dask-tutorial/lib/python3.10/site-packages/distributed/comm/tcp.py", line 498, in connect
    stream = await self.client.connect(
  File "/Users/ncclementi/mambaforge/envs/dask-tutorial/lib/python3.10/site-packages/tornado/tcpclient.py", line 275, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/ncclementi/mambaforge/envs/dask-tutorial/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/ncclementi/mambaforge/envs/dask-tutorial

In [None]:
## Using shuffle tasks is slower :/ explanation?
##CPU times: user 1.58 s, sys: 858 ms, total: 2.44 s
#Wall time: 4min 49s

In [None]:
%%time
ddf = ddf_pq[["id3", "v1", "v3"]]
(
    ddf.groupby("id3")
    .agg({"v1": "sum", "v3": "mean"}, shuffle="tasks")
    .compute()
)

In [None]:
#THERE IS BUG, AND I CAN'T RUN THIS
# SEE https://github.com/dask/dask/issues/9754
%%time
ddf = ddf_pq[["id3", "v1", "v3"]]
(
    ddf.groupby("id3")
    .agg({"v1": "sum", "v3": "mean"}, shuffle="p2p")
    .compute()
)