## Get better at dask dataframes

In this lesson you will learn some good practices for dask dataframes and dealing with data in general.


### Work close to your data

To get started when you are working with data that is in the cloud it's always better to work close to your data, to minimize the impact of IO networking. 

In this lesson, we will use coiled clusters that will be created on the same region that our datasets are stored. (the region is `"us-east-2"`)

**NOTE:**
If you do not have access to a coiled cluster you, can follow along just make sure you use the smaller dataset (use the `"0.5GB-"` ones). 

## Parquet vs CSV

Most people are familiarized with csv files, but when it comes to working with data, working with parquet can make a big difference. The Parquet file format is column-oriented and it's designed to efficiently store and retrieve data. 

**Extra reading**
You can read of the multiple advantages of using parquet data format in the blog [Advantages of Parquet File Format](https://www.coiled.io/blog/parquet-file-column-pruning-predicate-pushdown).

Let's see an example where we compare reading the same data but in one case it is stored as `csv` files, while the other as `parquet` files. 

In [None]:
data ={"0.5GB-csv": "s3://coiled-datasets/h2o-benchmark/N_1e7_K_1e2/*.csv",
       "0.5GB-pq": "s3://coiled-datasets/h2o-benchmark/N_1e7_K_1e2_parquet/*.parquet",
       "5GB-csv": "s3://coiled-datasets/h2o-benchmark/N_1e8_K_1e2/*.csv",
       "5GB-pq": "s3://coiled-datasets/h2o-benchmark/N_1e8_K_1e2_parquet/*.parquet",}

In [None]:
import coiled
from dask.distributed import Client
import dask.dataframe as dd

### SECTION ON HOW TO LOGIN INTO COILED WHEN WE HAVE INFO

In [None]:
#cluster = coiled.Cluster(name="dask-tutorial")

In [None]:
%%time
cluster = coiled.Cluster(name="dask-tutorial",
                        n_workers=8,
                        package_sync=True,
                        backend_options={"region_name": "us-east-2"},
                        );

## maybe use mi6 instead, the default ones are slower...

In [None]:
client = Client(cluster)
client

In [None]:
ddf_csv = dd.read_csv(data["5GB-csv"], storage_options={"anon": True})
ddf_pq = dd.read_parquet(data["0.5GB-pq"], storage_options={"anon": True})
#dd.read_parquet(data["5GB-pq"], storage_options={"anon": True})

In [None]:
%%time
ddf_csv.groupby("id1").agg({"v1": "sum"}).compute()

In [None]:
%%time
ddf_pq.groupby("id1").agg({"v1": "sum"}).compute()

Notice that the `parquet` version without doing much it is already ~5X faster. 

Let's take a look at the dtypes in both cases and see if we can make some things faster:

In [None]:
ddf_csv

In [None]:
##IF I SPECIFY THE DTYPES THIS GETS MUCH SLOWER ??? Thoughts??

# ddf_csv = dd.read_csv(
#             data["5GB-csv"],
#             dtype={
#                 "id1": "category",
#                 "id2": "category",
#                 "id3": "category",
#                 "id4": "Int32",
#                 "id5": "Int32",
#                 "id6": "Int32",
#                 "v1": "Int32",
#                 "v2": "Int32",
#                 "v3": "float64",
#             },
#             storage_options={"anon": True},)

In [None]:
ddf_pq

In [None]:
## example to exaplain column prunning. 


### Read about why in read_parquet we read the dtypes but not csv?

- show ddf.partitions[0].memory_usage(deep=True).compute() / 1e6
- see what happens with csv and with parquet, 

## dtypes

NOTE: 

FOR THE PUSPOSE OF THE TUTORIAL I NEED TO GENERATE THE DATA FOR 5GB WITH PYARROW STRINGS. 
OR TYPECAST, EXPLORE THAT.

THEN RUN 
```python
        ddf_q3 = ddf[["id3", "v1", "v3"]].astype({"id3": "string[pyarrow]"})
        (
            ddf_q3.groupby("id3", dropna=False, observed=True)
            .agg({"v1": "sum", "v3": "mean"})  
            .compute()
        )
```

chat with james to see if there is anything else about pyarrow dtypes we could be showing here.

## High cardinality 

- id1 has 100 unique values
- id3 has 1_000_000 unique values

Let's see what happens when we try to groupby on a high cardinality column, and what can we do to make this better. 

Read docs about shuffle, and explain advantages, extract useful info. Ask about p2p docs?
https://docs.dask.org/en/stable/dataframe-groupby.html#shuffle-methods


In [None]:
#With 5 workers

In [None]:
# CPU times: user 833 ms, sys: 338 ms, total: 1.17 s
# Wall time: 3min 9s

In [None]:
%%time
ddf = ddf_pq[["id3", "v1", "v3"]]
(
    ddf.groupby("id3")
    .agg({"v1": "sum", "v3": "mean"})
    .compute()
)

In [None]:
## Using shuffle tasks is slower :/ explanation?
##CPU times: user 1.58 s, sys: 858 ms, total: 2.44 s
#Wall time: 4min 49s

In [None]:
%%time
ddf = ddf_pq[["id3", "v1", "v3"]]
(
    ddf.groupby("id3")
    .agg({"v1": "sum", "v3": "mean"}, shuffle="tasks")
    .compute()
)

In [None]:
#THERE IS BUG, AND I CAN'T RUN THIS
# SEE https://github.com/dask/dask/issues/9754
%%time
ddf = ddf_pq[["id3", "v1", "v3"]]
(
    ddf.groupby("id3")
    .agg({"v1": "sum", "v3": "mean"}, shuffle="p2p")
    .compute()
)