# Intro to `dask`


A list of "better" `dask` tutorials:

* [Continuum analytics `dask` tutorial](http://dask.github.io/dask-tutorial/introduction.html#/) (slides)
    * [Parallelizing scientific python with `dask`](https://github.com/jcrist/dask-tutorial-pydata-seattle-2017) (the Github repository that goes with the slides)
* [Documentation examples/tutorials](http://dask.pydata.org/en/latest/examples-tutorials.html)


Similarity between `pandas` and `dask`...

This is a sample dump from [their documentation](https://dask.pydata.org/en/latest/docs.html) because there's a bunch of stuff I'm not familiar with. 

```python
import pandas as pd                     import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv')      df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean()     df.groupby(df.user_id).value.mean().compute()
```

```python
import numpy as np                       import dask.array as da
f = h5py.File('myfile.hdf5')             f = h5py.File('myfile.hdf5')
x = np.array(f['/small-data'])           x = da.from_array(f['/big-data'],
                                                           chunks=(1000, 1000))
x - x.mean(axis=1)                       x - x.mean(axis=1).compute()
```

```python
import dask.bag as db
b = db.read_text('2015-*-*.json.gz').map(json.loads)
b.pluck('name').frequencies().topk(10, lambda pair: pair[1]).compute()
```

```python
from dask import delayed
L = []
for fn in filenames:                  # Use for loops to build up computation
    data = delayed(load)(fn)          # Delay execution of function
    L.append(delayed(process)(data))  # Build connections between variables

result = delayed(summarize)(L)
result.compute()
```

For a tour through the common operations with `dask` dataframes, see [this tutorial page](http://dask.pydata.org/en/latest/dataframe.html). 

In [None]:
import dask.dataframe as dd
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
ddf = dd.from_pandas(df, npartitions=2)
ddf._meta

In [None]:
ddf._meta.dtypes

In [None]:
ddf._meta_nonempty

In [None]:
ddf.npartitions

In [None]:
ddf.a + 2

In [None]:
(ddf.a + 2).compute()

# How to import schema for cdr

In [None]:
DATA_PATH = '/home/asberk/data/workshop-content18/5-cloudpbx/data/cloudpbx_tbl_csv/'
SCHEMA_PATH = DATA_PATH + 'schema.txt'

with open(SCHEMA_PATH, 'r') as fp:
    x = fp.readlines()
x = [y[:-1].split(',') for y in x]

In [None]:
cdr_schema = [y for y in x if y[0] == 'cdrID'][0]
cdr_schema[0] = cdr_schema[0][-2:]
cdr_schema = cdr_schema[:-1]

cdr_schema

# How to import cdr table

In [None]:
import pandas as pd
import dask.dataframe as dd
import numpy as np

In [None]:
CDR_PATH = '/home/asberk/data/workshop-content18/5-cloudpbx/data/cloudpbx_tbl_csv/cdr.csv'

In [None]:
ddf = dd.read_csv(CDR_PATH, low_memory=False, na_values='\\N', dtype={'1155': 'float64'})