* Getting started with dask
* Overview of dask features
* Data processing using dask dataframes
* Create dask dataframe using csv files
* Get the row and column count
* Overview of data processing APIs of dask dataframes
* Write data in dask dataframe to csv files
* Real world example of data processing using dask
* Exercise and Solution

In [None]:
# Getting started with dask
# python -m pip install dask[complete]

In [None]:
# Overview of dask features
# Scale PyData libraries such as numpy, pandas, scikit-learn, etc using Dask DataFrames
# Scale any Python code using Dask Futures

In [None]:
# Data processing using dask dataframes
# Read data from files and other sources using read apis
# Process data using Pandas Dataframe like APIs (query, apply, groupby, join, etc)
# Write data to files and other targets using to apis

In [None]:
from dask import dataframe as dd

In [None]:
# dd.read_json
# df.to_json
df = dd.read_json('data/retail_db_json/departments/*')
# check the help on df.to_* APIs


In [None]:
# Create dask dataframe using csv files
df = dd.read_csv(
    'data/retail_db/departments/*', 
    names=['department_id', 'department_name']
)

In [None]:
df.compute() # Lazily Evaluated

In [None]:
# Get the row and column count
df.shape

In [None]:
type(df.compute())

In [None]:
df.compute().shape

In [None]:
# Overview of data processing APIs of dask dataframes
# query
# apply
# groupby
# join
# sort_values
df.query('department_id >= 3')

In [None]:
df.query('department_id >= 3').compute()

In [None]:
df.apply(lambda rec: rec['department_name'].upper(), axis=1)

In [None]:
df.apply(lambda rec: rec['department_name'].upper(), meta=(None, 'object'), axis=1).compute()

In [None]:
df.sort_values(by=['department_name']).compute()

In [None]:
# Write data in dask dataframe to csv files
df = dd.read_json('data/retail_db_json/departments/*')

In [None]:
df.compute()

In [None]:
df.to_csv('data/retail_db_csv/departments/part*.csv', index=False)

In [None]:
dd.read_csv('data/retail_db_csv/departments/part*.csv').compute()

In [None]:
# Real world example of data processing using dask
# Convert all the files under retail_db to json format
import glob
import os
import json

In [None]:
def get_schema(ds):
    with open('data/retail_db/schemas.json') as fp:
        schemas = json.load(fp)
    return [
        schema['column_name'] 
        for schema in sorted(schemas[ds], key=lambda s: s['column_position'])
    ]

In [None]:
for path in glob.glob('data/retail_db/*'):
    if os.path.isdir(path):
        ds = os.path.split(path)[1]
        df = dd.read_csv(f'{path}/*', names=get_schema(ds))
        df.to_json(
            f'data/retail_demo_json/{ds}/part*.json', 
            orient='records',
            lines=True,
            name_function=lambda i: '%05d' % i
        )

* Exercise: Convert the text files under `data/nyse_all/nyse_data` to json.
  * Source folder: `data/nyse_all/nyse_data`
  * Target folder: `data/nyse_all/nyse_json`
  * File Format: `gzip` compressed json format.
  * Column Names: `['ticker', 'trade_date', 'open_price', 'low_price', 'high_price', 'close_price', 'volume']`
  * Make sure file name is generated using part-nnnnn.json (eg: `part-00000.json.gz`)
  * Validate by using shape on both source and target locations.

In [None]:
from dask import dataframe as dd

In [None]:
df = dd.read_csv(
    'data/nyse_all/nyse_data/*',
    names=['ticker', 'trade_date', 'open_price', 'high_price', 'low_price', 'close_price', 'volume'],
    blocksize=None
)

In [None]:
df.head()

In [None]:
df.query('volume > 0').head()

In [None]:
df.compute().shape

In [None]:
df.to_json(
    'data/nyse_all/nyse_json/part-*.json.gz',
    orient='records',
    lines=True,
    compression='gzip',
    name_function=lambda i: '%05d' % i
)

In [None]:
dd.read_json(
    'data/nyse_all/nyse_json/part-*.json.gz', 
    lines=True, 
    blocksize=None
). \
    compute(). \
    shape