<img src="images/dask_horizontal.svg"
     width="45%"
     alt="Dask logo\">
     
# Scaling your data work with Dask

## PyData Global 2020

### Materials & setup

- Tutorial materials available at https://github.com/coiled/pydata-global-dask
- Two ways to go through the tutorial:
    1. Run locally on your laptop
    2. Run using Binder (no setup required)

### About the speakers

- **[James Bourbeau](https://www.jamesbourbeau.com/)**: Dask maintainer and Software Engineer at [Coiled](https://coiled.io/).
- **[Hugo Bowne-Anderson](http://hugobowne.github.io/)**: Head of Data Science Evangelism and Marketing at [Coiled](https://coiled.io/).

# Overview

## What is Dask?

Dask is a flexible, open source library for parallel computing in Python

- Documentation: https://docs.dask.org
- GitHub: https://github.com/dask/dask

Designed to scale the existing Python ecosystem

## Why Dask?

- Enables parallel and larger-than-memory computations
- Scales the existing Python ecosystem
    - Uses familiar APIs you're used to from projects like NumPy, Pandas, and scikit-learn
    - Allows you to scale existing workflows with minimal code changes
- Dask works on your laptop, but also scales out to large clusters
- Offers great built-in diagnosic tools

<img src="images/dask-components.svg"
     width="85%"
     alt="Dask components\">

# Dask in action!

In [None]:
# Sets up Dask's distributed scheduler
from dask.distributed import Client

client = Client()
client

In [None]:
# Perform Pandas-like operations
import dask.dataframe as dd

df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv",
    dtype={
        "payment_type": "UInt8",
        "VendorID": "UInt8",
        "passenger_count": "UInt8",
        "RatecodeID": "UInt8",
    },
    storage_options={"anon": True},
    blocksize="10 MiB",
).persist()

df.groupby("passenger_count").tip_amount.mean().compute()

In [None]:
client.close()

## Next step

Let's cover our first Dask collection: the `delayed` interface.