<img src="images/dask_horizontal.svg"
     width="45%"
     alt="Dask logo\">
     
# Scaling your data work with Dask

## PyData Global 2020

### Materials & setup

- Tutorial materials available at https://github.com/coiled/pydata-global-dask
- Two ways to go through the tutorial:
    1. Run locally on your laptop
    2. Run using Binder (no setup required)

### About the speakers

- **[James Bourbeau](https://www.jamesbourbeau.com/)**: Dask maintainer and Software Engineer at [Coiled](https://coiled.io/).
- **[Hugo Bowne-Anderson](http://hugobowne.github.io/)**: Head of Data Science Evangelism and Marketing at [Coiled](https://coiled.io/).

# Overview

Dask is a flexible, open source library for parallel computing in Python

- Documentation: https://docs.dask.org
- GitHub: https://github.com/dask/dask

From a high-level Dask:

- Enables parallel and larger-than-memory computations
- Scales the existing Python ecosystem
    - Uses familiar APIs you're used to from projects like NumPy, Pandas, and scikit-learn
    - Allows you to scale existing workflows with minimal code changes
- Dask works on your laptop, but also scales out to large clusters
- Offers great built-in diagnosic tools

<img src="images/dask-components.svg"
     width="85%"
     alt="Dask components\">

# Dask in action!

In [None]:
# Sets up Dask's distributed scheduler
from dask.distributed import Client

client = Client()
client

In [None]:
# Run this cell to download NYC flight dataset
%run prep.py -d flights

In [None]:
# Perform Pandas-like operations
import os
import dask.dataframe as dd

df = dd.read_csv(os.path.join("data", "nycflights", "*.csv"),
                 parse_dates={"Date": [0, 1, 2]},
                 dtype={"TailNum": str,
                        "CRSElapsedTime": float,
                        "Cancelled": bool})

df.groupby("Origin").DepDelay.mean().compute()

## Tutorial goals

The goal for this tutorial is to cover the basics of Dask. Attendees should walk away with an understanding of what
Dask offers, how it works, and ideas of how Dask can help them effectively scale their own data intensive workloads.

The tutorial consists of several Jupyter notebooks which contain explanatory material on how Dask works. Specifically, the notebooks presented cover the following topics:

- [Dask Delayed](1-delayed.ipynb)
- [Dask DataFrame](2-dataframe.ipynb)
- [Schedulers](3-schedulers.ipynb)

Each notebook also contains hands-on exercises to illustrate the concepts being presented. Let's look at our first example to get a sense for how they work.

### Exercise: Print `"Hello world!"`

Use Python to print the string "Hello world!" to the screen.

In [None]:
# Your solution here

In [None]:
# Run this cell to see a solution
%load solutions/overview.py

Note that several of the examples here have been adapted from the Dask tutorial at https://tutorial.dask.org.

## Next step

Let's start by covering our first Dask collection, the `dask.delayed` interface, in the [Dask delayed notebook](1-delayed.ipynb).