# Time for a Test Drive!

You've spent some time walking around the Dascar lot, hearing about all the awesome features and specs...

That's enough talk! Let's jump into a racecar and see what it can do!

![](racecar.png "Title")

## Dask DataFrames

The pandas car...with the Dask engine!

In [None]:
import dask.dataframe as dd

In [None]:
%run ../prep_data.py -d flights

In [None]:
import os

files = os.path.join('../data', 'nycflights', '*.csv')
files

In [None]:
df = dd.read_csv(files,
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={"TailNum": str,
                        "CRSElapsedTime": float,
                        "Cancelled": bool})

In [None]:
df.head()

In [None]:
%%time
df.groupby("Origin")["DepDelay"].mean().compute()

### A slight difference with pandas
Notice the `.compute()` call: this is necessary because Dask operates using something called **lazy evaluation**.

If you haven't heard about lazy evaluation before, [this metaphor](https://app.excalidraw.com/s/96tGQEGIZ4c/4ivm6sRL2Kq) might help.

In [None]:
df

## Dask Arrays

The Numpy car...with Dask engine superpowers!

In [None]:
import dask.array as da

In [None]:
array = da.random.random((10_000, 10_000), chunks=(1_000, 1_000))

In [None]:
array

In [None]:
array[:10,:5]

In [None]:
array[:10,:5].compute()

In [None]:
%%time
array.sum(axis=1).compute()

## Dask ML

The scikit-learn car with.... you guessed it -- Dask rocketfuel!

In [None]:
from dask_ml.linear_model import LogisticRegression
from dask_ml.datasets import make_classification

In [None]:
X, y = make_classification(n_samples=1_000, chunks=50)

In [None]:
X

In [None]:
y

In [None]:
lr = LogisticRegression()

In [None]:
%%time
lr.fit(X, y)

In [None]:
%%time
predictions = lr.predict(X).compute()

In [None]:
lr.score(X,y).compute()

# For the Mechanics in the Room

Dask's lower-level APIs give you even more flexibility and control over what / how to parallelize your custom Python code.

## Parallelize Python Code with `dask.delayed`

In [None]:
from time import sleep

def inc(x):
    """Increments x by one"""
    sleep(1)
    return x + 1

def add(x=0, y=0, z=0):
    """Adds x and y and z"""
    sleep(1)
    return x + y + z

In [None]:
%%time

x = inc(1) # takes 1 second
y = inc(2) # takes 1 second
z = add(x, y) # takes 1 second

In [None]:
z

In [None]:
from dask import delayed

In [None]:
%%time

a = delayed(inc)(1)
b = delayed(inc)(2)
c = delayed(add)(a, b)

In [None]:
c

In [None]:
a.visualize()

In [None]:
b.visualize()

In [None]:
c.visualize()

In [None]:
d = delayed(inc)(3)

In [None]:
c = delayed(add)(a, b, d)

In [None]:
c.visualize()

In [None]:
%%time
c.compute()

![](task-graph.png)

## Dask Cluster on Coiled

In [None]:
import coiled

In [None]:
cluster = coiled.Cluster(
    name="dask-tutorial", 
    n_workers=20, 
    worker_memory='25Gib',
    software="rrpelgrim/dask-mini-tutorial",
    scheduler_options={'idle_timeout':'3 hours'}, # default is 20min
    shutdown_on_close=False,
    backend_options={'spot': 'True'},
)

In [None]:
from distributed import Client

client = Client(cluster)
client

In [None]:
import dask.dataframe as dd

In [None]:
df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-*.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    dtype={
        "payment_type": "UInt8",
        "VendorID": "UInt8",
        "passenger_count": "UInt8",
        "RatecodeID": "UInt8",
        "store_and_fwd_flag": "category",
        "PULocationID": "UInt16",
        "DOLocationID": "UInt16",
    },
    storage_options={"anon": True},
    blocksize="16 MiB",
)

In [None]:
df

In [None]:
%%time
df.groupby("passenger_count").tip_amount.mean().compute()