# Scaling, Performance, and Memory

In this notebook we will work with a multi-machine cluster operating in the cloud.  We will do performance tuning on a workflow that enables interactie visualization, and learn about how to measure and improve performance in a distributed context.  We'll make some pretty images too.


In [None]:
import datashader
import dask
import coiled

## Request Dask Cluster

There are many services to create Dask clusters in the cloud.  Today we'll use [Coiled](https://coiled.io).

This should take a couple minutes.

In [None]:
import coiled

cluster = coiled.Cluster(
    n_workers=10,
    package_sync=True,
    # account="...",
)

from dask.distributed import Client, wait
client = Client(cluster)

client

## Include pickup and dropoff locations

So far we've only been looking at one of these two datasets.  Now we'll look at both together. 

We now take all of our lessons learned to set this up for interactive scaling.  

We'll be visualizing and interacting with 1+B points now.

You don't need to do anything, just execute these cells and play at the bottom.

In [None]:
# Read in one year of NYC Taxi data

import dask.dataframe as dd

df = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009",
    storage_options={"anon": True},
)


In [None]:
df = df[["dropoff_longitude", "dropoff_latitude", "pickup_longitude", "pickup_latitude"]]

# clean data
df = df.loc[
    (df.dropoff_longitude > -74.1) & (df.dropoff_longitude < -73.7) & 
    (df.dropoff_latitude > 40.6) & (df.dropoff_latitude < 40.9) &
    (df.pickup_longitude > -74.1) & (df.pickup_longitude < -73.7) &
    (df.pickup_latitude > 40.6) & (df.pickup_latitude < 40.9)
]


In [None]:
import pandas as pd


df_dropoff = df[["dropoff_longitude", "dropoff_latitude"]]
df_dropoff["journey_type"] = "dropoff"
df_dropoff = df_dropoff.rename(columns={'dropoff_longitude': 'long', 'dropoff_latitude': 'lat'})
df_pickup = df[["pickup_longitude", "pickup_latitude"]]
df_pickup["journey_type"] = "pickup"
df_pickup = df_pickup.rename(columns={'pickup_longitude': 'long', 'pickup_latitude': 'lat'})
df = dd.concat([df_dropoff, df_pickup])

pickup_dropoff = pd.CategoricalDtype(categories=["pickup", "dropoff"])
df = df.astype({"journey_type": pickup_dropoff})

df = df.repartition(partition_size="256Mib").persist()

In [None]:
import datashader
import hvplot.dask
import holoviews as hv
hv.extension('bokeh')

color_key = {'pickup': "#e41a1c", 'dropoff': "#377eb8"}

df.hvplot.scatter(
    x="long", 
    y="lat", 
    aggregator=datashader.by("journey_type"), 
    datashade=True, 
    cnorm="eq_hist",
    width=700,
    aspect=1.33, 
    color_key=color_key
)