# Dask dataframes on a cluster

The aim of this afternoon's session is to show you how you can go smoothly from 'a pandas dataframe I can handle on my computer' to 'a huuuge dataframe that I can handle on a cluster of X computers using dask'

[Dask](http://dask.pydata.org) is a library which provides advanced parallelism for analytics using familiar Python APIs like [pandas](pandas.pydata.org), [numpy](numpy.org) and [scikit-learn](scikit-learn.org)

We'll take a look at how we can scale the groupby/apply approaches we learnt this morning to a bigger dataframe on a cluster

Note that you actually need to have a cluster running for this to work. I've got some basic instructions for spinning up a cluster in Google Cloud in `../handouts/running_dask_gloud.md` but there's a lot of concepts to follow to get this running. If you want to try this locally on your own computer you can just install dask-distributed using conda (either in Anaconda Navigator or on the command line with `conda install dask distributed`).

In [1]:
import gcsfs

filesys = gcsfs.GCSFileSystem()
filesys.ls('core-skills-nyc-taxi/2017')

['core-skills-nyc-taxi/2017/green_tripdata_2017-07.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-02.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-04.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-01.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-06.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-09.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-03.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-05.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-10.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-08.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-12.csv',
 'core-skills-nyc-taxi/2017/green_tripdata_2017-11.csv']

This data is too large to fit into Pandas on a single computer. However, it can fit in memory if we break it up into many small pieces and load these pieces onto different computers across a cluster.

We connect a client to our Dask cluster, composed of one centralized dask-scheduler process and several dask-worker processes running on each of the machines in our cluster.

In [2]:
from dask.distributed import Client, progress

client = Client()
client

0,1
Client  Scheduler: tcp://analytics-dask-scheduler:8786  Dashboard: http://analytics-dask-scheduler:8787/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


We can use dask to parse the CSVs into a dataframe which looks and feels like a dataframe on our machine but is really being stored on the cluster

In [7]:
import dask.dataframe as dd

df = dd.read_csv('gcs://core-skills-nyc-taxi/2017/green_tripdata_2017-*.csv',
                 parse_dates=['lpep_pickup_datetime', 'lpep_dropoff_datetime'])
df = client.persist(df)

In [None]:
df.head()

In [None]:
len(df)

In [None]:
df.groupby(df.passenger_count).trip_distance.mean().compute()

In [None]:
df2 = df[(df.tip_amount > 0) & (df.fare_amount > 0)]    # filter out bad rows
df2['tip_fraction'] = df2.tip_amount / df2.fare_amount  # make new column

dayofweek = (df2.groupby(df2.tpep_pickup_datetime.dt.dayofweek)
                .tip_fraction
                .mean())
hour      = (df2.groupby(df2.tpep_pickup_datetime.dt.hour)
                .tip_fraction
                .mean())

In [None]:
df = c.persist(df.set_index('lpep_pickup_datetime'))

In [None]:
df = df.astype({'VendorID': 'uint8',
                'passenger_count': 'uint8',
                'RateCodeID': 'uint8',
                'payment_type': 'uint8'})

df.to_parquet('gcs://core-skills-nyc-taxi/2017/green_tripdata.',
              compression='snappy',
              has_nulls=False,
              object_encoding='utf8',
              fixed_text={'store_and_fwd_flag': 1})