<a href="https://colab.research.google.com/github/rcameronc/meetings/blob/main/february_22_Extending_Pandas_With_Dask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<td> <img src="https://sdzwildlifeexplorers.org/sites/default/files/2017-07/pandas-closeup.jpg" alt="Drawing" style="width: 250px;"/> </td>

## Extending Pandas with Dask

Feb 22rd, 2023

by [Roger Creel](https://rogercreel.com) for the [Columbia Data Club](https://github.com/columbia-data-club/).

This notebook underpins a ~60-75 minute presentation that demonstrates how to use Pandas with Dask.  It is geared towards complete beginners to Python and to programming. 


# **Extending Pandas with Dask**
*Presented by Columbia University Libraries*
***

Welcome to the Columbia University Library's Estending Pandas with Dask course! These are our objectives:

* Review Pandas as a tool for data analysis
* Explore how to use Dask dataframes as an extension for Pandas
* Discuss statistics and visualization capabilities of Dask + Pandas
* Answer questions! 






## **Getting started**
### *(with Google Colab)*

Topics to be covered:
1. What is Python?
2. Why does it matter?
3. How can you use Python? (IDEs, notebooks, terminal, Colab)
4. What are packages and why do we need them?
5. Basic familiarity with CoLab (shareability, power)
6. Pitfalls of using CoLab
7. Why pandas? Pandas is an open-source library that provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.


In [None]:
# Run these once per session, then no need to rerun them again even if kernel dies
!python -m pip install 'fastparquet'
!python -m pip install "dask[complete]"


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fastparquet
  Downloading fastparquet-2023.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas>=1.5.0
  Downloading pandas-1.5.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
Collecting cramjam>=2.3
  Downloading cramjam-2.6.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: cramjam, pandas, fastparquet
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.5
    Uninstalling pandas-1.3.5:
      S

In [1]:
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import statsmodels as sm
import requests
import pyarrow
import os
import numpy as np
import dask
import dask.dataframe as dd

We'll first download three years of data from the New York Taxi & Limousine Commission (TLC) [Trip Record Data website](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). 

In [None]:
# Make a directory for our data if it doesn't exist
path = f'./taxi_data_parquet/'
if not os.path.isdir(path):
  os.makedirs(path)

months = [str(i).zfill(2) for i in range(1,13)]
years = ['2019', '2020']
# get the URL of one month's data from New York Suway 
for year in years:
  print(year)
  for month in months:
    print(month)
    url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{year}-{month}.parquet"

    r = requests.get(url) # create HTTP response object
    with open(path + f"yellow_tripdata_{year}-{month}.parquet",'wb') as f:

          f.write(r.content)

# save parquet files as csv files 

# open parquet file
df = pd.read_parquet(path + 'yellow_tripdata_2020-01.parquet')
df.head()

We just downloaded a lot of files locally in the format TLC gave it to us: an Apache `parquet` file.  Let's check how much data we have using shell scripting command `du -sh` (disk usage, total size, human readable):

In [None]:
%%bash
cd taxi_data_parquet

du -sh

How many files do we have? Let's find out using the `find` and `wc -l` (word cound by line) commands, connected via a `|` pipe: 

In [None]:
%%bash
cd taxi_data_parquet
find . | wc -l

We don't want to load all those files separately.  That would be a nightmare! We want to work with them all at once.  Dask will makes that easy.  First let's get our filenames:

In [None]:
from glob import glob
filenames = sorted(glob('taxi_data_parquet/yellow_tripdata*.parquet'))
print(filenames[:3])
print(filenames[-3:])


Let's open one of them with pandas to remind ourselves what they look like.

In [None]:
df_one = pd.read_parquet(filenames[0])
df_one

More than seven million entries!  That's a lot, and only for one month.  If we want to look at three years of data, that would be 100+ million taxi rides.  Too many for to load into memory and do computations on.  Thankfully, Dask Dataframes is going to come to our rescue.  

In [None]:
# we'll use just the first half of 2020 because we have limited RAM
filenames_fewer = filenames[12:-9]

df = dd.read_parquet(filenames_fewer,
                     filename_suffix=".parquet",
                     engine='fastparquet',
                     parse_dates=[
                         "tpep_pickup_datetime",	"tpep_dropoff_datetime"
                         ],  
                    #  dtypes=df.dtypes.to_dict(),
                     )
df

In [None]:
df_orig_len = len(df)
df_orig_len

What do you see? Let's look at the first rows to check they are the same.

In [None]:
### Get the first 5 rows
df.head(5)

These columns have long, hard-to-read names.  We'll want to make them shorter and drop the less useful ones.

## **Refresher on Python data structures**

But before we do that, let's review a few key data structures in python and pandas:

In [None]:
# Lists
listex = [1, 2, 3, 4, "python", "makes", "rory", "roar"]
listex[1]
listex[0:5]

# Iterable

In [None]:
# Dictionaries
dictex = {"rory" : "the lion", "columbia": "the university", "founded": 1754}
dictex["founded"]
dictex["rory"]

# Iterable

Now let's clean up our data.  

In [None]:
df.columns

We don't need all the columns, and the columns we do need have cumbersome names.  Let's fix that. 

In [None]:
columns = {
          'tpep_pickup_datetime':'pickup',
          'tpep_dropoff_datetime':'dropoff',
          'passenger_count':'passengers', 
          'trip_distance':'distance', 
          'payment_type':'payment_type',
          'fare_amount':'fare',
          'extra':'extra',
          'mta_tax':'tax', 
          'tip_amount':'tip', 
          'tolls_amount':'tolls', 
          'improvement_surcharge':'improvement_surcharge',
          'total_amount':'total_fare', 
          'congestion_surcharge':'congestion_tax', 
          'airport_fee':'airport_fee'
          }

# choose only columns that are keys in dictionary
df = df[list(columns.keys())]

# rename columns by values of dictionary
df = df.rename(columns=columns)

df.head()

In [None]:
# Pandas series
type(df["pickup"])

# Column

In [None]:
df.dtypes

In [None]:
# Descriptions
df.describe(include = "all")

Any ideas about why it broke?  

In [None]:
df.describe(include=['float64'])

We can run this, but look at the number of tasks.  What does this mean?  

![Task graph](https://www.odatis-ocean.fr/fileadmin/_processed_/a/8/csm_Pangeo_Dask_calcul_parallele_scheduler_1ae2fe84d9.png)

In [None]:
df.iloc[:,0].visualize()

In [None]:
mean_passengers = df.passengers.mean()
mean_passengers.visualize() # .persist()
# mean_pass.compute()

In [None]:
mean_pass_inmem = mean_passengers.persist()
mean_pass_inmem.visualize()

In [None]:
mean_pass_inmem.compute()

Dask has some fast operations and some slower ones. Here are the fastest:

Element-wise operations: `df.x + df.y`, `df * df`

Row-wise selections: `df[df.x > 0]`

Loc: `df.loc[4.0:10.5]`

Common aggregations: `df.x.max(), df.max()`

Is in: `df[df.x.isin([1, 2, 3])]`

Date time/string accessors: `df.timestamp.month`

In [None]:
ridetime = (df.dropoff - df.pickup)
ridetime

The operation was fast, but running the tasks still takes time. 

In [None]:
ridetime = ridetime.loc[:10].persist()
ridetime.compute()

### Other fast-ish (Cleverly parallelizable) tasks:


 operations (fast):
groupby-aggregate (with common aggregations): `df.groupby(df.x).y.max()`, `df.groupby('x').min()` (see Aggregate)

groupby-apply on index: `df.groupby(['idx', 'x']).apply(myfunc)`, where idx is the index level name

value_counts: `df.x.value_counts()`

Drop duplicates: `df.x.drop_duplicates()`

Join on index: `dd.merge(df1, df2, left_index=True, right_index=True)` or `dd.merge(df1, df2, on=['idx', 'x'])` where idx is the index name for both df1 and df2

Join with pandas DataFrames: `dd.merge(df1, df2, on='id')`

Element-wise operations with different partitions / divisions: `df1.x + df2.y`

Date time resampling: `df.resample(...)`

Rolling averages: `df.rolling(...)`

Pearson’s correlation: `df[['col1', 'col2']].corr()`

In [5]:
df["pickup_day"] = df["pickup"].dt.day
df["pickup_year"] = df["pickup"].dt.year
df["pickup_month"] = df["pickup"].dt.month

In [None]:
# Start with a simple histogram

# First with months
df_mnth_cnt = df.groupby("pickup_day")[["pickup"]].count().persist()
df_mnth_cnt.compute().plot(kind='bar')

In [None]:
df[['distance', 'fare']].corr().compute()

Why would this correlation be so low?  

In [6]:
df = df[df['pickup_year'] > 2018]
df = df[(df['fare'] > 2.50) & (df['fare'] < 500)]
df = df[(df['distance'] > 0.05) & (df['distance'] < 500)]
df = df[df["passengers"] > 0]

Did dropping these bad values improve correlations between, for instance, distance and fare?


In [None]:
df[['distance', 'fare']].corr().compute()

In [26]:
# df[['passengers', 'distance']].corr().compute()

Other operations are slow, however, because they require a shuffle:

Set index: `df.set_index(df.x)`

groupby-apply not on index (with anything): `df.groupby(df.x).apply(myfunc)`

Join not on the index: `dd.merge(df1, df2, on='name')`

Why is shuffling hard?  

![shuffle graphic](https://assets-global.website-files.com/63192998e5cab906c1b55f6e/633f7b5df9c63728c2ce7ac6_image-3-700x340.png)

Every output partition depends on every input partition, so the graph becomes N² in size. Even with reasonable amounts of input data, this can crash the Dask scheduler.

![crazy graph](https://assets-global.website-files.com/63192998e5cab906c1b55f6e/633f7b5df9c6372f4bce7ac3_image7.png)
The current task graph of a very small shuffle (20 partitions). It grows quadratically with the number of partitions, so imagine this times 100 or 1000—it gets large very quickly!


### Asking data-driven questions

Now that we have the data, we can ask questions with it.  For instance:

* How did average number of passengers change during the pandemic?
* Did pandemic drivers get tipped more?
* Did COVID change ride durations or distances? 

In [13]:
df_cov = df[df['pickup_month'] >= 4]
df_precov = df[df['pickup_month'] == 2]


In [None]:
df_cov['tip'].mean().compute()

In [None]:
df_precov['tip'].mean().compute()

In [None]:
df_cov['tip'].std().compute()

In [None]:
df_precov['tip'].std().compute()

In [None]:
print(len(df_precov), len(df_cov))

There are many other fun queestions we could ask:

- investigate the potential effects of increasing the number of people in a taxi. Does it affect how likely and how much someone is to tip? Does it relate to how far they travel?
- how much do people generally tip?
- Are there differences in volume of passengers during different times of day? 
- What about payment type - who is still using cash, and at what time of day? Are they groups?
