# INFO 3401 – Class 16: In-class notebook

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT).

## Learning Objectives
This week we'll explore an alternative genre of in-class activity involving guided exploratory data analysis. We will work with a new dataset, review previous weeks' material, and apply the concepts introduced in the pre-class lectures. These lectures and notebooks will be recorded and posted to Canvas. We'll also experiment with the notebooks being cumulative, meaning the work we start on Monday we will try to continue through Wednesday and Friday.

* Review concepts from previous modules and classes on data reshaping, visualization, joining, *etc.*
* Reinforce new concepts about working with temporal data
* Develop strategies for performing exploratory data analyses on new datasets
* Think critically about the limitations of the data and implications this has on our interpretations

## Background
Arielle Daskal reminded me about the very powerful and interesting [NYC Taxi Trip Records data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). These data capture the pick-up and drop-off timestamp, locations, distances, fares, and passenger counts for the millions of NYC taxi, limo, and rideshare trips going back to 2009*. This data is nice because it is [well-documented](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf), very detailed, covers a long period of time, and has temporal and spatial variance.

The data is reported out at a monthly frequency and broken down by yellow cabs (generally Manhattan), green cabs (generally the NYC boroughs), and for-hire (rideshares). I want to focus in particular on the ridership data in the lead-up to and aftermath of the COVID-19-related quarantines and shutdowns in March and April 2020. What kinds of travel patterns can we discover? Which of these patterns were most and least disrupted by COVID?

## Load libraries

In [1]:
# Our usual libraries for working with data
import pandas as pd
import numpy as np

# Our usual libraries for visualizing data
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

## Load data

These are big files (>200MB!) so be careful and patient when running these code blocks to retrieve the data!

Retrieve the yellow cab data for March: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-03.csv

Print the shape and inspect the head.

In [3]:
nyc_yellow_cab = pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2020-03.csv")
nyc_yellow_cab.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1.0,2020-03-01 00:31:13,2020-03-01 01:01:42,1.0,4.7,1.0,N,88,255,1.0,22.0,3.0,0.5,2.0,0.0,0.3,27.8,2.5
1,2.0,2020-03-01 00:08:22,2020-03-01 00:08:49,1.0,0.0,1.0,N,193,193,2.0,2.5,0.5,0.5,0.0,0.0,0.3,3.8,0.0
2,1.0,2020-03-01 00:52:18,2020-03-01 00:59:16,1.0,1.1,1.0,N,246,90,1.0,6.0,3.0,0.5,1.95,0.0,0.3,11.75,2.5
3,2.0,2020-03-01 00:47:53,2020-03-01 00:50:57,2.0,0.87,1.0,N,151,238,1.0,5.0,0.5,0.5,1.76,0.0,0.3,10.56,2.5
4,1.0,2020-03-01 00:43:19,2020-03-01 00:58:27,0.0,4.4,1.0,N,79,261,1.0,16.5,3.0,0.5,4.05,0.0,0.3,24.35,2.5


Repeat for the green cab data: https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2020-03.csv

In [4]:
nyc_green_cab = pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2020-03.csv")
nyc_green_cab.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2.0,2020-03-01 00:20:18,2020-03-01 00:45:29,N,1.0,41,13,1.0,8.24,26.5,0.5,0.5,7.64,0.0,,0.3,38.19,1.0,1.0,2.75
1,2.0,2020-03-01 00:15:42,2020-03-01 00:44:36,N,1.0,181,107,1.0,4.87,21.0,0.5,0.5,0.0,0.0,,0.3,25.05,2.0,1.0,2.75
2,2.0,2020-03-01 00:36:18,2020-03-01 00:41:03,N,1.0,41,166,1.0,0.69,5.0,0.5,0.5,0.0,0.0,,0.3,6.3,2.0,1.0,0.0
3,1.0,2020-03-01 00:22:14,2020-03-01 00:32:57,N,1.0,129,7,1.0,1.8,9.0,0.5,0.5,0.0,0.0,,0.3,10.3,2.0,1.0,0.0
4,2.0,2020-03-01 00:07:22,2020-03-01 00:14:16,N,1.0,74,152,1.0,1.25,7.0,0.5,0.5,2.49,0.0,,0.3,10.79,1.0,1.0,0.0


Repeat for the for-hire vehicle (FHV) data: https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2020-03.csv

In [5]:
for_hire_vehicle = pd.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/fhv_tripdata_2020-03.csv")
for_hire_vehicle.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,SR_Flag
0,B00013,2020-03-01 00:52:09,2020-03-01 01:18:43,264.0,264.0,
1,B00013,2020-03-01 00:59:46,2020-03-01 01:14:18,264.0,264.0,
2,B00013,2020-03-01 00:48:29,2020-03-01 01:33:03,264.0,264.0,
3,B00013,2020-03-01 00:51:41,2020-03-01 01:32:46,264.0,264.0,
4,B00013,2020-03-01 00:58:29,2020-03-01 01:36:04,264.0,264.0,


Also grab the Taxi Zone Lookup Table: https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

In [6]:
taxi_zone_lookup = pd.read_csv("https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv")
taxi_zone_lookup.head()

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


## Inspect the columns for values and overlaps across data

## Plot some distributions of columns with continuous data

Use [hist()](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#histograms) or [plot.kde()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.kde.html) to explore the distributions of the continuous variables.

## Cleanup based on inspection

Remove, rename, or otherwise revise the DataFrame to clean up any problematic columns or rows.

## Reshape the data

Make a pivot table with the count of rides and the pickup (PU) and drop-off (DO) locations as columns and indexes.

Make another pivot table with the median "total_amount" by PU and DO location.

## Make heatmaps from the reshaped data

Use seaborn's powerful [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function ([also](https://seaborn.pydata.org/examples/spreadsheet_heatmap.html)).

## Interpret the heatmaps

## Boolean index to airport trips

Use the Taxi Zone Lookup Table to identify zone IDs corresponding with trips starting or ending at airports.

How do the distributions of fares differ for airport trips compared to the entire population?

## Inspect the pickup and dropoff datetimes

## Convert to `Timestamp`s using `to_datetime`

## Make new columns for hour, weekday, and date of PU and DO

## Make another pivot table and heatmap for hour and date

Pivot hour as an index and date as a column and experiment with the number of trips, fare, passengers, *etc*. as values.

Heatmap it.

Interpret it.

## Reshape to get the daily number of rides, fares, passengers, *etc*.

Use pivot table or groupby-aggregate to get some daily counts of some activity.

## Visualize this groupby as a time-series

Make a plot with dates as the x-axis and count of rides, total fare, total passengers, *etc*. as the y-axis.

Interpet some salient features.

## Compare yellow to another kind of mobility service (green, FHV)

Clean.

Reshape.

Visualize.

Interpret!

## Add in data from April 2020

Retrieve.

Clean.

Concatenate with the March 2020 data.

Pivot table or groupby-aggregate and make a visualization of the data over another month of time.

Interpret!