<img src="https://github.com/jupytercon/2020-exactlyallan/raw/master/images/RAPIDS-header-graphic.png" style="width:50%">


# RAPIDS Visualization Guide Notebook
### A Streamlined Guide to RAPIDS Accelerated Visualization and Visual Analtyics
The guide will walk through using RAPIDS cuDF, cuSpatial, and cuGraph with Holoviews, hvPlot, Datashader, cuxfilter, and Plotly Dash with the publically availble Divvy Bike share dataset. 

**NOTES and TODO:**
-Base on [JupyterCon Notebooks](https://github.com/rapidsai-community/event-notebooks/blob/main/JupyterCon_2020_RAPIDSViz/00%20Index%20and%20Introduction.ipynb) not cuxfilter tutorial


## Requirements
- System that meets the [RAPIDS system and GPU requirements](https://docs.rapids.ai/install#system-req)


## Dependencies
Use the below to install all the required dependencies via conda:

channels:
- rapidsai
- nvidia
- pyviz
- conda-forge
- plotly
- anaconda

dependencies:
- cuxfilter>=23.02
- cudf>=23.02
- cuspatial>=23.02
- cugraph>=23.02
- cudatoolkit=11.8
- python>=3.10
- plotly
- dash-core-components
- dash-html-components
- jupyter-dash
- jupyterlab
- jupyter-server-proxy
- holoviews
- hvplot
- geoviews
- cartopy
- networkx

In [70]:
# imports
import os
from zipfile import ZipFile
from pathlib import Path

import cudf
import hvplot.cudf
import cuspatial
import cugraph
import cuml


## Dataset
The dataset can be downloaded from the [Divvy Bike Share public dataset](https://divvybikes.com/system-data). Use the following script to download the desired date range and load it into a dataframe.


In [3]:
# Define the URL of the Divvy trip data and save dir
S3 = 'https://divvy-tripdata.s3.amazonaws.com/'
DATA_DIR = './data'


In [None]:
# Check dir
Path(DATA_DIR).mkdir(parents=True, exist_ok=True)

# Download the zip files from the URL within date range and unzip
for year in range(2021, 2022):
    for month in range(1, 3):
        file = f'{year}{month:02d}-divvy-tripdata.zip'
        URL = f'{S3}{file}'
        ! wget -P {DATA_DIR} {URL}
     
        with zipfile.ZipFile(f'{DATA_DIR}/{file}') as zip:
            zip.extractall(f'{DATA_DIR}')
            

In [51]:
# Load all csv as dataframes and combine into one cudf
df_array = []

for file in Path(DATA_DIR).rglob('*.csv'):
    gdf = cudf.read_csv(file)
    df_array.append(gdf)

df = cudf.concat(df_array)

# Check the data
df

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,89E7AA6C29227EFF,classic_bike,2021-02-12 16:14:56,2021-02-12 16:21:43,Glenwood Ave & Touhy Ave,525,Sheridan Rd & Columbia Ave,660,42.012701,-87.666058,42.004583,-87.661406,member
1,0FEFDE2603568365,classic_bike,2021-02-14 17:52:38,2021-02-14 18:12:09,Glenwood Ave & Touhy Ave,525,Bosworth Ave & Howard St,16806,42.012701,-87.666058,42.019537,-87.669563,casual
2,E6159D746B2DBB91,electric_bike,2021-02-09 19:10:18,2021-02-09 19:19:10,Clark St & Lake St,KA1503000012,State St & Randolph St,TA1305000029,41.885795,-87.631101,41.884866,-87.627498,member
3,B32D3199F1C2E75B,classic_bike,2021-02-02 17:49:41,2021-02-02 17:54:06,Wood St & Chicago Ave,637,Honore St & Division St,TA1305000034,41.895634,-87.672069,41.903119,-87.673935,member
4,83E463F23575F4BF,electric_bike,2021-02-23 15:07:23,2021-02-23 15:22:37,State St & 33rd St,13216,Emerald Ave & 31st St,TA1309000055,41.834733,-87.625827,41.838163,-87.645124,member
...,...,...,...,...,...,...,...,...,...,...,...,...,...
96829,B1A5336E1412D8BF,classic_bike,2021-01-19 19:03:17,2021-01-19 20:10:03,Lake Shore Dr & Monroe St,13300,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.880958,-87.616743,41.984037,-87.652310,member
96830,57EA5CB7DCD75F90,classic_bike,2021-01-05 18:42:27,2021-01-05 19:33:33,Lake Shore Dr & Monroe St,13300,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.880958,-87.616743,41.984037,-87.652310,member
96831,815B319A078CC984,classic_bike,2021-01-07 17:59:47,2021-01-07 19:34:03,Lakefront Trail & Bryn Mawr Ave,KA1504000152,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.984037,-87.652310,41.984037,-87.652310,member
96832,6DB04151565CEE63,classic_bike,2021-01-06 19:20:31,2021-01-06 20:41:57,Lakefront Trail & Bryn Mawr Ave,KA1504000152,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.984037,-87.652310,41.984037,-87.652310,member


The data seems unreasonabliy clean, but there are still a few things we improve on it. First lets double check the dtypes.


Lets check for blanks and nulls first


In [52]:
df.isnull().sum()


ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    21296
start_station_id      21296
end_station_name      25912
end_station_id        25912
start_lat                 0
start_lng                 0
end_lat                 420
end_lng                 420
member_casual             0
dtype: int64

In [53]:
# Filter rows with at least one null value
df[df['end_lat'].isnull()]


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
8594,E8AEBDA7D32FEF36,classic_bike,2021-02-19 13:31:46,2021-02-20 14:31:39,Lake Shore Dr & Wellington Ave,TA1307000041,,,41.936688,-87.636829,,,casual
8596,6BC446A9919843BA,classic_bike,2021-02-12 08:31:35,2021-02-13 09:31:17,Jeffery Blvd & 71st St,KA1503000018,,,41.766638,-87.576450,,,casual
8641,8EA050950CE48403,classic_bike,2021-02-05 17:56:04,2021-02-06 18:55:59,Wabash Ave & Adams St,KA1503000015,,,41.879472,-87.625689,,,member
8658,E9401504F648F516,classic_bike,2021-02-20 10:48:19,2021-02-20 11:23:02,Columbus Dr & Randolph St,13263,,,41.884728,-87.619521,,,member
8677,E6BF4E711EC764E3,classic_bike,2021-02-27 18:56:34,2021-02-27 22:05:08,Broadway & Belmont Ave,13277,,,41.940106,-87.645451,,,member
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95322,67B9F437883417D2,classic_bike,2021-01-16 14:34:58,2021-01-17 03:27:14,Sheridan Rd & Montrose Ave,TA1307000107,,,41.961670,-87.654640,,,member
95940,9E3BEB165BBFD925,classic_bike,2021-01-11 18:18:55,2021-01-11 18:34:03,Halsted St & Wrightwood Ave,TA1309000061,,,41.929143,-87.649077,,,member
95943,2619975F025F5D13,classic_bike,2021-01-04 09:19:49,2021-01-04 09:52:58,Orleans St & Chestnut St (NEXT Apts),620,,,41.898203,-87.637536,,,member
95947,5A2F3645BF6D6A83,classic_bike,2021-01-17 17:35:36,2021-01-17 20:53:57,Halsted St & Wrightwood Ave,TA1309000061,,,41.929143,-87.649077,,,member


In [54]:
# drop nulls
df = df.dropna(subset=['end_lat'])
df.isnull().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    21296
start_station_id      21296
end_station_name      25492
end_station_id        25492
start_lat                 0
start_lng                 0
end_lat                   0
end_lng                   0
member_casual             0
dtype: int64

In [55]:
df.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id       object
end_station_name       object
end_station_id         object
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
member_casual          object
dtype: object

The 'started_at' and 'ended_at' columns should be proper date times types.

In [56]:
df['started_at'] = cudf.to_datetime(df['started_at'])
df['ended_at'] = cudf.to_datetime(df['ended_at'])

df.dtypes

ride_id                       object
rideable_type                 object
started_at            datetime64[ns]
ended_at              datetime64[ns]
start_station_name            object
start_station_id              object
end_station_name              object
end_station_id                object
start_lat                    float64
start_lng                    float64
end_lat                      float64
end_lng                      float64
member_casual                 object
dtype: object

To make things a bit easier lets break out the date and time into sperate columns, assuming we only need to worry about start time.

In [57]:
df['year'] = df['started_at'].dt.year
df['month'] = df['started_at'].dt.month
df['day'] = df['started_at'].dt.day
df['hour'] = df['started_at'].dt.hour

df

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,year,month,day,hour
0,89E7AA6C29227EFF,classic_bike,2021-02-12 16:14:56,2021-02-12 16:21:43,Glenwood Ave & Touhy Ave,525,Sheridan Rd & Columbia Ave,660,42.012701,-87.666058,42.004583,-87.661406,member,2021,2,12,16
1,0FEFDE2603568365,classic_bike,2021-02-14 17:52:38,2021-02-14 18:12:09,Glenwood Ave & Touhy Ave,525,Bosworth Ave & Howard St,16806,42.012701,-87.666058,42.019537,-87.669563,casual,2021,2,14,17
2,E6159D746B2DBB91,electric_bike,2021-02-09 19:10:18,2021-02-09 19:19:10,Clark St & Lake St,KA1503000012,State St & Randolph St,TA1305000029,41.885795,-87.631101,41.884866,-87.627498,member,2021,2,9,19
3,B32D3199F1C2E75B,classic_bike,2021-02-02 17:49:41,2021-02-02 17:54:06,Wood St & Chicago Ave,637,Honore St & Division St,TA1305000034,41.895634,-87.672069,41.903119,-87.673935,member,2021,2,2,17
4,83E463F23575F4BF,electric_bike,2021-02-23 15:07:23,2021-02-23 15:22:37,State St & 33rd St,13216,Emerald Ave & 31st St,TA1309000055,41.834733,-87.625827,41.838163,-87.645124,member,2021,2,23,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96829,B1A5336E1412D8BF,classic_bike,2021-01-19 19:03:17,2021-01-19 20:10:03,Lake Shore Dr & Monroe St,13300,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.880958,-87.616743,41.984037,-87.652310,member,2021,1,19,19
96830,57EA5CB7DCD75F90,classic_bike,2021-01-05 18:42:27,2021-01-05 19:33:33,Lake Shore Dr & Monroe St,13300,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.880958,-87.616743,41.984037,-87.652310,member,2021,1,5,18
96831,815B319A078CC984,classic_bike,2021-01-07 17:59:47,2021-01-07 19:34:03,Lakefront Trail & Bryn Mawr Ave,KA1504000152,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.984037,-87.652310,41.984037,-87.652310,member,2021,1,7,17
96832,6DB04151565CEE63,classic_bike,2021-01-06 19:20:31,2021-01-06 20:41:57,Lakefront Trail & Bryn Mawr Ave,KA1504000152,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.984037,-87.652310,41.984037,-87.652310,member,2021,1,6,19


Finding the duration of each trip would also be a helpful metric.

In [58]:
df['duration_min'] = (df['ended_at'] - df['started_at'])

df['duration_min'] = df['duration_min'].dt.seconds / 60

df


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,year,month,day,hour,duration_min
0,89E7AA6C29227EFF,classic_bike,2021-02-12 16:14:56,2021-02-12 16:21:43,Glenwood Ave & Touhy Ave,525,Sheridan Rd & Columbia Ave,660,42.012701,-87.666058,42.004583,-87.661406,member,2021,2,12,16,6.783333
1,0FEFDE2603568365,classic_bike,2021-02-14 17:52:38,2021-02-14 18:12:09,Glenwood Ave & Touhy Ave,525,Bosworth Ave & Howard St,16806,42.012701,-87.666058,42.019537,-87.669563,casual,2021,2,14,17,19.516667
2,E6159D746B2DBB91,electric_bike,2021-02-09 19:10:18,2021-02-09 19:19:10,Clark St & Lake St,KA1503000012,State St & Randolph St,TA1305000029,41.885795,-87.631101,41.884866,-87.627498,member,2021,2,9,19,8.866667
3,B32D3199F1C2E75B,classic_bike,2021-02-02 17:49:41,2021-02-02 17:54:06,Wood St & Chicago Ave,637,Honore St & Division St,TA1305000034,41.895634,-87.672069,41.903119,-87.673935,member,2021,2,2,17,4.416667
4,83E463F23575F4BF,electric_bike,2021-02-23 15:07:23,2021-02-23 15:22:37,State St & 33rd St,13216,Emerald Ave & 31st St,TA1309000055,41.834733,-87.625827,41.838163,-87.645124,member,2021,2,23,15,15.233333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96829,B1A5336E1412D8BF,classic_bike,2021-01-19 19:03:17,2021-01-19 20:10:03,Lake Shore Dr & Monroe St,13300,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.880958,-87.616743,41.984037,-87.652310,member,2021,1,19,19,66.766667
96830,57EA5CB7DCD75F90,classic_bike,2021-01-05 18:42:27,2021-01-05 19:33:33,Lake Shore Dr & Monroe St,13300,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.880958,-87.616743,41.984037,-87.652310,member,2021,1,5,18,51.100000
96831,815B319A078CC984,classic_bike,2021-01-07 17:59:47,2021-01-07 19:34:03,Lakefront Trail & Bryn Mawr Ave,KA1504000152,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.984037,-87.652310,41.984037,-87.652310,member,2021,1,7,17,94.266667
96832,6DB04151565CEE63,classic_bike,2021-01-06 19:20:31,2021-01-06 20:41:57,Lakefront Trail & Bryn Mawr Ave,KA1504000152,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.984037,-87.652310,41.984037,-87.652310,member,2021,1,6,19,81.433333


Extracting out the day of the week would be hepful too.

In [59]:
df['day_of_week'] = df['started_at'].dt.dayofweek

df

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,year,month,day,hour,duration_min,day_of_week
0,89E7AA6C29227EFF,classic_bike,2021-02-12 16:14:56,2021-02-12 16:21:43,Glenwood Ave & Touhy Ave,525,Sheridan Rd & Columbia Ave,660,42.012701,-87.666058,42.004583,-87.661406,member,2021,2,12,16,6.783333,4
1,0FEFDE2603568365,classic_bike,2021-02-14 17:52:38,2021-02-14 18:12:09,Glenwood Ave & Touhy Ave,525,Bosworth Ave & Howard St,16806,42.012701,-87.666058,42.019537,-87.669563,casual,2021,2,14,17,19.516667,6
2,E6159D746B2DBB91,electric_bike,2021-02-09 19:10:18,2021-02-09 19:19:10,Clark St & Lake St,KA1503000012,State St & Randolph St,TA1305000029,41.885795,-87.631101,41.884866,-87.627498,member,2021,2,9,19,8.866667,1
3,B32D3199F1C2E75B,classic_bike,2021-02-02 17:49:41,2021-02-02 17:54:06,Wood St & Chicago Ave,637,Honore St & Division St,TA1305000034,41.895634,-87.672069,41.903119,-87.673935,member,2021,2,2,17,4.416667,1
4,83E463F23575F4BF,electric_bike,2021-02-23 15:07:23,2021-02-23 15:22:37,State St & 33rd St,13216,Emerald Ave & 31st St,TA1309000055,41.834733,-87.625827,41.838163,-87.645124,member,2021,2,23,15,15.233333,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96829,B1A5336E1412D8BF,classic_bike,2021-01-19 19:03:17,2021-01-19 20:10:03,Lake Shore Dr & Monroe St,13300,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.880958,-87.616743,41.984037,-87.652310,member,2021,1,19,19,66.766667,1
96830,57EA5CB7DCD75F90,classic_bike,2021-01-05 18:42:27,2021-01-05 19:33:33,Lake Shore Dr & Monroe St,13300,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.880958,-87.616743,41.984037,-87.652310,member,2021,1,5,18,51.100000,1
96831,815B319A078CC984,classic_bike,2021-01-07 17:59:47,2021-01-07 19:34:03,Lakefront Trail & Bryn Mawr Ave,KA1504000152,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.984037,-87.652310,41.984037,-87.652310,member,2021,1,7,17,94.266667,3
96832,6DB04151565CEE63,classic_bike,2021-01-06 19:20:31,2021-01-06 20:41:57,Lakefront Trail & Bryn Mawr Ave,KA1504000152,Lakefront Trail & Bryn Mawr Ave,KA1504000152,41.984037,-87.652310,41.984037,-87.652310,member,2021,1,6,19,81.433333,2


Lets map that to more legible names

In [60]:
rider_type = df.groupby('member_casual').size().rename("count").reset_index()
rider_type


Unnamed: 0,member_casual,count
0,casual,46263
1,member,196607


## A Note on Preattentive Attributes
This subconcious ability to quickly recognize patterns is due to our brain's natural ability to find preattentive attributes, such as height, orientation, or color. Imagine 100 values in a table and 100 in a bar chart and how quickly you would be albe to find the smallest and largest values in either.

In [11]:
rider_type.hvplot.bar(x='member_casual', y='count', title='Total Rider Types', yformatter='%0.0f')

In [None]:
hour_counts = df.groupby('hour').size().rename('count').reset_index()
hour_counts.hvplot.bar('hour', 'count', title="Trip starts, per hour", yformatter="%0.0f")

In [12]:
DOW = {0:'M', 1:'T', 2:'W', 3:'Th', 4:'F', 5:'Sa', 6:'Su'}

day_counts = df.groupby('day_of_week').size().rename('count').reset_index().sort_values('day_of_week')
day_counts.hvplot.bar('day_of_week', 'count', title="Trip starts per Week Day", yformatter="%0.0f")

In [61]:
df.hvplot.hist(y='duration_min', bins=120, title="Trips Duration Histrogram", yformatter="%0.0f")

In [62]:
# group data by day_of_week and hour, count the number of rows in each group
heatmap_data = df.groupby(['day_of_week','hour']).size().rename("count").reset_index().sort_values('hour')
heatmap_data


heatmap_data.hvplot.heatmap(x='day_of_week', y='hour', C='count')

In [69]:
df.hvplot.hexbin(x='start_lng', y='start_lat', geo=True, tiles="OSM", logz=False, gridsize=150, width=800, height=600)

In [67]:
df.hvplot.hexbin(x='end_lng', y='end_lat', geo=True, tiles="OSM", logz=False, gridsize=150, width=800, height=600)

And if you look at their system map, the lat longs seem to be accurate https://account.divvybikes.com/map

AttributeError: module 'cuspatial' has no attribute 'PointSeries'

In [68]:
#df.hvplot.points('start_lng','start_lat', geo=True, color='blue', alpha=0.2, xlim=(-86, -88), ylim=(40, 42), tiles='OSM', width=800, height=800)

## Outline

- Get Station Location + find out non station location count
- cuSpatial to calculate distance ( over time )
- cuSpatial to create Grid for start - stop nodes (in leu of stations)
- cuML linear regression?
- Cuxfilter (day, week, hour, type, map) - start to stop graph
- cuGraph PageRank leave, PageRank arrive