<img src="https://github.com/jupytercon/2020-exactlyallan/raw/master/images/RAPIDS-header-graphic.png" style="width:50%">


# RAPIDS Visualization Guide Notebook
### A Streamlined Guide to RAPIDS Accelerated Visualization and Visual Analtyics
The guide will walk through using RAPIDS cuDF, cuSpatial, and cuGraph with Holoviews, hvPlot, Datashader, cuxfilter, and Plotly Dash with the publically availble Divvy Bike share dataset. 

**NOTES and TODO:**
-Base on [JupyterCon Notebooks](https://github.com/rapidsai-community/event-notebooks/blob/main/JupyterCon_2020_RAPIDSViz/00%20Index%20and%20Introduction.ipynb) not cuxfilter tutorial


## Requirements
- System that meets the [RAPIDS system and GPU requirements](https://docs.rapids.ai/install#system-req)


## Dependencies
Use the below to install all the required dependencies via conda:

channels:
- rapidsai
- nvidia
- pyviz
- conda-forge
- plotly
- anaconda

dependencies:
- cuxfilter>=23.02
- cudf>=23.02
- cuspatial>=23.02
- cugraph>=23.02
- cudatoolkit=11.8
- python>=3.10
- plotly
- dash-core-components
- dash-html-components
- jupyter-dash
- jupyterlab
- jupyter-server-proxy
- holoviews
- hvplot
- geoviews
- cartopy
- networkx

In [None]:
# imports
import os
from zipfile import ZipFile
from pathlib import Path

import cudf
import hvplot.cudf
import cuspatial
import cugraph
import cuml


## Dataset
The dataset can be downloaded from the [Divvy Bike Share public dataset](https://divvybikes.com/system-data). Use the following script to download the desired date range and load it into a dataframe.


In [None]:
# Define the URL of the Divvy trip data and save dir
S3 = 'https://divvy-tripdata.s3.amazonaws.com/'
DATA_DIR = './data'


In [None]:
# Check dir
Path(DATA_DIR).mkdir(parents=True, exist_ok=True)

# Download the zip files from the URL within date range and unzip
for year in range(2021, 2022):
    for month in range(1, 3):
        file = f'{year}{month:02d}-divvy-tripdata.zip'
        URL = f'{S3}{file}'
        ! wget -P {DATA_DIR} {URL}
     
        with zipfile.ZipFile(f'{DATA_DIR}/{file}') as zip:
            zip.extractall(f'{DATA_DIR}')
            

In [None]:
# Load all csv as dataframes and combine into one cudf
df_array = []

for file in Path(DATA_DIR).rglob('*.csv'):
    gdf = cudf.read_csv(file)
    df_array.append(gdf)

df = cudf.concat(df_array)

# Check the data
df

The data seems unreasonabliy clean, but there are still a few things we improve on it. First lets double check the dtypes.


Lets check for blanks and nulls first


In [None]:
df.isnull().sum()


In [None]:
# Filter rows with at least one null value
df[df['end_lat'].isnull()]


In [None]:
# drop nulls
df = df.dropna(subset=['end_lat'])
df.isnull().sum()

In [None]:
df.dtypes

The 'started_at' and 'ended_at' columns should be proper date times types.

In [None]:
df['started_at'] = cudf.to_datetime(df['started_at'])
df['ended_at'] = cudf.to_datetime(df['ended_at'])

df.dtypes

To make things a bit easier lets break out the date and time into sperate columns, assuming we only need to worry about start time.

In [None]:
df['year'] = df['started_at'].dt.year
df['month'] = df['started_at'].dt.month
df['day'] = df['started_at'].dt.day
df['hour'] = df['started_at'].dt.hour

df

Finding the duration of each trip would also be a helpful metric.

In [None]:
df['duration_min'] = (df['ended_at'] - df['started_at'])

df['duration_min'] = df['duration_min'].dt.seconds / 60

df


Extracting out the day of the week would be hepful too.

In [None]:
df['day_of_week'] = df['started_at'].dt.dayofweek

df

In [None]:
rider_type = df.groupby('member_casual').size().rename("count").reset_index()
rider_type


## A Note on Preattentive Attributes
This subconcious ability to quickly recognize patterns is due to our brain's natural ability to find preattentive attributes, such as height, orientation, or color. Imagine 100 values in a table and 100 in a bar chart and how quickly you would be albe to find the smallest and largest values in either.

In [None]:
rider_type.hvplot.bar(x='member_casual', y='count', title='Total Rider Types', yformatter='%0.0f')

In [None]:
hour_counts = df.groupby('hour').size().rename('count').reset_index()
hour_counts.hvplot.bar('hour', 'count', title="Trip starts, per hour", yformatter="%0.0f")

In [None]:
# DOW = {0:'M', 1:'T', 2:'W', 3:'Th', 4:'F', 5:'Sa', 6:'Su'}

day_counts = df.groupby('day_of_week').size().rename('count').reset_index().sort_values('day_of_week')
day_counts.hvplot.bar('day_of_week', 'count', title="Trip starts per Week Day", yformatter="%0.0f")

In [None]:
df.hvplot.hist(y='duration_min', bins=120, title="Trips Duration Histrogram", yformatter="%0.0f")

In [None]:
# group data by day_of_week and hour, count the number of rows in each group
heatmap_data = df.groupby(['day_of_week','hour']).size().rename("count").reset_index().sort_values('hour')
heatmap_data


heatmap_data.hvplot.heatmap(x='day_of_week', y='hour', C='count')

In [None]:
df.hvplot.hexbin(x='start_lng', y='start_lat', geo=True, tiles="OSM", logz=False, gridsize=150, width=800, height=600)

In [None]:
df.hvplot.hexbin(x='end_lng', y='end_lat', geo=True, tiles="OSM", logz=False, gridsize=150, width=800, height=600)

And if you look at their system map, the lat longs seem to be accurate https://account.divvybikes.com/map.
But this seems like a lot of start / stop places, lets see if we can identify stations.

In [None]:
unique_station_ids = df['start_station_id'].unique()
unique_station_ids


In [None]:
unique_starts = df['start_lat'].unique()
unique_starts

So there are obviously many more starting points than stations, so it must be that the bikes do not have to start and stop at a station. We will have to find a way to bin the start stop locations into a reasonable number.

In [None]:
# Create a cuSpatial GeoSeries from the latitude and longitude columns
start_points = cuspatial.GeoSeries.from_points_xy(df[['start_lng','start_lat']].interleave_columns().astype("float64"))
end_points = cuspatial.GeoSeries.from_points_xy(df[['end_lng','end_lat']].interleave_columns().astype("float64"))
# Print the cuSpatial GeoSeries
print(points)

In [None]:
distances_in_km = cuspatial.haversine_distance(start_points, end_points)
distances_in_km

In [None]:
# add the distances back into the dataframe, rounding values to make it more obvious if the stopped at the same place it started
df['dist_km'] = cudf.Series(distances_in_km).values.round(4)
df

In [None]:
#df.hvplot.points('start_lng','start_lat', geo=True, color='blue', alpha=0.2, xlim=(-86, -88), ylim=(40, 42), tiles='OSM', width=800, height=800)

## Outline
- cuSpatial to create Grid for start - stop nodes (in leu of stations)
- cuML linear regression?
- Cuxfilter (day, week, hour, type, map) - start to stop graph
- cuGraph PageRank leave, PageRank arrive