<img src="https://github.com/jupytercon/2020-exactlyallan/raw/master/images/RAPIDS-header-graphic.png">

# Using RAPIDS and Jupyter to Accelerate Visualization Workflows

In [None]:
## Run this cell to play the walk through video: ##
from IPython.display.IFrame import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/TnN3a-G_ugs" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

## Introduction to RAPIDS
Backed by NVIDIA, the **[RAPIDS](https://rapids.ai/index.html)** suite of open source software libraries gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.

Some of the main libraries includes [**cuDF**](https://docs.rapids.ai/api/cudf/stable/), a pandas-like dataframe manipulation library; [**cuML**](https://docs.rapids.ai/api/cuml/stable/), a collection of machine learning libraries that provide GPU versions of algorithms available in scikit-learn; [**cuGraph**](https://docs.rapids.ai/api/cugraph/stable/), a NetworkX-like accelerated graph analytics library; and [**cuSpatial**](https://docs.rapids.ai/api/cuspatial/stable/), a library for common spatial and spatiotemporal operations.

For more general information, check out the **[RAPIDS.ai home page](https://rapids.ai/index.html)**.

For a detailed presentation about RAPIDS and the latest release notes, visit the **[RAPIDS overview documentation](https://docs.rapids.ai/overview)**.

## Introduction to RAPIDS Visualization
The RAPIDS viz group's overall goal is to build open source libraries and collaborate with other open source projects. We hope to foster a greater adoption of GPUs in the python visualization ecosystem and beyond. Its not just for the sake of making things faster - we feel that when data scientist and analysts are able to interact with larger datasets in real time and with high fidelity, they will be able to ask better questions, more often, and get more accurate answers to today's complex problems.


## RAPIDS Supported Viz Frameworks
The below frameworks currently support RAPIDS - primarily through using cuDF as a data source: 

- **[hvplot](https://hvplot.holoviz.org/)**: wrapper API for easily visualizing data. 
- **[cuxfilter](https://github.com/rapidsai/cuxfilter)**: RAPIDS library for easily cross-filtering data. 
- **[Plotly Dash](https://plotly.com/dash/gpu-dask-acceleration/)**: framework for production ready visualization applications.
- **[Datashader](https://datashader.org/)**: library for high fidelity server side data rendering.

The RAPIDS visualization team is continually working to integrate with other open source projects - if you wish to help or have questions, reach out on our [Community Slack Channel (GOAI)](https://join.slack.com/t/rapids-goai/shared_invite/zt-h54mq1uv-KHeHDVCYs8xvZO5AB~ctTQ). 

### GPU Compute and/or GPU Render
Generally RAPIDS works to accelerate visualization through faster compute - that is computing aggregations, filters, algorithms etc. quickly enough to be directly interacted with through a visualization. GPUs can also speed up visualization through faster data rendering (of which people more often associate GPUs). The architecture required to do one or both of these through web browsers can be complex, but is useful to understand when building advanced visualizations. Feel free to ask for details and future plans in our [Community Slack Channel (GOAI)](https://join.slack.com/t/rapids-goai/shared_invite/zt-h54mq1uv-KHeHDVCYs8xvZO5AB~ctTQ).


## Hardware and Software Requirements
To run RAPIDS you will need to meet these general requirements:
- NVIDIA Pascal™ or better GPU
- Ubuntu 16.04+ or CentOS 7 OS (Windows support pending)
- Recent CUDA & NVIDIA Drivers
- Docker and/or Anaconda

For the most up to date requirements and installation details, see the [RAPIDS Getting Started Page](https://rapids.ai/start.html).

### Package Requirments
Other packages are required in addition to a RAPIDS (0.16+) release installation. Everything is listed in the `environment.yml` and can be installed via [conda forge](https://conda-forge.org/). Using `conda`, first execute:
```
conda env create --name jupytercon_tutorial --file environment.yml
```
Then:
```
conda activate jupytercon_tutorial
```



# Index of Notebooks

- 00 **Index**: you are here (but are we anywhere..really?)
- [01 **Data inspection and validation**](01%20Data%20Inspection%20and%20Validation.ipynb): dataset procurement as well as inspection with hvplot via bokeh charts.
- [02 **Exploratory data visualization**](02%20Exploratory%20Data%20Visualization.ipynb): exploring preliminary patterns through cross-filtering with cuxfilter.
- [03 **Data analysis with visual analytics**](03%20Data%20Analysis%20with%20Visual%20Analytics.ipynb): applying visual analytics with cuSpatial, cuGraph, hvplot via bokeh charts and datashader. 
- [04 **Explanatory data visualization**](04%20Explanatory%20Data%20Visualization.ipynb): presenting findings through a visualization application with Plotly Dash.

# Data Inspection and Validation
***Loading data, vetting its quality, and understanding its shape***


In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/0PNdgpZGPuk" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


## Overview
This intro notebook will use cuDF and hvplot (with bokeh charts) to load a public bike share dataset and get a general sense of what it contains, then run some cursory visualization to validate that the data is free of issues.

### cuDF and hvplot
- [cuDF](https://docs.rapids.ai/api/cudf/stable/), the core of RAPIDS, is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data in a pandas-like API.
- [hvplot](https://hvplot.holoviz.org/) is a high-level plotting API for the PyData ecosystem built on [HoloViews](http://holoviews.org/).

## Imports
Let's first make sure the necessary imports are present to load.

In [None]:
import cudf
import hvplot.cudf
import cupy
import pandas as pd

## Data Size and GPU Speedups
This tutorial's dataset size is about `2.1GB` unzipped and contains about `9 million rows`. While this will do for a tutorial, its still too small to get a sense of the speed up possible with GPU acceleration. We've created a larger `300 million row` [2010 Census Visualization](https://github.com/rapidsai/plotly-dash-rapids-census-demo) application available through the RAPIDS [GitHub page](https://github.com/rapidsai) as another demo. 

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Q6UQullAAvY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


## Loading Data into cuDF
We need to download and extract the sample data we will use for this tutorial. This notebook uses the Kaggle [Chicago Divvy Bicycle Sharing Data](https://www.kaggle.com/yingwurenjian/chicago-divvy-bicycle-sharing-data) dataset. Once the `data.csv` file is downloaded and unzipped, point the paths below at the location *(Make sure to set DATA_DIR to the path you saved that data file to)*:


In [None]:
from pathlib import Path

DATA_DIR = Path("../data")

In [None]:
# Download and Extract the dataset
! wget -N -P {DATA_DIR} https://data.rapids.ai/viz-data/data.tar.xz
! tar -xf {DATA_DIR}/data.tar.xz -C {DATA_DIR}

In [None]:
FILENAME = Path("data.csv")

We now read the .csv file into the GPU cuDF Dataframe (which behaves similar to a Pandas dataframe). 

In [None]:
df = cudf.read_csv(DATA_DIR / FILENAME)

## Mapping out the Data Shape
CuDF supports all the standard Pandas operations for a quick look at the data e.g. to see the total number of rows...

In [None]:
len(df)

Or to inspect the column headers and first few rows...

In [None]:
df.head()

Or to see the full list of columns...

In [None]:
df.columns

Or see how many trips were made by subscribers.

In [None]:
df.groupby("usertype").size()

## Improving Data Utility
Now that we have a basic idea of how big our dataset is and what it contains, we want to start making the data more meaningful. This task can vary from removing unnecessary columns, mapping values to be more human readable, or formatting them to be understood by our tools.  

Having looked at the `df.head()` above, the first thing we might want is to re-load the data, parsing the start-stop time columns as more usable datetimes types: 

In [None]:
df = cudf.read_csv(DATA_DIR / FILENAME, parse_dates=('starttime', 'stoptime'))

One thing we will want to do is to look at trips by day of week. Now that we have real datetime columns, we can use `dt.weekday` to add a `weekday` column to our `cudf` Dataframe:

In [None]:
df["weekday"] = df['starttime'].dt.weekday

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/2BrOrIRp76M" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

## Inspecting Data Quality and Distribution
Another important step is getting a sense of the quality of the dataset. As these datasets are often larger than is feasible to look through row by row, mapping out the distribution of values early on helps find issuse that can derail an analysis later.

Some examples are gaps in data, unexpected or empty value types, infeasible values, or incorrect projections. 

## Gender and Subsriber Columns
We could do this in a numerical way, such as getting the totals from the 'gender' data column as a table:

In [None]:
mf_counts = df.groupby("gender").size().rename("count").reset_index()
mf_counts

While technically functional as a table, taking values and visualizating them as bars help to intuitively show the scale of the difference faster (hvplot's API makes this very simple):

In [None]:
mf_counts.hvplot.bar("gender","count").opts(title="Total trips by gender")

### A Note on Preattentive Attributes
This subconcious ability to quickly recognize patterns is due to our brain's natural ability to find [preattentive attributes](http://daydreamingnumbers.com/blog/preattentive-attributes-example/), such as height, orientation, or color. Imagine 100 values in a table and 100 in a bar chart and how quickly you would be albe to find the smallest and largest values in either.

### Try It out
Now try using [hvplot's user guide](https://hvplot.holoviz.org/user_guide/Plotting.html) and our examples to create a hvplot that shows the distribution of `Subscriber` types:

In [None]:
# code here

The above data columns maybe show some potentially useful disparities, but without supplimental data, it would be hard to have a follow up question.


In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/fRH03WEsyVk" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

## Trip Starts
Instead, another question we might want to ask is how many trip starts are there per day of the week? We can group the `cudf` Dataframe and call `hvplot.bar` directly the result:

In [None]:
day_counts = df.groupby("weekday").size().rename("count").reset_index()
day_counts.hvplot.bar("weekday", "count").opts(title="Trip starts, per Week Day", yformatter="%0.0f")

With 0-4 being a weekday, and 5-6 being a weekend, there is a clear drop off of ridership on the weekends. Lets note that!


## Trips by Duration
Another quick look we can generate is to see the overall distribution of trip durations, this time using `hvplot.hist`:

In [None]:
# We selected an arbitrary 50 for bin size, try and see patterns with other sizes
df.hvplot.hist(y="tripduration").opts(
    title="Trips Duration Histrogram", yformatter="%0.0f"
)

Clearly, most trips are less than 15 minuites long. 

`hvplot` also makes it simple to interrogate different dimensions. For example, we can add `groupby="month"` to our call to `hvplot.hist`, and automatically get a slider to see a histogram specific to each month:

In [None]:
df.hvplot.hist(y="tripduration", bins=50, groupby="month").opts(
    title="Trips Duration Histrogram by Month", yformatter="%0.0f", width=400
)

By scrubbing between the months we can start to see a pattern of slightly longer trip durations emerge during the summer months.



## Trips vs Temperatures
Lets follow up on this by using `hvplot` to generate a KDE distributions using our `cudf` Dataframes for 9 million trips:

In [None]:
df.hvplot.kde(y="temperature").opts(title="Distribution of trip temperatures")

Clearly most trips occur around a temperature sweet spot of around 65-80 degrees.


The `hvplot.heatmap` method can group in two dimensions and colormap according to aggregations on those groups. Here we see *average* trip duration by year and month: 

In [None]:
df.hvplot.heatmap(x='month', y='year', C='tripduration', 
                  reduce_function=cudf.DataFrame.mean , colorbar=True, cmap="Viridis")

So what we saw hinted at with the trip duration slider is much more clearly shown in this literal heatmap *(ba-dom-tss)*. 



In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/gqkdgOKiGNM" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


## Trip Geography
Temperature and months aside, we might also want to bin the data geographically to check for anomalies. The `hvplot.hexbin` can show the counts for trip starts overlaid on a tile map:

In [None]:
df.hvplot.hexbin(x='longitude_start', y='latitude_start', geo=True, tiles="OSM").opts(width=600, height=600)

## Data Cleanup
Based on our inspection, this dataset is uncommonly well formatted and of high quality. But a little cleanup and formatting aids will make some things simpler in future notebooks. 

One thing that is missing is a list of just station id's and their coordinates. Let's generate that and save it for later. First, let's group by all the unique "from" and "to" station id values, and take a representative from each group:

In [None]:
from_ids = df.groupby("from_station_id")
to_ids = df.groupby("to_station_id")

It's possible (but unlikely) that a particular station is only a sink or source for trips. For good measure, let's make sure the group keys are identical:

In [None]:
all(from_ids.size().index.values  == to_ids.size().index.values)

Each group has items for a single station, which all have the same lat/lon. So let's make a new DataFrame by taking a representative from each group, then rename some columns:

In [None]:
stations = from_ids.nth(1).to_pandas()
stations.index.name = "station_id"
stations.rename(columns={"latitude_start": "lat", "longitude_start": "lon"}, inplace=True)
stations = stations.reset_index().filter(["station_id", "lat", "lon"])
stations

Finally write the results to "stations.csv" in our data directory:

In [None]:
stations.to_csv(DATA_DIR / "stations.csv", index=False)

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/c0hQAGPdF5U" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


## Summary of the Data
Overall this is an interesting and useful dataset. Our preliminary vetting found no issues with quality and already started to hint at areas to investigate:

- Weekday vs Weekend trip counts
- Bike trips vs weather correlation 
- Core vs Outward trip concentrations 

We will follow up with these findings in our next notebook.

# Exploratory Data Visualization
***Quickly finding linked patterns in your data***


In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/Xu6R9Tad7H0" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


## Overview

Taking the previous notebook’s vetted Divvy bike share dataset, we will now use, cuDF, cuxfilter, and cuGraph to quickly create cross-filtered visualizations to explore different perspectives and slices of the data in search of interesting patterns. 

### cuxfilter and cuGraph
- [cuDF](https://docs.rapids.ai/api/cudf/stable/) is a RAPIDS GPU DataFrame library for manipulating data with a pandas-like API.

- [cuxfilter](https://docs.rapids.ai/api/cuxfilter/nightly/) is a RAPIDS viz project. Focused around cross-filtering data, its designed to quickly build linked dashboards powered by cuDF compute capabilities. Cuxfilter acts as a connector library rather than a visualization library. It abstracts away all the 'plumbing' required to connect a [curated list of visualizations](https://docs.rapids.ai/api/cuxfilter/nightly/charts/charts.html) to a GPU dataframe. By simply enabling accelerated dashboards inline within a notebook workflow, cuxfilter allows analysts to get to exploring their data faster.

- [cuGraph](https://docs.rapids.ai/api/cugraph/stable/) is a RAPIDS GPU accelerated graph analytics library with functionality like NetworkX.

## Imports
Let's first make sure the necessary imports are present to load, as well as setting the data location.

In [None]:
import cuxfilter
import cudf
import cugraph
from bokeh.models import NumeralTickFormatter
from pyproj import Proj, Transformer
from pathlib import Path

## Load Data into cuDF
As before, load `datda.csv` into the GPU dataframe:

In [None]:
DATA_DIR = Path("../data")
FILENAME = Path("data.csv")

data = cudf.read_csv(DATA_DIR / FILENAME)

## Data Preprocessing
Before we can visualize the data, we need to do some preprocessing to make it more human readable and usable for cuxfilter.

First we need to transform the x/y coordinates from its original [espg4326 projection](https://epsg.io/4326) to the spherical [epsg:3857 projection](https://epsg.io/3857) that works with the maptile underlays used in cuxfilter:

In [None]:
def transform_coords(df, x='x', y='y'):
    transform_4326_to_3857 = Transformer.from_crs('epsg:4326', 'epsg:3857')
    df['x'], df['y'] = transform_4326_to_3857.transform(df[x].to_array(), df[y].to_array())
    return df
# Apply Transformation
trips = transform_coords(data, x='latitude_start', y='longitude_start')

Based on our previous finding about the apparent difference between weekends and weekdays, we will want to extract `day_type` from the dataset:

In [None]:
# Note: days 0-4 are weekedays, days 5-6 are weekends 
trips['day_type'] = 0
trips.loc[trips.query('day>4').index, 'day_type'] = 1

Choosing the appropriate fidelity of data to show always takes some trial and error. Showing total trips of every day for every year can be noisy, while showing by month is not granular enough. We settled on weeks. That means we will want to get the global week number in the dataset:

In [None]:
# Note: Data always has edge cases, such as the extra week anomalies of 2015 and 2016:
# trips.groupby('year').week.max().to_pandas().to_dict() is {2014: 52, 2015: 53, 2016: 53, 2017: 52}
# Since 2015 and 2016 have 53 weeks, we add 1 to global week count for their following years - 2016 & 2017
# (data.year/2016).astype('int') => returns 1 if year>=2016, else 0
year0 = int(trips.year.min()) #2014
trips['all_time_week'] = data.week + 52*(data.year - year0) + (data.year/2016).astype('int')

To make the dashboard values more understandable, we are creating string maps to convert the dataset's numbers to their proper names. Though it may seem trivial, it removes unnecessary ambiguity and helps [reduce cognitive load](https://www.nngroup.com/articles/minimize-cognitive-load/) when our focus needs to be on finding patterns:

In [None]:
# create a weekday string map
days_of_week_map = {
    0: 'monday',
    1: 'tuesday',
    2: 'wednesday',
    3: 'thursday',
    4: 'friday',
    5: 'saturday',
    6: 'sunday'
}

month_map = {
    1: 'jan', 2: 'feb', 3: 'mar', 4: 'apr', 5: 'may', 6: 'jun', 7: 'jul', 8: 'aug', 9: 'sep', 10: 'oct', 11: 'nov', 12: 'dec'
}
day_type_map = {0:'weekday', 1:'weekend', '':'all'}

Finally, we remove the unused columns and reorganize our dataframe:

In [None]:
trips = trips[[
    'year', 'month', 'week', 'day', 'hour', 'gender', 'from_station_name',
    'from_station_id', 'to_station_id', 'x', 'y', 'from_station_name', 'to_station_name', 'all_time_week', 'day_type'
]]
trips.head()

In [None]:
# Note: save modified trips dataframe to be imported in the final notebok
trips.to_parquet(DATA_DIR / 'modified_trips.parquet')

## cuxfilter Bike Trips Dashboard
First lets investigate trip totals by varous time slices by linking the dataframe to cuxfilter:

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/QfJYu_8Cfgs" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


In [None]:
cux_df = cuxfilter.DataFrame.from_dataframe(data)

In [None]:
# Specify the charts and widgets to use with the selected columns of data and string maps
charts = [
    cuxfilter.charts.bar('hour', title='trips per hour'),
    cuxfilter.charts.bar('month', x_label_map=month_map),
    cuxfilter.charts.bar('day', x_label_map=days_of_week_map),
    cuxfilter.charts.multi_select('year'),
    cuxfilter.charts.multi_select('day_type', label_map=day_type_map),
]

# Generate the dashboard and select a layout
d = cux_df.dashboard(charts, layout=cuxfilter.layouts.feature_and_double_base, title='Bike Trips Dashboard')

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: use the slider below each chart to cross filter.

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
# IMPORTANT: if your notebook environment is in jupyterhub, set service_proxy='jupyterhub', otherwise set to 'none'
BASE_URL = 'http://localhost:8888/'
d.show(notebook_url=BASE_URL, service_proxy='none')

### Try It Out
Now try using [cuxfilter's user guide](https://docs.rapids.ai/api/cuxfilter/nightly/) and our examples to create a dashboard of the above data using a different layouts, themes, and chart types.

In [None]:
# code here

## cuxfilter Temperature Dashboard
Lets continue investigating, this time following up on the increasing trips year over year and decreases in winter months. 

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/b7Kg9U_M1HM" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


In [None]:
# Specify the charts and widgets to use with the selected columns of data and string maps
charts = [
    cuxfilter.charts.bar('all_time_week', title='rides per week'),
    cuxfilter.charts.heatmap(x='all_time_week', y='day', aggregate_col='temperature',
                             aggregate_fn='mean', point_size=40, legend_position='right',
                             title='mean temperature by day'),
    cuxfilter.charts.multi_select('day_type', label_map=day_type_map),
]

# Generate the dashboard and select a layout
d = cux_df.dashboard(charts, layout=cuxfilter.layouts.feature_and_base, title='Temperature Dashboard')

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: pan to match up the top and bottom chart axis

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
# IMPORTANT: if your notebook environment is in jupyterhub, set service_proxy='jupyterhub', otherwise set to 'none'
BASE_URL = 'http://localhost:8888/'
d.show(notebook_url=BASE_URL, service_proxy='none')

## Weather Findings
The dashboard should look something like this:
<img src="https://raw.githubusercontent.com/jupytercon/2020-exactlyallan/master//images/cuxfilter_02_dashboard_2.png" />

The weather's effect becomes clear in this dashboard as warmer temperatures seem to strongly match a large increase in ride counts - which intuitively makes sense. But aside developing weather control, there is'nt much that can be done to respond to this finding. 


## cuxfilter Geospatial Trips Graph
Next, lets take a look at the geospatial element of the data and see if we can find interesting patterns. Based on how the trip data is logged, converting it into a graph will make managing it easier.

For this we will need [cuGraph](https://docs.rapids.ai/api/cugraph/stable/api.html) to translate the dataset into an edge list:

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/36yztZl_jfY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


In [None]:
G = cugraph.Graph() 
G.from_cudf_edgelist(data, source='from_station_id', destination='to_station_id')
edges = G.edges()

In [None]:
# Trips have been converted into edges with source and destination based on station IDs.
edges.head()

Next we load the formatted data into cuxfilter and specify the chart types:

In [None]:
cux_df = cuxfilter.DataFrame.load_graph((trips, edges))

In [None]:
# Specifying a graph chart type will use Datashader and its required parameters
charts = [
    cuxfilter.charts.graph(
        node_id='from_station_id',
        edge_source='src', edge_target='dst',
        node_aggregate_fn='count',
        node_pixel_shade_type='linear', node_point_size=35, #node size is fixed
        edge_render_type='curved', #other option: direct
        edge_transparency=0.7, #0.1 - 0.9
        tile_provider='CARTODBPOSITRON', 
        title='Graph for trip source_stations (color by count)'
    ),
    cuxfilter.charts.multi_select('year'),
    cuxfilter.charts.multi_select('day_type', label_map=day_type_map),
    cuxfilter.charts.bar('from_station_id'),
    cuxfilter.charts.bar('to_station_id'),
    cuxfilter.charts.view_dataframe(['from_station_name', 'from_station_id'], drop_duplicates=True)
]

# Generate the dashboard, select a layout and theme
d = cux_df.dashboard(charts, layout=cuxfilter.layouts.feature_and_triple_base, theme=cuxfilter.themes.rapids, title='Geospatial Trips')

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: Graph edges can be turned on/off via the line tool icon
# Note: Inspect Neighboring Edges can be turned on/off for box or lasso select
# Caution: Selecting areas with Inspect Neighboring Edges on can result in slow performance or OOM errors  
# Caution: If the dashboard freezes, simply close the tab and restart this cell
# Note: This is rendering 9 MILLION edges

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
# IMPORTANT: if your notebook environment is in jupyterhub, set service_proxy='jupyterhub', otherwise set to 'none'
BASE_URL = 'http://localhost:8888/'
d.show(notebook_url=BASE_URL, service_proxy='none')

## cuxfilter Network and Geospatial Graph
While the above produced many findings, filtering through so many trip edges is not ideal.
Next we will try to push the visual analytics further with a clustered network graph along side the geospatial graph using the [ForceAtlas2](https://docs.rapids.ai/api/cugraph/stable/api.html?highlight=force#module-cugraph.layout.force_atlas2) algorithm from cuGraph:

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/IgLXuW-LRVk" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


In [None]:
# Note: Often a good visualization result only comes from a lot of trial and error
# The below parameters produce useful clustering, but try experimenting with them further
ITERATIONS=500
THETA=10.0
OPTIMIZE=True

# Using the previously created edge list, we calculate the FA2 layout positions here
trips_force_atlas2_layout = cugraph.layout.force_atlas2(G, max_iter=ITERATIONS,
                strong_gravity_mode=False,
                outbound_attraction_distribution=True,
                lin_log_mode=False,
                barnes_hut_optimize=OPTIMIZE, barnes_hut_theta=THETA, verbose=True)

Merge the calculated forceAtlas2 layout with the trip dataframe:

In [None]:
final_df = trips_force_atlas2_layout.merge(
                trips[['from_station_id', 'from_station_name','to_station_id', 'year', 'hour', 'day_type', 'x', 'y']],
                left_on='vertex',
                right_on='from_station_id',
                suffixes=('', '_original')
)

# Preview
final_df.head()

Next we load the data into cuxfilter and specify the chart types:

In [None]:
cux_df = cuxfilter.DataFrame.load_graph((final_df, edges))

In [None]:
# Both scatter and graph chart types use Datashader 
charts= [
  cuxfilter.charts.graph(
      edge_source='src', edge_target='dst',
      edge_color_palette=['gray', 'black'],
      ode_pixel_shade_type='linear',
      edge_render_type='curved', #other option: direct
      edge_transparency=0.7, #0.1 - 0.9
      title='ForceAtlas2 Layout Graph'
  ),
  cuxfilter.charts.scatter(
    x='x_original', y='y_original', 
    tile_provider='CARTODBPOSITRON',
    point_size=3,
    pixel_shade_type='linear',
    pixel_spread='spread',
    title='Original Layout'
  ),
  cuxfilter.charts.multi_select('year'),
  cuxfilter.charts.multi_select('day_type', label_map={0:'weekday', 1:'weekend', '':'all'}),
  cuxfilter.charts.bar('hour', title='Trips per hour'),
  cuxfilter.charts.bar('from_station_id', title='Source station'),
  cuxfilter.charts.bar('to_station_id', title='Destination station'),
  cuxfilter.charts.view_dataframe(['from_station_id', 'from_station_name'], drop_duplicates=True)
] 

# Generate the dashboard, select a layout and theme
d = cux_df.dashboard(charts, layout=cuxfilter.layouts.double_feature_quad_base, theme=cuxfilter.themes.rapids, title="Network and Geospatial Graph")

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: Graph edges can be turned on/off via the line tool icon
# Note: Inspect Neighboring Edges can be turned on/off for box or lasso select
# Caution: Selecting areas with Inspect Neighboring Edges on can result in slow performance or OOM errors  
# Caution: If the dashboard freezes, simply close the tab and restart this cell
# Note: This is rendering 9 MILLION edges

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
# IMPORTANT: if your notebook environment is in jupyterhub, set service_proxy='jupyterhub', otherwise set to 'none'
BASE_URL = 'http://localhost:8888/'
d.show(notebook_url=BASE_URL, service_proxy='none')

## Summary of Exploratory Findings
Based on the exploratory analytics done above, we've found that there are two distinct groups of behaviors based on time (hour / weekend / weekday) and location. With the next notebook, we will see if we can coax out further information about these groups using more advanced data analytics.

### cuxfilter Troubleshooting
As we just released the graph visualization capability in cuxfilter, we are still working on building out features and fixes. 

If you find something that needs fixing or have feature requests, please submit an [issue on our Github Page](https://github.com/rapidsai/cuxfilter/issues). Better yet, [help contribute](https://github.com/rapidsai/cuxfilter#contributing-developers-guide). 

# Data Analysis with Visual Analytics

***Combining analytics with visualization***


In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/tZl0mNmBwrA" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


## Overview

In this notebook we will continue to explore the Divvy bikes dataset using, cuDF, cuGraph, and cuSpatial to see how these analysis results can easily feed directly into visualization tools like hvplot and Datashader.

### cuDF, cuGraph, cuSpatial, hvplot, and Datashader
- [cuDF](https://docs.rapids.ai/api/cudf/stable/), is a GPU DataFrame library for manipulating data with a pandas-like API.

- [cuGraph](https://docs.rapids.ai/api/cugraph/stable/) is a RAPIDS library for GPU accelerated graph library with functionality like NetworkX.

- [cuSpatial](https://docs.rapids.ai/api/cuspatial/stable/) is a collection of GPU accelerated algorithms for computing geo-spatial measures.

- [hvplot](https://hvplot.holoviz.org/) is a high-level plotting API for the PyData ecosystem built on [HoloViews](http://holoviews.org/).

- [Datashader](https://datashader.org/) is a library for high fidelity server side data rendering.

## Imports

In addition to the libraries mentioned above, we will also make use of libraries [cupy](https://docs.cupy.dev/en/stable/), [NumPy](https://numpy.org/), and [Pandas](https://pandas.pydata.org/) directly.

In [None]:
import cudf
import cugraph
import cupy
import cuspatial

import numpy as np
import pandas as pd

import datashader as ds
import datashader.transfer_functions as tf

import hvplot.cudf
import hvplot.pandas

from pathlib import Path

## Loading Data into cuDF

First let's load the data. In addition to the main Divvy `data.csv` file, we will also load the small `stations.csv` file that we prepared in the first notebook. 

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/ZeiLc_DbKEk" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


In [None]:
DATA_DIR = Path("../data")

In [None]:
# Note: remember to reparse into datetimes
df = cudf.read_csv(DATA_DIR / "data.csv", parse_dates=('starttime', 'stoptime'))

In [None]:
stations = pd.read_csv(DATA_DIR / "stations.csv")

We will want to continue our investigation into weekday vs weekend patterns, so let's first add a column for that:

In [None]:
df["weekday"] = df['starttime'].dt.weekday

## Trying Analysis with CuSpatial 

Let's take a look at some spatial measures and see if there are any interesting features.

We might start with the first station, and see what the max trip length from it is:

In [None]:
r0 = df.iloc[0]
station_id, origin_lon, origin_lat = r0["from_station_id"], r0["longitude_start"], r0["latitude_start"]

The cuSpatial function `lonlat_to_cartesian` will let us quickly compute the x/y distances for every ending trip location (in Kilometers):

In [None]:
sub_df = df[df["from_station_id"]==station_id[0]]
dist = cuspatial.lonlat_to_cartesian(origin_lon[0], origin_lat[0], sub_df["longitude_end"], sub_df["latitude_end"])

CuPy functions can compute derived values on these GPU dataframes:

In [None]:
# good o' pythagorean theorem
cupy.sqrt(cupy.max(dist.x**2 + dist.y**2))

What if we want to compute all trip distances? We can compute the distances using every station as a starting point:

In [None]:
def trip_dists(df):
    results = []

    for idx, row in stations.iterrows():
        station_id, origin_lon, origin_lat = int(row["station_id"]), row["lon"], row["lat"]
        sub_df = df[df["from_station_id"]==station_id]
        res = cuspatial.lonlat_to_cartesian(origin_lon, origin_lat, sub_df["longitude_end"], sub_df["latitude_end"])
        res["dist"] = cupy.sqrt(res.x**2 + res.y**2)
        results.append(res)
        
    return cudf.concat(results)

In [None]:
all_from_dists = trip_dists(df)

## hvplot of Trip Distances
Now that we have all the distances in a dataframe, we can use hvplot to create a plot:

In [None]:
# bin size is chosen after some trial and error
all_from_dists.hvplot.hist(y="dist", normed=True, bins=20)

Clearly most trips are fairly short -usually under 2Km. This makes sense when we remember most trip durations are also less than 15min. 

It might also be interesting follow up and break the distribution of trips down weekday vs weekend:

In [None]:
weekend_trips = df[df["weekday"].isin([5, 6])] # weekend days = 5, 6 
weekday_trips = df[df["weekday"].isin(list(range(5)))]  # weekday days = 0-4

In [None]:
# calculate distances from the previous function
weekend_dists = trip_dists(weekend_trips)
weekday_dists = trip_dists(weekday_trips)

In [None]:
all_combined_dists =  cudf.concat([weekday_dists, weekend_dists])
all_combined_dists.head()

## hvPlot of Weekend vs Weekday Trip Distance
Plotting these two distributions together we can see the weekday (orange) trips peak more at shorter distances and the weekend distributions (blue) has more, longer trips:

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/9qnnVF91Xfc" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


In [None]:
weekend_hist = weekend_dists.hvplot.hist(y="dist", alpha=0.5, bin_range=(0, 10), normed=True, color="blue")
weekday_hist = weekday_dists.hvplot.hist(y="dist", alpha=0.5, bin_range=(0, 10), normed=True, color="orange")
weekend_hist * weekday_hist

While interesting to note, there doesn't appear to be any major revelations using a distance analysis approach. 

## Trying Analysis with cuDF

Let's use CuDF direclty to group and aggregate our data to look for anyting intersting about the flow of trips in and out stations. 

We want to look at the daily net flow of trips at each station, i.e. how many more (or less) trips *started* at a station vs *ended* at a station in a given day.

In order to group by day, we first take the "floor" of each timestamp divided by one day:

In [None]:
one_day = np.datetime64(1, 'D').astype('datetime64[ns]').astype('int64') 

# out
df['from_day'] = df['starttime'].astype('int64') // one_day

# in
df['to_day'] = df['stoptime'].astype('int64') // one_day

Now we can group by the station id and hour for both the departing and arriving cases. We name the columns from the size DataFrame `out` and `in` respectively:

In [None]:
df_out = df.groupby(by=["from_station_id", "from_day"]).size().to_frame('out').reset_index()
df_in = df.groupby(by=["to_station_id", "to_day"]).size().to_frame('in').reset_index()

Let's rename the columns to be the same in both DataFrames:

In [None]:
df_out.rename(columns={"from_station_id": "station_id", "from_day": "day"}, inplace=True)
df_in.rename(columns={"to_station_id": "station_id", "to_day": "day"}, inplace=True)

And reset the index to be the (station id, hour) pair:

In [None]:
df_out = df_out.set_index(["station_id", "day"])
df_in = df_in.set_index(["station_id", "day"])

Now we can join these two DataFrames to compute an `flow = out - in` column:

In [None]:
full_df = df_in.join(df_out, how="outer").fillna(0).reset_index()
full_df["flow"] = full_df["out"] - full_df["in"]

Let's also convert our "day" values back to proper timestamps:

In [None]:
full_df["time"] = (full_df["day"] * one_day).astype('datetime64[ns]')
full_df = full_df[["station_id", "time", "flow"]]

Now we can take a glimpse at the resulting DataFrame which has the net `out-in` trip flow by station per day:
- A `+` positive number means there was an excess of trips *starting out* of the station that day.
- A `-` negative number means an excess of trips *ending in* the station that day.

In [None]:
full_df.head()

We might like to look at the maximal behaviour. What is a high number of excess arrivals or departures at a station? Let's pull out individual timeseries for each station id, and look a the max/min for each station:

In [None]:
flows = []
for i in stations.station_id:
    subdf = full_df[full_df.station_id==i].set_index("time")
    flows.append((i, subdf.flow.max(), subdf.flow.min()))
flows = pd.DataFrame(flows, columns=["station_id", "max_out", "max_in"])

In [None]:
flows

With this information, we can see what stations had the largest ever excess departures (station 192) or arrivals (station 77):

In [None]:
flows.iloc[flows.max_out.argmax()]

In [None]:
flows.iloc[flows.max_in.argmin()]

Knowing about excess arrivals vs departures is probably important for Divvy to be able to manually re-allocate bikes. We could ask what fraction of stations ever have a max of more than 30 excess trips:

In [None]:
len(flows[flows.max_out > 30])

In [None]:
len(flows[flows.max_in < -30])

While looking at individual stations or max/mins is useful to get preliminary ideas of patterns, it would be better to see it all at once. First we need to prepare a new Dataframe that has all the series as columns:

In [None]:
series = []

for i in stations.station_id:
    s = full_df[full_df.station_id==i][["time", "flow"]]
    s.rename(columns={"flow": f"s{i}"}, inplace=True)
    s = s.set_index("time")
    series.append(s)
    
df_wide = cudf.concat(series, axis=1).fillna(0)

The resulting Dataframe has a daily time series for every column, one for each station:

In [None]:
df_wide

## hvplot of Select Station Flows
It's simple to pull out individual stations for comparison using `hvplot`:

In [None]:
df_wide.hvplot(y=["s77","s81","s192","s195","s268","s287"], alpha=0.4)


The above plot shows some of the more interesting station patterns - which roughly match the overall seasonal flows. Station 195 appears perpetually over taxed, while something nearby station 77 seems to draw in a lot of bikes. Yet its hard to gleen a pattern without the connection between station's flows. 

Bonus points for anyone who knows what anomaly happened on 6/24/2014 (seriously, we're curious). 


### Try It Now
See if you can plot all the stations in an hvplot (it is possible but takes a while to render): 

In [None]:
# code here

Lastly, lets take a look at the data with Datashader. First we make a funtion `series_shade` that can take a wide dataframe of timeseries like we have made above, and render *all* of the series at once using Datashader:

In [None]:
# Details here https://datashader.org/user_guide/index.html
def series_shade(df):
    cols = list(df.columns)
    
    itime = cudf.to_datetime(df.index).astype('int64')
    x_range = (itime[0], itime[-1])
    
    y_range = (df.min().min(), df.max().max())
    
    temp = cudf.DataFrame(df)
    temp["itime"] = itime
    
    # the width is 4x365, leaving one pixel width per day
    cvs = ds.Canvas(plot_height=400, plot_width=1460)
    agg = cvs.line(temp, x="itime", y=cols, agg=ds.count(), axis=1)
    
    print(f"y range: ({y_range[0]}, {y_range[1]})")
    return tf.shade(agg, how='linear', cmap=["purple","red","white"])

## Datashader Line Plot of Total Daily Flows
Now let's pass in-out daily net excess data to get a rough datashder plot:

In [None]:

series_shade(df_wide)

It's not completely clear what we can see here, but it points to some ideas for future exploration. If you squint it does seem that there is an unbalanced flow out of stations vs into stations.

## Datashader Line Plot of Cumulative Daily Flows
As a last experiment, let's make the same plot, but with *cumulative* excess trips:

In [None]:
df_cumulative = df_wide.cumsum()

In [None]:
series_shade(df_cumulative)

This view emphasizes the unbalanced flow and is a bit more interesting. It illustrates the notion that Divvy must be engaging in a lot of continual re-allocation of its bikes to offset these excess trips at individual stations.

If we knew the marginal costs compared to ridership income, it could prove an interesting data point on when network expansion would become prohibitive. However, without that we need to look elsewhere for analysis. 

### Try It Now
Datashader plots can be wrapped in hvplots, much like bokeh plots. Try wrapping the above examples in order to make them more interactive by using [Datashader's usage guide](https://datashader.org/user_guide/Timeseries.html):

In [None]:
# code here

## Trying Analysis with cuGraph PageRank

In our previous notebook we were able to find some interesting patterns by converting our dataframe into a graph. Here, we will try the `cugraph.pagerank` algorithm to see if it helps succinctly illustrate patterns for the "most popular" stations.

First, let's see what it looks like to compute PageRank for a single hour of the day, e.g. 5PM, by subsetting the data:

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/bJushO0ebrg" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


In [None]:
d17 = df[df["hour"]==17]

Then groupby (from_station_id, to_station_id) and take the group size to get all the unique individual routes between stations that hour, and also the number of trips that took each of those routes:

In [None]:
g17 = df.groupby(by=["from_station_id", "to_station_id"])
routes17 = g17.size().reset_index()
routes17.head()

Now we can create a `cugraph.Graph` with the from and to station IDs:

In [None]:
G = cugraph.Graph()
G.from_cudf_edgelist(d17, source='from_station_id', destination='to_station_id')
d17_page = cugraph.pagerank(G)
d17_page.head()

PageRank values are relative, and as such do not matter as much as the ranking it produces for the network of trips. Let's see which stations rank as most important at 5PM (on any day):

In [None]:
d17_top = d17_page.nlargest(20, "pagerank").to_pandas()
d17_top.head()

## hvplot of 5pm Top PageRank Locations
Plotting these stations we can see that at 5PM the most important stations are nearly all downtown, matching our previous notebook findings about a focused downtown core of total trips:

In [None]:
d17_page_locs = stations[stations.station_id.isin(d17_top.vertex)]
d17_page_locs.hvplot.points(x='lon', y='lat', alpha=0.7, size=300, geo=True, tiles="OSM").opts(width=600, height=600)

Now that we know applying PageRank seems to produce useful results, let's look at how stations rank by weekdays vs weekends. To get proper rankings, we need to compute it for every individual day of the week:

In [None]:
results = {}
for w in range(7):
    dfw = df[df["weekday"]==w]
    G = cugraph.Graph()
    G.from_cudf_edgelist(dfw, source='from_station_id', destination='to_station_id')
    df_page = cugraph.pagerank(G).nlargest(20, "pagerank")
    results[w] = set(df_page.to_pandas()["vertex"])

Let's find out what stations were continually highest ranked among all weekdays and weekend days:

In [None]:
weekday = set.intersection(*[results[i] for i in range(5)]) # days 0-4 are weekdays
weekend = set.intersection(results[5], results[6])  # days 5-6 are the weekend

Listing the stations that are ranked important on weekdays and ranked important on weekends, we find that there is little overlap:


In [None]:
weekend

In [None]:
weekday

Finally, we can plot these quickly using hvplot again. Let's add a column to denote weekday / weekend so that we can group by that type:

In [None]:
r1 = stations[stations.station_id.isin(weekend)]
r1 = r1.assign(type="Weekend")

r2 = stations[stations.station_id.isin(weekday)]
r2 = r2.assign(type="Weekday")

result = pd.concat([r1, r2])

## hvplot of Weekend / Weekday Top PageRank Locations
Looking at the plot, nearly all the important weekday stations are downtown, and on the weekend the important stations are further out in popular districts around downtown:

In [None]:
result.hvplot.points(x='lon', y='lat', by='type', 
                     alpha=0.7, size=300, geo=True, tiles="OSM").opts(width=700, height=600)

The above map of top PageRanked stations for weekend / weekdays matches very well with the ForceAtlas2 clustered graph and time series cross-filtered visualizations of the previous notebook, but in a much more concise and presentable manner. This is the positive result we were hoping to find with our analysis.

## Summary of Analytics Findings 
When running analytics, its critical to have a solid understanding of the underlying data in order to make correct decisions. We tried several analytical approaches to see if we could glean some meaningful patterns. As is often the case, some worked better than others - but because we did extensive exploratory visualization we now have confidence that the weekend / weekday binned PageRank approach will produce accurate results when used for visualizations in our next notebook.

In [None]:
## Run this cell to show this section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/8lfO8gOOTXI" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


# Explanatory Data Visualization
***Interactive presentation dashboards***


In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/GJnUGqYj7D0" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


## Overview
This final notebook is geared towards taking the previous findings and preparing them for presentation through an interactive Plotly Dash visualization application powered by cuDF and cuGraph's PageRank. 

### cuDF, cuGraph, cuxfilter, and Plotly Dash
- [cuDF](https://docs.rapids.ai/api/cudf/stable/) is a RAPIDS GPU DataFrame library for manipulating data with a pandas-like API.

- [cuGraph](https://docs.rapids.ai/api/cugraph/stable/) is a RAPIDS library for GPU accelerated graph analytics with functionality like NetworkX.

- [cuxfilter](https://docs.rapids.ai/api/cuxfilter/nightly/) is a RAPIDS visualization library for cross-filtering data, designed to quickly build linked dashboards powered by cuDF compute capabilities. 

- [Plotly Dash](https://plotly.com/dash/) is a framework for specifying production ready visualization applications all in Python. 

## Dashboard Concepts and Audiences
We've taken a bike share dataset, explored it, run analytics on it, and now have confidence in our ability to highlight the most important stations for two key usage patterns. The next step is communicating our findings - something at which visualization excels.

However, a viz needs to be appropriate for the data it is showing, the audience it is intended to be shown to, and the medium or context it will be shown.

For instance, is the presentation to highly technical colleagues already familiar with your work, as you drive the presentation from your personal machine? Or to executives at a board room? Or completely asynchronous through a web site with a wide range of audience expertise levels?

Thinking about these ahead of time and preparing for them will lead to more successfully communicating your findings.

As you think of this, it helps to explore previous works, such as the [Plotly Dash Gallery](https://dash-gallery.plotly.host/Portal/), for ideas and to see best practices (or worst practices, I'm looking at you 3D pie chart...).



In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/JyZs4ApG7yI" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')
     

### A Note About Sketching and Design Iterations
<img src="https://raw.githubusercontent.com/jupytercon/2020-exactlyallan/master/images/dashboard-sketch-ideas.jpg" />
<img src="https://raw.githubusercontent.com/jupytercon/2020-exactlyallan/master/images/plotly_dashboard_sketch.jpg" />

A well designed visualization owes as much to iteration as it does to skill (though experience helps). The more iterations, generally the better the viz. However, the mental overhead of our technical tools often get in the way of our thought process.

By the time it takes to create a new notebook cell, look up the syntax for your favorite viz library, and load data - you could have quickly generated several iterations of ideas through sketches. 

As shown from our sketches above, they don't have to be high fidelity or even good - just enough to try out ideas quickly and communicate to colleagues. 

### Try It Now
Pull out a piece of paper and do a few thumbnail sized sketches of variations on this dashboard - spending no more than 5-10 minutes. The messier the better. 

Thumbnail sketches are for thinking and personal consumption. Larger ones come later to help communicate ideas to colleagues. 

## Imports
Now that we have sketched out our idea, lets prototype them. As usual, make sure the necessary imports are present to load, as well as setting the data location.

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/R5AbKqo2Uvk" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


In [None]:
import cudf
import plotly.graph_objects as go
import plotly.express as px
from jupyter_dash import JupyterDash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import pandas as pd
import cugraph
import cuxfilter
from pathlib import Path

DATA_DIR = Path("../data")
FILENAME = Path("modified_trips.parquet")


In [None]:
trips = cudf.read_parquet(DATA_DIR / FILENAME)

In [None]:
trips['time_of_day'] = 0 # day
trips.loc[trips.query('hour>19 or hour<8').index, 'time_of_day'] = 1 # night

In [None]:
# create a day_type string map
day_type_map = {0:'weekday', 1:'weekend', '':'all'}
time_of_day_map = {0:'day(8am-8pm)', 1:'night(8pm-8am)', '':'all'}

### Dashboard Mockup with cuxfilter
With our sketch idea of what we want the final explanatory visualization to look like, and what data it will show, lets try and create an interactive mockup to test our concept.

As usual, we will load the data into cuDF and spec out the cuxfilter chart:

In [None]:
cux_df = cuxfilter.DataFrame.from_dataframe(trips)

In [None]:
# Specifying a scatter plot chart will use Datashader and its required parameters
charts = [
    cuxfilter.charts.scatter(x='x', y='y', tile_provider='CARTODBPOSITRON',
                           point_size=3, pixel_shade_type='linear', pixel_spread='spread',
                          title='All Trips'),
    cuxfilter.charts.bar('all_time_week', title='Rides per week'),
    cuxfilter.charts.multi_select('day_type', label_map=day_type_map),
    cuxfilter.charts.multi_select('hour'),
]

# Generate the dashboard, select a layout and theme
d = cux_df.dashboard(charts, layout=cuxfilter.layouts.feature_and_base, theme=cuxfilter.themes.rapids)


In [None]:
# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
# IMPORTANT: if your notebook environment is in jupyterhub, set service_proxy='jupyterhub', otherwise set to 'none'
BASE_URL = 'http://localhost:8888/'
d.show(notebook_url=BASE_URL, service_proxy='none')

## Mockup Results
The cuxfilter mockup should look something like this:
<img src="https://raw.githubusercontent.com/jupytercon/2020-exactlyallan/master/images/notebook_04_dashboard_1.png" />

Overall the design seems to work, with the obvious caveat that we have yet to see PageRank in action. Nevertheless, because of how quick it is to build an interactive dashboard with cuxfilter, it can work well as a mock up tool.

If your design calls for chart types or features which are difficult to fully test (like arbitrary function calls), mocking up still makes sense for even component elements or interactions, supplemented with tools like hvplot or Datashader.

Mock ups are particularly important if you haven't connected real or a full set of your data to a visualization yet, since building an explanatory / production level application takes substantial effort (even with a simple API). Therefore, skipping lower fidelity interactive mockups will almost certainly end up wasting time on rework. With data visualization there are always surprises from unuseable results, slow performance, or the limitations of various chart interactions.

As mentioned above with sketching, increasing the number of design iterations your visualization goes through improves its quality, and interactive mockups are a useful tool for that purpose.



## Production Ready <br> Plotly Dash Visualization Application <br> with Real-time Page Rank Compute
Now that we are confident that our chart types and interactions are appropriate, lets build the dashboard with help from the [Plotly Dash Documentation](https://dash.plotly.com/).

First lets load the data into a cuDF and prepare it for the vis:

In [None]:
## Run this cell to show the next section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/nOrVlzvT5Eg" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


In [None]:
# Load the stations data from first notebook
stations = cudf.read_csv(DATA_DIR / "stations.csv")

# Get station names
station_names = trips[['from_station_id', 'from_station_name']].drop_duplicates()
station_names.columns = ['station_id', 'station_name']

# Get total trips per station
total_trips = (trips.groupby('from_station_id').size() + trips.groupby('to_station_id').size()).reset_index()
total_trips.columns = ['station_id', 'total_trips']

# Add total trips to dataframe
stations = stations.merge(total_trips, on='station_id')
stations = stations.merge(station_names, on='station_id')

In [None]:
stations.head()

## Define Application Layout and Style
Plotly Dash apps use standard web elements to define layouts and styles through `html`, `styles`, `class` and `css`. You can learn more on the [Dash layouts documentation](https://dash.plotly.com/layout). For our example we are using a locally hosted `css` file in the default folder: `/assets/dash-style.css` and inline `sytles.` 

Here we define our app and the layout to have a title `h1` tag, a `div` side bar for total trips and two drop down menus, and another `div` to contain the map chart and bar chart below: 

In [None]:
app = JupyterDash(__name__)

app.layout = html.Div([
    html.Div([
        html.H3(["Divvy Bikeshare Chicago"]),
        html.H5(["Total Selected Trips:"]),
        dcc.Loading(
            dcc.Graph(id = 'number', figure = go.Figure(go.Indicator(mode = "number", value = trips.shape[0])),
            style = {
            'height': '250px'
            }),
            color = '#b0bec5'
        ),
        html.H5(["Day of Week:"]),
        dcc.Dropdown(id = 'day', clearable = False, value = '',
            options = [{'label': day_type_map[c],'value': c} for c in day_type_map]
        ),
        html.H5(["Time of Day:"]),
        dcc.Dropdown(id = 'time', clearable = False, value = '',
            options = [{'label': time_of_day_map[c], 'value': c} for c in time_of_day_map]
        )],
        style = {
            'z-index' : '99',
            'position': 'absolute',
            'width': '15%',
            'height': 'calc(100% - 2em)',
            'padding': '1em 2em',
            'background-color': '#aabacc',
            'color': 'rgb(70, 105, 130)',
            'box-shadow': '5px 0px 3px 0px rgba(0,0,0,0.1)'
        }
    ),
    html.Div([
        html.Div([
            html.H5(["Station Importance PageRank(Color) by Trips(Size)"]),
            dcc.Graph(id = 'pagerank_plot',
                config = {'responsive': True, 'modeBarButtonsToRemove': ['select2d', 'lasso2d']}
            )
        ],
        style = {
            'display': 'inline-block',
            'width': '100%',
            'vertical-align': 'top'
        }),
        html.Div([
            html.H5(["Total Trips Per Week (2014-2017)"]),
            dcc.Graph(id = 'all_time_week_bar',
                config = {'responsive': True, 'modeBarButtonsToRemove': ['zoom2d', 'zoomIn2d', 'zoomOut2d']}
            )
        ],
        style = {
            'display': 'inline-block',
            'width': '100%'
        })
    ],
    style = {
        'width': 'calc(80% - 6em)',
        'height': 'auto',
        'margin-left': 'calc(15% + 6em)',
        'padding-top': '2em',
        'display': 'inline-block',
        'vertical-align': 'top',
        'color': '#aabacc'
    })
],
style = {
    'position': 'relative',
    'border-bottom': '2px solid #aabacc'
})


## Define Function to Generate Plots with Plotly Express
Next lets define the functions to build our two charts and link them to our data:

In [None]:
# Geospatial bubble chart based on Page Rank and Trip data
def get_pagerank_plot(data):
    df = calculate_page_rank(data).to_pandas()
    g = px.scatter_mapbox(df, lat="lat", lon="lon", color="pagerank", size="total_trips",
                          hover_data=["station_name"], mapbox_style="carto-positron",
                          color_continuous_scale=px.colors.cyclical.Edge_r, size_max=15, zoom=10,
                          height=700
                         )
    g.layout['uirevision'] = True
    return g

# Bar chart based on total trips over weeks
def get_week_bar_chart(data):
    all_time_week_df = data.groupby('all_time_week').size().reset_index()
    all_time_week_df.columns = ['week', 'trips']
    g = px.bar(all_time_week_df.to_pandas(), 
               x="week", y='trips', template=dict(layout={'selectdirection': 'h',}), 
               height=300
              )
    g.layout['dragmode']='select'
    g.layout['uirevision'] = True
    return g

## Define Function to Calculate Page Rank
Because Plotly Dash applications are hosted through a python backend, the web based charts are able to call custom python functions. Lets use this feature and the speed of cuGraph to calculate new PageRank scores base on a user's selection:

In [None]:
def calculate_page_rank(data):
    G = cugraph.Graph()
    G.from_cudf_edgelist(data, source='from_station_id', destination='to_station_id')
    data_page = cugraph.pagerank(G)
    return data_page.merge(stations, left_on='vertex', right_on='station_id').drop(columns=['vertex'])

## Define Events and Callbacks
Here we define what happens when a user interacts with chart selections through [Dash callbacks](https://dash.plotly.com/basic-callbacks):

In [None]:
def bar_selection_to_query(selection, column):
    """
    Compute pandas query expression string for selection callback data
    Args:
        selection: selectedData dictionary from Dash callback on a bar trace
        column: Name of the column that the selected bar chart is based on
    Returns:
        String containing a query expression compatible with DataFrame.query. This
        expression will filter the input DataFrame to contain only those rows that
        are contained in the selection.
    """
    point_inds = [p['label'] for p in selection['points']]
    xmin = min(point_inds)  # bin_edges[min(point_inds)]
    xmax = max(point_inds) + 1  # bin_edges[max(point_inds) + 1]
    xmin_op = "<="
    xmax_op = "<="
    return f"{xmin} {xmin_op} {column} and {column} {xmax_op} {xmax}"

# Define callback to update graph, id ties plot code to layout
@app.callback(
    [
        Output('pagerank_plot', 'figure'),
        Output('all_time_week_bar', 'figure'),
        Output('number', 'figure')
    ],
    [
        Input("day", "value"), Input("time", "value"),
        Input("all_time_week_bar", "selectedData")
    ]
)
def update_figure(day, time, selected_weeks):
    query = ['day_type == '+str(day) if day != "" else "", 'time_of_day =='+str(time) if time != "" else ""]
    query_str = ' and '.join([x for x in query if x != ""])
    
    data = trips
    if len(query_str) > 0:
        data = trips.query(query_str)

    week_bar_chart = get_week_bar_chart(data)
    
    if selected_weeks is not None:
        query.append(bar_selection_to_query(selected_weeks, 'all_time_week'))
        query_str = ' and '.join([x for x in query if x != ""])
        if len(query) > 0:
            data = trips.query(query_str)
    
    pagerank_plot = get_pagerank_plot(data)
    
    number = go.Figure(go.Indicator(
                mode="number",
                value=data.shape[0]
            ))

    return pagerank_plot, week_bar_chart, number

## Start the Plotly Dash Visualization
Now that we have defined everything, lets run the application:

In [None]:
# NOTE: If you are running in a JupyterHub environment, run the below command:
# JupyterDash.infer_jupyter_proxy_config()

# NOTE: For Jupyterlab run: 
# app.run_server(mode="jupyterlab")

# NOTE: To run inline with a notebook (NOT recommended): 
# app.run_server(mode="inline")

# NOTE: To run as seperate tab run then click on the link (recommended):
app.run_server(debug=False)


## Final Plotly Dash Visualization
This is what the dashboard should look like:

<img src="https://raw.githubusercontent.com/jupytercon/2020-exactlyallan/master/images/PlotlyDash-Dashboard.png">
          
Overall the speed of cuGraph's PageRank as well as the simple interactions make this dashboard intuitive and usable for quickly finding the most important bike stations. With preset filters, it succinctly gives an overview of how this complicated network behaves over time and is able to handle new data seamlessly (not bad for a tutorial app).  

### Try It Now
See if you can adjust the layout of the above app by reordering the `div` tags and changing the `style` tag values.

Or try changing the `css` by adding a reference to `external_stylesheets` as shown below. You can use the external `css` files from example GitHub repos from their [Dash Gallery](https://dash-gallery.plotly.host/Portal/). 


```
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
```

NOTE: you'll have to re-run all of the Plotly related cells after updating. Be forewarned, editing css in a notebook is a lot like [this](http://gph.is/1heneJM).

## A Final Summary on the Benefits of <br> Running with RAPIDS

Hopefully as you've clicked through these tutorial notebooks, you've noticed how seamless it is working within the RAPIDS libraries and with other libraries. One of the key goals of RAPIDS is to keep the tools and workflows you are familiar with, but turn them into end-to-end GPU accelerated pipelines. From ETL, exploration, analytics, and visualization - you can take advantage of the speed ups from GPUs.

We on the viz team are continuing to integrate with other visualization libraries, and have projects in the works to improve the performance and capabilities of web visualizations even further.

RAPIDS is still a relatively young project (we aren't even to 1.0 yet!), but we continue to work towards building out more features and improving. Stay up to date with our projects on our [Home](https://rapids.ai/), [GitHub](https://github.com/rapidsai), and [Twitter page](https://twitter.com/rapidsai).

In [None]:
## Run this cell to show this section's walkthrough video ##
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/deGQdljxYlY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')
