<img src="https://github.com/jupytercon/2020-exactlyallan/raw/master/images/RAPIDS-header-graphic.png">

# Exploratory Data Visualization with Cuxfilter
***Quickly finding linked patterns in your data with cross filtering***

**NOTE**: A more extensive version of this tutorial can be found on our [RAPIDS Community GitHub](https://github.com/rapidsai-community/showcase/tree/master/event_notebooks/JupyterCon_2020_RAPIDSViz).

## Introduction
Part of RAPIDS, [cuxfilter](https://github.com/rapidsai/cuxfilter) is a library by the visualization team that enables GPU accelerated cross filtered dashboards in just a few lines of notebook code. 

Cuxfilter acts as a connector library rather than a visualization chart library. It abstracts away all the 'plumbing' required to connect a [curated list of visualizations](https://docs.rapids.ai/api/cuxfilter/stable/charts/charts.html) to a cuDF GPU dataframe. By simply enabling accelerated dashboards inline within a notebook workflow, cuxfilter allows analysts to get to exploring their data faster. You can find all of our features on our [Docs page](https://docs.rapids.ai/api/cuxfilter/stable/).

Illustrating how easy it is to use RAPIDS libraries together, we will also use:

- [cuDF](https://docs.rapids.ai/api/cudf/stable/) a RAPIDS GPU DataFrame library for manipulating data with a pandas-like API.

- [cuGraph](https://docs.rapids.ai/api/cugraph/stable/) a RAPIDS GPU accelerated graph analytics library with functionality like NetworkX.

## Requirments

For this tutorial you'll need an NVIDIA GPU with at least 16GB of memory and RAPIDS installed. You can find further requirements and installation instructions on our [Getting Started Page](https://rapids.ai/start.html).

## Data Download
For this tutorial, we are using the [Divvy Chicago bike share dataset](https://www.divvybikes.com/system-data) sourced from this [Kaggle page](https://www.kaggle.com/yingwurenjian/chicago-divvy-bicycle-sharing-data?select=data.csv).

In [None]:
from pathlib import Path
import cudf

DATA_DIR = Path("../data")
FILENAME = Path("data.csv")

In [None]:
# Download and Extract the dataset
! wget -N -P {DATA_DIR} https://rapidsai-data.s3.us-east-2.amazonaws.com/viz-data/data.tar.xz
! tar -xf {DATA_DIR}/data.tar.xz -C {DATA_DIR}

## Imports
Let's first make sure the necessary imports are present to load, as well as setting the data location.

In [None]:
import cuxfilter
import cudf
import cugraph
from bokeh.models import NumeralTickFormatter
from bokeh.palettes import Inferno
from pathlib import Path
from preprocess import * # for compactness we added functions to preprocess.py


## Load Data Into cuDF and Format Data
Load `datda.csv` into the GPU dataframe:

In [None]:
data = cudf.read_csv(DATA_DIR / FILENAME)

# check
data

In [None]:
# remove unnessary data and format using the script in preprocess.py
trips = process_trips(data)

# check
trips.head()

## A Note on Cognitive Load and Preattentive Attributes
To make the dashboard values more understandable, we are creating string maps to convert the dataset's numbers to their proper names. Though it may seem trivial, it removes unnecessary ambiguity and helps [reduce cognitive load](https://www.nngroup.com/articles/minimize-cognitive-load/) when our focus needs to be on finding patterns.

Preattentive Attributes are a subconcious ability to quickly recognize patterns is due to our brain's natural ability to find [preattentive attributes](http://daydreamingnumbers.com/blog/preattentive-attributes-example/), such as height, orientation, or color. Imagine 100 values in a table and 100 in a bar chart and how quickly you would be albe to find the smallest and largest values in either. This is one key reason visualizations are a powerful mechanism for recognizing patterns.

In [None]:
# create a weekday string map
days_of_week_map = {
    0: 'monday',
    1: 'tuesday',
    2: 'wednesday',
    3: 'thursday',
    4: 'friday',
    5: 'saturday',
    6: 'sunday'
}

# month
month_map = {
    1: 'jan', 2: 'feb', 3: 'mar', 4: 'apr', 5: 'may', 6: 'jun', 7: 'jul', 8: 'aug', 9: 'sep', 10: 'oct', 11: 'nov', 12: 'dec'
}

# weekend weekday
day_type_map = {0:'weekday', 1:'weekend', '':'all'}

## Cuxfilter Basic Dashboard, Adding Charts, and Custom Layouts
First lets investigate trip totals by varous time slices by linking the dataframe to cuxfilter:

In [None]:
cux_df = cuxfilter.DataFrame.from_dataframe(data)

In [None]:
# Inferno Taken from bokeh color pallettes https://docs.bokeh.org/en/latest/docs/reference/palettes.html
colors = Inferno[10]

# Specify the charts and widgets to use with the selected columns of data and string maps
charts = [
    cuxfilter.charts.multi_select('year'),
    cuxfilter.charts.multi_select('day_type', label_map=day_type_map),
    cuxfilter.charts.bar('hour', title='trips per hour'),
    cuxfilter.charts.bar('month', x_label_map=month_map),
    cuxfilter.charts.bar('day', x_label_map=days_of_week_map)
]


# NOTE try to add chart: cuxfilter.charts.datashader.heatmap(x='hour', y='day', aggregate_col='hour', point_shape='rect_horizontal', point_size=10, color_palette=colors)


# NOTE try custom layout with `layout_array` parameter:
# layout_array = [[1, 2], [3, 2]]

# preset layout with `layout` parameter
layout = cuxfilter.layouts.feature_and_double_base

# Generate the dashboard and select a layout
d = cux_df.dashboard(charts, layout = layout, title='Bike Trips Dashboard')

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: use the slider below each chart to cross filter.

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
# IMPORTANT: if your notebook environment is in jupyterhub, set service_proxy='jupyterhub', otherwise set to 'none'
BASE_URL = 'http://localhost:8888'
d.show(notebook_url=BASE_URL, service_proxy='none')

## Export UI Queried Dataframe


In [None]:
# From currently run df
queried_df = d.export()

queried_df.columns

## Cuxfilter Basic Dashboard, Preview, and Themes
Lets continue investigating, this time following up on the increasing trips year over year and decreases in winter months. 

In [None]:
# Specify the charts and widgets to use with the selected columns of data and string maps
charts = [
    cuxfilter.charts.bar('all_time_week', title='rides per week'),
    cuxfilter.charts.heatmap(x='all_time_week', y='day', aggregate_col='temperature',
                             aggregate_fn='mean', point_size=40, legend_position='right',
                             title='mean temperature by day'),
    cuxfilter.charts.multi_select('day_type', label_map=day_type_map),
]

# Generate the dashboard and select a layout
d = cux_df.dashboard(charts, 
                     layout=cuxfilter.layouts.feature_and_base, 
                     title='Temperature Dashboard', 
                     theme=cuxfilter.themes.light) #options: rapids, light, dark

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# NOTE: pan to match up the top and bottom chart axis

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
# IMPORTANT: if your notebook environment is in jupyterhub, set service_proxy='jupyterhub', otherwise set to 'none'
d.show(notebook_url=BASE_URL, service_proxy='none')


# NOTE: try inline image preview
# await d.preview()


## Cuxfilter Geospatial Graph Dashboard
Next, lets take a look at the geospatial element of the data and see if we can find interesting patterns. Based on how the trip data is logged, converting it into a graph will make managing it easier.

For this we will need [cuGraph](https://docs.rapids.ai/api/cugraph/stable/api.html) to translate the dataset into an edge list:

In [None]:
G = cugraph.Graph() 
G.from_cudf_edgelist(data, source='from_station_id', destination='to_station_id')
edges = G.edges()

In [None]:
# Trips have been converted into edges with source and destination based on station IDs.
edges.head()

Next we load the formatted data into cuxfilter and specify the chart types:

In [None]:
cux_df = cuxfilter.DataFrame.load_graph((trips, edges))

In [None]:
# Specifying a graph chart type will use Datashader and its required parameters
charts = [
    cuxfilter.charts.multi_select('year'),
    cuxfilter.charts.multi_select('day_type', label_map=day_type_map),
    cuxfilter.charts.graph(
        node_id='from_station_id',
        edge_source='src', edge_target='dst',
        node_aggregate_fn='count',
        node_pixel_shade_type='linear', node_point_size=35, #node size is fixed
        edge_render_type='direct', #options direct, curved
        edge_transparency=0.7, #0.1 - 0.9
        tile_provider='CARTODBPOSITRON', 
        title='Graph for trip source_stations (color by count)'
    ),
    cuxfilter.charts.bar('from_station_id'),
    cuxfilter.charts.bar('to_station_id')
]

# Generate the dashboard, select a layout and theme
d = cux_df.dashboard(charts, layout=cuxfilter.layouts.feature_and_double_base, theme=cuxfilter.themes.rapids, title='Geospatial Trips')

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: Graph edges can be turned on/off via the line tool icon
# Note: Inspect Neighboring Edges can be turned on/off for box or lasso select
# Caution: Selecting areas with Inspect Neighboring Edges on can result in slow performance or OOM errors  
# Caution: If the dashboard freezes, simply close the tab and restart this cell
# Note: This is rendering 9 MILLION edges

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
# IMPORTANT: if your notebook environment is in jupyterhub, set service_proxy='jupyterhub', otherwise set to 'none'
d.show(notebook_url=BASE_URL, service_proxy='none')


## CuGraph Clustering
While the above produced many findings, filtering through so many trip edges is not ideal.
Next we will try to push the visual analytics further with a clustered network graph along side the geospatial graph using the [ForceAtlas2](https://docs.rapids.ai/api/cugraph/stable/api.html?highlight=force#module-cugraph.layout.force_atlas2) algorithm from cuGraph:

In [None]:
# Note: Often a good visualization result only comes from a lot of trial and error
# The below parameters produce useful clustering, but try experimenting with them further
ITERATIONS=500
THETA=10.0
OPTIMIZE=True

# Using the previously created edge list, we calculate the FA2 layout positions here
trips_force_atlas2_layout = cugraph.layout.force_atlas2(G, max_iter=ITERATIONS,
                strong_gravity_mode=False,
                outbound_attraction_distribution=True,
                lin_log_mode=False,
                barnes_hut_optimize=OPTIMIZE, barnes_hut_theta=THETA, verbose=True)

Merge the calculated forceAtlas2 layout with the trip dataframe:

In [None]:
final_df = trips_force_atlas2_layout.merge(
                trips[['from_station_id', 'from_station_name','to_station_id', 'year', 'hour', 'day_type', 'x', 'y']],
                left_on='vertex',
                right_on='from_station_id',
                suffixes=('', '_original')
)

# Preview
final_df.head()

## Cuxfilter Clustered Graph and Geospatial Dashboard Two

Next we load the data into cuxfilter and specify the chart types:

In [None]:
cux_df = cuxfilter.DataFrame.load_graph((final_df, edges))

In [None]:
# Both scatter and graph chart types use Datashader 
charts= [
  cuxfilter.charts.graph(
      edge_source='src', edge_target='dst',
      edge_color_palette=['gray', 'black'],
      ode_pixel_shade_type='linear',
      edge_render_type='curved', #other option: direct
      edge_transparency=0.7, #0.1 - 0.9
      title='ForceAtlas2 Layout Graph'
  ),
  cuxfilter.charts.scatter(
    x='x_original', y='y_original', 
    tile_provider='CARTODBPOSITRON',
    point_size=3,
    pixel_shade_type='linear',
    pixel_spread='spread',
    title='Original Layout'
  ),
  cuxfilter.charts.bar('hour', title='Trips per hour'),
  cuxfilter.charts.bar('from_station_id', title='Source station'),
  cuxfilter.charts.bar('to_station_id', title='Destination station'),
  cuxfilter.charts.multi_select('year'),
  cuxfilter.charts.multi_select('day_type', label_map={0:'weekday', 1:'weekend', '':'all'})
] 

layout_array_3rds = [[1,1,2],[1,1,2],[3,4,5]]

# Generate the dashboard, select a layout and theme
d = cux_df.dashboard(charts, layout_array = layout_array_3rds, theme=cuxfilter.themes.rapids, title="Network and Geospatial Graph")

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: Graph edges can be turned on/off via the line tool icon
# Note: Inspect Neighboring Edges can be turned on/off for box or lasso select
# Caution: Selecting areas with Inspect Neighboring Edges on can result in slow performance or OOM errors  
# Caution: If the dashboard freezes, simply close the tab and restart this cell
# Note: This is rendering 9 MILLION edges

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
# IMPORTANT: if your notebook environment is in jupyterhub, set service_proxy='jupyterhub', otherwise set to 'none'
BASE_URL = 'http://localhost:8888'
d.show(notebook_url=BASE_URL, service_proxy='none')

## Visualization Conclusions

Running the FA2 algorithm to group the station nodes together in a graph and placing the geospatial chart along side provided some compelling findings:
- Stations form clusters of connectivity that are clearly geographically distinct 
- The core weekday group is actually multiple distinct clusters in close proximity (different work districts?)
- The weekday group stays focused until after work hours where they then disperse north (happy hour?)
- The weekend group is overall more spread out, starting along the coast then dispersing throughout the city towards the evening (sight seeing?)
- Theater on Lake Station is a hyper focal point for the weekend group

These are only a few notable points found relatively quickly - there are certainly more patterns.


## A Final Summary on the Benefits of Running with RAPIDS

Hopefully as you've clicked through this tutorial notebook, you've noticed how seamless it is working within the RAPIDS libraries and with other libraries. One of the key goals of RAPIDS is to keep the tools and workflows you are familiar with, but turn them into end-to-end GPU accelerated pipelines. From ETL, exploration, analytics, and visualization - you can take advantage of the speed ups from GPUs.

We on the viz team are continuing to integrate with other visualization libraries, and have projects in the works to improve the performance and capabilities of web visualizations even further.

## FYI: cuxfilter Troubleshooting
As we just released the graph visualization capability in cuxfilter, we are still working on building out features and fixes. 

If you find something that needs fixing or have feature requests, please submit an [issue on our Github Page](https://github.com/rapidsai/cuxfilter/issues). Better yet, [help contribute](https://github.com/rapidsai/cuxfilter#contributing-developers-guide). 