# The Bokeh Library

## Introduction

Analyzing datasets is an integral part of data science. But an even more important aspect of data science is effectively conveying findings from analysis to others! To that extent, we explored static graphs like bar charts and scatter plots in class. However, interactive graphical visualization provides additional dimensions for explaining findings that can put the results further in context.

For example, consider the following graph: 
<img src="http://minimaxir.com/img/online-class-charts/class-attendance.png" width="450">
Although this graph contains useful information, knowing how these statistics have changed over time may help understand the underlying cause. Such levels of understanding are achievable with interactivity as demonstrated in this tutorial!

The Bokeh library, in addition to the static graphical capabilities of Matplotlib discussed in class, provides interactive features to help convey a more comprehensive view of analysis results.

## Table of Contents

1. [Installation](#installation)
2. [The Plot](#the-plot)
3. [Glyphs](#glyphs)
4. [Static Properties](#properties)
5. [Basic Interactivity](#interactivity)
6. [Advanced Interactivity](#server)
7. [Application: Zomato Restaurants](#app)
9. [Summary](#summary)
10. [Resources](#resources)

## Installation
<a id='installation'></a>

Before getting started, we must install the Bokeh library. To do so using Anaconda, run the following command in terminal: ```conda install bokeh```
 
Ensure that your installation completed successfully by checking that the following import runs without errors:

In [1]:
from bokeh.plotting import figure, output_notebook, show
import bokeh.resources

## The Plot
<a id='the-plot'></a>

The first step to creating Bokeh graphs is specifying the output location. This only needs to be set once for the entire notebook. If you want plots to output to a file, import ```output_file``` from ```bokeh.plotting```, but this tutorial outputs within the notebook and thus, uses ```output_notebook``` instead. To switch output locations at any point, use ```output_reset```. The most important part of Bokeh graphs is the "plot" or Figure object -- a container for the components that make up a graph. The plot is initialized using the ```figure``` constructor as shown:

In [2]:
output_notebook(bokeh.resources.INLINE)
p = figure()

The title and axes labels are often the first graphical components that the audience reads to understand the graph. These labels can be set on the plot during initialization like so:

In [3]:
p = figure(title="Title Placeholder", x_axis_label="X Axis Label Placeholder", y_axis_label="Y Axis Label Placeholder")

Although this sets the text, Bokeh cannot render the plot properly without data to display.

## Glyphs
<a id='glyphs'></a>

To insert data into the plot, Bokeh uses "glyphs" -- the basic visual components of the graph. Although many glyphs exist in Bokeh, we will cover some of the most commonly used glyphs: line, circle, vbar/hbar. For all these glyphs, the x and/or y coordinate data can be specified as list(s) of values like shown below. After setting the glyph components, the plot can be viewed using the '```show```' function as demonstrated below.

#### Line

This glyph creates line graphs -- every consecutive pair of (x,y) coordinates are connected by a line. For example, consider the ```x^2``` function plotted below:

In [4]:
# Generate random data
import numpy as np
x = np.random.randint(low=1, high=1000, size=100)
x.sort()
y = [i**2 for i in x]

# Line Glyph
p_line = figure(title="Line Glyph Example", x_axis_label="Random Sorted X", y_axis_label="X^2")
p_line.line(x,y)
show(p_line, notebook_handle=True)

#### Circle

This glyph creates scatter plots with the data specified by (x,y) coordinates marked by circles. For example:

In [5]:
# Generate random data
x = np.random.randint(low=1, high=1000, size=100)
y = np.random.randint(low=1, high=1000, size=100)

# Circle Glyph
p_circle = figure(title="Circle Glyph Example", x_axis_label="Random X", y_axis_label="Random Y")
p_circle.circle(x,y)
show(p_circle)

#### Vertical Bar (vbar) and Horizontal Bar (hbar)

As the name implies, the vertical bar glyph (vbar) creates vertical bar charts with the data values determined by the x-coordinates and ```top``` coordinates specifying the heights of the bars. ```vbar``` also requires a specification of the width of each bar using the ```width``` parameter. For instance:

In [6]:
# Generate random data
x = [i+1 for i in range(20)]
y = [i*2 for i in x]
top = np.random.randint(low=1, high=1000, size=20)
right = np.random.randint(low=1, high=1000, size=20)

# Vertical Bar (vbar) Glyph
p_vbar = figure(title="VBar Glyph Example", x_axis_label="X", y_axis_label="Random Height")
p_vbar.vbar(x, top=top, width=0.5)
show(p_vbar)

Similarly, the horizontal bar glyph (hbar) creates horizontal bar charts with data values given by y-coordinates, ```right``` coordinates specifying the length of the bars, and ```height``` parameter specifying the thickness of the bars.

#### Multiple Glyphs

Bokeh also allows overlaying multiple glyphs in one plot which can help with plotting multiple data series together or marking the data points on a line graph. When combining glyphs, legends are useful in specifying what each of the glyphs represent. Creating the legend simply involves specifying a legend label parameter for each glyph!

In [7]:
# Generate multiple data sets for layering
x = np.random.randint(low=1, high=1000, size=100)
x.sort()
y1 = [i for i in x]
y2 = [4*i + 3 for i in x]
y3 = [(-2*i) + 4 for i in x]
y_bar = [i+1 for i in range(20)]
x_bar1 = [i*2 for i in y_bar]
x_bar2 = [4*i - 3 for i in y_bar]

p_layered = figure(title="Combination of Line & Circle Glyphs", x_axis_label="X Coordinate", y_axis_label="Y Coordinate")
p_layered.line(x, y1, legend="y=x")
p_layered.line(x, y2, legend="y=4x+3")
p_layered.circle(x, y2, legend="y=4x+3")
p_layered.circle(x, y3, legend="y=-2x+4")
show(p_layered)

p_multibar = figure(title="Combination of Bar Glyphs", x_axis_label="Bar Length", y_axis_label="Y Coordinate")
p_multibar.hbar(y_bar, right=x_bar1, legend="x = 2y", height = 0.5)
p_multibar.hbar(y_bar, right=[-1*i for i in x_bar2], legend="x = -(4y-3)", height = 0.5)
show(p_multibar)

## Static Properties
<a id='properties'></a>

When looking at a complex dataset, it often helps to observe the categorical features of the data: maybe a particular feature of the data points creates clusters. To convey this additional information in a plot, we can use dimensions like the glyph's color, the data markers' size, the data markers' shape, and line dashedness/solidness. These customizations can also help differentiate glyphs in the same plot.

#### Color

To set a glyph's color, one option is to apply the color to the entire glyph. This helps differentiate multiple overlayed glyphs and can be done by specifying the color parameter like so:

In [8]:
# Glyph color - bars
y_bar = [i+1 for i in range(20)]
x_bar1 = [i*2 for i in y_bar]
x_bar2 = [4*i - 3 for i in y_bar]

p_multibar = figure(title="Combination of Bar Glyphs", x_axis_label="Bar Length", y_axis_label="Y Coordinate")
p_multibar.hbar(y_bar, right=x_bar1, legend="x = 2y", height = 0.5, color="orange")
p_multibar.hbar(y_bar, right=[-1*i for i in x_bar2], legend="x = -(4y-3)", height = 0.5)
show(p_multibar)

x = np.random.randint(low=1, high=1000, size=100)
x.sort()
y1 = [i for i in x]
y2 = [4*i + 3 for i in x]
y3 = [(-2*i) + 4 for i in x]

p_layered = figure(title="Combination of Line & Circle Glyphs", x_axis_label="X Coordinate", y_axis_label="Y Coordinate")
p_layered.line(x, y1, legend="y=x", color="red")
p_layered.line(x, y2, legend="y=4x+3", color="purple")
p_layered.circle(x, y2, legend="y=4x+3", color="purple")
p_layered.circle(x, y3, legend="y=-2x+4")
show(p_layered)

Another option is rendering color based on categorical features of the data. To do this, we use palettes from ```bokeh.palettes``` with factor mappers from ```bokeh.transform``` to map the factors of the categorical feature to the palette's colors. Then to apply the factor mapping on the data, the data is compiled into a Column Data Source -- a container for the data specified as a dictionary or a Pandas dataframe. For instance, consider the example provided below:

In [9]:
from bokeh.models import ColumnDataSource
from bokeh.palettes import Spectral6
from bokeh.transform import factor_cmap

# Generate data with categorical component
x = np.random.randint(low=1, high=1000, size=100)
y = [2*i + 3 for i in x]
category = [str(i) for i in np.random.randint(low=0, high=5, size=100)]
source = ColumnDataSource(data=dict(x=x, y=y, category=category))

# Glyph color - circles
p_colmark = figure(title="Colored Circle Glyphs", x_axis_label="X Coordinate", y_axis_label="Y Coordinate")
p_colmark.circle('x', 'y', source=source, legend='category',
                 fill_color=factor_cmap('category', palette=Spectral6, factors=[str(i) for i in range(6)]))
show(p_colmark)

#### Marker Size

The data marker sizes can be used to represent ordinal features. The size must be specified in numerical values (representing screen space units). So, it is important to map the features' factors to numerical values as shown below. Since sizing is changed on the same glyph, specifying the legend is useless because it just creates a legend with same sized markers.

In [10]:
# Generate data with categorical component
x = np.random.randint(low=1, high=1000, size=100)
y = [2*i + 3 for i in x]
categories = ["Very Low", "Low", "Moderate", "High", "Very High"]
category = [categories[i] for i in np.random.randint(low=0, high=5, size=100)]

# Manipulating data marker size
source = ColumnDataSource(data=dict(x=x, y=y, size=[categories.index(c) for c in category]))
p_sizmark = figure(title="Sized Circle Glyphs", x_axis_label="X Coordinate", y_axis_label="Y Coordinate")
p_sizmark.circle('x', 'y', source=source, size='size')
show(p_sizmark)

#### Marker Shape

Data marker shapes are defined by the glyph constructor and thus, can only help differentiate multiple glyphs plotted together. There are many possible shapes including: circle, asterisk, cross, diamond, square, and triangle.

In [11]:
# Generate data with categorical feature
x = np.random.randint(low=1, high=1000, size=100)
x.sort()
y1 = [2*i + 3 for i in x]
y2 = [2*i + np.random.randint(low=-10, high=15) for i in x]
category = [str(i) for i in np.random.randint(low=0, high=5, size=100)]
source2 = ColumnDataSource(data=dict(x=x, y=y2, category=category))

# Multiple glyphs marker shape
p_multishape = figure(title="Multiple Shaped Glyphs", x_axis_label="X Coordinate", y_axis_label="Y Coordinate")
p_multishape.diamond(x, y1, legend="y1")
p_multishape.cross(x, y2, legend="y2")
show(p_multishape)

#### Line Dashedness & Thickness

Solidness/dashedness and thickness is specified for the entire line glyph and thus, can only help differentiate multiple glyphs on the same plot like shown below. When setting the ```line_dash``` property, we use a space separated string of two numbers "i j" to specify a pattern of an i-length solid line followed by a j-length space.

In [12]:
# Solid vs. dashed lines
x = np.random.randint(low=1, high=1000, size=100)
x.sort()
y1 = [i for i in x]
y2 = [4*i + 3 for i in x]
y3 = [(-2*i) + 4 for i in x]

p_multiline = figure(title="Dashedness of Line Glyphs", x_axis_label="X Coordinate", y_axis_label="Y Coordinate")
p_multiline.line(x, y1, legend="y=x", color="red", line_width=2)
p_multiline.line(x, y2, legend="y=4x+3", color="purple")
p_multiline.circle(x, y2, legend="y=4x+3", color="purple")
p_multiline.line(x, y3, legend="y=-2x+4", line_dash="3 1")
show(p_multiline)

## Basic Interactivity
<a id='interactivity'></a>

Now that we can create graphs, let's see how to add interactive effects to them. Although Bokeh provides many built-in interactive tools, we will cover some of the more commonly used ones.

##### Select/Zoom Area

This interactive feature allows zooming in on or selecting a particular range of the data in the plot. Selecting/zooming on a range is useful in helping direct the audience's attention to a particular region of the graph. 

In the plots we created above, the 'Box Zoom' tool was included by default in the tools at the right-hand side! We can zoom on a region by selecting this tool and then drawing a box around the region to be zoomed. Once zoomed in, it is often helpful to pan the view to see other parts of the data. To pan, the default toolbar provides a 'Pan' tool. Select this tool and then click-and-drag in the plot area. To reset to the original plot view, just click the 'Reset' tool provided in the default toolbar!

Although plot area selection tools are not included by default in the toolbar, they can be easily added to the toolbar as demonstrated below. There are two tools for selection: the box selection tool and the lasso selection tool. As the names imply, lasso selection allows selecting an arbitrary region while box selection is restricted to a rectangular area. Some helpful features of the selection tools are: making multiple selections using the SHIFT key and clearing selections with the ESC key!

In [13]:
# Generate random data
x = np.random.randint(low=1, high=1000, size=100)
y = np.random.randint(low=1, high=1000, size=100)

# Region Select
p_select = figure(title="Graph Area Select/Zoom Example", x_axis_label="Random X",
                  y_axis_label="Random Y", tools="pan,box_select,lasso_select,box_zoom,reset")
p_select.circle(x,y)
show(p_select)

##### Linked Interactivity

Bokeh also allows simultaneously applying selection or zoom to multiple linked graphs. Why might this help? Consider the case of plotting stock data. You may have a plot for the number of stocks owned by a person on any given day and you may also have the cost of the stock across days. However since these values have different units, they cannot be effectively combined into the same plot. But when we're focusing on a region in the number of stocks plot, we would like to focus on the same region in the stock prices plot. 

Linking plots for zooming and selecting involves linking both the plot ranges (for zooming) and the underlying data via a Column Data Source (for selection). For example with the stock scenario, we would want the day data range to be linked between the graphs as shown below:

In [14]:
from bokeh.layouts import gridplot

# Generate data
day = [i for i in range(100)]
num_stocks = np.random.randint(low=1, high=50, size=100)
price = np.random.randint(low=1, high=500, size=100)
source = ColumnDataSource(dict(day=day,num_stocks=num_stocks,price=price))

# Linked plots
p_link1 = figure(title="Number of Stocks Owned Example", x_axis_label="Day", y_axis_label="Number of Stocks", 
                 tools="pan,box_select,lasso_select,box_zoom,reset", width=450, height=450)
p_link1.circle('day','num_stocks',source=source)

p_link2 = figure(title="Price of Stocks Example", x_axis_label="Day", y_axis_label="Price of Stocks",
                 tools="pan,box_select,lasso_select,box_zoom,reset", x_range=p_link1.x_range,
                 width=450, height=450)
p_link2.circle('day','price',source=source)
show(gridplot([[p_link1,p_link2]]))

##### Hover Effects

The hover tool displays additional information about each data point as the mouse hovers over them. The text displayed by the tool can be customized as illustrated below. Within the text specification, the '\$' character refers to internal plot values like x and y coordinates while the '@' character refers to columns within the data source. For example in the plot below, we use '@x' and '@y' from the data source which shows the x and y coordinates of the data point that is hovered over. If we used '\$x' and '\$y' instead, hovering over the data points would display the precise x and y coordinates of the hover tool tip rather than the data.

In [15]:
from bokeh.models import HoverTool

# Generate random data
x = np.random.randint(low=1, high=1000, size=100)
y = np.random.randint(low=1, high=1000, size=100)
categories = ['apple', 'banana', 'pear', 'guava', 'peach', 'mango']
category = [categories[i] for i in np.random.randint(low=0, high=5, size=100)]
source = ColumnDataSource(dict(x=x,y=y,category=category))

# Region Select
hover_text = HoverTool(tooltips=[
    ("X", "@x"),
    ("Y", "@y"),
    ("Category", "@category")
])
p_hover = figure(title="Hover Text Example", x_axis_label="Random X",
                 y_axis_label="Random Y", tools=[hover_text,"pan","box_zoom","reset"])
p_hover.circle('x','y',source=source)
show(p_hover)

##### GeoData

Given geographical data, Bokeh can easily put this data into context by integrating it with Google Maps! Using GMapPlot and GMapOptions from ```bokeh.models```, we can create a plot with glyphs overlayed on a Google Maps instance. To accomplish this, we must start by obtaining a Google API key from: https://developers.google.com/maps/documentation/javascript/get-api-key and placing it in 'google_api_key.txt' in the same folder as this notebook. Next we need to create a GMapOptions object that specifies the center location of the map, level of zoom, type of Google map, etc. Then the map object can be created using these options. However, Bokeh developer forums discuss issues with displaying plots within Jupyter notebooks and thus, we use the output_file option to output to an html file.

In [16]:
from bokeh.io import output_file, reset_output
from bokeh.models import (
  GMapPlot, GMapOptions, ColumnDataSource, Circle, Range1d, PanTool, WheelZoomTool, BoxSelectTool
)

# Data with Geographical Features
lat = [40.443, 40.453, 40.798, 40.730, 40.344]
lon = [-79.945, -79.974, -77.862, -73.999, -74.654]
label = ['CMU', 'PNC Park', 'Penn State Univ.', 'NYU', 'Princeton Univ.']
source = ColumnDataSource(dict(lat=lat,lon=lon,label=label))

# GeoData plot
reset_output() 
plot_opts = GMapOptions(lat=41.75, lng=-77.79, map_type="roadmap", zoom=6)
# key_file = open('google_api_key.txt', 'r')  <-- Run after creating API Key file
# api_key = key_file.read()  <-- Run after creating API Key file
# key_file.close()  <-- Run after creating API Key file
map_hover = HoverTool(tooltips=[
    ("Lat", "@lat"),
    ("Lon", "@lon"),
    ("Location", "@label")
])
p_map = GMapPlot(x_range=Range1d(), y_range=Range1d(), map_options=plot_opts)
p_map.add_tools(map_hover,PanTool(),WheelZoomTool(),BoxSelectTool())
# p_map.api_key = api_key  <-- Run after creating API Key file
p_map.title.text = "Sample GeoData"
p_map.add_glyph(source, Circle(x='lon', y='lat', size=20, fill_color="red"))
output_file('GeoData_sample_plot.html')
# show(p_map) <-- Run after creating API Key file

This generates: [GeoData Sample Plot](GeoData_plot.html)


In [17]:
# To stop outputting to file
reset_output()
output_notebook()

## Advanced Interactivity\*\*
<a id='server'></a>
#### \*\*Interactive widgets DON'T render in static environments! Screenshots included instead.

So far, the interactivity we examined manipulated the visual rendering of the underlying data. However, interactivity can directly modify the underlying data of a plot! Such applications in Bokeh are often created as standalone Bokeh Server applications, but Bokeh easily integrates with the ```interact``` functionality of ```ipywidgets``` to allow interactive rendering within a Jupyter notebook. In the following example, we use this interactivity to subset on the categorical 'year' feature and use a sliding bar to choose which year's data to display. This feature makes it easier to demonstrate changes in x and y coordinate relations over time since the audience can interact and observe the changes themselves! Also notice that the example below integrates the Pandas library into Bokeh due to its ease of subsetting data.

In [18]:
from ipywidgets import interact
from bokeh.io import push_notebook, output_notebook
import pandas as pd

# Generate Data
x = np.random.randint(low=1, high=1000, size=500)
y = np.random.randint(low=1, high=100, size=500) 
years = [1900+(5*i) for i in range(10)]
data = pd.DataFrame(dict(x=x,y=y,year=[years[np.random.randint(low=0, high=9)] for i in range(500)]))
source = ColumnDataSource(data[data['year'] == 1900])

# Plot with Sliding Bar
p_slide = figure(title="Sliding Bar Example", x_axis_label="Random X", y_axis_label="Random Y")
circles = p_slide.circle('x', 'y', source=source)

def update(year=1900):
    new_data = data[data['year'] == year]
    new_source = dict(x=new_data['x'].tolist(), y=new_data['y'].tolist(), year=new_data['year'].tolist())
    circles.data_source.data = new_source
    push_notebook()

show(p_slide, notebook_handle=True)

In [19]:
interact(update, year=(1900, 1940, 5))

<function __main__.update>

This renders the widget: <img src="interact_widget.jpg" width="200"> that actively modifies the plot above.

## Example Application: Zomato Restaurants
<a id='app'></a>

Now let's put some of these components together to analyze a real-world dataset of Zomato Restaurants. We will use Zomato Restaurant data from [https://www.kaggle.com/shrutimehta/zomato-restaurants-data/]( https://www.kaggle.com/shrutimehta/zomato-restaurants-data/) so download the zomato.csv file and place it in the same folder as this notebook.

In [20]:
# Load Data
data_file = open('zomato.csv', 'r', encoding="latin-1")
zomato = pd.read_csv(data_file)

From the available features, we will create one plot that explores the relation between Aggregate Rating and Average Cost for Two across different Price Ranges with the ability to filter restaurants based on Number of Votes range. For easy comparison, we will use two side-by-side linked graphs.

In [21]:
from bokeh.palettes import Paired12

leftsource = ColumnDataSource(zomato)
rightsource = ColumnDataSource(zomato)
zhover = HoverTool(tooltips=[
    ("Restaurant", "@{Restaurant Name}"),
    ("Cuisines", "@Cuisines"),
    ("Avg. Cost for Two", "@{Average Cost for two}"),
    ("Agg. Rating", "@{Aggregate rating}"),
    ("Votes", "@Votes")
])
factor_map = factor_cmap('Currency', palette=Paired12, factors=zomato['Currency'].unique().tolist())
p_left = figure(title="Zomato Restaurant Rating vs. Average Cost", x_axis_label="Average Cost for Two",
                y_axis_label="Aggregate Rating", tools=[zhover, "pan","box_select","lasso_select","box_zoom,reset"],
                width=450, height=450)
zlcircles = p_left.circle('Average Cost for two', 'Aggregate rating', source=leftsource, fill_color=factor_map, legend='Currency')

p_right = figure(x_axis_label="Average Cost for Two", y_axis_label="Aggregate Rating",
                 tools=[zhover, "pan","box_select","lasso_select","box_zoom,reset"],
                 x_range=p_left.x_range, y_range=p_left.y_range, width=450, height=450)
zrcircles = p_right.circle('Average Cost for two', 'Aggregate rating', source=rightsource, fill_color=factor_map, legend='Currency')

def left_update(price_range=1, vote_min='0', vote_max='10934'):
    new_data = zomato[zomato['Price range'] == price_range]
    new_data = new_data[new_data['Votes'].between(int(vote_min), int(vote_max))]
    new_source = new_data.to_dict(orient='list')
    zlcircles.data_source.data = new_source
    push_notebook()
    
def right_update(price_range=1, vote_min='0', vote_max='10934'):
    new_data = zomato[zomato['Price range'] == price_range]
    new_data = new_data[new_data['Votes'].between(int(vote_min), int(vote_max))]
    new_source = new_data.to_dict(orient='list')
    zrcircles.data_source.data = new_source
    push_notebook()

show(gridplot([[p_left, p_right]]), notebook_handle=True)

For left plot interactivity:

In [22]:
interact(left_update, price_range=(1,4), vote_min='0', vote_max='10934')

<function __main__.left_update>

For right plot interactivity:

In [23]:
interact(right_update, price_range=(1,4), vote_min='0', vote_max='10934')

<function __main__.right_update>

These render interactive widgets like: <img src="zomato_widget.jpg" width="250"> that actively modify the plot contents.

Look at how much information we effectively conveyed within a single side-by-side plot using the features we learned! This plot already gives an extensive overview of the dataset! 

Keep in mind though that each additional interactive dimension introduces more work required to rerender the plot for an update which is important especially when working with notebooks as opposed to server applications.

## Summary
<a id='summary'></a>

As you have seen, this tutorial introduces the basic functionalities of the Bokeh library in Python for interactive graphical visualization. For more information regarding Bokeh, the Zomato dataset, or other interactive visualization libraries, please see the resources provided below.

## Resources
<a id='resources'></a>

* Bokeh: [https://bokeh.pydata.org/en/latest/docs/user_guide.html](https://bokeh.pydata.org/en/latest/docs/user_guide.html)
* Zomato Restaurants: [https://www.kaggle.com/shrutimehta/zomato-restaurants-data/](https://www.kaggle.com/shrutimehta/zomato-restaurants-data/)
* mpld3: [http://mpld3.github.io/](http://mpld3.github.io/)
* pygal: [http://pygal.org/](http://pygal.org/)
* Plotly: [https://plot.ly/python/](https://plot.ly/python/)