# Starting out with Bokeh

## One Variable Plotting

In this notebook, we will get our first experience with [Bokeh](https://bokeh.pydata.org/en/latest/), a powerful plotting library accessible through Python. Throughout this series of notebooks, we will use the [nycflights13](https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf) dataset. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
from bokeh.plotting import figure

from bokeh.io import show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool, CategoricalColorMapper
from bokeh.palettes import Category10_5

# Data Inspection

In [3]:
flights = pd.read_csv('../data/flights.csv', index_col=0)
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
1,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00
2,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00
3,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00
4,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00
5,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01 06:00:00


In [6]:
import os
import sys


0.147428443

We are going to focus on a single variable, in this case the arrival delay in minutes. Before we get into plotting, we will want to take a look at the summary statistics for the arrival delay.

In [4]:
flights['arr_delay'].describe()

count    327346.000000
mean          6.895377
std          44.633292
min         -86.000000
25%         -17.000000
50%          -5.000000
75%          14.000000
max        1272.000000
Name: arr_delay, dtype: float64

# Histogram 

The first graph we will make is a simple histogram of the arrival delay. We will consider all airlines on the same plot.

## Data for plotting

In [5]:
# Bins will be five minutes in width, limit delays to [-2, +2] hours
arr_hist, edges = np.histogram(flights['arr_delay'], bins = int(240/5), range = [-120, 120])

In [6]:
# Set up the figure
p = figure(plot_width = 500, plot_height = 500, title = 'Histogram of Arrival Delays',
          x_axis_label = 'Minutes', y_axis_label = 'Count')

# Add a quad glyph
p.quad(bottom=0, top=arr_hist, left=edges[:-1], right=edges[1:], fill_color='red', line_color='black')

# To show in notebook
output_notebook()

# Show the plot
show(p)

# Add Basic Styling

In [7]:
def style(p):
    p.title.align = 'center'
    p.title.text_font_size = '18pt'
    p.xaxis.axis_label_text_font_size = '12pt'
    p.xaxis.major_label_text_font_size = '12pt'
    p.yaxis.axis_label_text_font_size = '12pt'
    p.yaxis.major_label_text_font_size = '12pt'
    
    return p

In [8]:
styled_p = style(p)

show(styled_p)

# Column Data Source

In [9]:
arr_df = pd.DataFrame({'count': arr_hist, 'left': edges[:-1], 'right': edges[1:]})
arr_df['f_count'] = ['%d flights' % count for count in arr_df['count']]
arr_df['f_interval'] = ['%d to %d minutes' % (left, right) for left, right in zip(arr_df['left'], arr_df['right'])]

arr_df.head()

Unnamed: 0,count,left,right,f_count,f_interval
0,0,-120.0,-115.0,0 flights,-120 to -115 minutes
1,0,-115.0,-110.0,0 flights,-115 to -110 minutes
2,0,-110.0,-105.0,0 flights,-110 to -105 minutes
3,0,-105.0,-100.0,0 flights,-105 to -100 minutes
4,0,-100.0,-95.0,0 flights,-100 to -95 minutes


In [10]:
arr_src = ColumnDataSource(arr_df)

In [11]:
arr_src.data.keys()

dict_keys(['count', 'left', 'right', 'f_count', 'f_interval', 'index'])

# Add in Tooltips on Hover

In [12]:
# Set up the figure same as before
p = figure(plot_width = 500, plot_height = 500, title = 'Histogram of Arrival Delays',
          x_axis_label = 'Minutes', y_axis_label = 'Count')

# Add a quad glyph with source this time
p.quad(bottom=0, top='count', left='left', right='right', source=arr_src,
       fill_color='red', line_color='black')

# Add style to the plot
styled_p = style(p)

# Add a hover tool referring to the formatted columns
hover = HoverTool(tooltips = [('Delay', '@f_interval'),
                              ('Count', '@f_count')])

# Add the hover tool to the graph
styled_p.add_tools(hover)

# Show the plot
show(styled_p)

# Percentage of Delay Histogram for each Carrier

In [13]:
# Group by the carrier to find the most common
carrier_nums = flights.groupby('carrier')['year'].count().sort_values(ascending=False)

In [14]:
carrier_nums

carrier
UA    58665
B6    54635
EV    54173
DL    48110
AA    32729
MQ    26397
US    20536
9E    18460
WN    12275
VX     5162
FL     3260
AS      714
F9      685
YV      601
HA      342
OO       32
Name: year, dtype: int64

In [15]:
# Subset to the 8 most common carriers
flights = flights[flights['carrier'].isin(carrier_nums.index[:5])]

# Subset to only [-2, +2] hour delays
flights = flights[(flights['arr_delay'] >= -120) & (flights['arr_delay'] <= 120)]

## Find actual carrier names

In [16]:
carrier_names = pd.read_csv('../data/airlines.csv')
carrier_names.head()

Unnamed: 0,carrier,name
0,9E,Endeavor Air Inc.
1,AA,American Airlines Inc.
2,AS,Alaska Airlines Inc.
3,B6,JetBlue Airways
4,DL,Delta Air Lines Inc.


In [17]:
flights = flights.merge(carrier_names, how = 'left', on = 'carrier')
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour,name
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00,United Air Lines Inc.
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00,United Air Lines Inc.
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00,American Airlines Inc.
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00,JetBlue Airways
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01 06:00:00,Delta Air Lines Inc.


In [23]:
by_carrier = pd.DataFrame(columns=['proportion', 'left', 'right', 
                                   'f_proportion', 'f_interval',
                                   'name', 'color'])

# Iterate through all the carriers
for i, carrier_name in enumerate(flights['name'].unique()):
    
    # Subset to the carrier
    subset = flights[flights['name'] == carrier_name]
    
    # Create a histogram with 5 minute bins
    arr_hist, edges = np.histogram(subset['arr_delay'], bins = int(240/5), range = [-120, 120])
    
    # Divide the counts by the total to get a proportion
    arr_df = pd.DataFrame({'proportion': arr_hist / np.sum(arr_hist), 'left': edges[:-1], 'right': edges[1:] })
    
    # Format the proportion 
    arr_df['f_proportion'] = ['%0.5f' % proportion for proportion in arr_df['proportion']]
    
    # Format the interval
    arr_df['f_interval'] = ['%d to %d minutes' % (left, right) for left, right in zip(arr_df['left'], arr_df['right'])]
    
    # Assign the carrier for labels
    arr_df['name'] = carrier_name
    
    # Color each carrier differently
    arr_df['color'] = Category10_5[i]

    # Add to the overall dataframe
    by_carrier = by_carrier.append(arr_df)
    
# Overall dataframe
by_carrier = by_carrier.sort_values(['name', 'left'])

In [24]:
by_carrier.head()

Unnamed: 0,color,f_interval,f_proportion,left,name,proportion,right
0,#ff7f0e,-120 to -115 minutes,0.0,-120.0,American Airlines Inc.,0.0,-115.0
1,#ff7f0e,-115 to -110 minutes,0.0,-115.0,American Airlines Inc.,0.0,-110.0
2,#ff7f0e,-110 to -105 minutes,0.0,-110.0,American Airlines Inc.,0.0,-105.0
3,#ff7f0e,-105 to -100 minutes,0.0,-105.0,American Airlines Inc.,0.0,-100.0
4,#ff7f0e,-100 to -95 minutes,0.0,-100.0,American Airlines Inc.,0.0,-95.0


In [305]:
p.vbar(x = 'minutes', width = 1, top = 'proportion', fill_color = 'color', 
       legend = 'carrier_name')


hover = HoverTool(tooltips = [('Carrier', '@carrier_name'),
                              ('Proportion', '@f_proportion'),
                              ('Delay', '@f_minutes')])

styled_p = style(p)
styled_p.add_tools(hover)
show(styled_p)

E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: color, minutes, proportion [renderer: GlyphRenderer(id='2bac70f6-dfcc-4cdc-9c53-bf07b3b5222b', ...)]
E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: color, proportion, x [renderer: GlyphRenderer(id='7f07ca10-82e5-4dac-8947-878504208b60', ...)]


In [219]:
by_carrier.head()

Unnamed: 0,carrier_name,color,f_count,f_interval,f_proportion,left,proportion,right
0,American Airlines Inc.,#aec7e8,,-120 to -110 minutes,0.0,-120.0,0.0,-110.0
1,American Airlines Inc.,#aec7e8,,-110 to -100 minutes,0.0,-110.0,0.0,-100.0
2,American Airlines Inc.,#aec7e8,,-100 to -90 minutes,0.0,-100.0,0.0,-90.0
3,American Airlines Inc.,#aec7e8,,-90 to -80 minutes,0.0,-90.0,0.0,-80.0
4,American Airlines Inc.,#aec7e8,,-80 to -70 minutes,0.04167,-80.0,0.041667,-70.0


In [220]:
# Create column data source
carrier_src = ColumnDataSource(by_carrier)

In [223]:


# Add a quad glyph coloring each histogram by carrier
p.quad(bottom=0, top = 'proportion', left = 'left', right = 'right', 
       source = carrier_src, fill_alpha = 0.9,
       fill_color='color', legend = 'carrier_name', line_color='black')

# Add style to the plot
styled_p = style(p)

# Add a hover tool referring to the formatted columns
hover = HoverTool(tooltips = [('Delay', '@f_interval'),
                              ('Proportion', '@f_proportion')])

# Add the hover tool to the graph
styled_p.add_tools(hover)

# Show the plot
show(styled_p)