# Week 08: Mapping data with bokeh

## Objectives

By the end of this tutorial, you should be able to build a geographic map using `bokeh`, and to customize various aspects of it. Specifically you will:

- import geographic coordinates and draw corresponding regions on a graph
- interrogate hierarchical data structures containing geographic data
- build data structures to associate data with map elements

## Getting started

The imports below should be familiar...

In [1]:
from collections import OrderedDict

from bokeh.sampledata import us_counties, unemployment
from bokeh.plotting import figure, show, output_file, ColumnDataSource
from bokeh.models import HoverTool

... but one thing that's new is the sample data. `us_counties` and `unemployment` are both modules, and the next few lines reference them.

In [2]:
state = "tx"

county_xs=[
    us_counties.data[code]['lons'] for code in us_counties.data
    if us_counties.data[code]['state'] == state
]

county_ys=[
    us_counties.data[code]['lats'] for code in us_counties.data
    if us_counties.data[code]['state'] == state
]

We can't make much sense of the last two lines. We can tell they are list comprehensions, but can't understand much else without knowing more about the contents of the `us_counties` module.

## Interrogating `us_counties.data`

Every Python object contains a hidden, "magic" attribute called `__doc__` (with *four* underscores total). It is a string describing what the module does and how to use its contents.

In [3]:
print us_counties.__doc__


This modules exposes geometry data for Unites States. It exposes a dictionary 'data' which is
indexed by the two-tuple containing (state_id, county_id) and has the following dictionary as the
associated value:

    data[(1,1)]['name']
    data[(1,1)]['lats']
    data[(1,1)]['lons']




The `__doc__` attribute of `us_counties` tells us that `data` is the data structure we should use, and that it is a dictionary. Each key is for a different county (called parish in Louisiana and borough in Alaska). Each key is a tuple containing the state id and county id. 

What data structure is associated with each key? Let's find out by peeking at a single entry.

In [4]:
# Get a key from data; and use it to access the associated value and its type

d = us_counties.data
first_key = d.keys()[0]
print first_key             # should be a tuple of two numbers

first_value = d[first_key]  # get the value associated with this key

print type(d[first_key])    # get its type

(53, 53)
<type 'dict'>


It looks like the associated value is yet another dictionary. Let's find out what sort of keys are in *this* dictionary.

In [5]:
# print keys of first_value, and the type and length of associated values

for key in first_value.keys():
    print key, type(first_value[key]), len(first_value[key])

state <type 'str'> 2
lons <type 'list'> 162
name <type 'str'> 6
lats <type 'list'> 162


There are four keys. `state` and `name` are strings, probably to identify the state and county. `lons` and `lats` are both lists of equal size. We can guess that these lists are the geographic coordinates of a polygon that outlines the county. We confirm this by printing each list, and seeing that they are [typical longitude and latitude values for Texas](https://www.google.com/maps/place/Texas). (Click on any untagged area of Google maps to bring up a popup listing latitude and longitutde.)

In [6]:
# print values associated with each key

print first_value['state']
print first_value['name']
print first_value['lons']
print first_value['lats']

wa
Pierce
[-122.33363, -122.29115, -122.24857, -122.1658, -122.12055, -122.10325, -122.05577, -122.00459, -121.96352, -121.92448, -121.8707, -121.82342, -121.77963, -121.7241, -121.67652, -121.62574, -121.59376, -121.55548, -121.51028, -121.48878, -121.47287, -121.44775, -121.41889, -121.39287, -121.37967, -121.38251, -121.37752, -121.38691, -121.40103, -121.4165, -121.4454, -121.4537, -121.45308, -121.4527, -121.4712, -121.4909, -121.5017, -121.508, -121.5112, -121.5125, -121.5135, -121.5183, -121.52, -121.52308, -121.51059, -121.50308, -121.4903, -121.4754, -121.4727, -121.4557, -121.45436, -121.45511, -121.45635, -121.72144, -121.76337, -121.77397, -121.78451, -121.79044, -121.8022, -121.80681, -121.82071, -121.8368, -121.85162, -121.87, -121.9045, -121.93429, -121.95325, -121.97773, -121.99419, -122.01117, -122.02808, -122.05428, -122.07637, -122.10817, -122.12622, -122.14552, -122.1757, -122.20262, -122.20272, -122.20272, -122.23652, -122.27262, -122.31063, -122.3036, -122.32544, 

The structure of us_counties.data has a few levels:

```
  
--data (dict)
  |
  +-(id, id)
  | |
  | +-state (string)
  | +-name  (string)
  | +-lons (list of numbers)
  | +-lats (list of numbers)
  |
  +-(id, id)
  | |
  | +-state (string)
  | +-name  (string)
  | +-lons (list of numbers)
  | +-lats (list of numbers)
  |
  ...
  
```

If we wanted to construct it with literal notation, we'd write like so:

In [7]:
# data = {
#     (53, 53) : {
#         state : 'wa',
#         name : 'pierce',
#         lons : [-122.33363, -122.29115, -122.24857, ...],
#         lats : [47.25792, 47.2572, 47.25742, ...]
#     },
#     (18, 121) : {
#         state : 'in',
#         name : 'Parke',
#         lons : [-87.23233, -87.25305, -87.26713, ...],
#         lats : [39.60761, 39.60788, 39.60794, ...]
#     },
#     ...
# }

## Building the coordinate data

Now that we understand the structure of `us_counties.data`, we can look back at the list comprehensions and make sense of them. They are essentially building lists of polygons for counties in Texas.

In [8]:
# allows us to easily change the state being analyzed in the filters below
state = "tx"  

In [9]:
# a filtered list comprehension that iterates over each county's code
# us_counties.data[code] is a dictionary
# us_counties.data[code]['lons'] gets the list of longitudes
# but exclude any counties whose 'state' is not "tx"
# county_xs is a list of lists of longitude, only for texas

county_xs=[
    us_counties.data[code]['lons'] for code in us_counties.data
    if us_counties.data[code]['state'] == state
]

center_xs = [
    sum(us_counties.data[code]['lons'])/len(us_counties.data[code]['lons']) 
    for code in us_counties.data
    if us_counties.data[code]['state'] == state    
]

In [10]:
# likewise for latitudes
county_ys=[
    us_counties.data[code]['lats'] for code in us_counties.data
    if us_counties.data[code]['state'] == state
]


center_ys = [
    sum(us_counties.data[code]['lats'])/len(us_counties.data[code]['lats']) 
    for code in us_counties.data
    if us_counties.data[code]['state'] == state    
]

## Interrogating `unemployment.data`

`unemployment` is a module. As with `us_counties`, we inspect the `__doc__` attribute to find out how this module should be used.

In [11]:
print unemployment.__doc__


This modules exposes per-county unemployment data for Unites States in 2009. It exposes a
dictionary 'data' which is indexed by the two-tuple containing (state_id, county_id) and has the
unemployment rate (2009) as the associated value.




Like `us_counties`, `unemployment` contains a dictionary named `data` containing county_id tuples as keys. This is helpful: the `data` dictionary in both modules use the same keys. Let's make sure that the associated value is a number, and not a string.

In [12]:
# Get a key from data; and use it to access the associated value and its type
# Note the reused code from the interrogation of us_counties above: we simply
# replaced 'us_counties` with `unemployment`, and the rest works as is!

d = unemployment.data
first_key = d.keys()[0]
print first_key             # should be a tuple of two numbers

first_value = d[first_key]  # get the value associated with this key

print type(d[first_key])    # get its type

(53, 53)
<type 'float'>


The value is a float. But is it expressed as a fraction of 1, or as a percentage out of 100%? We need to look at some actual values.

In [13]:
# create a list of the values in d, but only for counties in texas

temp = [
    d[key] for key in us_counties.data
    if us_counties.data[key]['state'] == state
]

print temp

[11.6, 6.1, 6.0, 17.8, 7.4, 7.8, 6.8, 6.4, 7.8, 5.3, 8.5, 9.6, 7.7, 6.5, 8.7, 5.7, 6.0, 5.9, 15.6, 9.0, 7.4, 7.1, 7.3, 6.9, 10.0, 11.7, 7.8, 5.9, 6.4, 10.8, 6.2, 7.9, 6.9, 7.0, 6.5, 8.3, 5.7, 5.4, 7.8, 10.1, 6.7, 6.8, 6.9, 11.8, 7.0, 8.1, 8.1, 7.6, 6.3, 7.8, 12.6, 8.0, 6.1, 8.7, 6.4, 7.5, 8.0, 4.4, 6.1, 7.7, 8.1, 7.2, 9.1, 6.7, 15.9, 12.5, 6.1, 10.8, 6.1, 8.9, 6.6, 5.6, 8.6, 6.1, 5.4, 6.9, 6.8, 7.9, 9.1, 6.8, 7.0, 8.8, 9.5, 7.0, 5.5, 7.6, 7.9, 6.4, 6.2, 5.1, 9.6, 9.2, 9.0, 9.0, 5.1, 10.3, 5.6, 11.1, 5.2, 7.7, 8.7, 5.2, 6.9, 7.0, 8.2, 7.7, 6.6, 6.9, 6.1, 6.0, 7.0, 8.9, 7.1, 8.2, 9.1, 5.0, 7.4, 4.9, 5.6, 8.7, 7.3, 7.9, 16.3, 10.3, 11.2, 5.1, 13.9, 5.8, 5.2, 8.5, 6.4, 5.5, 6.7, 7.5, 6.3, 10.8, 4.5, 6.4, 9.9, 10.5, 6.2, 6.0, 8.0, 4.8, 8.4, 9.0, 6.2, 7.6, 7.2, 8.4, 4.6, 7.8, 7.3, 8.9, 7.1, 4.3, 8.4, 7.0, 11.2, 8.4, 10.2, 8.4, 6.2, 17.8, 7.9, 4.9, 6.6, 7.2, 8.2, 9.7, 8.5, 9.4, 7.1, 5.8, 6.3, 8.6, 7.8, 11.9, 8.7, 6.4, 9.8, 7.8, 6.5, 10.7, 7.2, 7.3, 8.7, 9.6, 8.9, 5.4, 8.3, 8.9, 8.2, 8.9, 11.5

The values appear to be percentages. What is the range of values?

In [14]:
print "low: %s" %min(temp)
print "average: %s" %(sum(temp)/len(temp))
print "high: %s" %max(temp)

low: 3.0
average: 7.93622047244
high: 17.8


## Building the employment data

The polygon data allow us to draw the counties, but we want to present the unemployment rate of each one, using a color scale. The loop below collects and organizes this information. (If you want to identify a given color hex, just search for the value: you'll find [a](https://www.colorcodehex.com) [wealth](http://www.colorhexa.com) [of](www.spycolor.com) [websites](www.color-hex.com) offering to show you a color sample.)

In [15]:
# our color palette: progressively deeper shades of magenta
colors = ["#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043"]

colors = ["#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043",
          "#A347FF", "#9933FF", "#8A2EE6", "#6B24B2", "#4C1A80"]

# create three empty lists, which will be filled in the loop
county_colors = []
county_names = []
county_rates = []

In [16]:
# county_id will be a tuple containing (state #, county #)
# generates county_id from us_counties.data, not unemployment.data,
# to ensure it matches the order of counties in county_xs and county_ys

for county_id in us_counties.data:

    # checks whether the 'state' value is "tx". If it's not, 
    # skip the rest of the commands and start the next loop iteration
    
    if us_counties.data[county_id]['state'] != state:
        continue
        
    # gets the unemployment rate of the current county in the loop
    rate = unemployment.data[county_id]
    
    # converts each rate to a 5-point unemployment index
    # 0: 0-2%; 1: 2-4%; 2: 4-6%; 3:6-8%; 4: 8-10%; 5: > 10%
    idx = min(int(rate/2), 10)
    
    # adds corresponding value to each list:
    #   colors: uses the unemployment index to slice into color palette
    #   names: uses us_counties.data to lookup the county name
    #   rates: the actual number (not the index)
    county_colors.append(colors[idx])
    county_names.append(us_counties.data[county_id]['name'])
    county_rates.append(rate)

## Building the graph model

In order to link various aspects of the visual data, we use a `ColumnDataSource` to store five aspects of each county.

In [17]:
# create a ColumnDataSource object with 5 columns
# dict() takes the assignment statements and turns them into a dictionary
# with the following keys: 'x', 'y', 'xc', 'yc', 'color', 'name', 'rate'

source = ColumnDataSource(
    data = dict(
        x=county_xs,
        y=county_ys,
        color=county_colors,
        name=county_names,
        rate=county_rates,
    )
)

# equivalent to:

source = ColumnDataSource({
        'x'    : county_xs,
        'y'    : county_ys,
        'xc'   : center_xs,
        'yc'   : center_ys,
        'color': county_colors,
        'name' : county_names,
        'rate' : county_rates
    })

In [18]:
# set the destination format
# set the usual tools plus hover
# create the Figure object

output_file("texas.html", title="texas.py example")

TOOLS="pan,wheel_zoom,box_zoom,reset,hover,save"

p = figure(title="Texas Unemployment 2009", tools=TOOLS)

The `Patches` glyph draws a polygon, using x and y coordinates specified in the `xs` and `ys` arguments.

In [19]:
# add patches glyphs using the coordinats from data source
# set color from data source
# thin, 0.5-pixel white outline
# 70% opacity

a = p.patches(
    xs = 'x',
    ys = 'y',
    fill_color = 'color', fill_alpha = 0.7,
    line_color = "white", line_width = 0.5,
    source = source)

In [20]:
# add text glyphs using the centroid data from data source
# set color from data source
# set text to county name

p.text(
    x = 'xc',
    y = 'yc',
    text = 'name',
    text_color = 'black', text_font_size = "6pt",
    text_align = "center",
    source = source)

<bokeh.models.renderers.GlyphRenderer at 0x10726e790>

In [21]:
# access the hover object within p
# select({'type': 'HoverTool'}) searches for the object by type
hover = p.select(dict(type=HoverTool))

In [22]:
# by default, hover's tool tip pops from center of county being hovered
# but you can change this using its point_policy attribute
hover.point_policy = "follow_mouse"

In [23]:
# sets the information to display in the tool tip
# Format of each tuple: first string is the label,
# second string specifies what data to fill in
# (the OrderedDict coercion is unnecessary in current Bokeh;
# a simple list of tuples will do)

# $x and $y refer to the exact coordinates of the mouse cursor
# @name and @rate refer to keys of the column data source
hover.tooltips = [
    ("Name", "@name"),
    ("Unemployment rate", "@rate%"),
    ("(Long, Lat)", "($x, $y)"),
]

In [24]:
show(p)

## Exercises

For each task, document the changes you had to make. To test your changes, use `Cell > Run All`.

### Map a different state

Choose an arbitrary state and map its unemployment rate in the same way. Document the precise changes you had to make.

#### Solution

All you need to do is change the string `state` to the desired abbreviation. For example, to view California, replace with `state = "ca"`. This change affects the filtering in the list comprehension and the loop, because they are defined in reference to `state`.

### Expand the unemployment scale

Allow the map to show unemployment rates between 10% and 20%, using shades of a different hue than magenta. Most websites about web colors (such as the [w3schools' colorpicker](http://www.w3schools.com/tags/ref_colorpicker.asp) or [color-hex](http://www.color-hex.com)), when given a particular color, will suggest a palette or gradient of related colors, varying by brightness.

#### Solution

1. Add five more values to the list `colors`. (In this case, progressively deeper shades of violet.)
2. In the loop to build the employment data, raise the cap on `idx` to 10.

In [25]:
# our color palette: progressively deeper shades of magenta
colors = ["#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043",
          "#A347FF", "#9933FF", "#8A2EE6", "#6B24B2", "#4C1A80"]

# in the loop...

idx = min(int(rate/2), 10)

### Add county names

Label each area with the name of the county, preferably located at the **centroid** (x<sub>avg</sub>, y<sub>avg</sub>). But only include counties that meet a particular criterion with respect to unemployment (you decide what the criterion is).

#### Solution

1. Define two new lists, `center_xs` and `center_ys`, as the average of each county's longitudes and latitudes.
2. Add `xc` and `yc` columns to the data source, for storing the centers
3. Add text glyphs using the centroid data

In the resulting map, the names are numerous and overlap, but they become more readable when you zoom in on a specific part of Texas.

In [26]:
center_xs = [
    sum(us_counties.data[code]['lons'])/len(us_counties.data[code]['lons']) 
    for code in us_counties.data
    if us_counties.data[code]['state'] == state    
]

center_ys = [
    sum(us_counties.data[code]['lats'])/len(us_counties.data[code]['lats']) 
    for code in us_counties.data
    if us_counties.data[code]['state'] == state    
]

source = ColumnDataSource({
        'x'    : county_xs,
        'y'    : county_ys,
        'xc'   : center_xs,
        'yc'   : center_ys,
        'color': county_colors,
        'name' : county_names,
        'rate' : county_rates
    })

# add text glyphs using the centroid data from data source
# set color from data source
# set text to county name

p.text(
    x = 'xc',
    y = 'yc',
    text = 'name',
    text_color = 'black', text_font_size = "6pt",
    text_align = "center",
    source = source)

<bokeh.models.renderers.GlyphRenderer at 0x10844fe50>

### Add a color key

Build a color key, with swatches of each color in the palette associated with the corresponding interval of unemployment rates.

#### Solution

Create variables and (11-item) lists to specify the location of each swatch. The add `rect` and `text` glyphs for each swatch. Use `colors` defined above to assign the same colors to the swatches.

In [28]:
x_pos = -106.2
y_pos = [32.5 + i * 0.4 for i in range(0,11)]
intervals = ["0-2%", "2-4%", "4-6%", "6-8%", "8-10%", 
             "10-12%", "12-14%", "14-16%", "16-18%", "18-20%", "> 20%"]

p.rect(
    x = x_pos,
    y = y_pos,
    width = 0.8,
    height = 0.25,
    color = colors,
    line_color = "black",
    fill_alpha = 0.7
)

p.text(
    x = x_pos + 0.6,
    y = y_pos,
    text = intervals,
    text_font_size = "10pt",
    text_color = 'black',
    text_align = "left",
    text_baseline = "middle"
)

show(p)