# Using Python for Research Homework: Week 4, Case Study 1

In this case study, we have prepared step-by-step instructions for you on how to prepare plots in Bokeh, a library designed for simple, interactive plotting.  We will demonstrate Bokeh by continuing the analysis of Scotch whiskies.

In [130]:
# DO NOT EDIT THIS CODE
import numpy as np, pandas as pd

whisky = pd.read_csv("https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@asset+block@whiskies.csv", index_col=0)
correlations = pd.DataFrame.corr(whisky.iloc[:,2:14].transpose())
correlations = np.array(correlations)

### Exercise 1

In this exercise, we provide a basic demonstration of an interactive grid plot using Bokeh. Make sure to study this code now, as we will edit similar code in the exercises that follow.

#### Instructions
- Execute the following code and follow along with the comments. We will later adapt this code to plot the correlations among distillery flavor profiles as well as plot a geographical map of distilleries colored by region and flavor profile.
- Once you have plotted the code, hover, click, and drag your cursor on the plot to interact with it. Additionally, explore the icons in the top-right corner of the plot for more interactive options!

In [131]:
# First, we import a tool to allow text to pop up on a plot when the cursor
# hovers over it.  Also, we import a data structure used to store arguments
# of what to plot in Bokeh.  Finally, we will use numpy for this section as well!

from bokeh.models import HoverTool, ColumnDataSource

# Let's plot a simple 5x5 grid of squares, alternating in color as red and blue.

plot_values = [1,2,3,4,5]
plot_colors = ["red", "blue"]

# How do we tell Bokeh to plot each point in a grid?  Let's use a function that
# finds each combination of values from 1-5.
from itertools import product

grid = list(product(plot_values, plot_values))
print(grid)

[(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (5, 1), (5, 2), (5, 3), (5, 4), (5, 5)]


In [132]:
# The first value is the x coordinate, and the second value is the y coordinate.
# Let's store these in separate lists.

xs, ys = zip(*grid)
print(xs)
print(ys)

(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5)
(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5)


In [133]:
# Now we will make a list of colors, alternating between red and blue.

colors = [plot_colors[i%2] for i in range(len(grid))]
print(colors)

['red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red', 'blue', 'red']


In [134]:
# Finally, let's determine the strength of transparency (alpha) for each point,
# where 0 is completely transparent.

alphas = np.linspace(0, 1, len(grid))

# Bokeh likes each of these to be stored in a special dataframe, called
# ColumnDataSource.  Let's store our coordinates, colors, and alpha values.

source = ColumnDataSource(
    data = {
        "x": xs,
        "y": ys,
        "colors": colors,
        "alphas": alphas,
    }
)
# We are ready to make our interactive Bokeh plot!
from bokeh.plotting import figure, output_file, show

output_file("Basic_Example.html", title="Basic Example")
fig = figure(tools="hover")
fig.rect("x", "y", 0.9, 0.9, source=source, color="colors",alpha="alphas")
hover = fig.select(dict(type=HoverTool))
hover.tooltips = {
    "Value": "@x, @y",
    }
show(fig)

**Potential edX question:** Which column has the most transparaent squares in the plot?

**Answer**: Column 1

### Exercise 2

In this exercise, we will create the names and colors we will use to plot the correlation matrix of whisky flavors. Later, we will also use these colors to plot each distillery geographically.

#### Instructions 
- Create a dictionary `region_colors` with `regions` as keys and `cluster_colors` as values.
- Print `region_colors`.

In [135]:
cluster_colors = ["red", "orange", "green", "blue", "purple", "gray"]
regions = ["Speyside", "Highlands", "Lowlands", "Islands", "Campbelltown", "Islay"]

region_colors = dict(zip(regions, cluster_colors))
region_colors

{'Speyside': 'red',
 'Highlands': 'orange',
 'Lowlands': 'green',
 'Islands': 'blue',
 'Campbelltown': 'purple',
 'Islay': 'gray'}

### Exercise 3

`correlations` is a two-dimensional `np.array` with both rows and columns corresponding to distilleries and elements corresponding to the flavor correlation of each row/column pair. In this exercise, we will define a list `correlation_colors`, with `string` values corresponding to colors to be used to plot each distillery pair. Low correlations among distillery pairs will be white, high correlations will be a distinct group color if the distilleries from the same group, and gray otherwise.

#### Instructions

- Edit the code to define `correlation_colors` for each distillery pair to have input `'white'` if their correlation is less than 0.7.
- `whisky` is a `pandas` dataframe, and `Group` is a column consisting of distillery group memberships. For distillery pairs with correlation greater than 0.7, if they share the same whisky group, use the corresponding color from `cluster_colors`. Otherwise, the `correlation_colors` value for that distillery pair will be defined as `'lightgray'`.

In [136]:
def same_group(distilleries, i, j):
    x1 = int(whisky.loc[whisky.Distillery == distilleries[i]]['Group'].item.__self__)
    x2 = int(whisky.loc[whisky.Distillery == distilleries[j]]['Group'].item.__self__)
    return x1 == x2

dict_dist = {}
distilleries = list(whisky.Distillery)
correlation_colors = []
for i in range(len(distilleries)):
    for j in range(len(distilleries)):
        if correlations[i, j] < 0.7:
            correlation_colors.append('white')         # just use white.
            dict_dist[distilleries[i]] = 'white'
        else:                                          # otherwise,
            if same_group(distilleries, i, j):
                correlation_colors.append(cluster_colors[whisky.Group[i]]) # color them by their mutual group.
                dict_dist[distilleries[i]] = cluster_colors[whisky.Group[i]]
            else:                                      # otherwise
                correlation_colors.append('lightgray') # color them lightgray.
                dict_dist[distilleries[i]] = 'lightgray'

In [137]:
list(dict_dist.values())

['white',
 'lightgray',
 'white',
 'white',
 'white',
 'white',
 'lightgray',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'lightgray',
 'white',
 'white',
 'lightgray',
 'white',
 'white',
 'lightgray',
 'white',
 'white',
 'white',
 'lightgray',
 'white',
 'lightgray',
 'lightgray',
 'white',
 'lightgray',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'lightgray',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'white',
 'gray',
 'gray',
 'gray',
 'gray',
 'gray',
 'gray',
 'white',
 'white',
 'white',
 'gray',
 'gray',
 'gray',
 'white',
 'gray',
 'white',
 'gray',
 'white',
 'gray',
 'gray']

In [138]:
int(whisky.loc[whisky.Distillery == distilleries[i]]['Group'].item.__self__)

5

In [139]:
correlations.flatten()

array([1.        , 0.44904168, 0.46216816, ..., 0.6626506 , 0.76520727,
       1.        ])

### Exercise 4

In this exercise, we will edit the given code to make an interactive grid of the correlations among distillery pairs based on the quantities found in previous exercises. Most plotting specifications are made by editing `ColumnDataSource`, a `bokeh` structure used for defining interactive plotting inputs. The rest of the plotting code is already complete.

#### Instructions 

- `correlation_colors` is a list of `string` colors for each pair of distilleries. Set this as `color` in `ColumnDataSource`.
- Define `correlations` in `source` using `correlations` from the previous exercise. To convert `correlations` from a `np.array` to a `list`, use the `flatten()` method. This correlation coefficient will be used to define both the color transparency as well as the hover text for each square.

In [140]:
list(distilleries)

['Tullibardine',
 'GlenElgin',
 'GlenDeveronMacduff',
 'GlenSpey',
 'Glenfiddich',
 'Glenkinchie',
 'Glenlivet',
 'Inchgower',
 'Bowmore',
 'GlenGarioch',
 'Bladnoch',
 'Linkwood',
 'Benriach',
 'GlenOrd',
 'Speyburn',
 'Belvenie',
 'Tomintoul',
 'RoyalBrackla',
 'Teaninich',
 'ArranIsleOf',
 'GlenScotia',
 'Isle of Jura',
 'Bruichladdich',
 'OldPulteney',
 'Oban',
 'Glenturret',
 'Glenrothes',
 'Mortlach',
 'RoyalLochnagar',
 'Springbank',
 'Glenfarclas',
 'Glendullan',
 'Tormore',
 'Tomatin',
 'Highland Park',
 'Macallan',
 'Glendronach',
 'Auchroisk',
 'Aberlour',
 'Balmenach',
 'Deanston',
 'GlenKeith',
 'Dalmore',
 'Dailuaine',
 'Balblair',
 'Strathmill',
 'Loch Lomond',
 'Tamnavulin',
 'Bunnahabhain',
 'Auchentoshan',
 'Cardhu',
 'GlenMoray',
 'Glenmorangie',
 'AnCnoc',
 'Dufftown',
 'Glenallachie',
 'Tobermory',
 'Dalwhinnie',
 'Craigganmore',
 'Craigallechie',
 'Glenlossie',
 'Laphroig',
 'Clynelish',
 'Talisker',
 'Ardbeg',
 'Lagavulin',
 'Caol Ila',
 'Tamdhu',
 'Strathisla',


In [141]:
source = ColumnDataSource(
    data = {
        "x": np.repeat(distilleries,len(distilleries)),
        "y": list(distilleries)*len(distilleries),
        "colors": correlation_colors,
        "correlations": correlations.flatten(),
    }
)

output_file("Whisky Correlations.html", title="Whisky Correlations")
fig = figure(title="Whisky Correlations",
    x_axis_location="above", x_range=list(reversed(distilleries)), y_range=distilleries)
fig.grid.grid_line_color = None
fig.axis.axis_line_color = None
fig.axis.major_tick_line_color = None
fig.axis.major_label_text_font_size = "5pt"
fig.xaxis.major_label_orientation = np.pi / 3
fig.rect('x', 'y', .9, .9, source=source,
     color='colors', alpha='correlations')
hover = fig.select(dict(type=HoverTool))
hover.tooltips = {
    "Whiskies": "@x, @y",
    "Correlation": "@correlations",
}
show(fig)

### Exercise 5

In this exercise, we give a demonstration of plotting geographic points.

#### Instructions 

- Run the following code, to be adapted in the next section. Compare this code to that used in plotting the distillery correlations.

In [142]:
points = [(0,0), (1,2), (3,1)]
xs, ys = zip(*points)
colors = ["red", "blue", "green"]

output_file("Spatial_Example.html", title="Regional Example")
location_source = ColumnDataSource(
    data={
        "x": xs,
        "y": ys,
        "colors": colors,
    }
)

fig = figure(title = "Title",
    x_axis_location = "above", tools="hover, save")
fig.plot_width  = 300
fig.plot_height = 380
fig.circle("x", "y", size=10, source=location_source,
     color='colors', line_color = None)

hover = fig.select(dict(type = HoverTool))
hover.tooltips = {
    "Location": "(@x, @y)"
}
show(fig)

In [143]:
colors

['red', 'blue', 'green']

**Potential edX question:** What is the location of the blue point in this plot?

**Answer**: (1,2)

### Exercise 6

In this exercise, we will define a function `location_plot(title, colors)` that takes a string `title` and a list of colors corresponding to each distillery and outputs a Bokeh plot of each distillery by latitude and longitude. It will also display the distillery name, latitude, and longitude as hover text.

#### Instructions 

- Adapt the given code beginning with the first comment and ending with `show(fig)` to create the function `location_plot()`, as described above.
- `Region` is a column of in the `pandas` dataframe `whisky`, containing the regional group membership for each distillery. Make a list consisting of the value of `region_colors` for each distillery, and store this list as `region_cols`.
- Use `location_plot` to plot each distillery, colored by its regional grouping.

In [144]:
prc = pd.DataFrame(region_colors.items(), columns=['Region', 'Color'])
newWhisky = pd.merge(whisky, prc, on='Region')
newWhisky

Unnamed: 0,RowID,Distillery,Body,Sweetness,Smoky,Medicinal,Tobacco,Honey,Spicy,Winey,Nutty,Malty,Fruity,Floral,Postcode,Latitude,Longitude,Region,Group,Color
0,86,Tullibardine,2,3,0,0,1,0,2,1,1,2,2,1,PH4 1QG,289690,708850,Highlands,0,orange
1,35,GlenGarioch,2,1,3,0,0,0,3,1,0,2,2,2,AB51 0ES,381020,827590,Highlands,0,orange
2,39,GlenOrd,3,2,1,0,0,1,2,1,1,2,2,2,IV6 7UJ,251810,850860,Highlands,0,orange
3,81,Teaninich,2,2,2,1,0,0,2,0,0,0,2,2,IV17 0XB,265360,869120,Highlands,0,orange
4,69,OldPulteney,2,1,2,2,1,0,1,1,2,2,2,2,KW1 5BA,336730,950130,Highlands,1,orange
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,82,Tobermory,1,1,1,0,0,1,0,0,1,2,2,2,PA75 6NR,150450,755070,Islands,3,blue
82,78,Talisker,4,2,3,3,0,1,3,0,1,2,2,0,IV47 8SR,137950,831770,Islands,4,blue
83,72,Scapa,2,2,1,1,0,2,1,1,2,2,2,2,KW15 1SE,342850,1008930,Islands,5,blue
84,40,GlenScotia,2,2,2,2,0,1,0,1,2,2,1,1,PA28 6DS,172090,621010,Campbelltown,1,purple


In [146]:
# edit this to make the function `location_plot`.
def location_plot(title, region_cols):
    output_file(title+".html")
    location_source = ColumnDataSource(
        data = {
            "x": newWhisky[" Latitude"],
            "y": newWhisky[" Longitude"],
            "colors": newWhisky['Color'],
            "regions": newWhisky['Region'],
            "distilleries": newWhisky['Distillery']
        }
    )

    fig = figure(title = title,
        x_axis_location = "above", tools="hover, save")
    fig.plot_width  = 400
    fig.plot_height = 500
    fig.xaxis.major_label_orientation = np.pi / 3
    fig.circle("x", "y", size=10, source=location_source, color='colors', line_color = None)
    hover = fig.select(dict(type = HoverTool))
    hover.tooltips = {
        "Distillery": "@distilleries",
        "Location": "(@x, @y)"
    }
    show(fig)

region_cols = region_colors
location_plot("Whisky Locations and Regions", region_cols)

### Exercise 7 

In this exercise, we will use this function to plot each distillery, colored by region and taste coclustering classification, respectively.

#### Instructions 
- Create the list `region_cols` consisting of the color in `region_colors` that corresponds to each whisky in `whisky.Region`.
- Similarly, create a list `classification_cols` consisting of the color in `cluster_colors` that corresponds to each cluster membership in `whisky.Group`.
- Create two interactive plots of distilleries, one using `region_cols` and the other with colors defined by called `classification_cols`. How well do the coclustering groupings match the regional groupings?

In [None]:
region_cols = ## ENTER CODE HERE! ##
classification_cols = ## ENTER CODE HERE! ##

location_plot("Whisky Locations and Regions", region_cols)
location_plot("Whisky Locations and Groups", classification_cols)