# SI 370 - Homework #2: Visualization

Imagine you are employed as a data scientist Olympic Studies Centre, whose mission is "to share Olympic knowledge with professionals and researchers through providing information, giving access to our unique collections, enabling research and stimulating intellectual exchange" (https://www.olympic.org/olympic-studies-centre).

You are tasked to create visualizations of historical Olympic data that reflect your exploration of the data.
In addition, you have been asked to create an interactive visualization that would be suitable for use by 
someone without a data science background but who is interested in the history of Olympic performance.

This homework assignment consists of three main parts:
1. Question Formulation
2. Seaborn and Bokeh Visualizations
3. Advanced Visualization using Bokeh

You will be using the same data set(s) you used in the first homework assignment, available via  https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results.  You are allowed to 
use adjunct data sets (such as the one that was used for the Bonus Question in Homework #1), although you
are not required to do so.


## 1. Question Forumulation (6 points)

Being able to formulate good questions is an important aspect of exploratory data analysis.  This part
of the homework provides an opportunity for you to practice developing this skill.

Generate three (3) authentic exploratory questions about the data, similar in nature to what was asked in the previous homework assignment's questions 5, 6, and the Bonus Question.  When contemplating which questions to pose, keep in mind that you should generate questions that can be answered using visualizations of the data.  Questions should be chosen to allow you to demonstrate your ability to both manipulate data and visualize it.  Selecting overly simplistic questions (e.g. "What is the median age of female swimmers") will not earn full points.

Questions should start with "I wonder...".  For example:  "I wonder how the number of different events in the Summer Olympics has changed over time".

We suggest working on question formulation in teams during class and asking peers and the teaching team for feedback.  You should also feel free to circle back after you work on your visualizations to rephrase or reframe your questions.


## 2. Seaborn and Bokeh Visualizations (20 points)
Create one or more Seaborn-based **and** one or more Bokeh-based visualizations that can provide visual answers to the questions you posed in the previous section.  A total of *at least* four (4) visualizations should be used, including one from each of Seaborn and Bokeh. If you create more than four visualizations, the best four will be
counted.  The following rubric will be used for each visualization:

* 5 points: Excellent visualization that goes beyond the basics covered in class.  Clear understanding of the visualization toolkit's functionality, typically learned from studying the documentation and/or examples from other sources.
* 4 points: Good visualization that uses basic charting and plotting functions as covered in class.
* 3 points: Acceptable visualization with some errors or omissions.  
* 2 points: Perfunctory attempt at creating a visualization.


## 3. Adapting the ```Gapminder``` Visualization (4 points)
Use some of the functionality from the Bokeh-based Gapminder visualization (detailed below) to create an interactive visualization that allows a user to explore the Olympics data set.  This section should also be driven by an overarching question that seeks to understand the relationship between two or more variables.  You must state this question as part of your answer to this section.

To create your interactive visualization, think about the number of variables that you want to present, along with the number of modalities that are available to you.  For example, in the Gapminder demo, there are three continuous variables (Children per woman, Life expectancy at birth, and Population) and two categorical variables (continent and year).  You don't need to use that many variables but you should be aware that you have quite a few options.

The number of points awarded for this section will be based on the degree to which your visualization allows the user (in this case, Chris and Minje) to explore your data while attempting to answer your question.  The following rubric will be used:

* 4 points: Excellent work that allows deep and thorough exploration of the data and that can reveal otherwise invisible features ("surprises").
* 2 points: Acceptable work that uses one advanced feature from the Gapminder visualization.
* 1 points: At least you tried!

You should create a new notebook for this homework.  The notebook
should conform to PEP-8 and "Elements of Style" guidelines.  The final notebook should represent your own original
work, although you are encouraged to work in groups while formulating questions and general approaches to
visualizations.

The remainder of this notebook is based on the Gapminder visualization tutorial https://rebeccabilbro.github.io/interactive-viz-bokeh/, which is a Bokeh-based implementation of
Prof. Hans Rosling's famous [TED talk](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen).

You are encouraged to study that page for more information about how the following code works.

In [None]:
import numpy as np
import pandas as pd

from bokeh.layouts import layout
from bokeh.layouts import widgetbox

from bokeh.embed import file_html

from bokeh.io import show
from bokeh.io import output_notebook

from bokeh.models import Text
from bokeh.models import Plot
from bokeh.models import Slider
from bokeh.models import Circle
from bokeh.models import Range1d
from bokeh.models import CustomJS
from bokeh.models import HoverTool
from bokeh.models import LinearAxis
from bokeh.models import ColumnDataSource
from bokeh.models import SingleIntervalTicker

from bokeh.palettes import Spectral6

In [None]:
output_notebook()

First we will load each of the datasets with the process_data() function and do a bit of clean up:

In [None]:
def process_data():
    fertility = pd.read_csv('data/gapminder_fertility.csv')
    regions = pd.read_csv('data/gapminder_regions.csv')
    population = pd.read_csv('data/gapminder_population.csv')
    life_expect = pd.read_csv('data/gapminder_life_expectancy.csv')

    fertility = fertility.set_index('Country')
    regions = regions.set_index('Country')
    population = population.set_index('Country')
    life_expect = life_expect.set_index('Country')
    
    # Make the column names ints not strings for handling
    columns     = list(fertility.columns)
    years       = list(range(int(columns[0]), int(columns[-1])))
    rename_dict = dict(zip(columns, years))

    fertility   = fertility.rename(columns=rename_dict)
    life_expect = life_expect.rename(columns=rename_dict)
    population  = population.rename(columns=rename_dict)
    regions     = regions.rename(columns=rename_dict)

    # Turn population into bubble sizes.
    # Use min_size and factor to tweak.
    scaling  = 200
    pop_size = np.sqrt(population / np.pi) / scaling
    min_size = 3
    pop_size = pop_size.where(
                  pop_size >= min_size
                  ).fillna(min_size)

    # Use pandas categories and categorize & color the regions
    regions.Group = regions.Group.astype('category')
    regions_list  = list(regions.Group.cat.categories)

    def get_color(r):
        return Spectral6[regions_list.index(r.Group)]

    regions['region_color'] = regions.apply(get_color, axis=1)

    return (fertility, life_expect, pop_size,
        regions, years, regions_list)

Next we will add each of our sources to the sources dictionary, where each key is the name of the year (prefaced with an underscore) and each value is a dataframe with the aggregated values for that year.

Note that we needed the prefixing as JavaScript objects cannot begin with a number.

In [None]:
(fertility_df, life_expect_df,
pop_size_df, regions_df, years, regions) = process_data()

sources = {}

region_color      = regions_df['region_color']
region_color.name = 'region_color'

for year in years:
    fertility       = fertility_df[year]
    fertility.name  = 'fertility'
    life            = life_expect_df[year]
    life.name       = 'life'
    population      = pop_size_df[year]
    population.name = 'population'

    new_df = pd.concat(
                [fertility, life, population, region_color],
                axis=1, sort=False
    )
    sources['_' + str(year)] = ColumnDataSource(new_df)

Later we will be able to pass this sources dictionary to the JavaScript Callback. In so doing, we will find that in our JavaScript we have objects named by year that refer to a corresponding ColumnDataSource.

We can also create a corresponding dict_of_sources object, where the keys are integers and the values are the references to our ColumnDataSources from above:

In [None]:
dict_of_sources = dict(zip(
                      [x for x in years],
                      ['_%s' % x for x in years])
                      )

js_source_array = str(dict_of_sources).replace("'", "")

First we need to create a Plot object. We’ll start with a basic frame, only specifying things like plot height, width, and ranges for the axes.

In [None]:
xdr  = Range1d(1, 9)
ydr  = Range1d(20, 100)
plot = Plot(
    x_range=xdr,
    y_range=ydr,
    plot_width=800,
    plot_height=400,
    outline_line_color=None,
    toolbar_location=None,
    min_border=20,
)

If you were to call show() here, what would you expect to see? Bokeh’s API works in much the same way as Matplotlib’s, meaning that we can imagine our digital canvas in the same way we would imagine a traditional fabric canvas. As we add new elements to our plot object, we are adding new layers of information onto our canvas that will appear as overlays (unless they explicitly reset some earlier-set parameter). So far we have only created the plot object, so if we were to show() it at this phase, we would get… a blank canvas!

Next we can make some stylistic modifications to the plot axes (e.g. by specifying the text font, size, and color, and by adding labels), to make the plot look more like the one in Hans Rosling’s video.

In [None]:
AXIS_FORMATS = dict(
    minor_tick_in=None,
    minor_tick_out=None,
    major_tick_in=None,
    major_label_text_font_size="10pt",
    major_label_text_font_style="normal",
    axis_label_text_font_size="10pt",

    axis_line_color='#AAAAAA',
    major_tick_line_color='#AAAAAA',
    major_label_text_color='#666666',

    major_tick_line_cap="round",
    axis_line_cap="round",
    axis_line_width=1,
    major_tick_line_width=1,
)

xaxis = LinearAxis(
    ticker     = SingleIntervalTicker(interval=1),
    axis_label = "Children per woman (total fertility)",
    **AXIS_FORMATS
)
yaxis = LinearAxis(
    ticker     = SingleIntervalTicker(interval=20),
    axis_label = "Life expectancy at birth (years)",
    **AXIS_FORMATS
)   

plot.add_layout(xaxis, 'below')
plot.add_layout(yaxis, 'left')

show(plot)

One of the features of Rosling’s animation is that the year appears as the text background of the plot. We will add this feature to our plot first so it will be layered below all the other glyphs.

In [None]:
text_source = ColumnDataSource({'year': ['%s' % years[0]]})
text        = Text(
                  x=2, y=35, text='year',
                  text_font_size='150pt',
                  text_color='#EEEEEE'
                  )
plot.add_glyph(text_source, text)

show(plot)

Next we will add the bubbles using Bokeh’s Circle glyph. We start from the first year of data, which is our source that drives the circles (the other sources will be used later).

In [None]:
# Add the circle
renderer_source = sources['_%s' % years[0]]
circle_glyph    = Circle(
                    x='fertility', y='life',
                    size='population', fill_alpha=0.8,
                    fill_color='region_color',
                    line_color='#7c7e71',
                    line_width=0.5, line_alpha=0.5
                    )

circle_renderer = plot.add_glyph(renderer_source, circle_glyph)

show(plot)

In [None]:
# Add hover for the circle (not other plot elements)
tooltips = "@index"
plot.add_tools(HoverTool(
                  tooltips=tooltips,
                  renderers=[circle_renderer]
                  )
              )

show(plot)

Next we will manually build a legend for our plot by adding circles and texts to the upper-righthand portion:

In [None]:
text_x = 7
text_y = 95
for i, region in enumerate(regions):
    plot.add_glyph(Text(
                      x=text_x, y=text_y,
                      text=[region],
                      text_font_size='10pt',
                      text_color='#666666'
                      )
                  )
    plot.add_glyph(Circle(
                      x=text_x - 0.1,
                      y=text_y + 2,
                      fill_color=Spectral6[i],
                      line_color=None,
                      fill_alpha=0.8,
                      size=10,
                      )
                  )
    text_y = text_y - 5
    
show(plot)

Next we add the slider widget and the JavaScript callback code, which changes the data of the renderer_source (powering the bubbles / circles) and the data of the text_source (powering our background text). After we’ve set() the data we need to trigger() a change. slider, renderer_source, text_source are all available because we add them as args to Callback.

It is the combination of sources = %s % (js_source_array) in the JavaScript and Callback(args = sources...) that provides the ability to look-up, by year, the JavaScript version of our Python-made ColumnDataSource.

In [None]:
# Add the slider
code = """
    var year = slider.value,
        sources = %s,
        new_source_data = sources[year].data;
    renderer_source.data = new_source_data;
    text_source.data = {'year': [String(year)]};
""" % js_source_array

callback = CustomJS(args=sources, code=code)
slider   = Slider(
              start=years[0], end=years[-1],
              value=1, step=1, title="Year",
              callback=callback
              )
callback.args["renderer_source"] = renderer_source
callback.args["text_source"] = text_source
callback.args["slider"] = slider

In [None]:
# In order to see what our slider widget looks like by itself, we can call show(widgetbox(slider)):
show(widgetbox(slider))

Last but not least, we put the chart and the slider together in a layout, which we can display inline in a notebook by calling show(layout([[plot], [slider]], sizing_mode='scale_width')):

In [None]:
show(layout([[plot], [slider]], sizing_mode='scale_width'))