### Download and Clean Data

In [1]:
'''import packages required for data wrangling'''

import pandas as pd
import re

The datasets that we have downloaded are not in the desired form that we want in that they have both column and row names. To make it easier for us to work with the datasets, we need to create 'tidy' datasets. Essentially, each variable must be in its own column. You can read more about this in Hadley Wicham's paper- <a href="https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html">Tidy Data</a>. 

In [2]:
'''melting the income dataset using pandas'''

income = pd.read_excel('/Users/adityakamath/Desktop/gapminder_data/income.xlsx', sheetname = 'Data')
income = pd.melt(income, id_vars = ['country'])
income.columns = ['country', 'year', 'income']
'''strip the country names of spaces so that we can use them for merging later on'''
income['country'] = income['country'].map(lambda x: x.strip()).astype(str)
income['year'] = income['year'].astype(str)

In [3]:
'''melting the population dataset'''

population = pd.read_excel('/Users/adityakamath/Desktop/gapminder_data/population.xlsx', sheetname = 'Data')
population = pd.melt(population, id_vars = ['country'])
population.columns = ['country', 'year', 'population']
population['country'] = population['country'].map(lambda x: x.strip())
population['year'] = population['year'].astype(str)
'''convert all population values to integer type'''
population['population'] = population['population'].fillna(value = 0)
population['population'] = population['population'].astype(str)

In [4]:
'''melting the life_expectancy dataset'''

life_expectancy = pd.read_excel('/Users/adityakamath/Desktop/gapminder_data/life_expectancy.xlsx', sheetname = 'Data')
life_expectancy = pd.melt(life_expectancy, id_vars = ['country'])
life_expectancy.columns = ['country', 'year', 'life_expectancy']
life_expectancy['country'] = life_expectancy['country'].map(lambda x: x.strip()).astype(str)
life_expectancy['year'] = life_expectancy['year'].astype(str)

The three datasets above don't have any data on which region each country belongs to. So we need to find a dataset that gives us this information. Head on over to this <a href="https://github.com/mledoze/countries/tree/master/dist">GitHub repo</a> and download the countries.csv dataset.

In [5]:
'''for each country in the countries.csv dataset, there are several different names. 
Let us take only the first name for each country and create a regions dataset with only the country 
and the corresponding region.'''

regions = pd.read_csv('/Users/adityakamath/Desktop/gapminder_data/countries.csv')
regions['country'] = regions['name'].apply(lambda x: re.findall('[^,]*', x)[0])
regions = regions[['country', 'region']]

In [6]:
'''first let's merge the income and population dataset'''

merge1 = pd.merge(left = income, right = population, on = ['country', 'year'], how = 'inner')

'''then merge merge1 and life_expectancy'''
merge2 = pd.merge(left = merge1, right = life_expectancy, on = ['country', 'year'], how = 'inner')
#worldDF.set_index('year', inplace = True)

'''finally merge merge2 and the regions dataset'''
worldDF = pd.merge(left = merge2, right = regions, on = 'country', how = 'left')
worldDF.region = worldDF.region.fillna('Other')

### Load Bokeh

Creating plots with Bokeh basically involves four steps. The first step is complete, ie., preparing data. Second, you create a plot using the figure() function. Then you add the necessary glyphs to visualize the data in the plot and finally show the output.

In [7]:
'''let's import the required packages.  To see the output inside the jupyter notebook, call the output_notebook() 
function. To show the output as a separate file use output_file() function. I've used both here since I want to use 
the output file separately.'''

from bokeh.io import output_notebook, output_file, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
output_notebook()
output_file('/Users/adityakamath/Desktop/gapminder.html')

In [8]:
'''ColumnDataSource is an important data structure in bokeh and it is used to pass data to many of the plotting
functions. You can pass a dictionary, a pandas dataframe or even a pandas groupby object to it. In this case, I will
be passing the worldDF dataframe with data only for year 2009.'''

worldDF2 = worldDF[worldDF['year'].astype(int) == 2009]
source = ColumnDataSource(worldDF2)

In [9]:
p = figure(height = 500, title = 'Life Expectancy v/s Income', 
           x_axis_label = 'Income ($)', y_axis_label = 'Life Expectancy (Years)')

p.circle( x= 'income', y = 'life_expectancy', source = source, size = 15, 
         fill_color  = '#f98d77', fill_alpha = 0.5, line_color = '#e65c6f')

show(p)

In [10]:
'''import some tools for enhancing the plot as well as a color palette'''
from bokeh.models import HoverTool, LinearInterpolator, CategoricalColorMapper
from bokeh.palettes import Spectral6

In [26]:
'''the LinearInterpolator is used to set the size of each datapoint depending on the value of the population'''
sizeMapper = LinearInterpolator( x = [worldDF2.population.astype(float).min(), worldDF2.population.astype(float).max()],
                               y = [12, 60])

'''HoverTool function is used to display some information when you hover over each datapoint in the plot'''
hover = HoverTool(tooltips = [('Country','@country'),('Population', '@population'), ('Region', '@region')])

'''map different colors to different regions'''
color_map = CategoricalColorMapper(factors = list(worldDF2.region.unique()), palette = Spectral6)

'''repeat the code for plotting but this time include the hover tools in the tools parameter'''
p = figure(height = 500, width = 700, title = 'Life Expectancy v/s Income in 2009', 
           x_axis_label = 'Income ($)', y_axis_label = 'Life Expectancy (Years)', tools = [hover])

'''add color preferences for hover tools'''
p.circle( x= 'income', y = 'life_expectancy', 
          source = source, size = {'field': 'population', 'transform': sizeMapper}, 
          color = {'field': 'region', 'transform': color_map}, 
          fill_alpha = 0.6,
           legend = 'region')

'''adjust the location of the legend'''
p.legend.location = (512, 10)

show(p)