[This post by Ramiro Gomez](http://exploringdata.github.io/vis/evolution-internet-users/) is nice: it shows the number of internet users by country and gdp. However, the interaction is a little bit lacking. Can we do better with Bokeh, given that it looks a lot like this [gapminder animation](https://notebooks.anaconda.org/bokeh/gapminder)?

# Getting the data 

Ramiro Gomez has created a repository for the data. However one first has to get it in raw form from the world bank.

In [17]:
import numpy as np

In [1]:
import pandas as pd

In [25]:
df = pd.read_excel('files/InternetUsers_GDP.xlsx')

In [26]:
df.columns

Index(['Series Name', 'Series Code', 'Country Name', 'Country Code',
       '1991 [YR1991]', '1992 [YR1992]', '1993 [YR1993]', '1994 [YR1994]',
       '1995 [YR1995]', '1996 [YR1996]', '1997 [YR1997]', '1998 [YR1998]',
       '1999 [YR1999]', '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]',
       '2003 [YR2003]', '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]',
       '2007 [YR2007]', '2008 [YR2008]', '2009 [YR2009]', '2010 [YR2010]',
       '2011 [YR2011]', '2012 [YR2012]', '2013 [YR2013]', '2014 [YR2014]',
       '2015 [YR2015]'],
      dtype='object')

In [27]:
df['Series Name'].unique()

array(['Internet users (per 100 people)',
       'GDP per capita (constant 2005 US$)', 'Population, total', nan,
       'Data from database: World Development Indicators',
       'Last Updated: 11/12/2015'], dtype=object)

Let's build the dataframe for internet use.

In [28]:
df_internet = df[df['Series Name'] == 'Internet users (per 100 people)'][['Country Name', '1991 [YR1991]', '1992 [YR1992]', '1993 [YR1993]', '1994 [YR1994]',
       '1995 [YR1995]', '1996 [YR1996]', '1997 [YR1997]', '1998 [YR1998]',
       '1999 [YR1999]', '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]',
       '2003 [YR2003]', '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]',
       '2007 [YR2007]', '2008 [YR2008]', '2009 [YR2009]', '2010 [YR2010]',
       '2011 [YR2011]', '2012 [YR2012]', '2013 [YR2013]', '2014 [YR2014]',
       '2015 [YR2015]']]

In [29]:
df_internet.replace(to_replace='..', value=np.nan, inplace=True)

In [30]:
s = df_internet.pop('Country Name')
df_internet.set_index(s)

Unnamed: 0_level_0,1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],1996 [YR1996],1997 [YR1997],1998 [YR1998],1999 [YR1999],2000 [YR2000],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,,,,,,,,,,,...,2.107124,1.900000,1.840000,3.550000,4.00,5.000000,5.454545,5.90000,6.39000,
Albania,,,,,0.011169,0.032197,0.048594,0.065027,0.081437,0.114097,...,9.609991,15.036115,23.860000,41.200000,45.00,49.000000,54.655959,57.20000,60.10000,
Algeria,,,,0.000361,0.001769,0.001739,0.010268,0.020239,0.199524,0.491706,...,7.375985,9.451191,10.180000,11.230000,12.50,14.000000,15.228027,16.50000,18.09000,
American Samoa,,,,,,,,,,,...,,,,,,,,,,
Andorra,,,,,,1.526601,3.050175,6.886209,7.635686,10.538836,...,48.936847,70.870000,70.040000,78.530000,81.00,81.000000,86.434425,94.00000,95.90000,
Angola,,,,,,0.000776,0.005674,0.018454,0.071964,0.105046,...,1.907648,3.200000,4.600000,6.000000,10.00,14.776000,16.937210,19.10000,21.26000,
Antigua and Barbuda,,,,,2.200769,2.858450,3.480537,4.071716,5.300681,6.482226,...,30.000000,34.000000,38.000000,42.000000,47.00,52.000000,58.000000,63.40000,64.00000,
Argentina,,0.002993,0.029527,0.043706,0.086277,0.141955,0.280340,0.830767,3.284482,7.038683,...,20.927202,25.946633,28.112623,34.000000,45.00,51.000000,55.800000,59.90000,64.70000,
Armenia,,,,0.009117,0.052743,0.094573,0.111651,0.128659,0.970738,1.300470,...,5.631788,6.021253,6.210000,15.300000,25.00,32.000000,37.500000,41.90000,46.30000,
Aruba,,,,,,2.768383,,,4.506179,15.442823,...,28.000000,30.900000,52.000000,58.000000,62.00,69.000000,74.000000,78.90000,83.78000,


Let's now build the dataframe for GDP.

In [31]:
df_gdp = df[df['Series Name'] == 'GDP per capita (constant 2005 US$)'][['Country Name', '1991 [YR1991]', '1992 [YR1992]', '1993 [YR1993]', '1994 [YR1994]',
       '1995 [YR1995]', '1996 [YR1996]', '1997 [YR1997]', '1998 [YR1998]',
       '1999 [YR1999]', '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]',
       '2003 [YR2003]', '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]',
       '2007 [YR2007]', '2008 [YR2008]', '2009 [YR2009]', '2010 [YR2010]',
       '2011 [YR2011]', '2012 [YR2012]', '2013 [YR2013]', '2014 [YR2014]',
       '2015 [YR2015]']]

In [32]:
df_gdp.replace(to_replace='..', value=np.nan, inplace=True)

In [33]:
s = df_gdp.pop('Country Name')
df_gdp.set_index(s)

Unnamed: 0_level_0,1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],1996 [YR1996],1997 [YR1997],1998 [YR1998],1999 [YR1999],2000 [YR2000],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,,,,,,,,,,,...,263.012374,291.128823,294.238183,347.208097,366.324813,377.292766,418.426197,413.233959,408.898697,
Albania,1211.431789,1131.047006,1247.214425,1359.050613,1549.345209,1700.873966,1536.967481,1743.097819,1931.344256,2085.582721,...,2939.136002,3136.159294,3398.487894,3536.231650,3685.568522,3790.101771,3857.339741,3916.231237,3994.625479,
Algeria,2484.153186,2470.566078,2366.015901,2297.100086,2339.655291,2393.552046,2381.351303,2465.744244,2509.110754,2530.011442,...,3109.768204,3167.388563,3179.776708,3176.744294,3233.176770,3262.063307,3304.701427,3330.802120,3400.732802,
American Samoa,,,,,,,,,,,...,,,,,,,,,,
Andorra,29833.095771,28970.387067,27685.024766,27574.571029,27825.961843,28921.863630,31615.216190,32757.517100,33955.180327,33700.762129,...,40745.465162,40054.227486,36296.274603,34968.485540,33512.015532,32713.610776,33357.462074,34835.709370,,
Angola,1376.983953,1241.205271,904.267859,906.193896,970.202243,1048.154225,1100.061612,1142.965689,1146.517824,1145.236277,...,1838.473307,2178.362871,2397.088332,2373.831980,2373.765225,2385.576823,2426.366161,2507.085626,2521.102581,
Antigua and Barbuda,9793.500455,9717.444050,10014.827235,10379.394318,9684.082744,10057.745742,10258.959051,10426.206800,10593.961477,10897.778167,...,13547.711833,14671.133807,14517.635341,12629.713786,11602.142227,11275.280156,11607.745344,11481.385188,11731.963766,
Argentina,4395.007076,4852.381210,5070.344112,5296.817066,5081.892652,5298.425979,5661.959372,5813.822438,5554.697946,5449.989077,...,6108.427140,6527.158500,6659.230509,6594.499948,7143.504413,7662.156733,7642.930027,7781.549510,7737.715767,
Armenia,1021.638633,605.352200,565.159435,610.113998,665.722822,715.818418,748.023547,808.996242,840.862615,895.603701,...,1847.746891,2111.675685,2267.312170,1952.342100,1997.052261,2087.751968,2230.288855,2297.661964,2364.748214,
Aruba,,,,23086.937951,22319.245642,23233.536705,24129.274973,23896.337025,24490.145798,23902.920892,...,23662.635648,22710.463505,21121.812032,19913.149353,,,,,,


Finally, let's build the dataframe for population:

In [34]:
df_pop = df[df['Series Name'] == 'Population, total'][['Country Name', '1991 [YR1991]', '1992 [YR1992]', '1993 [YR1993]', '1994 [YR1994]',
       '1995 [YR1995]', '1996 [YR1996]', '1997 [YR1997]', '1998 [YR1998]',
       '1999 [YR1999]', '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]',
       '2003 [YR2003]', '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]',
       '2007 [YR2007]', '2008 [YR2008]', '2009 [YR2009]', '2010 [YR2010]',
       '2011 [YR2011]', '2012 [YR2012]', '2013 [YR2013]', '2014 [YR2014]',
       '2015 [YR2015]']]

In [35]:
df_pop.replace(to_replace='..', value=np.nan, inplace=True)

In [36]:
s = df_pop.pop('Country Name')
df_pop.set_index(s)

Unnamed: 0_level_0,1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],1996 [YR1996],1997 [YR1997],1998 [YR1998],1999 [YR1999],2000 [YR2000],...,2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015]
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,12789374,13745630,14824371,15869967,16772522,17481800,18034130,18511480,19038420,19701940,...,25183615,25877544,26528741,27207291,27962207,28809167,29726803,30682500,31627506,
Albania,3266790,3247039,3227287,3207536,3187784,3168033,3148281,3128530,3108778,3089027,...,2992547,2970017,2947314,2927519,2913021,2904780,2900489,2897366,2894475,
Algeria,26554277,27180921,27785977,28362015,28904300,29411839,29887717,30336880,30766551,31183658,...,33749328,34261971,34811059,35401790,36036159,36717132,37439427,38186135,38934334,
American Samoa,48379,49597,50725,51807,52874,53926,54942,55899,56768,57522,...,58648,57904,57031,56226,55636,55316,55227,55302,55434,
Andorra,56674,58904,61003,62707,63854,64291,64147,63888,64161,65399,...,83373,84878,85616,85474,84419,82326,79316,75902,72786,
Angola,11472173,11848971,12246786,12648483,13042666,13424813,13801868,14187710,14601983,15058638,...,18541467,19183907,19842251,20520103,21219954,21942296,22685632,23448202,24227524,
Antigua and Barbuda,62412,63434,64868,66550,68349,70245,72232,74206,76041,77648,...,83467,84397,85350,86300,87233,88152,89069,89985,90900,
Argentina,33193920,33655149,34110912,34558114,34994818,35419683,35833965,36241578,36648054,37057453,...,39558750,39969903,40381860,40798641,41222875,41655616,42095224,42538304,42980026,
Armenia,3511912,3449497,3369673,3289943,3223173,3173425,3137652,3112958,3093820,3076098,...,3002161,2988117,2975029,2966108,2963496,2967984,2978339,2992192,3006154,
Aruba,64623,68235,72498,76700,80326,83195,85447,87276,89004,90858,...,100830,101218,101342,101416,101597,101936,102393,102921,103441,


# Plotting this with Bokeh 

We follow the gapminder example.

##  Setting up the data

In [37]:
from bokeh.models import ColumnDataSource

In [39]:
years = df_pop.columns

sources = {}

for year in years:
    population = df_pop[year]
    population.name = 'population' 
    
    internet = df_internet[year]
    internet.name = 'internet users'

    gdp = df_gdp[year]
    gdp.name = 'gdp (2005 $)'
    
    new_df = pd.concat([internet, gdp, population], axis=1)
    sources['_' + str(year)] = ColumnDataSource(new_df)

Let's build the dict that references the datasources:

In [41]:
dictionary_of_sources = dict(zip([x for x in years], ['_%s' % x for x in years]))
js_source_array = str(dictionary_of_sources).replace("'", "")

## Setting up the plot 

Let's talk about axes:

- Limits in x: 100\$ to 100,000\$.
- Limits in y: 0 to 100.

In [43]:
from bokeh.models import Range1d, Plot, LinearAxis, SingleIntervalTicker

In [45]:
# Set up the plot
xdr = Range1d(100, 1000000)
ydr = Range1d(0, 100)
plot = Plot(
    x_range=xdr,
    y_range=ydr,
    title="",
    plot_width=800,
    plot_height=400,
    outline_line_color=None,
    toolbar_location=None,    
)
AXIS_FORMATS = dict(
    minor_tick_in=None,
    minor_tick_out=None,
    major_tick_in=None,
    major_label_text_font_size="10pt",
    major_label_text_font_style="normal",
    axis_label_text_font_size="10pt",

    axis_line_color='#AAAAAA',
    major_tick_line_color='#AAAAAA',
    major_label_text_color='#666666',

    major_tick_line_cap="round",
    axis_line_cap="round",
    axis_line_width=1,
    major_tick_line_width=1,
)

xaxis = LinearAxis(SingleIntervalTicker(interval=1), axis_label="Gdp (2005 $)", **AXIS_FORMATS)
yaxis = LinearAxis(SingleIntervalTicker(interval=20), axis_label="Internet usage", **AXIS_FORMATS)   
plot.add_layout(xaxis, 'below')
plot.add_layout(yaxis, 'left')

Adding the background: 

In [49]:
[s[0] for s in years.str.split(' ')]

['1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015']

In [51]:
from bokeh.models import Text

In [52]:
# Add the year in background (add before circle)
text_source = ColumnDataSource({'year': ['%s' % [s[0] for s in years.str.split(' ')]]})
text = Text(x=2, y=35, text='year', text_font_size='150pt', text_color='#EEEEEE')
plot.add_glyph(text_source, text)

<bokeh.models.renderers.GlyphRenderer at 0x9101518>

Bubbles and hover:

In [53]:
from bokeh.models import Circle, HoverTool

In [117]:
# Add the circle
renderer_source = sources['_%s' % years[0]]
circle_glyph = Circle(
    x='gdp (2005 $)', y='internet users', size='population',
    fill_color='#7c7e71', fill_alpha=0.8, 
    line_color='#7c7e71', line_width=0.5, line_alpha=0.5)
circle_renderer = plot.add_glyph(renderer_source, circle_glyph)

# Add the hover (only against the circle and not other plot elements)
tooltips = "@index"
plot.add_tools(HoverTool(tooltips=tooltips, renderers=[circle_renderer]))

In [118]:
renderer_source.column_names

['population', 'index', 'gdp (2005 $)', 'internet users']

Add the slider:

In [56]:
from bokeh.models import CustomJS, Slider

In [60]:
years_label = [int(s[0]) for s in years.str.split(' ')]

In [61]:
# Add the slider
code = """
    var year = slider.get('value'),
        sources = %s,
        new_source_data = sources[year].get('data');
    renderer_source.set('data', new_source_data);
    text_source.set('data', {'year': [String(year)]});
""" % js_source_array

callback = CustomJS(args=sources, code=code)
slider = Slider(start=years_label[0], end=years_label[-1], value=1, step=1, title="Year", callback=callback, name='testy')
callback.args["renderer_source"] = renderer_source
callback.args["slider"] = slider
callback.args["text_source"] = text_source

Displaying:

In [70]:
from bokeh.plotting import vplot
from IPython.display import display, HTML
from bokeh.resources import JSResources
from bokeh.embed import file_html

In [120]:
# Stick the plot and the slider together
layout = vplot(plot, slider)

# Use inline resources
js_resources = JSResources(mode='inline')    
html = file_html(layout, None, "Bokeh - Gapminder Bubble Plot", js_resources=js_resources)

#display(HTML(html))

  warn('No Bokeh CSS Resources provided to template. If required you will need to provide them manually.')
ERROR:C:\Anaconda3\lib\site-packages\bokeh\validation\check.py:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: fertility, life [renderer: GlyphRenderer, ViewModel:GlyphRenderer, ref _id: 682659a1-a16c-4276-ad8b-0f7009498737]
ERROR:C:\Anaconda3\lib\site-packages\bokeh\validation\check.py:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: region_color [renderer: GlyphRenderer, ViewModel:GlyphRenderer, ref _id: acf6ef2f-79cd-4445-8d73-bc9542c35609]
