## Tutorial 4. Interactive exploratory data analysis : *the beauty of holoviews*


Created by Emanuel Flores-Bautista 2019  All content contained in this notebook is licensed under a [Creative Commons License 4.0](https://creativecommons.org/licenses/by/4.0/). The code is licensed under a [MIT license](https://opensource.org/licenses/MIT).

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
import holoviews as hv
from holoviews import dim, opts
from scipy.stats import pearsonr
import TCD19_utils as TCD_19

hv.extension('bokeh')

#Setting all the plots in the notebook
%matplotlib inline

#Make the figure format appear as svg
%config InlineBackend.figure_format = 'svg' 

TCD_19.set_plotting_style_2()

In [None]:
df = pd.read_csv('../data/data_CONAPO_municipal_90-15.csv', encoding = "ISO-8859-1")

In [None]:
df.head()

Let's filter out the data to get the Yucatán values. 

In [None]:
df_yuc = df[df['ENT'] == 'Yucatán']

Let's rename the columns.

In [None]:
df_yuc = df_yuc.rename(columns = {'SPRIM': '% sin primaria', 
                       'OVSD': '% sin drenaje', 
                       'ANALF': '% analfabeta', 
                       'OVSEE': '% sin energía eléctrica', 
                       'OVPT': '% con piso de tierra', 
                       'GM': 'Grado de marginación', 
                       'PO2SM': '% con ingresos de menos de 2 salarios mín.',
                       'OVSAE': '% sin agua entubada',
                        'IM': 'índice de marginación'})

In [None]:
cols = df_yuc.columns[4:17]
df_yuc[cols] = df_yuc[cols].apply(pd.to_numeric, errors='coerce')

In [None]:
df_yuc = df_yuc[df_yuc['AÑO']== 2015]

In [None]:
df_yuc.tail(3)

In [None]:
sns.jointplot(data = df_yuc, x = '% analfabeta', y = '% con ingresos de menos de 2 salarios mín.',
              kind ='hex', stat_func = pearsonr, color = 'grey');



In [None]:
sns.jointplot(data = df_yuc, x = '% analfabeta', y = '% con ingresos de menos de 2 salarios mín.',
          hue = 'MUN',  kind ='kde', stat_func = pearsonr,
              alpha = 0.4, color = 'grey');



Let's practice our abilities ! Make a pairgrid with the top axis as a 2D kde and the lower axis as a scatter plot, and the diagonal a box plot.

In [None]:
##write your code here

Make a function to calculate the correlation matrix, and draw conclusions on what variables are highly correlated. After that, plot a clustermap to find clusters of correlated variables. 

Free your mind, use your creativity! Now practice your dataviz abilities by making some interesting description about the above distributions. Remember to upload your example code in a new notebook and to upload it to Github. 

## Interactive data visualization

In [None]:
df_yuc.head()

In [None]:
hv.Scatter(df_yuc , kdims = ['índice de marginación'], vdims = ['% analfabeta'] )

## Shine with style

The default style of this scatter plot is not the default Bokeh style, which I prefer. We might also want to specify the plot dimensions and other attributes about the plot. In HoloViews, you can specify style options and plot options.

* style options: These are options used by the renderer (in our case Bokeh). These are things like coloring of glyphs.

* plot options: These control how HoloViews builds the graphic. These are things like whether or not to display a title or show a grid.

In [None]:
style_opts = {'color': '#24f295','size': 5}
plot_opts = {'show_grid':  True,'width': 450,'height': 350}

scatter = hv.Scatter(df_yuc, kdims=['índice de marginación'] , vdims=['% analfabeta'])

scatter.opts(style=style_opts, plot=plot_opts)

In [None]:
%%opts Scatter [show_grid=True, width=500, height=300] (size=6, color= '#24f295')

scatter = hv.Scatter(df_yuc , kdims = ['índice de marginación'], vdims = ['% analfabeta',
                                                                          'Grado de marginación', 'MUN'] )

scatter 

## Split-apply-combine with graphic elements

In [None]:
%%opts Scatter [show_grid=True, width=450, height=350] (size=5)

gb = scatter.groupby('Grado de marginación')

gb

In [None]:
overlay = gb.overlay()

overlay

In [None]:
layout = gb.layout()

layout

## Exploring your data with hover tools

In [None]:
%%opts Scatter [show_grid=True, width=500, height=300, tools=['hover']] (size=6, color= '#24f295')

scatter = hv.Scatter(df_yuc , kdims = ['índice de marginación'], vdims = ['% analfabeta',
                                                                          'Grado de marginación', 'MUN'] )

scatter 

In [None]:
%%opts Scatter [show_grid=True, width=600, height=400, tools=['hover']] (size=6, color= '#72827b')

scatter = hv.Scatter(df_yuc, kdims = ['% analfabeta'],
                     vdims = ['% con ingresos de menos de 2 salarios mín.', 'Grado de marginación', 'MUN'],
                     label='Marginación').opts(fontsize={'title': 16,'labels': 14, 'xticks': 9, 'yticks': 12})


scatter

In [None]:
import scipy.stats as st

In [None]:
slope, intercept, r_value, p_value, std_err = st.linregress(df_yuc['% analfabeta'].values,
                              df_yuc['% con ingresos de menos de 2 salarios mín.'].values)

In [None]:
r_value

In [None]:
analfabet = np.linspace(0,25, 1000)

In [None]:
low_income = slope * analfabet + intercept

In [None]:
%%opts Curve (line_width=2, color= '#24f295')

regression_line = hv.Curve((analfabet, low_income),
                           kdims=['% analfabeta'],
                           vdims=['% con ingresos de menos de 2 salarios mín.'])

regression_line * scatter

In [None]:
regression_line + scatter