# Follow along: 
## Go to <a href="http://bit.ly/caffeina-data-vis">bit.ly/caffeina-data-vis</a> and click on `launch binder` badge

# The speaker

My name is Sergey Antopolskiy. I've studied biology at MSU, Moscow, and neuroscience at SISSA, Trieste. Currently I work as a senior research scientist / data scientist at Camlin Group R&D office in Parma (https://www.camlingroup.com/) where I work with biomedical data.

Contacts: 
- https://github.com/antopolskiy
- s.antopolsky@gmail.com

# Intro to data visualization
- Why we do data visualization?
- Why choose Python over alternatives?
- Main data visualization libraries in Python
    - Matplotlib
    - Seaborn
    - Bokeh

In [None]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
mpl.rcParams['figure.dpi'] = 110

# Data vis: from presentation to exploration

## Presentation ("Infographics")
- Made for non-experts
- Relies on design ideas
- Form as important as content

https://informationisbeautiful.net/

<img src="https://thumbnails-visually.netdna-ssl.com/infographics--the-benefits-of-their-use-online_565c628147e97.jpg" width=600>

More recently -- interactive visualizations, e.g.: http://informationisbeautiful.net/visualizations/snake-oil-scientific-evidence-for-nutritional-supplements-vizsweet/

## Exploration
- Content is more important than form
- Must be quick to implement ideas (high level commands)
- Prefferably interactive

## Middle-ground

Visualization tools for non-experts: https://www.gapminder.org/tools/

Data-driven journalism: https://www.bloomberg.com/graphics/2015-dangerous-jobs/

Difference of these tools with infographics: in the latter only a selected set of results is present to attract viewer's attention, while here the user is given full access to the data and is free to explore on her own.

# Data exploration

Why not just use descriptive statistics: mean, standard deviation, statistical tests, etc?

In [None]:
anscombe = pd.read_csv('data/anscombe.csv')[['y1','y2','y3','y4']]
anscombe

In [None]:
anscombe.describe().loc[['mean','std']]

We are visual animals, we notice visual patterns much more easily, e.g. from the recent Russian Presidential elections:

<img src="img/russian-elections-graph.png" width=600>

Note the peaks at the multiples of 5. An image much like this one provoked the massive protests in 2011 after the parlamentary elections.

# Why in Python?

- Free and open source
- Complete programming language
- Many visualization libraries suited for various purposes
- Jupyter notebooks
- Very large community of data scientists and ML engineers

# Visualization tools in Python
We need some data. As an example, let's consider a dataset of births in the US from 1969 to 1988:

In [None]:
births = pd.read_csv('data/births.csv').dropna()
births = births.loc[births.day.notnull()]
births = births.loc[births.births>1000]
births['day'] = pd.to_numeric(births.day)
births['month'] = pd.to_numeric(births.month)
births['year'] = pd.to_numeric(births.year)
print(births.shape)

In [None]:
births.head()

In [None]:
births.tail()

Before we can visualize, we need to transform the data in some way. I do it here with `pandas` library, which deserves a talk of its own. I will not explain how I transform the data (i.e. the code), I will only show what I get in the end. If you're interested, take a look at this repo: https://github.com/jakevdp/PythonDataScienceHandbook

## Matplotlib
https://matplotlib.org/
### Lineplots

In [None]:
births_per_year = births.groupby('year')['births'].mean()
plt.plot(births_per_year)

In [None]:
plt.plot(births_per_year, color='darkgreen', marker='s', linestyle=':')

In [None]:
births_per_year_m = births.groupby(['gender','year'])['births'].mean()['M']
births_per_year_f = births.groupby(['gender','year'])['births'].mean()['F']

In [None]:
plt.plot(births_per_year_m, color='darkblue', marker='o', label='Boys')
plt.plot(births_per_year_f, color='darkred', marker='o', label='Girls')
plt.legend()

### Histogram

In [None]:
plt.hist(births['births'],bins=100)
plt.xlabel('Number of children born')
plt.ylabel('Freq')
plt.show()

### Heatmap

In [None]:
births_year_month = births.pivot_table(index='year',columns='month',values='births',aggfunc=np.sum)

plt.imshow(births_year_month, interpolation=None, cmap='Greens', aspect=0.6)
plt.colorbar()

plt.xticks(range(len(births_year_month.columns)), 
           births_year_month.columns)

plt.yticks(range(len(births_year_month.index)), 
           births_year_month.index)

plt.xlabel('Month')
plt.ylabel('Year')
plt.title('Total births for each month from {} to {}'
          .format(str(births_year_month.index[0]),
                  str(births_year_month.index[-1])));

### Boxplots

In [None]:
births_month_day = births.pivot_table(columns='month',values='births',index='day').dropna()
plt.boxplot(births_month_day.values);

### Barplots

In [None]:
plt.bar(births_month_day.mean().index, births_month_day.mean())
plt.errorbar(births_month_day.mean().index, 
             births_month_day.mean(), 
             yerr=births_month_day.std(),
             linestyle='None',
             color='k')

In [None]:
plt.bar(anscombe.mean().index, anscombe.mean())
plt.errorbar(anscombe.mean().index, anscombe.mean(), yerr=anscombe.std(), 
             linestyle='None', color='k')

In [None]:
plt.boxplot(anscombe.values);

### Interactivity

In [None]:
%matplotlib notebook

In [None]:
plt.plot(births_per_year_m, color='darkblue', marker='o', label='Boys')
plt.plot(births_per_year_f, color='darkred', marker='o', label='Girls')
plt.legend()

In [None]:
%matplotlib inline
mpl.rcParams['figure.dpi'] = 110

### Pros
- Easy to use
- Very flexible: can control each element

### Cons
- Too low-level for quick use
- Need to transform data into appropriate shape
- Limited interactivity

# Tidy data plotting
What is tidy data?

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is
messy or tidy depending on how rows, columns and tables are matched up with observations,
variables and types. Core principles of tidy data are simple:
1. Each **variable** forms a **column**
2. Each **observation** forms a **row**
3. Each **type of observational unit** forms a **table**

<img src="img/tidy_data_ex.png">

# Example

In [None]:
untidy = pd.DataFrame({'treatment_a':[9, 16, 3],'treatment_b':[2,11,1]}, 
                      index=['John Smith', 'Jane Doe','Mary Johnson'])
untidy

In [None]:
untidy.index.name = 'person'
untidy.columns.name = 'treatment'
tidy = pd.melt(untidy.reset_index(),id_vars=['person'],value_name='IgG_level')
tidy['treatment'].replace({'treatment_a':'a','treatment_b':'b'}, inplace=True)
tidy

## Seaborn: tidy plotting
https://seaborn.pydata.org/index.html

In [None]:
sns.boxplot(data=tidy, x='treatment', y='IgG_level', width=0.3)

In [None]:
sns.boxplot(data=births, x='month', y='births', palette='Blues')

In [None]:
sns.boxplot(data=births, x='year', y='births', palette='Greens')
plt.xticks(rotation='vertical')
plt.show()

In [None]:
sns.stripplot(data=births.loc[births.day==1], x='gender', y='births', jitter=0.1)

In [None]:
sns.catplot(data=births, x='year', y='births', kind='point', height=5, aspect=1.2)
plt.xticks(rotation='vertical');

In [None]:
sns.catplot(data=births, x='year', y='births', kind='point', hue='gender', height=5, aspect=1.2)
plt.xticks(rotation='vertical');

In [None]:
sns.catplot(data=births.loc[births.day<=3], x='month', y='births', 
               kind='point', col='gender', hue='day');

### Pros
- Given a tidy dataset, very easy to use
- Automatic labeling
- Versitile (https://seaborn.pydata.org/examples/index.html)
- Built on top of `matplotlib` -> can be modified with low-level `matplotlib` functions

### Cons
- Static (with very limited interactivity)

## Bokeh: interactive plotting
https://bokeh.pydata.org/en/latest/

In [None]:
from bokeh.plotting import figure, show, output_notebook
output_notebook()

In [None]:
from bokeh.sampledata.iris import flowers

colormap = {'setosa': 'red', 'versicolor': 'green', 'virginica': 'blue'}
colors = [colormap[x] for x in flowers['species']]

p = figure(title = "Iris Morphology")
p.xaxis.axis_label = 'Petal Length'
p.yaxis.axis_label = 'Petal Width'

p.circle(flowers["petal_length"], flowers["petal_width"],
         color=colors, fill_alpha=0.2, size=10)
show(p)

In [None]:
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource

x = list(range(-20, 21))
y0, y1 = [abs(xx) for xx in x], [xx**2 for xx in x]

# create a column data source for the plots to share
source = ColumnDataSource(data=dict(x=x, y0=y0, y1=y1))

TOOLS = "box_select,lasso_select,help"

# create a new plot and add a renderer
left = figure(tools=TOOLS, width=300, height=300)
left.circle('x', 'y0', source=source)

# create another new plot and add a renderer
right = figure(tools=TOOLS, width=300, height=300)
right.circle('x', 'y1', source=source)

p = gridplot([[left, right]])

show(p)

In [None]:
from bokeh.plotting import figure
from bokeh.palettes import Spectral5
from bokeh.sampledata.autompg import autompg_clean as df
from bokeh.transform import factor_cmap

df.cyl = df.cyl.astype(str)
df.yr = df.yr.astype(str)

group = df.groupby(('cyl', 'mfr'))

index_cmap = factor_cmap('cyl_mfr', palette=Spectral5, factors=sorted(df.cyl.unique()), end=1)

p = figure(plot_width=800, plot_height=300, title="Mean MPG by # Cylinders and Manufacturer",
           x_range=group, toolbar_location=None, tooltips=[("MPG", "@mpg_mean"), ("Cyl, Mfr", "@cyl_mfr")])

p.vbar(x='cyl_mfr', top='mpg_mean', width=1, source=group,
       line_color="white", fill_color=index_cmap, )

p.y_range.start = 0
p.x_range.range_padding = 0.05
p.xgrid.grid_line_color = None
p.xaxis.axis_label = "Manufacturer grouped by # Cylinders"
p.xaxis.major_label_orientation = 1.2
p.outline_line_color = None

show(p)

In [None]:
from bokeh.models import HoverTool
from bokeh.plotting import figure

n = 500
x = 2 + 2*np.random.standard_normal(n)
y = 2 + 2*np.random.standard_normal(n)

p = figure(title="Hexbin for 500 points", match_aspect=True,
           tools="wheel_zoom,reset", background_fill_color='#440154')
p.grid.visible = False

r, bins = p.hexbin(x, y, size=0.5, hover_color="pink", hover_alpha=0.8)

p.circle(x, y, color="white", size=1)

p.add_tools(HoverTool(
    tooltips=[("count", "@c"), ("(q,r)", "(@q, @r)")],
    mode="mouse", point_policy="follow_mouse", renderers=[r]
))

show(p)

# Bokeh apps

https://demo.bokehplots.com/apps/movies

### Pros
- Full interactivity in the browser, powered by JavaScript
- Can create apps to explore the datasets

### Cons
- Has its own syntax, which you need to learn from scratch (doesn't rely on `matplotlib`)
- Interactivity can be tideous to implement (sometimes need to you JavaScript snippets)
- Not highly optimized for very large scale projects: need to work around

Also check `mplD3` library, which allows to create JS.D3 graphics using `matplotlib`: http://mpld3.github.io/

# Conclusions

You have a large variety of data visualization tools in Python, especially for data exploration purposes. 

- If you've got a tidy dataset (or know how to tidy an existing messy dataset) -- use `seaborn`. 
- If you need to tweak settings, use `matplotlib` in addition. 
- If you want interactivity and web integration -- use `bokeh`.

# Further resources

Python Data Science Handbook: https://jakevdp.github.io/PythonDataScienceHandbook/

Notebooks on data visualization: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks#data-visualization-and-plotting