# Introduction to data visualization

First things first: why should you look at your data? Isn't statistics enough?

In [None]:
import pandas as pd

In [None]:
data4 = pd.read_csv('data/data4.csv')
data4.head()

In [None]:
datasets = data4.groupby('dataset')
datasets.agg(['count', 'mean', 'var'])

In [None]:
datasets[['x', 'y']].corr().loc[(slice(None), 'x'), 'y']

In [None]:
from scipy import stats
datasets.apply(lambda df: stats.linregress(df.x, df.y)[:2])

Surely these four datasets must be more or less the same for all statistically meaningful purposes...

But let's double-check to be sure...

In [None]:
%matplotlib inline
import seaborn as sns

In [None]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=data4,
           col_wrap=2, ci=None, size=4);

In [None]:
from IPython.display import Video
Video("https://pbs.twimg.com/tweet_video/CrIDuOhWYAAVzcM.mp4")

For more, see [The Datasaurus dozen](https://www.autodeskresearch.com/publications/samestats).

## Introduction to matplotlib

The [matplotlib](http://matplotlib.org) library is a powerful tool capable of producing complex publication-quality figures with fine layout control in two and three dimensions; here we will only provide a minimal self-contained introduction to its usage that covers the functionality needed for the rest of the book.  We encourage the reader to read the tutorials included with the matplotlib documentation as well as to browse its extensive gallery of examples that include source code.

Just as we typically use the shorthand `np` for Numpy, we will use `plt` for the `matplotlib.pyplot` module where the easy-to-use plotting functions reside (the library contains a rich object-oriented architecture that we don't have the space to discuss here):

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

The `plot` command:

In [None]:
x = np.random.rand(100)
plt.plot(x)

Plotting a function: $f(x) = \sin(x)$:

In [None]:
x = np.linspace(0, 2*np.pi, 300)
y = np.sin(x)
plt.plot(x, y);

The most frequently used function is simply called `plot`, here is how you can make a simple plot of $\sin(x)$ and $\sin(x^2)$ for $x \in [0, 2\pi]$ with labels and a grid (we use the semicolon in the last line to suppress the display of some information that is unnecessary right now):

In [None]:
y2 = np.sin(x**2)
plt.plot(x, y, label=r'$\sin(x)$')
plt.plot(x, y2, label=r'$\sin(x^2)$')
plt.title('Some functions')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
plt.legend();

You can control the style, color and other properties of the markers, for example:

In [None]:
x = np.linspace(0, 2*np.pi, 50)
y = np.sin(x)
plt.plot(x, y, linewidth=2);

In [None]:
plt.plot(x, y, 'o', markersize=5, color='r');

## Plotting two-dimensional arrays

In [None]:
a = np.random.rand(5,10)
plt.imshow(a, interpolation='bilinear', cmap=plt.cm.Blues)
plt.figure()
plt.imshow(a, interpolation='bicubic', cmap=plt.cm.Blues)
plt.figure()
plt.imshow(a, interpolation='nearest', cmap=plt.cm.Blues);

In [None]:
img = plt.imread('data/dessert.png')
img.shape

In [None]:
plt.imshow(img);

## Subplots
Plot the r, g, b channels of the image.

In [None]:
fig, ax = plt.subplots(1,4, figsize=(10,6))
ax[0].imshow(img[:,:,0], cmap=plt.cm.Reds_r)
ax[1].imshow(img[:,:,1], cmap=plt.cm.Blues_r)
ax[2].imshow(img[:,:,2], cmap=plt.cm.Greens_r)
ax[3].imshow(img);
for a in ax:
    a.set_xticklabels([]); a.set_xticks([])
    a.set_yticklabels([]); a.set_yticks([])

## Simple 3d plotting with matplotlib

Matplotlib is mainly a 2-d plotting library, but it has basic 3-d capabilities; you can read more about them in the [3d plotting tutorial](https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html). 

Note that you must execute at least once in your session:

In [None]:
from mpl_toolkits.mplot3d import Axes3D

One this has been done, you can create 3d axes with the `projection='3d'` keyword to `add_subplot`:

    fig = plt.figure()
    fig.add_subplot(<other arguments here>, projection='3d')

A simple surface plot:

In [None]:
from mpl_toolkits.mplot3d.axes3d import Axes3D
from matplotlib import cm

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1, projection='3d')
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.viridis,
        linewidth=0, antialiased=False)
ax.set_zlim3d(-1.01, 1.01);

## The [matplotlib gallery](http://matplotlib.sourceforge.net/gallery.html)

In [None]:
# %load http://matplotlib.org/mpl_examples/pie_and_polar_charts/polar_scatter_demo.py

In [None]:
"""
==========================
Scatter plot on polar axis
==========================

Demo of scatter plot on a polar axis.

Size increases radially in this example and color increases with angle
(just to verify the symbols are being scattered correctly).
"""
import numpy as np
import matplotlib.pyplot as plt


# Compute areas and colors
N = 150
r = 2 * np.random.rand(N)
theta = 2 * np.pi * np.random.rand(N)
area = 200 * r**2
colors = theta

ax = plt.subplot(111, projection='polar')
c = ax.scatter(theta, r, c=colors, s=area, cmap='hsv', alpha=0.75)

plt.show()


## Matplotlib and dataframes

In [None]:
cars = pd.read_csv('data/cars.csv')
cars.head()

In [None]:
x, y = 'Horsepower', 'Miles_per_Gallon'

plt.scatter(x=x, y=y, data=cars);

In [None]:
fig, ax = plt.subplots()
for name, df in cars.groupby('Origin'):
    ax.scatter(x=x, y=y, data=df, label=name)

ax.set_xlabel(x)
ax.set_ylabel(y)
ax.legend();

## Seaborn to the rescue: statistical plots with matplotlib

Created by a Stanford graduate student, [Mike Waskom](https://seaborn.pydata.org/).

*Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.*

"...built on top of matplotlib and tightly integrated with the PyData stack, including support for `numpy` and `pandas` data structures and statistical routines from `scipy` and `statsmodels`."

Some of the features that seaborn offers are

 - Several built-in themes that improve on the default matplotlib aesthetics
 - Tools for choosing color palettes to make beautiful plots that reveal patterns in your data
 - Functions for visualizing univariate and bivariate distributions or for comparing them between subsets of data
 - Tools that fit and visualize linear regression models for different kinds of independent and dependent variables
 - Functions that visualize matrices of data and use clustering algorithms to discover structure in those matrices
 - A function to plot statistical timeseries data with flexible estimation and representation of uncertainty around the estimate
 - High-level abstractions for structuring grids of plots that let you easily build complex visualizations

"The plotting functions operate on dataframes and arrays containing a whole dataset and internally perform the necessary aggregation and statistical model-fitting to produce informative plots. If matplotlib “tries to make easy things easy and hard things possible”, seaborn tries to make a well-defined set of hard things easy too."

In [None]:
import seaborn as sns

In [None]:
sns.lmplot(x=x, y=y, data=cars, hue='Origin', fit_reg=False);

In [None]:
sns.lmplot(x='Horsepower', y='Miles_per_Gallon', data=cars, hue='Origin');

In [None]:
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species");

## Some other tools

### Declarative plots with Altair

[Altair](https://altair-viz.github.io) is a declarative statistical visualization library for Python, based on the [Vega-Lite](https://vega.github.io/vega-lite) visualization grammar for interactive graphics. Created by Jupyter/CalPoly's [Brian Granger](https://github.com/ellisonbg) and UW's [Jake VanderPlas](http://vanderplas.com).

In [None]:
import altair as alt
# alt.enable_mime_rendering()  # Uncomment for rendering in JupyterLab & nteract

In [None]:
alt.Chart(cars).mark_circle().encode(x=x, y=y, color='Origin')

### Interactive plots with plotly

An [open-source plotting library](https://plot.ly) with rich JavaScript-based interactivity,  Jupyter Notebook integration and cross-language support (Python, R, Julia, Matlab). Plotly charts can be rendered [offline in the Jupyter Notebook](https://plot.ly/python/offline), or [shared online via Plotly](https://help.plot.ly/how-sharing-works-in-plotly).

In [None]:
import plotly.offline as py
py.init_notebook_mode()

In [None]:
data = [ dict(x=df[x], y=df[y], name=name, mode='markers')
         for name, df in cars.groupby('Origin') ]

In [None]:
fig = {'data': data,
       'layout': {'xaxis': {'title': x},
                  'yaxis': {'title': y} } }
py.iplot(fig)