# 2. Integrated Plotting

We've shown some plotting methods using Matplotlib. Matplotlib is a low-level library that enables versatile output, but now we are going to look at some high-level plotting library methods:

- Pandas.plotting
- Seaborn

Both *Pandas.plotting* and *Seaborn* work directly with relational tables (such as Pandas) to automatically set axis labels etc.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

## Pandas.plotting

Pandas associates itself with Matplotlib directly through the .plot() pointer. Thus we can plot data directly from a Dataframe using the appropriate column names:

In [None]:
var = pd.DataFrame({'normal': np.random.normal(size=100),
                    'gamma': np.random.gamma(1, size=100),
                    'poisson': np.random.poisson(size=100)})
var.cumsum(0).plot()

Here we can see that each column is being plotted by *default* as a line.

In [None]:
var.cumsum(0).plot(subplots=True, grid=True)

We may want to have some series displayed on the secondary y-axis, which can allow for greater detail and less empty space:

In [None]:
var.cumsum(0).plot(secondary_y="normal", grid=True)

Let's use an example from one of our relational tables: Titanic.

In [None]:
titanic = pd.read_excel("../2. Python Data Handling - Pandas and Networkx/titanic.xlsx")

Here we use .groupby() to sum all the passengers by 1st...3rd class. This then forms a series which is directly plotted using .bar() extension.

In [None]:
titanic.groupby("Pclass").Survived.sum().plot.bar()

In [None]:
titanic.groupby(["Sex","Pclass"]).Survived.sum().plot.barh()

Here we use **crosstab** to add up individuals by *class* and *sex* whether they survived.

In [None]:
death_counts = pd.crosstab([titanic.Pclass, titanic.Sex], titanic.Survived.astype(bool))
death_counts.plot.bar(stacked=True, color=['black','gold'], grid=True)

Here we scale by the total per group to calculate the *proportion* of individuals per group.

In [None]:
death_counts.div(death_counts.sum(1).astype(float), 0).plot.barh(stacked=True, color=['black','gold'], grid=True)

In [None]:
titanic.Fare.hist(bins=20)

In [None]:
from scipy.stats import kurtosis
# create optimum bins.
doanes = lambda data: int(1 + np.log(len(data)) + np.log(1 + kurtosis(data) * (len(data) / 6.) ** 0.5))

We can create a raw *fig* and *axes* object using Matplotlib and then reference the *ax* object in the Pandas Dataframe:

In [None]:
fig,ax=plt.subplots()
# we normalise the distribution to match the KDE.
titanic.Fare.hist(bins=doanes(titanic.Fare.dropna()), ax=ax, normed=True, color='lightseagreen')
titanic.Fare.dropna().plot.kde(xlim=(0,600), style='r--')

In [None]:
titanic.boxplot("Fare", "Pclass", grid=False)

For scatterplots, the x and y axis must be specified as columns:

In [None]:
wine = pd.read_table("../2. Python Data Handling - Pandas and Networkx/wine.dat", sep="\s+")
attributes = ['Grape','Alcohol','Malic acid','Ash','Alcalinity of ash','Magnesium','Total phenols',
            'Flavanoids','Nonflavanoid phenols','Proanthocyanins','Color intensity','Hue',
            'OD280/OD315 of diluted wines','Proline']
wine.columns = attributes

In [None]:
wine.plot.scatter("Color intensity", "Hue")

We could assign scatter point size using another column:

In [None]:
wine.plot.scatter("Color intensity", "Hue", s=wine.Alcohol*100, alpha=.5)

In [None]:
wine.plot.scatter("Color intensity", "Hue", c="Grape")

Pandas provides a convenience method to plot the pairwise scatter plots for all variables concerned, with the diagonal
(variance) being a histogram by default or optionally KDE.

In [None]:
_ = pd.scatter_matrix(wine.iloc[:,1:6], figsize=(12,12), diagonal='kde')

## Task

Using the Titanic dataset, create a KDE estimate plot of the age distributions of survivors and victims.

In [None]:
# your codes here

## Advanced Pandas.plotting

Here we will consider plotting non-traditional plotting techniques that Pandas can provide:

For instance, when we want to visualise a large number of continuous points, say in a timeseries or for a few columns, the parallel_coordinates function is particularly useful:

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.concat([pd.DataFrame(iris['data'], columns=iris['feature_names']),
           pd.DataFrame(iris['target'], columns=["Species"]).replace(dict(zip([0, 1, 2], iris.target_names)))], axis=1)
from pandas.plotting import parallel_coordinates
fig = plt.figure(figsize=(10,6))
parallel_coordinates(iris_df, "Species")

Alternatively we can plot them in aggregate by species, using *Andrews curves*, which works similarly to KDE plot:

In [None]:
from pandas.plotting import andrews_curves
fig = plt.figure(figsize=(8,6))
andrews_curves(iris_df, "Species")

*Bootstrapping* is a common practice to visually assess the uncertainty of a statistic, such a mean, median, midrange etc. A random subset of specified size is selected from a data set, the statistic in question is computed for this subset and the process is repeated $n$ times.

In [None]:
data = pd.Series(np.random.randn(10000))
from pandas.plotting import bootstrap_plot
fig = plt.figure(figsize=(14,8))
bootstrap_plot(data, size=100, samples=1000, fig=fig, color='lightseagreen')
plt.show()

## Seaborn

In Python there are many libraries to use for plotting, most of which are built on top of Matplotlib. One of the most commonly used libraries is *Seaborn*.

Matplotlib as a package is very powerful, but a relatively **low-level** plotting library that makes very few assumptions about what constitutes good layout by design, but has tonnes of flexibility to allow the user to completely customize the look of the output.

Seaborn, on the other hand, makes **high-level** assumptions about good layout and design, which allows users to generate publication-quality visualizations in an automated way.

In [None]:
import seaborn as sns
sns.set_context("notebook")

For instances in Pandas plotting we would:

In [None]:
normals = pd.Series(np.random.randn(50))
normals.cumsum().plot()

**Seaborn's** high level interace makes it easy to visually explore data, by iterating through different plot types and layouts. Seaborn can immediately improve existing Matplotlib plots through plot styles:

In [None]:
sns.set_style("whitegrid")

normals.cumsum().plot()

In [None]:
titanic = pd.read_excel("../2. Python Data Handling - Pandas and Networkx/titanic.xlsx")
sns.boxplot(x="Pclass", y="Age", data=titanic, order=['1st class', '2nd class', '3rd class'])

We can remove edges with despine():

In [None]:
sns.boxplot(x="Pclass", y="Age", data=titanic, order=['1st class', '2nd class', '3rd class'])
sns.despine()

Seaborn also gives us aesthetic paramters to control the *scale* of plot elements. There are 4 defaults:
* paper
* notebook
* talk
* poster

The default is *notebook*, which is optimized for Jupyter notebooks. We can change the scalign with set_context():

In [None]:
sns.set_context("poster")
sns.boxplot(x="Pclass", y="Age", data=titanic, order=['1st class', '2nd class', '3rd class'])

Detailed setting are available in plotting.context:

In [None]:
sns.plotting_context()

In [None]:
dfx = pd.DataFrame(np.random.normal(2.0, 1.0, size=(100,2)), columns=['x','y'])
fig,ax=plt.subplots(ncols=2, figsize=(12,4))
sns.kdeplot(dfx.x, dfx.y, ax=ax[1], shade=True)
sns.kdeplot(dfx.x, ax=ax[0])

In [None]:
sns.distplot(dfx.y)

A jointplot will generate a shaded joint KDE, with marginal KDEs for each of the two variables.

In [None]:
with sns.axes_style("dark"):
    sns.jointplot("x", "y", dfx, kind='kde')

In [None]:
sns.axes_style()

To explore correlations between several continuous variables, *pairplot()* generates pairwise plots with histograms/KDEs on the diagonal, with customizability:

In [None]:
titanic.Pclass.replace({'1st class': 1, '2nd class': 2, '3rd class': 3}, inplace=True)
sns.pairplot(titanic.dropna(), vars=['Age', 'Fare', 'Pclass'], hue='Survived', markers='+', palette="muted")

### Plotting Small Multiples on Data-aware Grids

Pairplot above is an example of replicating the same visualisation on subsets of a particular dataset. This enables an easy comparison between groups. 

We can generate plots in Seaborn using *data-aware grids*, provided that the DataFrame is structured appropriately in *long-form*, such that the variables are columns and the observations are rows. One of the tools for this is *FacetGrid*:

In [None]:
sns.set_context("notebook")
sns.FacetGrid(titanic, col="Pclass", row="Sex")

You can then assign a third variable to be plotted in each grid cell, according to the plot type passed, for instance a distplot will generate both a histogram and KDE for age, according to sex/class combinations, using color to separate those whom survived or not:

In [None]:
g = sns.FacetGrid(titanic, col="Pclass", row="Sex", hue="Survived", legend_out=True)
g.map(sns.distplot, "Age")

In [None]:
cdystonia = pd.read_csv("../2. Python Data Handling - Pandas and Networkx/cdystonia.csv")

We can do things like *wrap* long column data if it is in a time series, such as this:

In [None]:
g = sns.FacetGrid(cdystonia[cdystonia.patient <= 8], col="patient", col_wrap=4)
g.map(sns.pointplot, "week", "twstrs", color="0.5")

We can specify the order of column elements with an order=[] set:

In [None]:
g = sns.FacetGrid(cdystonia, col="treat", col_order=['Placebo',"5000U","10000U"])
g.map(sns.pointplot, "week", "twstrs", color="r")

In [None]:
from scipy.stats import norm
sns.set_context("notebook")
g = sns.FacetGrid(cdystonia, row='treat', col='week')
g.map(sns.distplot, 'twstrs', kde=False, fit=norm)

In [None]:
g = sns.FacetGrid(cdystonia, col="week", row="treat", hue="sex")
g.map(sns.regplot, "age", "twstrs")

We can achieve a similar thing specifically for *categorical variables* using Seaborn' *factorplot* function, where we break our data down into 3 dimensions using $x$, $y$ and $hue$:

In [None]:
g = sns.factorplot(x="Pclass", y="Survived", hue="Sex", data=titanic, size=6, kind='bar')

We can continue to grow the plot by breaking down into higher dimensions using colour, row and column:

In [None]:
sns.factorplot(x="n_siblings", y="Age", col="Pclass", row="Sex", hue="Survived", data=titanic, kind='bar')

Seaborn also happens to make some of the easiest heatmaps (in my opinion):

In [None]:
fig,ax=plt.subplots(figsize=(4,8))
_ = sns.heatmap(cdystonia.pivot_table(index=['patient'], columns="week", values="twstrs"), linewidths=.05, ax=ax)

We can discover structures through clustering these values in a *clustermap*:

In [None]:
g = sns.clustermap(cdystonia.pivot_table(index=['patient'], columns="week", values="twstrs").dropna(), cmap="summer")