# ``pdvega``: Pandas Plotting API to Vega-Lite

[``pdvega``](http://github.com/jakevdp/pdvega) is a package that extends the pandas plotting API to work with [Vega-Lite](http://vega.github.io/vega-lite/).
It adds the ``vgplot`` attribute to Pandas ``Series`` and ``Dataframe`` objects, so that the ``data.vgplot()`` method can be used similarly to the ``data.plot()`` method to create interactive Vega-Lite plots.

This notebook contains examples of this in action. Note that if you are viewing this notebook on github or nbviewer, you will see only static snapshots of the outputs. Run the notebook locally to see the full interactive versions of the plots.

Throughout this notebook, we will use the [vega_datasets](https://github.com/jakevdp/vega_datasets) package for loading example data as pandas dataframes. This can be installed with:

```
pip install vega_datasets
```

For example, here is some stock market data from a few large tech companies:

In [None]:
from vega_datasets import data
stocks = data.stocks(pivoted=True)
stocks.head()

## The Pandas Plotting API

Let's start with the standard matplotlib, pandas, and numpy imports, as well as some matplotlib setup:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

Pandas has long provided a simple plotting API based on the ``plot`` method:

In [None]:
stocks.plot();

The result is a nice plot, but is just a static PNG.

Importing the ``pdvega`` module will add the ``vgplot`` attribute to pandas ``Series`` and ``DataFrame`` objects that produces a more dynamic plot:

In [None]:
# import adds the vgplot attribute to Series and DataFrames
import pdvega

hasattr(pd.DataFrame, 'vgplot')

The ``vgplot`` method is designed to have mostly the same interface as the ``plot`` method:

In [None]:
stocks.vgplot()

The result is an interactive plot; if you're running the notebook live, you should be able to pan and zoom the plot using your mouse or trackpad.

Let's explore a few types of plots that ``pdvega`` makes available:

## Line Plots

Above we saw that the default plot style is a line plot.
We can also specify the line plot explicitly using ``plot.line()`` or ``vgplot.line()``:

In [None]:
stocks.plot.line();

In [None]:
stocks.vgplot.line()

## Scatter Plots

Scatter plots are also supported; for this let's use the ``cars`` data:

In [None]:
cars = data('cars')
cars.head()

In [None]:
cars.plot.scatter(x='Horsepower', y='Miles_per_Gallon', c='Cylinders');

In [None]:
cars.vgplot.scatter(x='Horsepower', y='Miles_per_Gallon', c='Cylinders')

Notice that unlike matplotlib, pdvega recognizes that the ``Cylinders`` column consists of discrete, ordinal values, and chooses the legend accordingly.
For continuous values, a continuous legend is used:

In [None]:
cars.vgplot.scatter(x='Horsepower', y='Miles_per_Gallon', c='Weight_in_lbs')

Both the size and the color of points can be specified, as well as the transparency:

In [None]:
cars.vgplot.scatter(x='Horsepower', y='Miles_per_Gallon',
                    c='Origin', s='Weight_in_lbs', alpha=0.3)

## Bar Plots

Bar plots are supported with the ``bar`` and ``barh`` methods.
let's create some simple data to plot

In [None]:
rand = np.random.RandomState(42)
df = pd.DataFrame({'x': 0.5 + 0.5 * rand.rand(20),
                   'y': 0.7 + 0.5 * rand.rand(20)})

### Unstacked

The default bar plot type is unstacked; there is a slight difference between matplotlib and vega here: matplotlib puts the bars side-by-side, while Vega layers the unstacked bars on top of each other, making them transparent:

In [None]:
df.iloc[-10:].plot.bar();

In [None]:
df.vgplot.bar()

### Stacked

It can be more useful to stack the bars instead:

In [None]:
df.plot.bar(stacked=True);

In [None]:
df.vgplot.bar(stacked=True)

### Horizontal bar

By changing ``bar`` to ``barh``, we get horizontal bar plots:

In [None]:
df10 = df.iloc[:10]

In [None]:
df10.plot.barh(stacked=True);

In [None]:
df10.vgplot.barh(stacked=True)

## Area Plots

Filled line plots become area plots, which are stacked by default:

In [None]:
df = pd.DataFrame({'x': 1 + np.random.rand(50),
                   'y': 1 + np.random.rand(50)})

In [None]:
df.plot.area();

In [None]:
df.vgplot.area()

When unstacked, they become partially transparent for clarity:

In [None]:
df.plot.area(stacked=False);

In [None]:
df.vgplot.area(stacked=False)

## Histograms

For exploring distributions of data, histograms are useful:

In [None]:
df = pd.DataFrame({'x': np.random.randn(1000),
                   'y': 1 + np.random.randn(1000)})

In [None]:
df.plot.hist(bins=50);

In [None]:
df.vgplot.hist(bins=50)

Vega supports different histogram types, including step plots:

In [None]:
df.vgplot.hist(bins=50, histtype='step')

And filled step plots:

In [None]:
df.vgplot.hist(bins=50, histtype='stepfilled')

## KDE Plots

A more useful density estimator is a kernel density estimate, which can be plotted with the ``kde`` method:

In [None]:
df.plot.kde();

In [None]:
df.vgplot.kde()

## Heatmaps and Hexbin

For two-dimensional data, matplotlib provides a hexagonally-binned heatmapwith the ``hexbin`` function:

In [None]:
df.plot.hexbin(x='x', y='y', gridsize=20);

Unfortunately, Vega-Lite does not support hexagonal binning, but does support Cartesian-binned heatmaps:

In [None]:
df.vgplot.heatmap(x='x', y='y', gridsize=20)

## Scatter Matrix

For higher-dimensional data, a matrix of scatterplots can be a useful way to explore data. Pandas provides this in the ``pd.plotting.scatter_matrix()`` function:

In [None]:
iris = data.iris()

In [None]:
pd.plotting.scatter_matrix(iris, figsize=(8, 8));

pdvega provides this as well:

In [None]:
pdvega.plotting.scatter_matrix(iris, 'species', figsize=(8, 8))

In this case, the scatter matrix supports two types of interactions:

- clicking and dragging will pan and zoom the plots in a linked manner
- clicking and dragging while holding the SHIFT key will perform linked-brushing, highlighting selected points across the plot.

## Parallel Coordinates

Anotehr way to visualize higher-dimensional data is through a parallel coordinates plot:

In [None]:
pd.plotting.parallel_coordinates(iris, 'species');

In [None]:
pdvega.plotting.parallel_coordinates(iris, 'species')

## Andrews Curves

Similar in spirit is the Andrews curve plot, which converts each datapoint into a smooth curve via a Fourier series:

In [None]:
pd.plotting.andrews_curves(iris, 'species');

In [None]:
pdvega.plotting.andrews_curves(iris, 'species')

## Lag Plot

A lag plot is a way of exploring the temporal relationships within a time series, by plotting values separated by a constant lag.
For example, here is the 12-month lag plot of some tech stock prices between 1998 and 2010

In [None]:
stocks = data.stocks(pivoted=True)

In [None]:
pd.plotting.lag_plot(stocks[['MSFT', 'AMZN']], lag=12);

Unlike matplotlib, ``pdvega`` automatically colors and labels points to make the relationships more clear:

In [None]:
pdvega.plotting.lag_plot(stocks[['MSFT', 'AMZN']], lag=12)

This shows us that during 1998-2010, Amazon, a startup company, was quite volatile (its price any given month was very unpredictive of its price a year later) while Microsoft, and established tech giant, was much more stable.

## Learning More

Hopefully this whets your appetite for the types of plots you can make in Vega-Lite. For more information on the pdvega package, please see the documentation at http://jakevdp.github.io/pdvega/.