# Common Plots

## Introduction

In this chapter, we'll look at some of the most common plots that you might want to make--and how to create them using the most popular libraries. If you need an introduction to these libraries, see the previous chapter.

Bear in mind that for many of the **matplotlib** examples, using the `df.plot.*` syntax can get the plot you want more quickly! To be more comprehensive, the solution for any kind of data is shown in the examples below.

Throughout, we'll assume that the data are in a tidy format (one row per observation, one variable per column). Remember that all Altair plots can be made interactive by adding `.interactive()` at the end.

First, though, let's import the libraries we'll need.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from plotnine import *
import altair as alt
from vega_datasets import data

# Set seed for reproducibility
np.random.seed(10)
# Set max rows displayed for readability
pd.set_option('display.max_rows', 6)
# Nicer matplotlib fonts
plt.style.use({'mathtext.fontset': 'stix',
               'font.family': 'STIXGeneral',
               'figure.figsize': (10, 5.5),
               'xtick.labelsize': 20,
               'ytick.labelsize': 20,
               'font.size': 20})

## Scatter plot

In this example, we see a simple scatter plot with categories using the cars data:

In [None]:
cars = data.cars()
cars.head()

### Matplotlib

In [None]:
colormap = plt.cm.Set1
colorst = [colormap(i) for i in
           np.linspace(0, 0.9, len(cars['Origin'].unique()))]
fig, ax = plt.subplots()
for i, origin in enumerate(cars['Origin'].unique()):
    cars_sub = cars[cars['Origin'] == origin]
    ax.scatter(cars_sub['Horsepower'],
               cars_sub['Miles_per_Gallon'],
               color=colorst[i],
               label=origin,
               edgecolor='grey')
ax.set_ylabel('Miles per Gallon')
ax.set_xlabel('Horsepower')
ax.legend()
plt.show()

### Seaborn

In this first example, I'll also show how to tweak the labels by using the underlying matplolib `Axes` object, here called `ax`.

In [None]:
fig, ax = plt.subplots()
sns.scatterplot(data=cars,
                x="Horsepower",
                y="Miles_per_Gallon",
                hue="Origin",
                ax=ax)
ax.set_ylabel('Miles per Gallon')
ax.set_xlabel('Horsepower')
plt.show()

### Plotnine

In [None]:
(
    ggplot(cars, aes(x="Horsepower",
                     y="Miles_per_Gallon",
                     color='Origin'))
    + geom_point()
    + ylab('Miles per Gallon')
)

### Altair

For this first example, we'll also show how to make the altair plot interactive with movable axes and more info on mouse-hover.

In [None]:
alt.Chart(cars).mark_circle(size=60).encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color='Origin',
    tooltip=['Name', 'Origin', 'Horsepower', 'Miles_per_Gallon']
).interactive()

## Bubble plot

This is a scatter plot where the size of the point carries an extra dimension of information.

### Matplotlib



In [None]:
fig, ax = plt.subplots()
scat = ax.scatter(cars['Horsepower'],
               cars['Miles_per_Gallon'],
               s = cars['Displacement'],
               alpha=0.4)
ax.set_ylabel('Miles per Gallon')
ax.set_xlabel('Horsepower')
ax.legend(*scat.legend_elements(prop="sizes", num=4), loc="upper right", title="Displacement", frameon=False)
plt.show()

### Seaborn



In [None]:
sns.scatterplot(data=cars,
                x="Horsepower",
                y="Miles_per_Gallon",
                size="Displacement");

### Plotnine

In [None]:
(
    ggplot(cars, aes(x="Horsepower",
                     y="Miles_per_Gallon",
                     size='Displacement'))
    + geom_point()
)

### Altair


In [None]:
alt.Chart(cars).mark_circle().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    size='Displacement'
)

## Line plot

First, let's get some data on GDP growth:

In [None]:
import pandas_datareader.data as web

ts_start_date = pd.to_datetime('1999-01-01')

df = pd.concat([web.DataReader('ticker=RGDP' + x, 'econdb', start=ts_start_date) for x in ['US', 'UK']], axis=1)
df.columns = ['US', 'UK']
df.index.name = 'Date'
df = 100*df.pct_change(4)
df = pd.melt(df.reset_index(),
             id_vars=['Date'],
             value_vars=df.columns,
             value_name='Real GDP growth, %',
             var_name='Country')
df = df.set_index('Date')
df.head()

### Matplotlib

Note that **Matplotlib** prefers data to be one variable per column, in which case we could have just run

```python
fig, ax = plt.subplots()
df.plot(ax=ax)
ax.set_title('Real GDP growth, %', loc='right')
ax.yaxis.tick_right()
```

but we are working with tidy data here, so we'll do the plotting slightly differently.

In [None]:
colormap = plt.cm.Set1
colorst = [colormap(i) for i in
           np.linspace(0, 0.9, len(df['Country'].unique()))]
fig, ax = plt.subplots()
for i, country in enumerate(df['Country'].unique()):
    df_sub = df[df['Country'] == country]
    ax.plot(df_sub.index,
               df_sub['Real GDP growth, %'],
               color=colorst[i],
               label=country,
               lw=2)
ax.set_title('Real GDP growth, %', loc='right')
ax.yaxis.tick_right()
ax.legend()
plt.show()

### Seaborn

Note that **seaborn** prefers not to work with an index value so we use `df.reset_index()` to make the 'date' index column into a regular column in the snippet below:

In [None]:
fig, ax = plt.subplots()
sns.lineplot(x="Date", y="Real GDP growth, %",
             hue="Country",
             data=df.reset_index(),
             ax=ax)
ax.yaxis.tick_right()
plt.show()

### Plotnine

In [None]:
(
    ggplot(df.reset_index(), aes(x='Date',
                                 y='Real GDP growth, %',
                                 color='Country'))
    + geom_line()
)

### Altair

In [None]:
alt.Chart(df.reset_index()).mark_line().encode(
    x='Date:T',
    y='Real GDP growth, %',
    color='Country',
    strokeDash='Country',
)

## Bar chart

Let's see a bar chart, using the 'barley' dataset.

In [None]:
barley = data.barley()
barley = pd.DataFrame(barley.groupby(['site'])['yield'].sum())
barley.head()

### Matplotlib

Just remove the 'h' in `ax.barh` to get a vertical plot.

In [None]:
fig, ax = plt.subplots()
ax.barh(barley['yield'].index, barley['yield'], 0.35)
ax.set_xlabel('Yield')
plt.show()

### Seaborn

Just switch x and y variables to get a vertical plot.

In [None]:
sns.catplot(
    data=barley.reset_index(),
    kind="bar",
    y="site", x="yield",
)

### Plotnine

Just omit `coord_flip()` to get a vertical plot.

In [None]:
(
    ggplot(barley.reset_index(), aes(x='site', y='yield'))
    + geom_col()
    + coord_flip()
)

### Altair

Just switch x and y to get a vertical plot.

In [None]:
alt.Chart(barley.reset_index()).mark_bar().encode(
    y='site',
    x='yield',
).properties(
    width=alt.Step(40)  # controls width of bar.
)

## Grouped bar chart



In [None]:
barley = data.barley()
barley = pd.DataFrame(barley.groupby(['site', 'year'])['yield'].sum()).reset_index()
barley.head()

### Matplotlib

In [None]:
labels = barley['site'].unique()
y = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
ax.barh(y - width/2, barley.loc[barley['year'] == 1931, 'yield'], width, label='1931')
ax.barh(y + width/2, barley.loc[barley['year'] == 1932, 'yield'], width, label='1932')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('Yield')
ax.set_yticks(x)
ax.set_yticklabels(labels)
ax.legend(frameon=False)
plt.show()

### Seaborn

In [None]:
sns.catplot(
    data=barley,
    kind="bar",
    y="site", x="yield",
    hue="year"
)

### Plotnine

In [None]:
(
    ggplot(barley, aes(x='site', y='yield', fill='factor(year)'))
    + geom_col(position='dodge')
    + coord_flip()
)

### Altair


In [None]:
alt.Chart(barley.reset_index()).mark_bar().encode(
    y='year:O',
    x='yield',
    color='year:N',
    row='site:N'
).properties(
    width=alt.Step(40)  # controls width of bar.
)

## Stacked bar chart



### Matplotlib 

In [None]:
labels = barley['site'].unique()
y = np.arange(len(labels))  # the label locations
width = 0.35  # the width (or height) of the bars

fig, ax = plt.subplots()
ax.barh(y, barley.loc[barley['year'] == 1931, 'yield'], width, label='1931')
ax.barh(y, barley.loc[barley['year'] == 1932, 'yield'], width, label='1932', left=barley.loc[barley['year'] == 1931, 'yield'])

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('Yield')
ax.set_yticks(x)
ax.set_yticklabels(labels)
ax.legend(frameon=False)
plt.show()

### Seaborn

As far as I know, there's no easy way of doing this.

### Plotnine



In [None]:
(
    ggplot(barley, aes(x='site', y='yield', fill='factor(year)'))
    + geom_col()
    + coord_flip()
)

### Altair

In [None]:
alt.Chart(barley.reset_index()).mark_bar().encode(
    y='site',
    x='yield',
    color='year:N',
).properties(
    width=alt.Step(40)  # controls width of bar.
)

## Kernel density estimate

We'll use the diamonds dataset to demonstrate this.

In [None]:
diamonds = sns.load_dataset("diamonds").sample(1000)
diamonds.head()

### Matplotlib

Technically, there is a way to do this but it's pretty inelegant if you want a quick plot. That's because **matplotlib** doesn't do the density estimation itself. [Jake Vanderplas](https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html) has a nice example but as it relies on a few extra libraries, I won't reproduce it here.

### Seaborn



In [None]:
sns.displot(diamonds,
            x="carat", kind="kde", hue='cut',
            fill=True);

### Plotnine



In [None]:
(
   ggplot(diamonds, aes(x='carat', fill = 'cut', colour = 'cut')) +
   geom_density(alpha=0.5)
)

### Altair

In [None]:
alt.Chart(diamonds).transform_density(
    density='carat',
    as_=['carat', 'density'],
    groupby=['cut']
).mark_area(fillOpacity=0.5).encode(
    x='carat:Q',
    y='density:Q',
    color='cut:N',
)

## Histogram or probability density function

For this, let's go back to the penguins dataset.

In [None]:
penguins = sns.load_dataset("penguins")
penguins.head()

### Matplotlib

The `density=` keyword parameter decides whether to create counts or a probability density function.

In [None]:
fig, ax = plt.subplots()
ax.hist(penguins['flipper_length_mm'], bins=30, density=True, edgecolor='k')
ax.set_xlabel('Flipper length (mm)')
ax.set_ylabel('Probability density')
fig.tight_layout()
plt.show()

### Seaborn

In [None]:
sns.histplot(data=penguins, x="flipper_length_mm", bins=30, stat='density');

### Plotnine



In [None]:
(
    ggplot(penguins, aes(x='flipper_length_mm', y='stat(density)'))
    + geom_histogram(bins=30) # specify the binwidth
)

### Altair



In [None]:
alt.Chart(penguins).mark_bar().encode(
    alt.X("flipper_length_mm:Q", bin=True),
    y='count()',
)

## Marginal histograms



### Maplotlib

[Jaker Vanderplas's excellent notes](https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html) have a great example of this, but now there's an easier way to do it with Matplotlib's new `constrained_layout` options.

In [None]:
fig = plt.figure(constrained_layout=True)
# Create a layout with 3 panels in the given ratios
axes_dict = fig.subplot_mosaic([['.', 'histx'], ['histy', 'scat']],
                         gridspec_kw={'width_ratios': [1, 7],
                                      'height_ratios': [2, 7]})
# Glue all the relevant axes together
axes_dict['histy'].invert_xaxis()
axes_dict['histx'].sharex(axes_dict['scat'])
axes_dict['histy'].sharey(axes_dict['scat'])
# Plot the data
axes_dict['scat'].scatter(penguins['bill_length_mm'], penguins['bill_depth_mm'])
axes_dict['histx'].hist(penguins['bill_length_mm'])
axes_dict['histy'].hist(penguins['bill_depth_mm'], orientation='horizontal');

### Seaborn

In [None]:
sns.jointplot(data=penguins, x="bill_length_mm", y="bill_depth_mm");

### Plotnine

I couldn't find an easy way to do this in plotnine but you can make rug plots, which have some similarities in terms of information conveyed.

In [None]:
(ggplot(penguins, aes(x='bill_length_mm', y='bill_depth_mm')) +
  geom_point() +
  geom_rug())

### Altair

This is a bit fiddly.

In [None]:
base = alt.Chart(penguins)

xscale = alt.Scale(domain=(20, 60))
yscale = alt.Scale(domain=(10, 30))

area_args = {'opacity': .5, 'interpolate': 'step'}

points = base.mark_circle().encode(
   alt.X('bill_length_mm', scale=xscale),
   alt.Y('bill_depth_mm', scale=yscale)
)

top_hist = base.mark_area(**area_args).encode(
    alt.X('bill_length_mm:Q',
          # when using bins, the axis scale is set through
          # the bin extent, so we do not specify the scale here
          # (which would be ignored anyway)
          bin=alt.Bin(maxbins=30, extent=xscale.domain),
          stack=None,
          title=''
         ),
    alt.Y('count()', stack=None, title='')
).properties(height=60)

right_hist = base.mark_area(**area_args).encode(
    alt.Y('bill_depth_mm:Q',
          bin=alt.Bin(maxbins=30, extent=yscale.domain),
          stack=None,
          title='',
         ),
    alt.X('count()', stack=None, title=''),
).properties(width=60)

top_hist & (points | right_hist)

## Heatmap