In [None]:
%matplotlib inline

# Matplotlib tutorial

Before going through this tutorial, read this [short description](http://matplotlib.org/faq/usage_faq.html#general-concepts) of Matplotlib.

There are two different interfaces: a matlab-like interface and an object-oriented interface. I prefer the object-oriented and will use it here, although it will be easy to switch to the other interface if you choose. In general, a reasonable approch is to use the matlab-like interface for interactive plotting from ipython and use the object-oriented interface for scripting (but again, I typically just use the object-oriented interface).

When you are trying to find your way around Matplotlib, a good place to start is the [gallery](http://matplotlib.org/gallery.html), where you can scan through images to find something similar to what you are trying to produce. To explore the difference between the object-oriented interface and the matlab-like interface, you can compare code from the [api](http://matplotlib.org/gallery.html#api) and [pylab](http://matplotlib.org/gallery.html#pylab_examples) examples.

To get started, we'll need to import `matplotlib.pyplot` as well as `numpy` and `pandas`, which we'll use to work with the data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

For many of the plots below, we'll work with the data set from the last Pandas tutorial.

In [None]:
dat = pd.read_csv('../data/array.csv')

In [None]:
dat.info()

In [None]:
dat.head()

In [None]:
print('The data set contains {} genes, {} brains, and {} regions'.format(dat.gene.nunique(),
                                                                         dat.brain.nunique(),
                                                                         dat.region.nunique()))

## Creating figures and axes

The easiest way to create a figure is `plt.subplots`. A figure can consists of several different subplots, which each have their own axes. At first, we'll just focus on the single axes case. If no arguments are passed, it returns a figure object and a single axes object.

    fig, ax = plt.subplots()
Both the figure and axes objects have a lot of methods, so just pressing `fig.<TAB>` or `ax.<TAB>` can be pretty overwhelming. The main plotting commands can be accessed under `ax`. Some popular ones include `scatter`, `plot`, `bar`, `boxplot`, and `hist`.

## Histograms

First, we'll create a histogram using the "value" column from `dat`. Each plot method returns objects that you may want to customize further or use for something else. In the case of a histogram, it is the height of the bars, the bin locations, and the bar objects themselves. If you don't need them for anything, you don't have to assign them a name (and, because they're not assigned, they'll show up in the output of ipython).

In [None]:
fig, ax = plt.subplots()
ax.hist(dat.value)

Change the bin size.

Add grid lines.

In [None]:
fig, ax = plt.subplots()
ax.scatter(mean_exp.CBC, mean_exp.DFC, alpha=0.7, edgecolor='none')

ax.grid(True)

ax.set_xlabel('CBC')
ax.set_ylabel('DFC')

Keep grid but reduce number of ticks.

In [None]:
fig, ax = plt.subplots()
ax.scatter(mean_exp.CBC, mean_exp.DFC, alpha=0.7, edgecolor='none')

ax.grid(True)
ax.locator_params(nbins=5)

ax.set_xlabel('CBC')
ax.set_ylabel('DFC')

## Bar plots

For the bar plots, we'll use the gene that has the maximum expression difference between the two regions.

In [None]:
max_diff_gene = mean_by_reg.abs_diff.idxmax()
max_diff = mean_by_reg.loc[max_diff_gene, ['CBC', 'DFC']]

In [None]:
max_diff

In [None]:
fig, ax = plt.subplots()
xlocs = np.arange(max_diff.size)
ax.bar(xlocs, max_diff)

Change width of bar and move to center.

In [None]:
fig, ax = plt.subplots()
xlocs = np.arange(max_diff.size)
ax.bar(xlocs, max_diff, width=0.6, align='center')

Relabel x axis and only keep two x-tick locations.

For the scatter plots, we'll use mean expression of each game for the two regions present.

In [None]:
mean_exp = dat.groupby(['gene', 'region'])['value'].mean().unstack()

In [None]:
mean_exp.head()

In [None]:
fig, ax = plt.subplots()

ax.scatter(mean_exp.CBC, mean_exp.DFC)
ax.set_xlabel('CBC')
ax.set_ylabel('DFC')

Make dots semi-transparent and remove edgecolor.

In [None]:
fig, ax = plt.subplots()
ax.scatter(mean_exp.CBC, mean_exp.DFC, alpha=0.7, edgecolor='none')
ax.set_xlabel('CBC')
ax.set_ylabel('DFC')

Change color by the difference between CBC and DFC (which is a bit silly, considering that is already reflected in the position).

In [None]:
mean_by_reg = dat.groupby(['gene', 'region'])['value'].mean().unstack()
mean_by_reg['abs_diff'] = np.abs(mean_by_reg.CBC - mean_by_reg.DFC)

In [None]:
fig, ax = plt.subplots()
ax.scatter(mean_exp.CBC, mean_exp.DFC, edgecolor='none', c=mean_by_reg.abs_diff)
ax.set_xlabel('CBC')
ax.set_ylabel('DFC')

Change the tick locations.

In [None]:
fig, ax = plt.subplots()
ax.scatter(mean_exp.CBC, mean_exp.DFC, alpha=0.7, edgecolor='none')

ax.locator_params(nbins=5)  ## reduce the number of ticks

ax.set_xlabel('CBC')
ax.set_ylabel('DFC')

Remove the ticks completely.

Move legend to the top.

In [None]:
fig, ax = plt.subplots()
ax.plot(x, np.sin(x), label='sine')
ax.plot(x, np.cos(x), label='cosine')
ax.legend(bbox_to_anchor=(0.5, 1.1), ncol=2,
          loc='center', frameon=False)

Fill area between curves.

In [None]:
fig, ax = plt.subplots()
xsin = np.sin(x)

ax.plot(x, xsin)

ax.fill_between(x, 0, xsin, alpha=0.3)
ax.set_xlim(0, 6)

## Box plots

For the box plots, we'll use the expression values for the gene that has the maximum difference between the two regions.

In [None]:
max_dat = dat[dat.gene == max_diff_gene]

In [None]:
max_dat

The `boxplot` method accepts a sequence of vectors, so below we make a list of containing the expression values for each region.

In [None]:
regions = max_dat.region.unique()

## make a list that contains value column for each region
## [ [CBC values], [DFC values] ]
region_vals = [max_dat[max_dat.region == reg].value for reg in regions]

In [None]:
fig, ax = plt.subplots()
ax.boxplot(region_vals)

Set the x label names.

In [None]:
fig, ax = plt.subplots()
ax.boxplot(region_vals)
ax.set_xticklabels(regions)

ax.set_ylabel('Expression value')
ax.set_xlabel('Region')

Change box colors and increase width.

In [None]:
fig, ax = plt.subplots()
ax.hist(dat.value, bins=25)

Adjust colors.

In [None]:
fig, ax = plt.subplots()
ax.hist(dat.value, bins=25,
        edgecolor='w', facecolor='0.5')

Set axis labels.

In [None]:
fig, ax = plt.subplots()
ax.hist(dat.value, bins=25,
        edgecolor='w', facecolor='0.5')
ax.set_xlabel('Expression value')
ax.set_ylabel('Frequency')

Normalize the histogram.

In [None]:
fig, ax = plt.subplots()
ax.hist(dat.value, bins=25,
        edgecolor='w', facecolor='0.5',
        normed=True)
ax.set_xlabel('Expression value')
ax.set_ylabel('Frequency')

Create a cumulative histogram.

## Line plots

In [None]:
x = np.linspace(0, 10)

fig, ax = plt.subplots()
ax.plot(x, np.sin(x))
ax.set_title('A wave')

In [None]:
fig, ax = plt.subplots()
ax.plot(x, np.sin(x), label='sine')
ax.plot(x, np.cos(x), label='cosine')
ax.set_title('Some waves')
ax.legend()

In [None]:
fig, ax = plt.subplots()
bp = ax.boxplot(region_vals, widths=0.5)

## bp is just a dictionary of elements
## that make up the box.
## Iterate through the values and set the color.
## plt.setp is often used instead of a for loop
for elems in bp.values():
    for elem in elems:
        elem.set_color('g')
    
ax.set_xticklabels(regions)

ax.set_ylabel('Expression value')
ax.set_xlabel('Region')

Fill the box instead and only change the box color.

In [None]:
fig, ax = plt.subplots()
bp = ax.boxplot(region_vals, widths=0.5,
                patch_artist=True)

for box in bp['boxes']:
    box.set_color('0.6')
    
ax.set_xticklabels(regions)

ax.set_ylabel('Expression value')
ax.set_xlabel('Region')

Same as above, but make outliers red dots.

In [None]:
fig, ax = plt.subplots()
bp = ax.boxplot(region_vals, widths=0.5,
                patch_artist=True,
                sym='.')

for box in bp['boxes']:
    box.set_color('0.6')
    
for flier in bp['fliers']:
    flier.set_color('r')
    
ax.set_xticklabels(regions)

ax.set_ylabel('Expression value')
ax.set_xlabel('Region')

## Multiple subplots

Above we have been calling `plt.subplots` without arguments. If we pass `ncols` or `nrows` to this method, it will create several axes objects for us.

In [None]:
vals1, vals2 = np.random.randn(2, 100)

fig, axes = plt.subplots(ncols=2, figsize=(9, 4))
axes[0].hist(vals1)
axes[1].hist(vals2)

Share the y axis.

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(9, 4),
                         sharey=True)
axes[0].hist(vals1)
axes[1].hist(vals2)

Add plot titles and legends as usual.

In [None]:
fig, ax = plt.subplots()
ax.scatter(mean_exp.CBC, mean_exp.DFC, alpha=0.7, edgecolor='none')

ax.tick_params(length=0)
ax.locator_params(nbins=5)

ax.set_xlabel('CBC')
ax.set_ylabel('DFC')

In [None]:
fig, ax = plt.subplots()
xlocs = np.arange(max_diff.size)

ax.bar(xlocs, max_diff, width=0.6, align='center')

ax.set_xticks(xlocs)  # only label where bar is placed
ax.set_xticklabels(max_diff.index)  # use region names instead of numbers

ax.set_xlabel('Region')
ax.set_ylabel('Expression value')

Change figure size, remove x ticks, and reduce number of y ticks.

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(9, 4),
                               sharey=True)
ax1.hist(vals1)
ax2.hist(vals2)

ax1.set_title('Histogram 1')
ax2.set_title('Histogram 2')

ax1.set_ylabel('Frequency')

If the two x axes would have the same title, we might not want to add an x label to each. We can instead add text to the add just one label.

In [None]:
fig, ax = plt.subplots()
ax.hist(dat.value, bins=25,
        edgecolor='w', facecolor='0.5',
        cumulative=True)
ax.set_xlabel('Expression value')
ax.set_ylabel('Frequency')

## Scatter plots

In [None]:
fig, ax = plt.subplots(figsize=(5, 5))
xlocs = np.arange(max_diff.size)

ax.bar(xlocs, max_diff, width=0.7, align='center')

ax.locator_params(axis='y', nbins=5)
ax.tick_params(axis='x', length=0)

ax.set_xticks(xlocs)  # only label where bar is placed
ax.set_xticklabels(max_diff.index)  # use region names instead of numbers

ax.set_xlim(-0.5, 1.5)

ax.set_xlabel('Region')
ax.set_ylabel('Expression value')

Tweak colors and line width.

In [None]:
fig, ax = plt.subplots(figsize=(5, 5))
xlocs = np.arange(max_diff.size)

ax.bar(xlocs, max_diff, width=0.7, align='center',
       edgecolor='#006699', facecolor='#006699',
       alpha=0.5,
       linewidth=3)

ax.locator_params(axis='y', nbins=5)
ax.tick_params(axis='x', length=0)

ax.set_xticks(xlocs)  # only label where bar is placed
ax.set_xticklabels(max_diff.index)  # use region names instead of numbers

ax.set_xlim(-0.5, 1.5)

ax.set_xlabel('Region')
ax.set_ylabel('Expression value')

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(9, 4),
                         sharey=True)
ax1.hist(vals1)
ax2.hist(vals2)

ax1.set_title('Histogram 1')
ax2.set_title('Histogram 2')

ax1.set_ylabel('Frequency')

fig.text(0.5, 0, 'Thing',
         ha='center')

Add a title for the figure.

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(9, 4),
                         sharey=True)
ax1.hist(vals1)
ax2.hist(vals2)

ax1.set_ylabel('Frequency')

fig.text(0.5, 0, 'Thing',
         ha='center')
fig.suptitle('Both histograms',
             size='large')

Add more rows.

In [None]:
vals = np.random.randn(4, 100)

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2,
                         sharex=True, sharey=True)

for val, ax in zip(vals, axes.ravel()):
    ax.hist(val)
    ax.locator_params(nbins=4)
    
fig.tight_layout()

## Heat map