In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Matplotlib

<!-- requirement: small_data/flights14.csv -->

`Matplotlib` lets you plot things, and `matplotlib.pyplot` is a layer on top of it to give it a `MATLAB`-like syntax. You can make illustrations ranging from simple line plots to complicated combinations of different plotting primitives.  `Matplotlib` is great for static academic illustrations.  For more interactive plots check out a package like `Bokeh`.

`Matplotlib` operates with a wide set of default settings for the way things should look (for example, the width of a plotted line), but is also extremely customizable by way of optional arguments and keyword arguments to most of the plotting functions.  As we work through examples, keep an eye out for these arguments.

Let us start by looking at a simple line plot.

### Line plot
Matplotlib can do basic X-Y plots if you give it the `x` and `y` data of equal length.  Here is a plot of a few sample paths of Brownian Motion.

Notice that calling `plt.plot` multiple results in multiple lines on the same figure.  Call `plt.figure` to create a new figure.

In [None]:
# Line plot example
xs = np.random.randn(5, 100)

plt.title("A few paths of Brownian Motion")
bms = xs.cumsum(1)
for bm in bms:
    plt.plot(bm) # [0,...,N-1] is used for x when only y data is specified 

Cool example!  But we should really always have labels on our plots, so let's add a label for the x and y axis.  We can do this with the `xlabel` and `ylabel` commands.  

In [None]:
# Line plot example
xs = np.random.randn(5, 100)

plt.title("A few paths of Brownian Motion")
bms = xs.cumsum(1)
for bm in bms:
    plt.plot(bm)
plt.xlabel('step')
plt.ylabel('displacement');

Our plot is getting much nicer, but we often want to add labels to the individual lines and gather these into a legend.  The easiest way to do this is to provide a `label` argument to each `plt.plot` command and then use the `plt.legend()` function which will find all the labels for the plotted lines and format this into a legend.

In [None]:
# Line plot example
xs = np.random.randn(5, 100)

plt.title("A few paths of Brownian Motion")
bms = xs.cumsum(1)
for i, bm in enumerate(bms):
    plt.plot(bm, label='path {}'.format(i))
plt.xlabel('step')
plt.ylabel('displacement')
plt.legend(loc='best')

### Scatter plot
`Matplotlib` can generate 2D scatter plot data. Just like the `plt.plot` function, the `plt.scatter` function takes in two arrays (or lists) of equal length and plots them as `x,y` coordinates.  We can also pass in a few other parameters, namely `c` which is an array to control the color of each point and `s` an array to control the size of each point.

In [None]:
# Generate randomly sampled dots within the unit circle, with gamma-distributed radius
N=250
A=20
xo,yo = np.random.uniform(low=-1, high=1, size=N), np.random.uniform(low=-1, high=1, size=N)
so = A*np.random.gamma(4.5, 1.0, size=N)

x = xo[xo**2+yo**2 < 1]
y = yo[xo**2+yo**2 < 1]
s = so[xo**2+yo**2 < 1]

# Scatter plot, with _s_izes and translucent circles
plt.scatter(x, y, s=s, alpha=0.5)

### Histograms
`Matplotlib` can also plot histograms from raw count data.  They are a useful way of looking at distributions of data.  A `histogram` can be made with the `plt.hist` function.  This function returns three things, the histogram values, the edges of the bins and the patches or list of patches used for the histogram.

Let's try this with a `gamma` distribution.

In [None]:
data = np.random.gamma(4.5, 1.0, 10000)
n, bins, patches = plt.hist(data, bins=50, alpha=.5)
plt.title("Gamma(4.5, 1.0) distribution, 10000 samples")
plt.xlabel("Value")
plt.ylabel("Occurances per 10,000");

Now we can look at the counts and the bins

In [None]:
n[:5], bins[:5]

One reason we might want to know the bins is to plot two histograms with the same bins

In [None]:
bins = np.linspace(0, 20, 50)
for gamma in [3, 4.5]:
    data = np.random.gamma(gamma, 1.0, 10000)
    n, bins, patches = plt.hist(data, bins=bins, 
                                alpha=.5, 
                                density=True,
                                label="Gamma({}, 1.0)".format(gamma))
plt.title("Gamma distribution, 10000 samples")
plt.legend(loc='best')
plt.xlabel("Value")
plt.ylabel("Occurances per 10,000");

You might have noticed we have been adjusting the images opacity through use of the `alpha` keyword argument, this is because histograms tend to look better when they are not opaque and this is particularly true when we have multiple histograms on the same plot.

### Images
`Matplotlib` can plot arrays as 2D images, using a color map that you specify.  Conventionally, we represent images with the origin placed at the upper left instead of lower left corner, so watch out! 

In [None]:
a = np.arange(-4, 4, 0.01)

x, y = np.meshgrid(a, a)
assert(x.shape == (len(a), len(a)))
r = np.sqrt(x ** 2 + y ** 2)
plt.imshow(r, cmap=plt.cm.viridis)
plt.colorbar()
plt.title("radius")
plt.xlabel("x")
plt.ylabel("y")

### Exercises
1. Generate an `np.array` of normally distributed samples and plot them using the histogram function.  Overlay with a 2D plot of the standard normal PDF function.  What happens as the number of random samples increases?

## Matplotlib and Pyplot
You'll notice that all of the plots created thus far started with `plt.` That references this import at the top of the notebook:

```python
import matplotlib.pyplot as plt
```

Pyplot is a special plotting "state machine" created for Matplotlib to simplify the creation of plots. Basically, it has an internal concept of the current chart being operated on by the set of methods made available to you. It is a wrapper around Matplotlib's object oriented plotting library.

For the previous plot, we could have created it like this:

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.imshow(r, cmap=plt.cm.viridis)
fig.colorbar(ax.get_images()[0])
ax.set_title("radius")
ax.set_xlabel("x")
ax.set_ylabel("y")

This approach is more typing but it exposes some of the hidden complexity in `pyplot`. There are figure and axis objects and each has methods that contribute to the result.

One approach is not necessarily better than the other, but it is important to know that there is a `pyplot` state machine that creates plots and there is a separate object oriented approach for creating plots.

Later in your Python adventures you will see sample Matplotlib code on the Internet and will want to use it to add features to your data visualizations. The sample code might not easily fit the code you have already written if one is using `pyplot` and the other is not.

To help you with this, `pyplot` provides the `gcf` and `gca` methods. You can use these to get `pyplot`'s current figure or axis objects.

### Matplotlib subplots
Frequently you will want 2 or more plots in the same figure. You can do that with the subplot command.

A common way of creating subplots is with a 3 digit number. The hundreds digit represents the number of rows, the tens digit represents the number of columns, and the ones digit represents the current chart. You call this repeatedly to move from one subplot to the next.

In [None]:
# create a 2x2 subplot grid, and prepare to plot data into the first subplot.
plt.subplot(2, 2, 1)
plt.title('Upper Left')
plt.plot(np.random.rand(10))

# move to the second subplot
plt.subplot(2, 2, 2)
plt.title('Upper Right')
plt.plot(np.random.rand(10))

# move to the third
plt.subplot(2, 2, 3)
plt.title('Lower Left')
plt.plot(np.random.rand(10))

# move to the last subplot
plt.subplot(2, 2, 4)
plt.title('Lower Right')
plt.plot(np.random.rand(10))

These plots look a bit squeezed together, if we want to make them a bit more clean, we can force `Matplotlib` to lay them out in a better format with the `plt.tight_layout()` function.

In [None]:
# create a 2x2 subplot grid, and prepare to plot data into the first subplot.
plt.subplot(2, 2, 1)
plt.title('Upper Left')
plt.plot(np.random.rand(10))

# move to the second subplot
plt.subplot(2, 2, 2)
plt.title('Upper Right')
plt.plot(np.random.rand(10))

# move to the third
plt.subplot(2, 2, 3)
plt.title('Lower Left')
plt.plot(np.random.rand(10))

# move to the last subplot
plt.subplot(2, 2, 4)
plt.title('Lower Right')
plt.plot(np.random.rand(10))

plt.tight_layout()

## Matplotlib plots from Pandas
The Pandas library comes with built-in plotting tools. Data stored in a DataFrame can be plotted just as easily as the previous examples.

In [None]:
import pandas as pd

Let's make a `DataFrame` with some random data, we can then use it to make some nice plots.

In [None]:
test_data = pd.DataFrame(np.random.rand(10, 2),
                      index=np.arange(10),
                      columns=['A', 'B'])
test_data

`DataFrame` objects have a few plotting methods, each one of them makes some assumptions about the type of data generally stored in a `DataFrame` (namely structured, tabular data) and thus can make a decent looking plot with effectively zero configuration.

For example, when we call the `plot` method, notice that it automatically adds a legend with the column names.

In [None]:
test_data.plot()

By default, it assumes you would like to see a line chart. Other choices are available.  Perhaps we want a bar chart:

In [None]:
test_data.plot.bar()

Just like in `Matplotlib`, we can pass parameters to the `bar` method to adjust the chart, here let's use a stacked chart and color things blue and red.

In [None]:
test_data.plot.bar(stacked=True, color=['red', 'blue'], legend=False)

These plots can be useful for visually inspecting your data and general exploratory data analysis.

As we have already seen, a histogram is particularly helpful for understanding the range and distribution of your data. Outliers will be visible, as well as potential data errors.

In [None]:
test_hist = pd.DataFrame(np.random.beta(0.6, 0.5, size=5000),
                      columns=['Beta(0.6, 0.5)'])

test_hist.hist(bins=100, color='red');

One of the great features of Pandas and plotting is how it handles dates.

In [None]:
import pandas.util.testing as pd_testing

time_df = pd_testing.makeTimeDataFrame(50).cumsum()

time_df.head()

This DataFrame has dates in the index. Pandas tries to figure out an intelligent way of arranging the x axis so the labels look pretty.

In [None]:
time_df.plot()

### `Matplotlib` in and out of `Jupyter` Notebooks

You have may have noticed that our plots appear directly in the notebook, which is extremely useful for most forms of exploratory data science.  We can achieve this through the use of the `%matplotlib inline` magic which you might notice we use in the top cell of most of our notebooks.  Often this is enough, but when we want to use `Matplotlib` outside of the notebooks, the usage is slightly different.  Instead of figures appearing after the end of a cell, they appear only after the `plt.show()` function is executed.  This creates a separate window with an interactive plot.  To try it, run the following code in a python interpreter within a desktop environment (sadly not our `Jupyterhub` environment).

```python
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 20)
y = x ** 2
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
```

Both in and out of a notebook we can save the figures we create to disk, through the `plt.savefig` command.  This takes in a string for the location on disk to place the saved figure and will inspect the extension to use the correct format.  For example, let's run the previous code and save it as a `png` file.

In [None]:
x = np.linspace(0, 10, 20)
y = x ** 2
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.savefig('example_plot.png')

Let's look at this image.

In [None]:
from IPython import display
display.Image('example_plot.png')

We can also save a host of other file formats like `jpg` and `pdf`.  One common issue can be seen with the following plot where we rotate the labels so that they are readable.

In [None]:
def gen_plot():
    labels = ['carl baggins', 
              'patrick callahan', 
              'mitt clinton', 
              'donald jackson',
              'mickey mouse',
              'sally mcfarland',
              'hope walker',
              'cynthia smith']
    data = [25, 21, 40, 10, 15, 50, 20, 43]
    plt.bar(labels, data)
    plt.xticks(np.arange(len(labels)), labels, rotation=40)

gen_plot()
plt.savefig('bar.png', dpi=250)

If we look at the image, we can see some of the text is cut off.

In [None]:
display.Image('bar.png')

In order to fix this, we can generate the plot again and then save it with the `bbox_inches` keyword argument.

In [None]:
gen_plot()
plt.savefig('bar_fixed.png', dpi=250, bbox_inches='tight')

In [None]:
display.Image('bar_fixed.png')

The author has found this trick useful many times.

## Seaborn

`Seaborn` is a higher level library built on top of `Matplotlib` and `pandas` to create statistical visualizations and (in many people opinions) give a better set of default style parameters to basic `Matplotlib` functionality.  The `sns.set()` command in the beginning of the notebook sets these parameters.  

Let's start with some data about airport delays with a few categorical variables.  We can use a `FacetGrid` to plot the data in a grid with a plot for each categorical variable.  We will give the `FacetGrid` `col` and `row` arguments to specify which category should represent each column and row respectively. 

Just creating the `FacetGrid` creates a plotting grid, but if we want to put some data on the plots, we need to specify which plotting function we want applied to each subset of the data (filtered by row and column).  We can do this by using the `map` method of the `FacetGrid`.

In [None]:
from datetime import datetime
def get_quarter(x):
    if x < 4 :
        return 1
    elif x < 7:
        return 2
    elif x < 10:
        return 3
    else:
        return 4
df = pd.read_csv('small_data/flights14.csv')
# restrict to largest carriers
df = df[df['carrier'].isin(df['carrier'].value_counts().index[:4])]
df['quarter'] = df['month'].apply(get_quarter)
df.head()

In [None]:
g = sns.FacetGrid(df, row='carrier', col='quarter')
g.map(plt.hist, "dep_delay")
plt.tight_layout()

We can also do other types of plots, good examples can be found in the [Gallery](https://seaborn.pydata.org/examples/index.html).  For example let's take a look at a box plot broken up by quarter and origin for a few different airports.  Here we will only look at the "non-extreme" delays of less than an hour.

In [None]:
sns.boxplot(x='quarter', 
            hue='origin', 
            y='dep_delay', 
            data=df[df['arr_delay'] < 60])

We can also do cool things like create a `PairGrid` to look at distributions and relationships between different variables.

In [None]:
(sns.PairGrid(df[['dep_delay', 'arr_delay']].iloc[:500], diag_sharey=False)
  .map_diag(plt.hist, lw=3, bins=20)
  .map_upper(sns.scatterplot)
  .map_lower(sns.kdeplot))

Much more customization is possible in both `Matplotlib` and libraries built upon it such as `Seaborn`.  They are effective tools to represent data in a useful way.

*Copyright &copy; 2018 The Data Incubator.  All rights reserved.*