# Subplots

In [1]:
%matplotlib notebook

import matplotlib.pyplot as plt
import numpy as np

plt.subplot?

In [22]:
plt.figure()
# subplot with 1 row, 2 columns, and current axis is 1st subplot axes
plt.subplot(1, 2, 1)

linear_data = np.array([1,2,3,4,5,6,7,8])

plt.plot(linear_data, '-o')

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7f0c28ab7e48>]

In [23]:
exponential_data = linear_data**2 

# subplot with 1 row, 2 columns, and current axis is 2nd subplot axes
plt.subplot(1, 2, 2)
plt.plot(exponential_data, '-o')

[<matplotlib.lines.Line2D at 0x7f0c289d39e8>]

In [4]:
# plot exponential data on 1st subplot axes
plt.subplot(1, 2, 1)
plt.plot(exponential_data, '-x')

[<matplotlib.lines.Line2D at 0x7f0c4c580588>]

In [25]:
plt.figure()
ax1 = plt.subplot(1, 2, 1)
plt.plot(linear_data, '-o')
# pass sharey=ax1 to ensure the two subplots share the same y axis
ax2 = plt.subplot(1, 2, 2, sharey=ax1)
plt.plot(exponential_data, '-x')

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7f0c2879a630>]

In [8]:
plt.figure()
# the right hand side is equivalent shorthand syntax
plt.subplot(1,2,1) == plt.subplot(121)

<IPython.core.display.Javascript object>

True

In [18]:
# create a 3x3 grid of subplots
fig, ((ax1,ax2,ax3), (ax4,ax5,ax6), (ax7,ax8,ax9)) = plt.subplots(3, 3, sharex=True, sharey=True)
# plot the linear_data on the 5th subplot axes 
ax5.plot(linear_data, '-')

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7f0c29168748>]

In [19]:
# set inside tick labels to visible
for ax in plt.gcf().get_axes():
    for label in ax.get_xticklabels() + ax.get_yticklabels():
        label.set_visible(True)

In [20]:
# necessary on some systems to update the plot
plt.gcf().canvas.draw()

# Histograms

In [None]:
'''
A histogram is a bar chart which shows the frequency of a given phenomena. 
A great example are probability distributions. For instance, in the first course in this specialization, we touched on the difference between the random, uniform, normal, and chi squared distributions.
Probability function can be visualized as a curve, where the y-axis holds the probability a given value would occur, and the x-axis is the value itself. This is called a probability density function. The y-axis values are limited to between zero and one, where zero means there is no chance of a given value occurring and one means that the value will always occur. 
The x-axis values are labeled in terms of the distribution function. In the case of the normal distribution, this is usually in terms of standard deviations. 
So a histogram is just a bar chart where the x-axis is a given observation and the y-axis is the frequency with which that observation occurs. So we should be able to plot a given probability distribution by sampling from it. 
Now, recall that sampling means that we just pick a number out of the distribution, like rolling a die or pulling a single card out of a deck. As we do this over and over again, we get a more accurate description of the distribution. 
Let's pull some samples from the normal distribution and plot four different histograms as subplots. First I'll create our 2 x 2 grade of axis objects. In this case, we don't want to share the y-axis between the plots since we're intentionally looking at a number of different sizes of samples. 
We are mostly interested in how uniform the distribution looks. Then we can iterate through a list of four different values, 10, 100, 1,000 and 10,000. And we will pull samples from the NumPy. 


Remember that the normal function of random just creates a list of numbers based on the underlying normal distribution. 
We can then plot these to a given axis object using the hist function. And set the title as appropriate. 
Well, there we go. The first plot only has ten samples, so it looks pretty jagged. And in my version here I don't think anyone would say this is obviously a normal distribution. 
When we jump to 100 samples, it gets better, but still quite jagged. Then it seems to smooth out a bit on the plots for 1,000 and 10,000 samples. 
But if we look closely, we can see that the bar of the 10,000 plots are actually wider than those of the 10 or the 100 plot. What's going on here? 
By default, the histogram in Matplotlib uses ten bins, that is ten different bars. Here we created a shared x-axis, and as we sample more from the distribution, we're more likely to get outlier values further from our mean. Thus, ten bins for n=10 is at best capturing ten unique values, while for n=10,000, many values have to be combined into a single bin. 

'''


In [26]:
# create 2x2 grid of axis subplots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex=True)
axs = [ax1,ax2,ax3,ax4]

# draw n = 10, 100, 1000, and 10000 samples from the normal distribution and plot corresponding histograms
for n in range(0,len(axs)):
    print(n)
    sample_size = 10**(n+1)
    sample = np.random.normal(loc=0.0, scale=1.0, size=sample_size)
    print(sample)
    axs[n].hist(sample)
    axs[n].set_title('n={}'.format(sample_size))

<IPython.core.display.Javascript object>

0
[ 0.39964336  1.47463916 -0.03818158 -1.30844545  0.53380954 -0.33793137
 -0.28614652 -0.45754638  0.18301746 -0.79949538]
1
[-0.13032899  0.12840564  0.91563644  0.11042218  0.15960335 -0.32036142
  1.18786218  0.00427099 -1.9357584   0.09003686  1.67265125  0.99788023
 -0.64785729  1.18646492  0.45676184 -0.57563368  0.02491935 -0.71290162
  1.06227679 -0.46260106  0.09987893 -1.31242565 -0.39944517  0.10674331
  1.7621709  -0.92922095 -0.52824405  0.49251493  1.65605685  0.20167119
  1.77196406 -0.51315254 -2.18770191 -0.50119576  1.20071134 -0.28997626
 -0.58584733 -0.24523477 -1.25715681  0.32610732 -0.406268    1.18414955
  1.55999793 -0.20385093 -0.0260339   0.48275973  0.90517055  1.14690698
  1.29797319  1.148002   -0.41354972 -1.67269261  0.02423616 -1.04509456
 -1.46627003 -1.15230286 -0.47054536  0.17514656  0.44106903 -0.80078359
 -3.94853609 -1.07190366 -0.29824364  0.52459015  0.15090477  0.00402419
  3.43234289  0.37443886 -0.07120049 -1.45673187 -1.01101501 -0.970893

In [None]:
'''
Let's do the same function with the bin set to 100 
Now we see that the 10,000 sampled plot looks much smoother than all of the others. And the 10 sample plot shows that each sample is basically in its own bin. 

So I think this brings up an important question of how many bins you should plot when using a histogram. I'm afraid that the answer isn't really clear. Both of these plots are true, one is a visual of the data at a coarse granularity, and one at a more fine grain granularity. When we look at the finest granularity in our data, plotting with 10,000 bins then the histograms became basically useless for decision making. Since they aren't showing trends between samples as much as they're just showing the sample size themselves. This is similar to using aggregate statistics like the mean in standard deviation to describe a sample of a population. These values are coarse and whether they are appropriate depends highly on your questions and interests. 
'''

In [27]:
# repeat with number of bins set to 100
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex=True)
axs = [ax1,ax2,ax3,ax4]

for n in range(0,len(axs)):
    sample_size = 10**(n+1)
    sample = np.random.normal(loc=0.0, scale=1.0, size=sample_size)
    axs[n].hist(sample, bins=100)
    axs[n].set_title('n={}'.format(sample_size))

<IPython.core.display.Javascript object>

In [28]:
plt.figure()
Y = np.random.normal(loc=0.0, scale=1.0, size=10000)
X = np.random.random(size=10000)
plt.scatter(X,Y)

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x7f0c21d7c6a0>

In [66]:
# use gridspec to partition the figure into subplots
import matplotlib.gridspec as gridspec

plt.figure()
gspec = gridspec.GridSpec(3, 3)

top_histogram = plt.subplot(gspec[0, 1:]) #change slicing and values too see changes
side_histogram = plt.subplot(gspec[1:, 0])
lower_right = plt.subplot(gspec[1:, 1:])

<IPython.core.display.Javascript object>

In [67]:
Y = np.random.normal(loc=0.0, scale=1.0, size=10000)
X = np.random.random(size=10000)
lower_right.scatter(X, Y)
top_histogram.hist(X, bins=100)
s = side_histogram.hist(Y, bins=100, orientation='horizontal')

In [68]:
# clear the histograms and plot normed histograms
top_histogram.clear()
top_histogram.hist(X, bins=100, normed=True)
side_histogram.clear()
side_histogram.hist(Y, bins=100, orientation='horizontal', normed=True)
# flip the side histogram's x axis
side_histogram.invert_xaxis()

In [69]:
# change axes limits
for ax in [top_histogram, lower_right]:
    ax.set_xlim(0, 1)
for ax in [side_histogram, lower_right]:
    ax.set_ylim(-5, 5)



![MOOC DATA](moocdata.png "Image")

# Box and Whisker Plots

In [None]:
'''
A box plot. Sometimes called a box-and-whisker plot is a method of showing aggregate statistics of various samples in a concise matter. 

The box plot simultaneously shows, for each sample, the median of each value, the minimum and maximum of the samples, and the interquartile range. '''

In [70]:
import pandas as pd
normal_sample = np.random.normal(loc=0.0, scale=1.0, size=10000)
random_sample = np.random.random(size=10000)
gamma_sample = np.random.gamma(2, size=10000)

df = pd.DataFrame({'normal': normal_sample, 
                   'random': random_sample, 
                   'gamma': gamma_sample})

In [75]:
df.head()

Unnamed: 0,gamma,normal,random
0,0.526025,0.107124,0.000666
1,4.297316,0.915933,0.268139
2,1.439985,-0.170468,0.034307
3,2.986515,-1.2111,0.959507
4,2.139898,-2.110759,0.505692


In [None]:
'''
Now we can use the pandas described function to see some summary statistics about our data frame. Each row has 10,000 entries. The mean values and standard deviation vary heavily. 

The minimal and maximum values are showing, and there's three different percentage values. 

These percentage values make up what's called the interquartile range. There are four different quarters of the data. The first is between the minimal value and the first 25% of the data. And this value of 25% is called the first quartile. The second quarter of data is between the 25% mark and the 50% of the data. The third between 50 and 75% of the data. And 75% mark is called the third quartile. And the final piece of data is between the 75% and the maximum of the data. 

Like standard deviation, the interquartile range is a measure of variability of data. And it's common to plot this using a box plot. In a box plot, the mean, or the median, of the data is plotted as a straight line. Two boxes are formed, one above, which represents the 50% to 75% data group, and one below, which represents the 25% to 50% data group. Thin lines which are capped are then drawn out to the minimum and maximum values. 

Here's an example. We'll create a new figure. Then we call plt.boxplot, and we pass in the column that we want to visualize. 





'''

In [71]:
df.describe()

Unnamed: 0,gamma,normal,random
count,10000.0,10000.0,10000.0
mean,2.014615,-0.000302,0.495747
std,1.439124,0.999818,0.289131
min,0.005355,-3.968116,0.000147
25%,0.968529,-0.666876,0.243171
50%,1.691479,9.7e-05,0.494635
75%,2.70944,0.666016,0.745779
max,13.950869,3.889279,0.999999


In [None]:
'''
Finally, we set the whis parameter to be the range. This tells the box plot to set the whisker values all the way out to the minimum and maximum values. Also, I am going to assign the output of the box plot function to a variable, which is just an underscore. 
Now this is either horrendous or beautiful, depending on who you are. You see, underscore is actually a legal name for a variable on Python. But it's also completely uninformative. 
It's common practice by some to use an underscore when unpacking values which you don't care about and won't use later. 

I am using it here because if we don't assign the return value of plotting function to a variable, the Jupiter Notebook will assume that we wanted to print that output. 
Since plotting functions return a list of all of the artists plotted, this would really muddy up our display. It's up to you whether you want to use this underscore pattern or not in your own code. 
Great, this gives us a basic box plot. 




'''

In [76]:
plt.figure()
# create a boxplot of the normal data, assign the output to a variable to supress output
_ = plt.boxplot(df['normal'], whis='range')

<IPython.core.display.Javascript object>

In [77]:
# clear the current figure
plt.clf()
# plot boxplots for all three of df's columns
_ = plt.boxplot([ df['normal'], df['random'], df['gamma'] ], whis='range')

In [74]:
plt.figure()
_ = plt.hist(df['gamma'], bins=100)

<IPython.core.display.Javascript object>

In [78]:
import mpl_toolkits.axes_grid1.inset_locator as mpl_il

plt.figure()
plt.boxplot([ df['normal'], df['random'], df['gamma'] ], whis='range')
# overlay axis on top of another 
ax2 = mpl_il.inset_axes(plt.gca(), width='60%', height='40%', loc=2)
ax2.hist(df['gamma'], bins=100)
ax2.margins(x=0.5)

<IPython.core.display.Javascript object>

In [79]:
# switch the y axis ticks for ax2 to the right side
ax2.yaxis.tick_right()

In [80]:
# if `whis` argument isn't passed, boxplot defaults to showing 1.5*interquartile (IQR) whiskers with outliers
plt.figure()
_ = plt.boxplot([ df['normal'], df['random'], df['gamma'] ] )

<IPython.core.display.Javascript object>

In [None]:
'''
if you don't supply the whis argument, the whiskers actually only go out to halfway between the interquartile range. 
You can figure that out through the top of the box minus the bottom of the box and times that value by 1.5. 
This is one method of detecting outliers. And the points which are plotted beyond the whiskers are called fliers. 
You can also plot the confidence interval in a couple of different ways on the data. The most common is to add notches to the box plot representing the 95% confidence interval of the data and there are lots of other ways to customize the box plot. 
The box plot is one of the more common plots that you might use as a data scientist. 

'''

In [None]:
'''


Heatmaps are a way to visualize three-dimensional data and to take advantage of spatial proximity of those dimensions. 
heatmaps aren't all bad. But where they break down is when there's no continuous relationship between dimensions. Using a heatmap for categorical data, for instance, is just plain wrong. It misleads the viewer into looking for patterns and ordering through spatial proximity. And any such patterns would be purely spurious. 




'''

# Heatmaps

In [81]:
plt.figure()

Y = np.random.normal(loc=0.0, scale=1.0, size=10000)
X = np.random.random(size=10000)
_ = plt.hist2d(X, Y, bins=25)

<IPython.core.display.Javascript object>

In [82]:
plt.figure()
_ = plt.hist2d(X, Y, bins=100)

<IPython.core.display.Javascript object>

In [83]:
# add a colorbar legend
plt.colorbar()

<matplotlib.colorbar.Colorbar at 0x7f0c219f5160>

In [None]:
'''
So far we focused on static images, but matplotlib does have some support for both animation and interactivity. We call this the backend that renders the plot to the stream. Animation and interactivity heavily depend on support from this backend layer. And using a backend like the image png1 doesn't provide this. However, the NBN backend or the matplotlib notebook magic function does provide for some interactivity, so we can leverage that here. 

The Maplotlib.animation module contains important helpers for building animations. 

Start transcript at 45 seconds
0:45
For our discussion, the important object here is to call FuncAnimation. And it builds an animation by iteratively calling a function which you define. Essentially, your function will either clear the axis object and redraw the next frame, which you want users to see or will return a list of objects which need to be redrawn. 


'''

# Animations

In [84]:
import matplotlib.animation as animation

n = 100
x = np.random.randn(n)

In [85]:
# create the function that will do the plotting, where curr is the current frame
def update(curr):
    # check if animation is at the last frame, and if so, stop the animation a
    if curr == n: 
        a.event_source.stop()
    plt.cla()
    bins = np.arange(-4, 4, 0.5)
    plt.hist(x[:curr], bins=bins)
    plt.axis([-4,4,0,30])
    plt.gca().set_title('Sampling the Normal Distribution')
    plt.gca().set_ylabel('Frequency')
    plt.gca().set_xlabel('Value')
    plt.annotate('n = {}'.format(curr), [3,27])

In [86]:
fig = plt.figure()
a = animation.FuncAnimation(fig, update, interval=100)

<IPython.core.display.Javascript object>

# Interactivity

In [None]:
'''
Interactivity and animation are very similar in Matplotlib. For interactivity though, we have to head down to the artist layer a bit more. In particular, we have to reference the canvas object of the current figure. 
The canvas object handles all of the drawing events and it's tightly connected with a given back end. 
If event listening is something you're not familiar with, it can be a bit of a tough concept to grasp. For decades, computers have had abstract methods for doing multiple things at once and now with multiprocessing and multi-core machines, there's actually physical ways Is to do many thing at once. 
But even before that, the abstraction news was largely focused on the notion of events. 
Moving a mouse pointer for instance would create an event, clicking will create an event, pressing on keys on the keyboard would create an event. And this didn't only happen at the hardware level such as IRQ interrupts but at the software level as well. In fact, event driven programming has infiltrated most of the ways computer programmers regularly engage with software. From HTML and JavaScript, down to lower level C code. 
You can think of an event as a piece of data which is associated with a function call. And when the event happens, the software environment, in our case this is Matplotlibs backend, will call the function with the relevant data. 
Let's look at a trivial example. 

We'll create a new figure and plot some random data to it. 
Then we'll create a new function called onclick. And this takes one parameter, which is the event object. So what's in an event object? Well, that depends on the type of the event. Here we're going to deal with mouse events. And they have both an x and a y value as far as the location of the mouse in pixels on the canvas. As well as an x and a y values first location of the mouse relative to our data and axis. 
So for our onclick we'll clear the current axis then plot our data then set the title of the plot to be variance location of the mouse. 
Finally, we have to connect this events to an event listener and this process is usually called wiring it up. In this case very easy, get the current figure and its canvas subject then call the mpl_connect function. Passing in the string for button_press_event as well as reference to the function onclick, which will be call when the event is detected. 
Now when we click on our plot we see the most information printed to the title. 
The Matplotlib documentation describes the kinds of events you can listen from. But whether they work or not, depends on the backend you're using, and some backends are not interactive. Button presses, key presses, scroll events and figuring axis enter and leave events are almost common. But the most important event for us is the pick event. 
The pick event allows you to respond when the user is actually clicked on a visual element in the figure. 


'''

In [87]:
plt.figure()
data = np.random.rand(10)
plt.plot(data)

def onclick(event):
    plt.cla()
    plt.plot(data)
    plt.gca().set_title('Event at pixels {},{} \nand data {},{}'.format(event.x, event.y, event.xdata, event.ydata))

# tell mpl_connect we want to pass a 'button_press_event' into onclick when the event is detected
plt.gcf().canvas.mpl_connect('button_press_event', onclick)

<IPython.core.display.Javascript object>

7

In [88]:
from random import shuffle
origins = ['China', 'Brazil', 'India', 'USA', 'Canada', 'UK', 'Germany', 'Iraq', 'Chile', 'Mexico']

shuffle(origins)

df = pd.DataFrame({'height': np.random.rand(10),
                   'weight': np.random.rand(10),
                   'origin': origins})
df

Unnamed: 0,height,origin,weight
0,0.885444,Iraq,0.268489
1,0.982305,UK,0.868182
2,0.380378,Chile,0.632582
3,0.081783,Germany,0.222194
4,0.192636,India,0.105106
5,0.831148,Canada,0.585171
6,0.102643,China,0.7514
7,0.180478,USA,0.088051
8,0.090761,Brazil,0.864824
9,0.711172,Mexico,0.942976


In [92]:
plt.figure()
# picker=5 means the mouse doesn't have to click directly on an event, but can be up to 5 pixels away
plt.scatter(df['height'], df['weight'], picker=5)
plt.gca().set_ylabel('Weight')
plt.gca().set_xlabel('Height')

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x7f0c1bff7240>

In [105]:
def onpick(event):
    origin = df.iloc[event.ind[0]]['origin']
    plt.gca().set_title('Selected item came from {}'.format(origin))

# tell mpl_connect we want to pass a 'pick_event' into onpick when the event is detected
plt.gcf().canvas.mpl_connect('pick_event', onpick)

19