## Introduction to Histograms

In [None]:
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

sns.set(style="ticks", color_codes=True)

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Let's get a quick intro to what a Histogram is:

http://en.wikipedia.org/wiki/Histogram

Key Points:
* Bins are used to create bars. Bins are all equal width.
* Height of bars represent frequency or proportion of values in each bin.
* Bars touch because data is __continuous__ and each bar begins where the previous bin ends.
* Generally, each bin is __inclusive__ on the left and __exclusive__ on the right.
* Changing the number of bins can have a big impact on how the data appears. Experiment with different numbers of bins to find a display that communicates the data as clearly and accurately as possible. 

We'll start by creating a random sample of 10,000 values from a Normal distribution, and just drawing the default Histogram.

In [None]:
dataset1 = np.random.randn(10000)

# Plot a histogram of the dataset, note bins=10 by default
plt.hist(dataset1)
plt.show()

Describe this histogram. What is the x-axis and what is the y-axis?

Let's create another data set by sampling from a Normal distribution, but this time let's only take 20 values instead of 10,000.

In [None]:
dataset2 = np.random.randn(20)

# Let's change the color for "fun"
plt.hist(dataset2, color='darkorchid')
plt.show()

Compare this histogram to the first one. Besides having a much more "fun" color, how is it different?

OK, now let's compare our 2 datasets. Since we are going to have 2 colors on one plot, let's make it viewable by colorblind people by setting the default color palette. This is a Seaborn function, but MatPlotLib will use it also once it's set.

In [None]:
sns.set_palette("colorblind")

One has 10,000 values and one has 20. It wouldn't be reasonable to compare them on the same scale, so let's __normalize__ them. Instead of the y-axis being __frequency__, it will now be __proportion__.

In [None]:
# Set density=True for the plots to be normalized
# Set alpha=0.75 for transparency
plt.hist(dataset1, density=True, alpha=0.75)
plt.hist(dataset2, density=True, alpha=0.75)

# Since we are changing the y-axis, let's add some labels
plt.xlabel('Value')
plt.ylabel('Probability')
plt.title('Histogram of Different Sized Samples')
plt.text(-3, 0.35, r'$\mu=100,\ \sigma=15$')
plt.show()

Compare these 2 samples. What is similar, what is different? Is it possible they came from the same distribution?

If we want the bins to line up also, we can calculate them from all the data together in one stack, and then pass them to the .hist() functions.

In [None]:
# TODO: Look up how to make the same plot as above, but make both distributions
# have the same bins (the bars should line up).

Now how do the samples compare?