## Boxplots
Boxplots are a great way to communicate the center, spread, shape, and outliers of a distribution of univariate data. It's often the first visualization to reach for when looking at quantitative data. Let's get a quick look at what it shows:
https://en.wikipedia.org/wiki/Box_plot

Key points:
* Center line = median
* Box ends = Q1 and Q3
* Whisker ends = Min and Max
* Each segments contains 25% of the data points
* IQR = length of the box
* If one side is stretched, the data is skewed that direction
* Outliers are represented as points beyond the whiskers (+/- 1.5 IQR past Q1/Q3)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Let's start by generating a random sample of size 20 from a Normal distribution, and drawing a basic boxplot.

In [None]:
n = 20
normal_sample = np.random.normal(size=(n,1))
print(normal_sample)
sns.boxplot(x=normal_sample, whis=1.5)
plt.show()

If you don't get any outliers on your boxplot, repeat the above code until you do.

Now take a moment to describe what the boxplot shows.

### How Significant Is A Boxplot Outlier?

<img src="https://imgs.xkcd.com/comics/boyfriend.png">

*It is common to boxplot as an informal test for the existence of outliers. While the procedure is useful, it should be used with caution, as at least **30% of samples from a normally-distributed population** of any size will be flagged as containing an outlier, while for small samples (N<10) even extreme outliers indicate little. This fact is most easily seen using a simulation.*

Journal of Statistics Education Volume 19, Number 2(2011), www.amstat.org/publications/jse/v19n2/dawson.pdf  
Robert Dawson, Saint Mary’s University

### Definition of Outlier

That is, if a data point is below Q1 – 1.5*IQR or above Q3 + 1.5*IQR, it is viewed as being too far from the central values to be reasonable.

How does this work with data that is normally distributed? Remember the rule of 68-95-99.7

<img src="https://upload.wikimedia.org/wikipedia/commons/2/22/Empirical_rule_histogram.svg">

Let's generate a random sample of 10,000 normally distributed values with mean 0 and variance 1, and then plot a histogram of the sample that includes lines showing the +/- 1.5IQR guide lines.

In [None]:
n = 10000
normal_sample = np.random.normal(size=(n,1))

plt.hist(normal_sample,100)

# Indicate the mean with a black dashed line
plt.axvline(normal_sample.mean(), color='k', linestyle='dashed', linewidth=1)

# Compute the 25th, 50th, and 75th percentiles for the data
Q1, median, Q3 = np.percentile(normal_sample, [25,50,75])

# TODO: Calculate the IQR
IQR = 

# Indicate the inner and outer guides computed from the IQR
# Data outside the inner range guides are outliers
# Data outside the outer range guides are extreme values
inner_range_lower = Q1 - 1.5*IQR
inner_range_upper = Q3 + 1.5*IQR
outer_range_lower = Q1 - 3.0*IQR
outer_range_upper = Q3 + 3.0*IQR

# TODO: Plot 4 more lines for the inner and outer ranges.
# Make all of them red. 
# Make the inner ranges solid lines
# Make the outer ranges dotted lines






plt.rcParams["figure.figsize"] =(12,9)

Now let's create a boxplot with outliers for our sample, directly above a histogram of the same sample and on the same x-axis.

In [None]:
# Cut the plot window into 2 parts
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, 
                                    gridspec_kw={"height_ratios": (.15, .85)})
 
# TODO: Add a boxplot and a distplot. Set the number of bins on the distplot to 20.



# Remove x axis name for the boxplot
ax_box.set(xlabel='')
plt.show()

How does this set of plots relate to the previous histogram?

### How often will samples from a Normal distribution contain outliers?
To find out, let's create a simulation where we randomly sample from a Normal distribution. We will try 5 different sample sizes, so 5 different simulations. For each size, we'll draw 10,000 samples and count how many contain an outlier.

In [None]:
# Number of samples to draw in each experiment
sample_sizes = [5, 9, 13, 17, 21]

# Number of trials to run for each sample size
trials = 10000

output_results = {}

for s in sample_sizes:

    count = 0
    for i in range(trials):

        # generate a random sample of normally distributed values with mean 0 and variance 1
        normal_sample = np.random.normal(size=(s,1))
        n, min_max, mean, var, skew, kurt = stats.describe(normal_sample)

        # Compute the quartiles, mean, and IQR
        Q1, Q2, Q3 = np.percentile(normal_sample, [25,50,75])
        IQR = Q3 - Q1

        inner = (Q1 - 1.5*IQR, Q3 + 1.5*IQR)
        outer = (Q1 - 3.0*IQR, Q3 + 3.0*IQR)

        # find samples which are outside the inner range, but inside the outer range
        outside_inner = (normal_sample >= inner[0]) & (normal_sample <= inner[1])
        inside_outer = (normal_sample >= outer[0]) & (normal_sample <= outer[1])

        # if there are any such samples, we say that the boxplot method found suspect outliers, record the result
        if any(~outside_inner & inside_outer):
            count += 1

    output_results[s] = [count]

In [None]:
df = pd.DataFrame(output_results)
df

If we consider the boxplot as an informal test of the Null Hypothesis that the sample is from a normally-distributed (and uncontaminated) population, a flagged outlier in a properly-distributed sample corresponds to a “Type I error”. But, as we have just seen, whatever the sample size, the $\alpha$ of such a test is often close to 30%!

Thus, the existence of a datum flagged as a mild outlier can never be taken, on its own, as significant evidence against the purity of the sample or the normality of the population.