<center><img src='img/ms_logo.jpeg' height=40% width=40%></center>

<center><h1>Outlier Detection, Sample Size, and Confidence Intervals</h1></center>

When you're designing an experiment, numbers matter.  After all, we want out experiments to be statistically valid--otherwise, we're just guessing.  In this notebook, we'll learn a method for detecting outliers in our data set called "Tukey Fences", named after famed statistician John Tukey.  

Next, we'll learn about confidence inteverals, sample size, and the relationship between the two.  We'll learn how to calculate confidence intervals based on sample size, as well as how to determine the minimum sample size needed in order to reach a specific confidence interval.  

Let's get started!

<center><h2>Outlier Detection</h2></center>

Recall that before we begin an experiment, we usually start by "cleaning" our dataset.  This step usually includes things like:

* Exploring our dataset(s) to get a feel for what changes need to be made to make it more usable
* Examining and standardizing the values within cells (converting "yes"/"no" answers to 1's and 0's, for example)
* Dealing with cells that contain NaNs (Null values)
* Organizing and structuring datasets as needed (for instance, combining many small datasets into one big one)
* Normalizing continuous data into z-scores with a mean of 0 and unit variance.  

Another major step we need to do at this point in the project is to detect **outliers**, and determine how to deal wit them.  Outliers are extreme values that can skew our dataset, sometimes giving us an incorrect picture of how things actually are in our dataset.  The hardest part of this is determining which data points are acceptable, and which ones constitute "outlier" status.  This is where "Tukey Fences" come into play!

### 1.5 x IQR

In order to find outliers, we first need a working definition of what constitutes an outlier.  Tukey suggested we calculate the range between the first quartile (25%) and  third quartile (75%) in the data, called the **interquartile range**.  We then multiply this value by 1.5.  To get the Fence for high values, add this value to the Q3 value.  Anything greater than this "Fence" value is considered an outlier.  Similarly, to get the Fence for low values, subtract 1.5 x IQR from Q1.  Anything less than this "Fence" value is also considered an outlier.  

Let's try an example!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(1547)
% matplotlib inline

In [2]:
# Generate a random normal distribution of 1000 samples with mean 100 and std_dev of 8
normal_dist = np.random.normal(100, 8, (1000)).astype('float64')
# Generate a random uniform distribution between 1 and 200 with 100 samples
uniform_dist = np.random.uniform(1, 200, (100)).astype('float64')
# Combine both distributions and store in and Pandas Series object
sample_dset = pd.Series(np.append(normal_dist, uniform_dist))
sample_dset

0       106.145874
1        91.044219
2       107.202858
3        86.464344
4       103.249059
5        90.923516
6       104.537804
7        98.168093
8       100.619681
9        88.562850
10       98.669220
11      101.294004
12       95.603203
13       96.755166
14      100.191893
15       98.559921
16       97.932122
17       90.858907
18       87.431719
19       98.345744
20      101.390562
21      106.771191
22      108.409304
23      106.708683
24       95.973145
25      116.029499
26      108.147891
27      110.542668
28       93.369981
29      103.094504
           ...    
1070    189.189803
1071    147.178346
1072    184.463681
1073    155.482583
1074    156.048024
1075     48.678164
1076    136.150561
1077     20.009754
1078    153.901911
1079    142.498713
1080    191.318721
1081    158.503217
1082     82.783222
1083    115.211518
1084     56.099865
1085     34.242390
1086    183.971021
1087    127.040301
1088     63.634793
1089     59.152878
1090     98.049143
1091    100.

In [3]:
sample_dset[1]

91.04421874119312

In [4]:
sample_dset.describe()['25%']

94.65885980256553

Now that we've created an ugly data set, let's see if we can identify some outliers.  

Start by calculating the **Inter-Quartile Range**: Q3 - Q1.

Next, calculate how far your fences are from the quartiles: f = IQR x 1.5

Finally, place your fences and filter for values outside them:  Lower Fence = Q1 - f, Upper Fence = Q3 + f

See if you can write write some code to filter for outliers in the `sample_dset` array we've just created.

In [5]:
# Get Locations for Q1 and Q3
q1 = 94.658860
q3 = 106.459821

# Inter-Quartile Range and fence distance
IQR = q3 - q1
fence_distance = IQR * 1.5

# calculate fence locations
lower_fence = q1 - fence_distance
upper_fence = q3 + fence_distance

# Filter out the outliers and inspect them!
sample_dset_no_outliers = sample_dset[(lower_fence < sample_dset) & (sample_dset < upper_fence)]
sample_dset_no_outliers

0       106.145874
1        91.044219
2       107.202858
3        86.464344
4       103.249059
5        90.923516
6       104.537804
7        98.168093
8       100.619681
9        88.562850
10       98.669220
11      101.294004
12       95.603203
13       96.755166
14      100.191893
15       98.559921
16       97.932122
17       90.858907
18       87.431719
19       98.345744
20      101.390562
21      106.771191
22      108.409304
23      106.708683
24       95.973145
25      116.029499
26      108.147891
27      110.542668
28       93.369981
29      103.094504
           ...    
999     112.524148
1001     79.065670
1004    106.480217
1005     82.031813
1006    110.448626
1008     90.398271
1012    114.049114
1017    122.622859
1023    100.802027
1025    113.260434
1027     86.526187
1028    116.898955
1030    118.484723
1035    103.943877
1036     88.793482
1042    121.690856
1048    122.493991
1049     89.491819
1050     99.592893
1054    115.880911
1061    117.151834
1064    118.

Great! That works, but it isn't efficient to calculate this manually every time we run across a new data set.  

**TASK:** Write a function that takes in a pandas series, and returns a new pandas series with the outliers removed!

In [6]:
def remove_outliers(series):
    # Get locations for Q1 and Q3
    q1 = sample_dset.describe()['25%']
    q3 = sample_dset.describe()['75%']

    # Calculate inter-quartile range and fence distance
    iqr = q3 - q1
    fence_distance = iqr * 1.5

    # Calculate fence locations
    lower_fence = q1 - fence_distance
    upper_fence = q3 + fence_distance

    # Filter out the outliers and inspect them!
    series_no_outliers = series[(lower_fence < series) & (series < upper_fence)]
    return series_no_outliers

In [7]:
remove_outliers(sample_dset)

0       106.145874
1        91.044219
2       107.202858
3        86.464344
4       103.249059
5        90.923516
6       104.537804
7        98.168093
8       100.619681
9        88.562850
10       98.669220
11      101.294004
12       95.603203
13       96.755166
14      100.191893
15       98.559921
16       97.932122
17       90.858907
18       87.431719
19       98.345744
20      101.390562
21      106.771191
22      108.409304
23      106.708683
24       95.973145
25      116.029499
26      108.147891
27      110.542668
28       93.369981
29      103.094504
           ...    
999     112.524148
1001     79.065670
1004    106.480217
1005     82.031813
1006    110.448626
1008     90.398271
1012    114.049114
1017    122.622859
1023    100.802027
1025    113.260434
1027     86.526187
1028    116.898955
1030    118.484723
1035    103.943877
1036     88.793482
1042    121.690856
1048    122.493991
1049     89.491819
1050     99.592893
1054    115.880911
1061    117.151834
1064    118.

<center><h2>Sample Size and Confidence Intervals</h2></center>

## What is a Confidence Interval?

Recall that in statistics, we almost never get to work with the entire population.  Instead, we work with samples taken from the population, and use statistical methods to try and estimates about the population based on what we see in the samples. When you think about this estimation process, this leads to two very important questions:

<center>1. **_How accurate are our estimates?_**</center>
<br>
<center>2. **_How many samples do we need to be need to be sure our estimates are accurate?_**</center>

This is where confidence intervals come in to play.  When estimating population parameters such as the population mean, for example, it is impossible to know with certainty that our estimate is 100% accurate.  Instead, statisticians define an acceptable margin of error.  In plain English, that means that we're okay with our estimate being wrong, as long as we're {X}% sure we're within a certain distance from the mean.  

To illustrate this concept, let's look at a type of graph statisticians use to denote confidence intervals, called a **_Box Plot_**.  

<center><img src='http://www.cs.utsa.edu/~cs1173/lessons/BoxPlotQuestions/BoxPlotQuestions_02.png'></center>

This is a box plot of the confidence intervals for the population means of 3 different types of Iris flowers (you'll get very familiar with this data set when you move onto supervised learning).  The only way that we could know the true mean of the sepal length of these three species of Irises is if we took the time to record the sepal length of every one of them *in the world*.  This isn't plausible.  Instead, we can use the data we've collected about our samples to determine upper and lower bounds for our confidence interval.  If we have an acceptable error rate (often refered to as an 'Alpha' value) of 5%, then that means that we have a confidence interval of 95%.  This means that we are 95% confident that the actual value of the population mean (often called the 'ground truth') is between our upper and lower bounds, which we find by using the confidence interval formula.  

<center><img src='img/Confidence_Interval_Formulas.png' height=60% width=60%></center>

Don't let the mathematical notation in those pictures scare you.  Here's what they each mean:

n = sample size
<br>
x_bar = mean of the sample
<br>
s = standard deviation of the sample
<br>
z* = point probability for that percentage (can be found with a lookup table or using the `scipy.stats` package)

**TASK:**  Read in the `iris.csv` data sets from the dataset folder. Make sure you specific that `header=None`, and se the `column names` variable to set the column names.  Compute the confidence intervals for at least one type of Iris flower.  

**STRETCH CHALLENGE #1:**  Write a function that takes in a Pandas Series and confidence level and returns the confidence interval.  (Hint: remember that each column in a dataframe is just a Series object!)

**STRETCH CHALLENGE #2:** Pick a column and visualize the the sample mean or median for at least one flower using a box-whisker plot.   (Hint: Consider writing the function from the first challenge to output everything needed for this visualization--then, this will be really easy!)

In [8]:
# Read in the dataset from iris.csv, in the datasets folder.  Make sure you pass 'header=None' and 'names=column_names'
# when calling pd.read_csv()!
column_names = ['Sepal Length(cm)', 'Sepal Width(cm)', 'Petal Length(cm)', 'Petal Width(cm)', 'Class']
df = pd.read_csv('datasets/iris.csv', header=None, names=column_names)
df

Unnamed: 0,Sepal Length(cm),Sepal Width(cm),Petal Length(cm),Petal Width(cm),Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [18]:
Iris_virginica_df = df[df['Class'] == 'Iris-virginica']
Iris_virginica_df.describe()

Unnamed: 0,Sepal Length(cm),Sepal Width(cm),Petal Length(cm),Petal Width(cm)
count,50.0,50.0,50.0,50.0
mean,6.588,2.974,5.552,2.026
std,0.63588,0.322497,0.551895,0.27465
min,4.9,2.2,4.5,1.4
25%,6.225,2.8,5.1,1.8
50%,6.5,3.0,5.55,2.0
75%,6.9,3.175,5.875,2.3
max,7.9,3.8,6.9,2.5


In [27]:
import scipy.stats as st
# Sepal Length
sample_size = 50
sample_mean = 6.58800
sample_std_dev = 0.63588
z_star = st.norm.interval(0.95)
interval = z_star[1] * (sample_std_dev / sample_size ** 0.5)
ucl = sample_mean + interval
lcl = sample_mean - interval
ucl, lcl

(6.7642537047654949, 6.4117462952345052)

In [31]:
# Sepal Width
sample_size = 50
sample_mean = 2.974000
sample_std_dev = 0.322497
z_star = st.norm.interval(0.95)
interval = z_star[1] * (sample_std_dev / sample_size ** 0.5)
ucl = sample_mean + interval
lcl = sample_mean - interval
ucl, lcl

(3.0633899651282599, 2.8846100348717405)

In [32]:
# Petal Length
sample_size = 50
sample_mean = 5.552000
sample_std_dev = 0.551895
z_star = st.norm.interval(0.95)
interval = z_star[1] * (sample_std_dev / sample_size ** 0.5)
ucl = sample_mean + interval
lcl = sample_mean - interval
ucl, lcl

(5.7049746782278925, 5.3990253217721067)

In [33]:
# Petal Width
sample_size = 50
sample_mean = 2.02600
sample_std_dev = 0.27465
z_star = st.norm.interval(0.95)
interval = z_star[1] * (sample_std_dev / sample_size ** 0.5)
ucl = sample_mean + interval
lcl = sample_mean - interval
ucl, lcl

(2.1021276970715275, 1.9498723029284719)