In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Exercise 1
Define a function named get_lower_and_upper_bounds that has two arguments. The first argument is a pandas Series. The second argument is the multiplier, which should have a default argument of 1.5.

In [17]:
def get_lower_and_upper_bounds(series, multiplier=1.5):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - multiplier*iqr
    upper_bound = q3 + multiplier*iqr
    return lower_bound, upper_bound

## Exercise 2
Using lemonade.csv dataset and focusing on continuous variables:

- Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense?Which outliers should be kept?

In [5]:
df = pd.read_csv('lemonade.csv')

In [7]:
df.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


In [38]:
df.describe()

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
count,365.0,365.0,365.0,365.0,365.0
mean,61.224658,0.825973,40.10411,0.5,27.865753
std,18.085892,0.27389,13.786445,0.0,30.948132
min,15.1,0.4,-38.0,0.5,7.0
25%,49.7,0.65,31.0,0.5,20.0
50%,61.1,0.74,39.0,0.5,25.0
75%,71.7,0.91,49.0,0.5,30.0
max,212.0,2.5,80.0,0.5,534.0


In [8]:
#assigning series to be explored
temp = df.Temperature
rain = df.Rainfall
flyers = df.Flyers
price = df.Price
sales = df.Sales

In [25]:
lower_temp, upper_temp = get_lower_and_upper_bounds(temp)

temp[temp <= lower_temp]

364    15.1
Name: Temperature, dtype: float64

Depending on the location, this temperature could make sense. We will keep this outlier moving forward.

In [26]:
lower_rain, upper_rain = get_lower_and_upper_bounds(rain)

rain[rain <= lower_rain]

Series([], Name: Rainfall, dtype: float64)

There are no lower outliers for rainfall.

In [27]:
lower_flyers, upper_flyers = get_lower_and_upper_bounds(flyers)
flyers[flyers <= lower_flyers]

324   -38
Name: Flyers, dtype: int64

Unless somebody lost flyers that day, I do not think there can be negative flyers being put up. This outlier should be dropped.

In [28]:
lower_price, upper_price = get_lower_and_upper_bounds(price)
price[price <= lower_price]

0      0.5
1      0.5
2      0.5
3      0.5
4      0.5
      ... 
360    0.5
361    0.5
362    0.5
363    0.5
364    0.5
Name: Price, Length: 365, dtype: float64

After further investigation, all of the prices are set as .5. Since they are all the same, we can ignore the relevance for price.

In [33]:
lower_sales, upper_sales = get_lower_and_upper_bounds(sales)
sales[sales <= lower_sales]

Series([], Name: Sales, dtype: int64)

There are no lower outliers for sales.

- Use the IQR Range Rule and the upper and lower bounds to identify the upper outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense?Which outliers should be kept?

In [39]:
temp[temp >= upper_temp]

41    212.0
Name: Temperature, dtype: float64

They are probably not selling lemonade in the middle of a desert so this is most likely a mistake. We should drop this outlier.

In [40]:
rain[rain >= upper_rain]

0      2.00
1      1.33
2      1.33
5      1.54
6      1.54
10     1.54
11     1.33
12     1.33
15     1.67
16     1.43
19     1.43
23     1.54
27     1.33
28     1.33
337    1.54
338    1.82
342    1.43
343    1.82
345    1.33
346    1.43
347    1.54
350    1.33
351    1.43
354    1.33
355    1.54
359    1.43
363    1.43
364    2.50
Name: Rainfall, dtype: float64

I think we should keep all the rainfall that are in the upper outlier because there a normal person can't tell the difference between 1.5 inches of rain to 2 inches of rain. More than likely the sales were constant over a certain amount of rainfall.

In [43]:
flyers[flyers >= upper_flyers]

166    77
171    76
194    80
198    76
Name: Flyers, dtype: int64

These four are right on the upper edge of our outliers. It may be in our best interest to keep these since they are within 4 units of our upper bound. 

In [46]:
price[price >= upper_price]

0      0.5
1      0.5
2      0.5
3      0.5
4      0.5
      ... 
360    0.5
361    0.5
362    0.5
363    0.5
364    0.5
Name: Price, Length: 365, dtype: float64

After further investigation, all of the prices are set as .5. Since they are all the same, we can ignore the relevance for price.

In [47]:
sales[sales >= upper_sales]

181    143
182    158
183    235
184    534
Name: Sales, dtype: int64

In [49]:
df.iloc[180:185,:]

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
180,6/30/17,Friday,89.4,0.53,47,0.5,38
181,7/1/17,Saturday,102.9,0.47,59,0.5,143
182,7/2/17,Sunday,93.4,0.51,68,0.5,158
183,7/3/17,Monday,81.5,0.54,68,0.5,235
184,7/4/17,Tuesday,84.2,0.59,49,0.5,534


The upper outliers are anywhere from 3-11 times greater than the upper bound. Index 183 and 184 might be an input mistake because of how high the sales are in comparison to the rest so we should probably drop these. Index 181 and 182 could possibly be explained as "normal" outliers since the temperatures are much higher 

- Using the multiplier of 3, IQR Range Rule, and the lower and upper bounds, identify the outliers below the lower bound in each colum of lemonade.csv. Do these lower outliers make sense?Which outliers should be kept?
- Using the multiplier of 3, IQR Range Rule, and the lower and upper bounds, identify the outliers above the upper_bound in each colum of lemonade.csv. Do these upper outliers make sense? Which outliers should be kept?

## Exercise 3
Identify if any columns in lemonade.csv are normally distributed. For normally distributed columns:

Use a 2 sigma decision rule to isolate the outliers.
Do these make sense?
Should certain outliers be kept or removed?

## Exercise 4
Now use a 3 sigma decision rule to isolate the outliers in the normally distributed columns from lemonade.csv