## Continuous Probabilistic Methods Exercises

**Define a function named ```get_lower_and_upper_bounds``` that has two arguments. The first argument is a pandas Series. The second argument is the multiplier, which should have a default argument of 1.5.**

In [1]:
def get_lower_and_upper_bounds(df, multiplier = 1.5):
    '''
    This function takes in a data frame and a multiplier (default multiplier
    is set at 1.5) and returns the upper and lower bound outliers for each column
    '''
    for col in df:
        
        # limit to only columns with numeric data
        if np.issubdtype(df[col].dtype, np.number):
        
            # define first and third quartiles
            q1 = df[col].quantile(0.25)  
            q3 = df[col].quantile(0.75)
        
            # define IQR
            iqr = q3 - q1
        
            # calculate upper and lower bounds, based on multipliers
            lower_bound = round(q1 - (multiplier * iqr), 2)
            upper_bound = round(q3 + (multiplier * iqr), 2)

            print(f"The lower bound of the range for '{col}'  is: {lower_bound} and the upper bound is {upper_bound}")
    
    return

**1.  Using ```lemonade.csv``` dataset and focusing on continuous variables:**

- Use the IQR Range Rule and the upper and lower bounds to identify the lower outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these lower outliers make sense?Which outliers should be kept?

- Use the IQR Range Rule and the upper and lower bounds to identify the upper outliers of each column of lemonade.csv, using the multiplier of 1.5. Do these upper outliers make sense?Which outliers should be kept?

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# turn off warning boxes for presentation purposes
import warnings
warnings.filterwarnings("ignore")

In [3]:
# read csv to a dataframe
df = pd.read_csv('https://gist.githubusercontent.com/ryanorsinger/19bc7eccd6279661bd13307026628ace/raw/e4b5d6787015a4782f96cad6d1d62a8bdbac54c7/lemonade.csv')
df.head(2)

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         365 non-null    object 
 1   Day          365 non-null    object 
 2   Temperature  365 non-null    float64
 3   Rainfall     365 non-null    float64
 4   Flyers       365 non-null    int64  
 5   Price        365 non-null    float64
 6   Sales        365 non-null    int64  
dtypes: float64(3), int64(2), object(2)
memory usage: 20.1+ KB


In [5]:
# convert Date datetime format
df['Date'] = pd.to_datetime(df.Date)

In [6]:
get_lower_and_upper_bounds(df)

The lower bound of the range for 'Temperature'  is: 16.7 and the upper bound is 104.7
The lower bound of the range for 'Rainfall'  is: 0.26 and the upper bound is 1.3
The lower bound of the range for 'Flyers'  is: 4.0 and the upper bound is 76.0
The lower bound of the range for 'Price'  is: 0.5 and the upper bound is 0.5
The lower bound of the range for 'Sales'  is: 5.0 and the upper bound is 45.0


In [7]:
# col = ['Rainfall']
# Outliers = []
# for y in df[col]:
   
#     q1 = df.Rainfall.quantile(0.25)  
#     q3 = df.Rainfall.quantile(0.75)
        
#     # define IQR
#     iqr = q3 - q1
        
#     # calculate upper and lower bounds, based on multipliers
#     lower_bound = round(q1 - (1.5 * iqr), 2)
#     upper_bound = round(q3 + (1.5 * iqr), 2)
    
#     if y < lower_bound or y > upper_bound:
#        Outliers.append(y)
    
# print(f'The outliers in {col} are {Outliers}')




### Do bounds and outliers make sense:  (
- Temperature -- temperatures over 104 are probably errors
- Rainfall -- rainfall lower bound should be 0, no upper bounds are necessary
- Flyers -- negative amounts should be deleted (can't have negative flyers)
- Price -- is a constant
- Sales -- lower bound should be 0, upper bound makes sense

- Using the multiplier of 3, IQR Range Rule, and the lower and upper bounds, identify the outliers below the lower bound in each column of lemonade.csv. Do these lower outliers make sense?Which outliers should be kept?

- Using the multiplier of 3, IQR Range Rule, and the lower and upper bounds, identify the outliers above the upper_bound in each column of lemonade.csv. Do these upper outliers make sense? Which outliers should be kept?

In [8]:
get_lower_and_upper_bounds(df, 3)

The lower bound of the range for 'Temperature'  is: -16.3 and the upper bound is 137.7
The lower bound of the range for 'Rainfall'  is: -0.13 and the upper bound is 1.69
The lower bound of the range for 'Flyers'  is: -23.0 and the upper bound is 103.0
The lower bound of the range for 'Price'  is: 0.5 and the upper bound is 0.5
The lower bound of the range for 'Sales'  is: -10.0 and the upper bound is 60.0


#### Do bounds and outliers make sense:  @3 
- Temperature -- temperatures boundaries are too low and too high
- Rainfall -- rainfall lower bound should be 0, no negative rainfall / upper bound is reasonable
- Flyers -- negative amounts should be deleted (can't have negative flyers)
- Price -- is a constant
- Sales -- lower bound should be 0, upper bound makes sense

**2. Identify if any columns in lemonade.csv are normally distributed. For normally distributed columns:**

Use a 2 sigma decision rule to isolate the outliers.
Do these make sense?
Should certain outliers be kept or removed?

- normal distributions are temperature and flyers (possibly rainfall too)

In [9]:
#Using 2 Sigma Decision Rule to isolate outliers
norm = ['Temperature','Flyers','Rainfall']

 
# Calculate the z-score
z_scores = pd.Series((df['Temperature'] - df['Temperature'].mean()) / df['Temperature'].std())

# Create a column for z-scores
df['2z_scores_temp'] = z_scores

# Finds all of the observations three standard deviations or more.
z_outliers = df[df['2z_scores_temp'].abs() >= 2]

z_outliers


Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales,2z_scores_temp
41,2017-02-11,Saturday,212.0,0.91,35,0.5,21,8.336627
166,2017-06-16,Friday,99.3,0.47,77,0.5,41,2.105251
176,2017-06-26,Monday,102.6,0.47,60,0.5,42,2.287714
181,2017-07-01,Saturday,102.9,0.47,59,0.5,143,2.304301
190,2017-07-10,Monday,98.0,0.49,66,0.5,40,2.033372
198,2017-07-18,Tuesday,99.3,0.47,76,0.5,41,2.105251
202,2017-07-22,Saturday,99.6,0.47,49,0.5,42,2.121838
207,2017-07-27,Thursday,97.9,0.47,74,0.5,43,2.027843
338,2017-12-05,Tuesday,22.0,1.82,11,0.5,10,-2.168799
364,2017-12-31,Sunday,15.1,2.5,9,0.5,7,-2.550311


In [10]:
# Using 2 Sigma Decision Rule to isolate outliers

norm = ['Temperature','Flyers','Rainfall']

 
# Calculate the z-score
z_scores = pd.Series((df['Flyers'] - df['Flyers'].mean()) / df['Flyers'].std())

# Create a column for z-scores
df['2z_scores_flyer'] = z_scores

# Finds all of the observations three standard deviations or more.
z_outliers = df[df['2z_scores_flyer'].abs() >= 2]

z_outliers

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales,2z_scores_temp,2z_scores_flyer
166,2017-06-16,Friday,99.3,0.47,77,0.5,41,2.105251,2.676244
170,2017-06-20,Tuesday,85.1,0.54,70,0.5,37,1.320109,2.168499
171,2017-06-21,Wednesday,94.3,0.47,76,0.5,41,1.828792,2.603709
182,2017-07-02,Sunday,93.4,0.51,68,0.5,158,1.77903,2.023429
183,2017-07-03,Monday,81.5,0.54,68,0.5,235,1.121058,2.023429
194,2017-07-14,Friday,92.0,0.5,80,0.5,40,1.701621,2.893849
198,2017-07-18,Tuesday,99.3,0.47,76,0.5,41,2.105251,2.603709
203,2017-07-23,Sunday,89.1,0.51,72,0.5,37,1.541275,2.313569
204,2017-07-24,Monday,83.5,0.57,69,0.5,35,1.231642,2.095964
207,2017-07-27,Thursday,97.9,0.47,74,0.5,43,2.027843,2.458639


In [11]:
# Using 2 Sigma Decision Rule to isolate outliers

norm = ['Temperature','Flyers','Rainfall']

 
# Calculate the z-score
z_scores = pd.Series((df['Rainfall'] - df['Rainfall'].mean()) / df['Rainfall'].std())

# Create a column for z-scores
df['2z_scores_rain'] = z_scores

# Finds all of the observations three standard deviations or more.
z_outliers = df[df['2z_scores_rain'].abs() >= 2]

z_outliers

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales,2z_scores_temp,2z_scores_flyer,2z_scores_rain
0,2017-01-01,Sunday,27.0,2.0,15,0.5,10,-1.89234,-1.820927,4.286488
5,2017-01-06,Friday,25.3,1.54,23,0.5,11,-1.986336,-1.240647,2.606983
6,2017-01-07,Saturday,32.9,1.54,19,0.5,13,-1.566119,-1.530787,2.606983
10,2017-01-11,Wednesday,32.6,1.54,23,0.5,12,-1.582706,-1.240647,2.606983
15,2017-01-16,Monday,30.6,1.67,24,0.5,12,-1.69329,-1.168112,3.081626
16,2017-01-17,Tuesday,32.2,1.43,26,0.5,14,-1.604823,-1.023042,2.205363
19,2017-01-20,Friday,31.6,1.43,20,0.5,12,-1.637998,-1.458252,2.205363
23,2017-01-24,Tuesday,28.6,1.54,20,0.5,12,-1.803873,-1.458252,2.606983
337,2017-12-04,Monday,34.9,1.54,16,0.5,13,-1.455535,-1.748392,2.606983
338,2017-12-05,Tuesday,22.0,1.82,11,0.5,10,-2.168799,-2.111067,3.629291


### Dataframe of outliers under the 2 sigma decision rule

In [12]:
#2-Sigma z-scores 

#create a dataframe with rain outliers
z2_outliers = df[df['2z_scores_rain'].abs() >= 2]
# add columns for temp and flyer outliers
z2_outliers['2z_scores_temp'] = df[df['2z_scores_temp'].abs() >= 2]
z2_outliers['2z_scores_flyer'] = df[df['2z_scores_flyer'].abs() >= 2]
z2_outliers.shape

(19, 10)

**3. Now use a 3 sigma decision rule to isolate the outliers in the normally distributed columns from lemonade.csv**

In [13]:
z2_outliers

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales,2z_scores_temp,2z_scores_flyer,2z_scores_rain
0,2017-01-01,Sunday,27.0,2.0,15,0.5,10,NaT,NaT,4.286488
5,2017-01-06,Friday,25.3,1.54,23,0.5,11,NaT,NaT,2.606983
6,2017-01-07,Saturday,32.9,1.54,19,0.5,13,NaT,NaT,2.606983
10,2017-01-11,Wednesday,32.6,1.54,23,0.5,12,NaT,NaT,2.606983
15,2017-01-16,Monday,30.6,1.67,24,0.5,12,NaT,NaT,3.081626
16,2017-01-17,Tuesday,32.2,1.43,26,0.5,14,NaT,NaT,2.205363
19,2017-01-20,Friday,31.6,1.43,20,0.5,12,NaT,NaT,2.205363
23,2017-01-24,Tuesday,28.6,1.54,20,0.5,12,NaT,NaT,2.606983
337,2017-12-04,Monday,34.9,1.54,16,0.5,13,NaT,NaT,2.606983
338,2017-12-05,Tuesday,22.0,1.82,11,0.5,10,2017-12-05 00:00:00,2017-12-05 00:00:00,3.629291


In [14]:
# Using 3 Sigma Decision Rule to isolate outliers

norm = ['Temperature','Flyers','Rainfall']

 
# Calculate the z-score
z_scores = pd.Series((df['Temperature'] - df['Temperature'].mean()) / df['Temperature'].std())

# Create a column for z-scores
df['z_scores_temp'] = z_scores

# Finds all of the observations three standard deviations or more.
z_outliers = df[df['z_scores_temp'].abs() >= 3]


In [15]:
# Using 3 Sigma Decision Rule to isolate outliers

norm = ['Temperature','Flyers','Rainfall']

 
# Calculate the z-score
z_scores = pd.Series((df['Flyers'] - df['Flyers'].mean()) / df['Flyers'].std())

# Create a column for z-scores
df['z_scores_flyer'] = z_scores

# Finds all of the observations three standard deviations or more.
z_outliers = df[df['z_scores_flyer'].abs() >= 3]

z_outliers

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales,2z_scores_temp,2z_scores_flyer,2z_scores_rain,z_scores_temp,z_scores_flyer
324,2017-11-21,Tuesday,47.0,0.95,-38,0.5,20,-0.786506,-5.665283,0.452836,-0.786506,-5.665283


In [16]:
# Using 3 Sigma Decision Rule to isolate outliers

norm = ['Temperature','Flyers','Rainfall']

 
# Calculate the z-score
z_scores = pd.Series((df['Rainfall'] - df['Rainfall'].mean()) / df['Rainfall'].std())

# Create a column for z-scores
df['z_scores_rain'] = z_scores

# Finds all of the observations three standard deviations or more.
z_outliers = df[df['z_scores_rain'].abs() >= 3]

z_outliers

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales,2z_scores_temp,2z_scores_flyer,2z_scores_rain,z_scores_temp,z_scores_flyer,z_scores_rain
0,2017-01-01,Sunday,27.0,2.0,15,0.5,10,-1.89234,-1.820927,4.286488,-1.89234,-1.820927,4.286488
15,2017-01-16,Monday,30.6,1.67,24,0.5,12,-1.69329,-1.168112,3.081626,-1.69329,-1.168112,3.081626
338,2017-12-05,Tuesday,22.0,1.82,11,0.5,10,-2.168799,-2.111067,3.629291,-2.168799,-2.111067,3.629291
343,2017-12-10,Sunday,31.3,1.82,15,0.5,11,-1.654586,-1.820927,3.629291,-1.654586,-1.820927,3.629291
364,2017-12-31,Sunday,15.1,2.5,9,0.5,7,-2.550311,-2.256137,6.112037,-2.550311,-2.256137,6.112037


### Dataframe of all outliers identified under the 3 sigma decision rule

In [17]:
#3-Sigma z-score outliers

#create a dataframe with rain outliers
z3_outliers = df[df['z_scores_rain'].abs() >= 3]
# add columns for temp and flyer outliers
z3_outliers['z_scores_temp'] = df[df['2z_scores_temp'].abs() >= 3]
z3_outliers['z_scores_flyer'] = df[df['2z_scores_flyer'].abs() >= 3]
z3_outliers

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales,2z_scores_temp,2z_scores_flyer,2z_scores_rain,z_scores_temp,z_scores_flyer,z_scores_rain
0,2017-01-01,Sunday,27.0,2.0,15,0.5,10,-1.89234,-1.820927,4.286488,NaT,NaT,4.286488
15,2017-01-16,Monday,30.6,1.67,24,0.5,12,-1.69329,-1.168112,3.081626,NaT,NaT,3.081626
338,2017-12-05,Tuesday,22.0,1.82,11,0.5,10,-2.168799,-2.111067,3.629291,NaT,NaT,3.629291
343,2017-12-10,Sunday,31.3,1.82,15,0.5,11,-1.654586,-1.820927,3.629291,NaT,NaT,3.629291
364,2017-12-31,Sunday,15.1,2.5,9,0.5,7,-2.550311,-2.256137,6.112037,NaT,NaT,6.112037
