# Fall 2021 - Section 01 - Project 1

__Members:__
* Ethan Kamus
* Nathaniel Matthew Marquez
* Rebecca Lee

__Goal:__ Using Python and Jupyter to implement algorithms for outlier detection 

## Experiment 1

#### __1. Use the csv module to load in the dataset of participants:__
* The csv module is used to open the file of the participants dataset and load in the dataset.
* With csv.DictReader the data loaded in creates dictionary objects where the column headers are the keys and the values to the keys are the data recorded under each column. The result will be an array of dictionary objects.

In [9]:
import csv

participants = []
with open('participants.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        participants.append(row)


## Experiment 2

#### __2. Using the statistics module, find the mean and median for Week 1's data:__
* Assigned to the variable __*wk1*__: using list comprehension we iterate through the array of participant objects created from above and grabbing the values where the key is 'Week 1', which was the column header, and casting it to int since it was read in as a string.
* We grab the mean and median of all the values stored in the array for week 1 with the use of the statistics module defined methods. 

In [10]:
import statistics

wk1 = [int(m['Week 1']) for m in participants]

wk1_mean = statistics.mean(wk1)
wk1_median = statistics.median(wk1)

wk1_mean, wk1_median

(161, 175)

## Experiment 3
#### __3.Use statistics module to get the quartiles of the Week 1 data points__
To get quartiles of the week 1 values we use the quantiles method in the statistics module and set n = 4 in order to get the quartiles, and stored it in an array variable __*quartiles*__

In [11]:
quartiles = [q for q in statistics.quantiles(wk1, n = 4)]

quartiles

[174.0, 175.0, 179.0]

## Experiment 4
__4. Using Turkey's fences method, find the outlers in the data set, where k = 1.5 will indicate an "outlier"__
$$ [Q1 - k(Q3 - Q1), Q3 + k(Q3 - Q1)] $$

* The 'Turkey fences' method observes outliers based on measuring the IQR (interquartile range) of the observed data, same method for boxplots. Q1 is the lower quartile and Q3 is the upper bound, and we set k = 1.5 as John Turkey suggested that would indicate "outliers". 
* Using this method we assign the lower and upper quartile values from Experiment 3's results, and then apply the formula listed to get the upper and lower outlier bounds.
* To get the outliers we loop through the array of week 1 values and pick out any that were outside the range. 

In [12]:
k = 1.5
q1 = quartiles[0]
q2 = quartiles[1]
q3 = quartiles[2]

turkey_fence = [(q1 - k*(q3 - q1)), (q3 + k*(q3 - q1))]

outliers = [i for i in wk1 if (i < turkey_fence[0] or i > turkey_fence[1])]

outliers

[77, 51, 9, 24]

## Experiment 5

__5. Compute the standard deviation for Week 1 dataset to find outliers using the 68-95-99.7 rule, then compare with outlier results found with Turkey Method in Experiment 4.__ 

$$ Pr(\mu - 3\sigma \leq X \leq \mu + 3\sigma) $$

* The 68-95-99.7 rule is another method used on normally distributed sets of data to detect outliers, where nearly all values will lie within three standard deviations of the mean. The formula above for 3 standard deviations gives the rule for 99.7% probability as it will be near certainty. 
* We get the standard deviation from the statistics module for the dataset in week 1, and then loop through the week 1 values to assign into the __*outliers_sd*__ array any of the values which are greater than or equal to/less than or equal to 3 standard deviations from the mean.

In [13]:
sd = statistics.stdev(wk1)
outliers_sd = [j for j in wk1 if (j <= wk1_mean - 3*sd or j >= wk1_mean + 3*sd)]

outliers_sd

[9, 24]

#### Results:
The outliers with the Turkey Fence IQR method were different from the ones received with the 99.7% rule. When following empirical rule of normally distributed dataset where there is 99.7% probability that the values lie within 3 standard deviations from the mean, then we only receive two outliers of [9, 24]. With the Turkey Fence method there were 4 outliers of [77, 51, 9, 24]. If we had instead used the 68% rule of only one standard deviation, then we would have the same result. 

## Experiment 6

__6. Create a function tardy_iqr() to return a list of the names for the outliers found with Turkey Fence method.__

Function accepts a parameter of 'col_name' which accepts a string of one of the columns in participants.csv and will then calculate the outliers with the Turkey Fence method and return a list of the names of the students with the matching outlier values. 

In [14]:
def tardy_iqr(col_name):
    d = [int(m[col_name]) for m in participants]
    d_mean = statistics.mean(d)
    d_median = statistics.median(d)
    
    quartiles = [q for q in statistics.quantiles(d, n = 4)]
    q1 = quartiles[0]
    q2 = quartiles[1]
    q3 = quartiles[2]
    
    turkey_fence = [(q1 - k*(q3 - q1)), (q3 + k*(q3 - q1))]

    return [student['Student Name'] for student in participants if (int(student[col_name]) < turkey_fence[0] or int(student[col_name]) > turkey_fence[1])]

f'Week 1: {tardy_iqr("Week 1")}'

"Week 1: ['Adrian Ellison', 'Tayla Sparrow', 'Owain Emerson', 'Alaya Dickinson']"

## Experiment 7

__7. Create a function tardy_stdev() to return a list of the names for the outliers found using the 99.7% probability rule for a normally distributed dataset.__

Function accepts a parameter of 'col_name' which accepts a string of one of the columns in participants.csv and will then calculate the outliers with the 99.7% probability rule of normally distributed data sets and return a list of the names of the students with the matching outlier values. 

In [15]:
def tardy_stdev(col_name):
    d = [int(m[col_name]) for m in participants]
    d_mean = statistics.mean(d)
    sd = statistics.stdev(d)
    
    return [student['Student Name'] for student in participants if (int(student[col_name]) <= d_mean - 3*sd or int(student[col_name]) >= d_mean + 3*sd)]

f'Week 1: {tardy_stdev("Week 1")}'

"Week 1: ['Owain Emerson', 'Alaya Dickinson']"

## Experiment 8

__8. Print results of Week 2 thru Week 5 outlier results from tardy_iqr() and tardy_stdev() and compare__

In [16]:
print('IQR outlier results of Week 2 - 5:')
print(f'Week 2: {tardy_iqr("Week 2")}')
print(f'Week 3: {tardy_iqr("Week 3")}')
print(f'Week 4: {tardy_iqr("Week 4")}')
print(f'Week 5: {tardy_iqr("Week 5")}\n')

print('Stdev outlier results of Week 2 - 5:')
print(f'Week 2: {tardy_stdev("Week 2")}')
print(f'Week 3: {tardy_stdev("Week 3")}')
print(f'Week 4: {tardy_stdev("Week 4")}')
print(f'Week 5: {tardy_stdev("Week 5")}')

IQR outlier results of Week 2 - 5:
Week 2: ['Yasir Fenton', 'Tamara Cottrell', 'Jazmin Foreman', 'Bear Zuniga', 'Miles Lyons', 'Owain Emerson']
Week 3: ['Adrian Ellison', 'Adeline Jordan', 'Jaye Sweeney']
Week 4: ['Dora Delacruz', 'Shaquille Wood']
Week 5: ['Jazmin Foreman', 'Sanjay Edwards', 'Alfie-James Pierce', 'Adeline Jordan', 'Saffa Brook']

Stdev outlier results of Week 2 - 5:
Week 2: ['Miles Lyons', 'Owain Emerson']
Week 3: ['Adrian Ellison']
Week 4: ['Dora Delacruz']
Week 5: ['Jazmin Foreman']


#### Results:

As seen in the results, the IQR method would produce a larger set of outliers versus the one using the 99.7% rule for standard deviations. It is the same reasoning as when we did the comparison above in Experiment 5, simply because the IQR range is giving the bounds that would be the range of the 68% rule, so the IQR method has a lower probability of being certain about the outliers. Where the stddev method will give the results of values that are almost certainly outliers. 