## Performing Outlier Detection
### 1.Read the inputs (of the training data and location data)

In [None]:
import numpy as np
import pandas as pd
from numpy import percentile

In [None]:
train = pd.read_csv('../data/cases_train.csv')
location = pd.read_csv('../data/location.csv')

## Training Dataset
*Only dealing with the important numerical attributes*

In [None]:
train_df = train[["age","latitude", "longitude"]] 

## Location Dataset
*Only dealing with the numerical attributes*

In [None]:
location_df = location[["Lat", "Long_", "Confirmed", "Deaths", "Recovered", "Active", "Incidence_Rate", "Case-Fatality_Ratio"]]

## Using Inter-Quartile Range to find Outliers

In [None]:
def IQR(data):

    '''
    This function identifies outliers and non-outliers with interquartile range
    
    Input: dataframe
    Ouput: series of print statements
    
    References:
    # reference: https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
    # reference: https://www.thoughtco.com/what-is-the-interquartile-range-rule-3126244#:~:text=Using%20the%20Interquartile%20Rule%20to%20Find%20Outliers&text=Multiply%20the%20interquartile%20range%20(IQR,this%20is%20a%20suspected%20outlier.&text=Any%20number%20less%20than%20this%20is%20a%20suspected%20outlier.    
    '''
 
    # calculate interquartile range
    q25, q75 = percentile(data, 25), percentile(data, 75)
    iqr = q75 - q25
    print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q25, q75, iqr))

    # calculate the outlier cutoff
    # Multiply the interquartile range (IQR) by 1.5
    # Add 1.5 x (IQR) to the third quartile - anything more are considered outliers
    # Subtract 1.5 x (IQR) from the first quartile -  - anything more are considered outliers
    cut_off = iqr * 1.5
    lower, upper = q25 - cut_off, q75 + cut_off
    print("lower bound: ", lower)
    print("upper bound: ", upper)

    # identify outliers
    outliers = [x for x in data if x < lower or x > upper]
    print('Identified outliers: %d' % len(outliers))

    # remove outliers
    outliers_removed = [x for x in data if x >= lower and x <= upper]
    print('Non-outlier observations: %d' % len(outliers_removed))
    
    return

In [None]:
def outlier_detection(df):
    
    '''
    This function displays number of outliers, number of non-outliers and interquartile range
    
    Input: dataframe
    Ouput: series of print statements
    
    References:
    # https://stackoverflow.com/questions/28218698/how-to-iterate-over-columns-of-pandas-dataframe-to-run-regression
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.name.html
    # https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html
    # https://stackoverflow.com/questions/40705480/python-pandas-remove-everything-after-a-delimiter-in-a-string
    '''
    
    for column in df:
        
        if df[column].name == "age":
            data = df[column].dropna()
            data = data.map(lambda x: str(x)[:2])
            data = data.str.split('.').str[0]
            data = data.str.split('-').str[0]
            data = data.astype(int)
            print(df[column].name, " outlier summary:")
            IQR(data)
            print()

        else:
            data = df[column].dropna()
            print(df[column].name, " outlier summary:")
            IQR(data)
            print()
    
    return

## The Results of Outlier Detection

In [None]:
outlier_detection(train_df)

In [None]:
outlier_detection(location_df)

*Below is a table that summarizes the Outlier detection, and is referred to by the report.*

In [None]:
from IPython.display import Image
Image(filename='../plots/table1-outlier_summary.png') 

***NOTE: These outliers that are identified are statically based of the IQR calculations and not the logical (actual) outliers of the data.***

For example, while the Latitude lower bound = 19.9 and upper bound = 55.5 for the location data, latitudes and longitudes are always within range: -90 to 90 and -180 to 180, respectively.