## Performing Outlier Detection
### 1.Read the inputs (of the training data and location data)

In [19]:
import numpy as np
import pandas as pd
from numpy import percentile

In [20]:
train = pd.read_csv('../data/cases_train.csv')
location = pd.read_csv('../data/location.csv')

## Training Dataset
*Only dealing with the important numerical attributes*

In [21]:
train_df = train[["age","latitude", "longitude"]] 

## Location Dataset
*Only dealing with the numerical attributes*

In [22]:
location_df = location[["Lat", "Long_", "Confirmed", "Deaths", "Recovered", "Active", "Incidence_Rate", "Case-Fatality_Ratio"]]

## Using Inter-Quartile Range to find Outliers

In [23]:
def IQR(data):

    '''
    This function identifies outliers and non-outliers with interquartile range
    
    Input: dataframe
    Ouput: series of print statements
    
    References:
    # reference: https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
    # reference: https://www.thoughtco.com/what-is-the-interquartile-range-rule-3126244#:~:text=Using%20the%20Interquartile%20Rule%20to%20Find%20Outliers&text=Multiply%20the%20interquartile%20range%20(IQR,this%20is%20a%20suspected%20outlier.&text=Any%20number%20less%20than%20this%20is%20a%20suspected%20outlier.    
    '''
 
    # calculate interquartile range
    q25, q75 = percentile(data, 25), percentile(data, 75)
    iqr = q75 - q25
    print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q25, q75, iqr))

    # calculate the outlier cutoff
    # Multiply the interquartile range (IQR) by 1.5
    # Add 1.5 x (IQR) to the third quartile - anything more are considered outliers
    # Subtract 1.5 x (IQR) from the first quartile -  - anything more are considered outliers
    cut_off = iqr * 1.5
    lower, upper = q25 - cut_off, q75 + cut_off
    print("lower bound: ", lower)
    print("upper bound: ", upper)

    # identify outliers
    outliers = [x for x in data if x < lower or x > upper]
    print('Identified outliers: %d' % len(outliers))

    # remove outliers
    outliers_removed = [x for x in data if x >= lower and x <= upper]
    print('Non-outlier observations: %d' % len(outliers_removed))
    
    return

In [24]:
def outlier_detection(df):
    
    '''
    This function displays number of outliers, number of non-outliers and interquartile range
    
    Input: dataframe
    Ouput: series of print statements
    
    References:
    # https://stackoverflow.com/questions/28218698/how-to-iterate-over-columns-of-pandas-dataframe-to-run-regression
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.name.html
    # https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html
    # https://stackoverflow.com/questions/40705480/python-pandas-remove-everything-after-a-delimiter-in-a-string
    '''
    
    for column in df:
        
        if df[column].name == "age":
            data = df[column].dropna()
            data = data.map(lambda x: str(x)[:2])
            data = data.str.split('.').str[0]
            data = data.str.split('-').str[0]
            data = data.astype(int)
            print(df[column].name, " outlier summary:")
            IQR(data)
            print()

        else:
            data = df[column].dropna()
            print(df[column].name, " outlier summary:")
            IQR(data)
            print()
    
    return

## The Results of Outlier Detection

In [25]:
outlier_detection(train_df)

age  outlier summary:
Percentiles: 25th=28.000, 75th=57.000, IQR=29.000
lower bound:  -15.5
upper bound:  100.5
Identified outliers: 0
Non-outlier observations: 158371

latitude  outlier summary:
Percentiles: 25th=12.682, 75th=27.577, IQR=14.895
lower bound:  -9.659629999999986
upper bound:  49.91869000000007
Identified outliers: 84142
Non-outlier observations: 283492

longitude  outlier summary:
Percentiles: 25th=-0.143, 75th=77.209, IQR=77.352
lower bound:  -116.17190000000001
upper bound:  193.23770000000002
Identified outliers: 295
Non-outlier observations: 367339



In [26]:
outlier_detection(location_df)

Lat  outlier summary:
Percentiles: 25th=33.270, 75th=42.159, IQR=8.888
lower bound:  19.93817088125
upper bound:  55.490836791250004
Identified outliers: 415
Non-outlier observations: 3459

Long_  outlier summary:
Percentiles: 25th=-96.611, 75th=-77.639, IQR=18.972
lower bound:  -125.06939710625
upper bound:  -49.18077613625001
Identified outliers: 530
Non-outlier observations: 3344

Confirmed  outlier summary:
Percentiles: 25th=137.000, 75th=2129.000, IQR=1992.000
lower bound:  -2851.0
upper bound:  5117.0
Identified outliers: 639
Non-outlier observations: 3315

Deaths  outlier summary:
Percentiles: 25th=1.000, 75th=48.000, IQR=47.000
lower bound:  -69.5
upper bound:  118.5
Identified outliers: 609
Non-outlier observations: 3345

Recovered  outlier summary:
Percentiles: 25th=0.000, 75th=0.000, IQR=0.000
lower bound:  0.0
upper bound:  0.0
Identified outliers: 614
Non-outlier observations: 3340

Active  outlier summary:
Percentiles: 25th=114.000, 75th=1453.000, IQR=1339.000
lower bound

***NOTE: These outliers that are identified are statically based of the IQR calculations and not the logical (actual) outliers of the data.***

For example, while the Latitude lower bound = 19.9 and upper bound = 55.5 for the location data, latitudes and longitudes are always within range: -90 to 90 and -180 to 180, respectively.