# Outlier detection

The timeseries for covid cses may contain outlier observations, we need to find and correct those. Hampel filters allow us to do both. The idea is simple: pass a moving window throug the series, calculate aerage and standard deviation and then use Pearson's criteria (difference greater than 3 std) to detect outliers. Once detected the outlier value can be substituted by the rolling average or something similar

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import pandas as pd
import os

In [None]:
def hampel(vals_orig, k=7, threshold=3):
    """Detect and filter outliers in a time series.
    
    Parameters
    vals_orig: pandas series of values from which to remove outliers
    k: size of window (including the sample; 7 is equal to 3 on either side of value)
    threshold: number of standard deviations to filter outliers
    
    Returns
    
    """
    
    #Make copy so original not edited
    vals = vals_orig.copy()
    
    #Hampel Filter
    L = 1.4826 # Constant factor to estimate STD from MAD assuming normality
    rolling_median = vals.rolling(window=k, center=True).median()
    MAD = lambda x: np.median(np.abs(x - np.median(x)))
    rolling_MAD = vals.rolling(window=k, center=True).apply(MAD)
    threshold = threshold * L * rolling_MAD
    difference = np.abs(vals - rolling_median)
    
    '''
    Perhaps a condition should be added here in the case that the threshold value
    is 0.0; maybe do not mark as outlier. MAD may be 0.0 without the original values
    being equal. See differences between MAD vs SDV.
    '''
    
    outlier_idx = difference > threshold
    vals[outlier_idx] = rolling_median[outlier_idx] 
    return(vals)

## Test on single time series

To test out our hampel filter we're going to use it on Mexico's time series. Firts we need to apply some standard preprocessing steps to interpolate missing values

In [None]:
df = pd.read_csv("../data_sources/OxCGRT_latest.csv")
# get a single country
df = df[df['CountryCode'] == 'MEX']
# Add new cases column
df['NewCases'] = df.ConfirmedCases.diff().fillna(0)
# Fill any missing case values by interpolation and setting NaNs to 0
df.NewCases = df.NewCases.interpolate().fillna(0)
df['NewCases'].plot()

Obviously there are some outlier observations, lets see what happens if we apply hampel filters of different lenghths

In [None]:
filter_7 =  hampel(df.NewCases, k=7, threshold=3)
filter_11 =  hampel(df.NewCases, k=11, threshold=3)
filter_15 =  hampel(df.NewCases, k=15, threshold=3)
comparisson = pd.concat([filter_7, filter_11, filter_15], axis=1)
comparisson.columns = ['7-days', '11-days', '15-days']
comparisson.plot(alpha=0.6)

So we can filter outliers in this timeseries with all time windows. Maybe using the smaller one introduces less errors, I'm not sure.

## Apply to the whole series

Now lets apply the Hampel filter to the whole series with every country and region

In [None]:
df = pd.read_csv("../data_sources/OxCGRT_latest.csv", 
                 parse_dates=['Date'],
                 encoding="ISO-8859-1",
                 dtype={"RegionName": str,
                        "RegionCode": str},
                 error_bad_lines=False)
df['GeoID'] = df['CountryName'] + '__' + df['RegionName'].astype(str)
df['NewCases'] = df.groupby('GeoID').ConfirmedCases.diff().fillna(0)
df.update(df.groupby('GeoID').NewCases.apply(
    lambda group: group.interpolate()).fillna(0))

filtered = df.groupby('CountryCode').apply(lambda group: hampel(group.NewCases))
filtered = filtered.reset_index()[['NewCases']]
filtered.columns = ['NewCasesHampel']
df = df.join(filtered)
df

Lets check one country

In [None]:
df[df.CountryCode == 'MEX'][['NewCases', 'NewCasesHampel']].plot()

In [None]:
df[df.CountryCode == 'ESP'][['NewCases', 'NewCasesHampel']].plot()