# Outlier Detection

An outlier is a terminology commonly used by analysts and data scientists because it requires special attention, otherwise, it can lead to totally wrong estimates.
Simply put, outlier detection is an observation that appears far away from and diverges from an overall pattern in a sample.

* What is Outlier?
An outlier is an observation that is numerically distant from the rest of the data or, in a nutshell, is the value that is out of range. Let’s take an example to check what happens to a dataset with a dataset without outliers.

	                  Data without Outliers	   |       Data with Outliers
- Data	               1, 2, 3, 3, 4, 5, 4	   |    1, 2, 3, 3, 4, 5, 400
- Mean	                     3.142	           |          59.714
- Median	                   3	           |            3
- Standard Deviation	     1.345185	       |         150.057

As you can see, the dataset with outliers has a significantly different mean and standard deviation. In the first scenario, we will say that the average is 3.14. But with the outlier, the average climbs to 59.71. This would completely change the estimate.

Let’s take a concrete example of an outlier. In a company of 50 employees, 45 people with a monthly salary of Rs. 6000, 5 seniors with a monthly salary of Rs. 100000 each. If you calculate the average monthly salary of the employees of the company is 14,500 rupees, which will give you a bad conclusion.

But if you take the median salary, it is Rs.6000 which is more sensitive than the average. For this reason, the median is an appropriate measure for the mean. Here you can see the effect of an outlier.

Now let’s have a quick look at the main causes of outliers before getting started with the task of outlier detection:

- Data Entry Errors: Human errors such as errors caused during data collection, recording, or entry can cause outliers in data.
- Measurement Errors: It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty.
- Natural Outliers: When an outlier is not artificial (due to error), it is a natural outlier. Most real-world data belong to this category.

## Outlier Detection in Machine Learning using Hypothesis Testing
Now, An outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of a univariate outlier. These outliers can be found when we look at the distribution of a single variable. Multivariate outliers are outliers in an n-dimensional space.

Hypothesis testing is a common technique for detecting outliers in machine learning. Hypothesis testing is a method of testing a claim or hypothesis about a parameter in a population, using data measured in a sample. In this method, we test a hypothesis by determining the probability that a sample statistic could have been selected, if the hypothesis regarding the population parameter was true.

The purpose of the hypothesis test is to determine the probability that a population parameter, such as the mean, is likely to be true. There are four steps in the hypothesis test:

State the assumptions.
Define the criteria for a decision.
Calculate the test statistic.
Make a decision.

In [1]:
import numpy as np
import scipy.stats as stats

x = np.array([12,13,14,19,21,23])

y = np.array([12,13,14,19,21,23,45])

def H_test(x):
    n = len(x)
    mean_x = np.mean(x)
    sd_x = np.std(x)
    numerator = max(abs(x-mean_x))
    g_calculated = numerator/sd_x
    
    print("H_test Calculated Value:",g_calculated)
    
    t_value = stats.t.ppf(1 - 0.05 / (2 * n), n - 2)
    g_critical = ((n - 1) * np.sqrt(np.square(t_value))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value)))
    
    print("H_test Critical Value:",g_critical)
    
    if g_critical > g_calculated:
        print("From H_test we observe that calculated value is lesser than critical value, Accept null hypothesis and conclude that there is no outliers\n")
    else:
        print("From H_test we observe that calculated value is greater than critical value, Reject null hypothesis and conclude that there is an outliers\n")
H_test(x)
H_test(y)

H_test Calculated Value: 1.4274928542926593
H_test Critical Value: 1.887145117792422
From H_test we observe that calculated value is lesser than critical value, Accept null hypothesis and conclude that there is no outliers

H_test Calculated Value: 2.2765147221587774
H_test Critical Value: 2.019968507680656
From H_test we observe that calculated value is greater than critical value, Reject null hypothesis and conclude that there is an outliers



One of the major problems with machine learning is an outlier. If you will neglect the outliers in the data, then it will result in the poor performance of your machine learning model.

In [2]:
# IQR For Outlier Treatment
import numpy as np
import pandas as pd
from scipy import stats

data = np.array([-1,54,77,53,43,32,12,22,31,420], dtype='float')
print(data, '\n')
print(f"Mean for Data: {np.mean(data)}")
print(f"Median for Data: {np.median(data)}")
print(f"Standard Deviation for Data: {np.std(data)}")

[ -1.  54.  77.  53.  43.  32.  12.  22.  31. 420.] 

Mean for Data: 74.3
Median for Data: 37.5
Standard Deviation for Data: 117.18024577547189


In [3]:
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
IQR = q3 - q1

upper = q3 + 1.5 * IQR
lower = q1 - 1.5 * IQR

outlier = (data>upper) | (data<lower)
data[outlier] = np.median(data)
print(data,'\n')
print(f"Mean for Data: {np.mean(data)}")
print(f"Median for Data: {np.median(data)}")
print(f"Standard Deviation for Data: {np.std(data)}")

[-1.  54.  77.  53.  43.  32.  12.  22.  31.  37.5] 

Mean for Data: 36.05
Median for Data: 34.75
Standard Deviation for Data: 21.277276611446307


In [4]:
# Standadization

print(data := np.array([-1,54,77,53,43,32,12,22,31,420], dtype='float'))
print(meann := np.mean(data))
print(stdd := np.std(data),'\n')

z = [(x-meann)/stdd for x in data]
print(z)

print(meann := np.mean(z))
print(stdd := np.std(z))

[ -1.  54.  77.  53.  43.  32.  12.  22.  31. 420.]
74.3
117.18024577547189 

[-0.6425997786715836, -0.17323739053164866, 0.023041426326869553, -0.18177125213419293, -0.2671098681596356, -0.3609823457876226, -0.531659577838508, -0.4463209618130653, -0.3695162073901669, 2.9501559559995543]
4.4408920985006264e-17
1.0
