# Handling Outliers

## 1) Introduction

There is no censenus definition of outliers, but we can considere these data point as an observations that lies an abnormal distance from other values.

Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space. In order to find them, you have to look at distributions in multi-dimensions.

There is a bench of questions that should be asked when outliers are spotted, among these questions:
#### 1) Is the outlier a mistke or legitimate point?
#### 2) Is the outlier part of the population of interest?
#### 3) Does the outlier have to be always dropped?
#### 4) How to properly deal with the outliers?
#### 5) What is the impact of outliers on a dataset?

Sometimes outliers are simply caused by data recording errors, in other cases, outliers are legitimate observations. 

For the second question is whether the outlier is part of the population of interest. Depending on the answer to this question, we can decide whether outliers should be included in our analysis, which is the answer of the third question. 

There's no single solution. If an outlier is the result of data recording errors, we should recorrect the error if possible, otherwise we just remove it. If an outlier is outside of the population of interest, we should simply remove the outlier from further analysis. One should be cautious when removing outliers as removing them can sometime dramatically change the result of subsequent analysis. Outliers are not always outside of the population of interest. Sometimes they are actually the main focus of our analysis. As an example, it can be argued that all successful startups are outliers, as the success rate of startups is very low. However, in many analysis we're only interested in analyzing successful startups. 

As far as the last question, outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:

- It increases the error variance and reduces the power of statistical tests

- If the outliers are non-randomly distributed, they can decrease normality

- They can bias or influence estimates that may be of substantive interest

- They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.

## 2) Identifying outliers

As far as identifying outliers, there is, of course, a degree of ambiguity. Qualifying a data point as an anomaly leaves it up to the analyst or model to determine what is abnormal and what to do with such data points.

These outliers are typically easy to detect using straightforward methods like box plots, histograms and scatter-plots. In other cases, mathematical techniques are extremely valuable in fields which process large amounts of data and require a means to perform pattern recognition in larger datasets.


### 2.1) Interquartile Range (IQR)
The interquartile range (IQR) is a measure of statistical dispersion and is calculated as the difference between the 75th and 25th percentiles. It is represented by the formula IQR = Q3 − Q1, hence any value beyond the range of -1.5 x IQR to 1.5 x IQR may be considered as outlier.

### 2.2) Skewness
Several machine learning algorithms make the assumption that the data follow a normal (or Gaussian) distribution. This is easy to check with the skewness value, which explains the extent to which the data is normally distributed. Ideally, the skewness value should be between -1 and +1, and any major deviation from this range indicates the presence of outliers.

### 2.3) Standard Deviation Method
If a value is higher than the mean plus or minus three Standard Deviation is considered as outlier. It is based on the characteristics of a normal distribution for which 99.87% of the data appear within this range. 

This method has several shortcomings :
- The mean and standard deviation are strongly affected by outliers.
- It assumes that the distribution is normal (outliers included)
- It does not detect outliers in small samples

### 2.4) DBSCAN
This technique is based on the DBSCAN clustering method. DBSCAN is a non-parametric, density based outlier detection method in a one or multi dimensional feature space.

In the DBSCAN clustering technique, all data points are defined either as Core Points, Border Points or Noise Points.

Core Points are data points that have at least MinPts neighboring data points within a distance ℇ.
Border Points are neighbors of a Core Point within the distance ℇ but with less than MinPts neighbors within the distance ℇ.
All other data points are Noise Points, also identified as outliers.
Outlier detection thus depends on the required number of neighbors MinPts, the distance ℇ and the selected distance measure, like Euclidean or Manhattan.

Dbscan pros:
- It is a super effective method when the distribution of values in the feature space can not be assumed.
- Works well if the feature space for searching outliers is multidimensional (ie. 3 or more dimensions)
- Sci-kit learn’s implementation is easy to use and the documentation is superb.
- Visualizing the results is easy and the method itself is very intuitive.

Dbscan cons:
- The values in the feature space need to be scaled accordingly.
- Selecting the optimal parameters eps, MinPts and metric can be difficult since it is very sensitive to any of the three params.
- It is an unsupervised model and needs to be re-calibrated each time a new batch of data is analyzed.
- It can predict once calibrated but is strongly not recommended.

#### Article ==> A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

### 2.5) Isolation Forest
This is a non-parametric method for large datasets in a one or multi dimensional feature space.

An important concept in this method is the isolation number.

The isolation number is the number of splits needed to isolate a data point. This number of splits is ascertained by following these steps:

A point “a” to isolate is selected randomly.
A random data point “b” is selected that is between the minimum and maximum value and different from “a”.
If the value of “b” is lower than the value of “a”, the value of “b” becomes the new lower limit.
If the value of “b” is greater than the value of “a”, the value of “b” becomes the new upper limit.
This procedure is repeated as long as there are data points other than “a” between the upper and the lower limit.
It requires fewer splits to isolate an outlier than it does to isolate a non-outlier, i.e. an outlier has a lower isolation number in comparison to a non-outlier point. A data point is therefore defined as an outlier if its isolation number is lower than the threshold.

The threshold is defined based on the estimated percentage of outliers in the data, which is the starting point of this outlier detection algorithm.
#### An explanation with images ==> https://quantdare.com/isolation-forest-algorithm/.

Isolation Forest pros:
- There is no need of scaling the values in the feature space.
- It is an effective method when value distributions can not be assumed.
- It has few parameters, this makes this method fairly robust and easy to optimize.
- Scikit-Learn’s implementation is easy to use and the documentation is superb.

Isolation Forest cons:
- The Python implementation exists only in the development version of Sklearn.
- Visualizing results is complicated.
- If not correctly optimized, training time can be very long and computationally expensive.

### 2.6) Angle-Based Outlier Detection (ABOD)
It considers the relationship between each point and its neighbor(s). It does not consider the relationships among these neighbors. The variance of its weighted cosine scores to all neighbors could be viewed as the outlying score
ABOD performs well on multi-dimensional data
PyOD provides two different versions of ABOD:

- Fast ABOD: Uses k-nearest neighbors to approximate
- Original ABOD: Considers all training points with high-time complexity

### 2.7) k-Nearest Neighbors Detector
For any data point, the distance to its kth nearest neighbor could be viewed as the outlying score
PyOD supports three kNN detectors:

- Largest: Uses the distance of the kth neighbor as the outlier score
- Mean: Uses the average of all k neighbors as the outlier score
- Median: Uses the median of the distance to k neighbors as the outlier score

### 2.8) Local Correlation Integral (LOCI)
- LOCI is very effective for detecting outliers and groups of outliers. It provides a LOCI plot for each point which summarizes a lot of the information about the data in the area around the point, determining clusters, micro-clusters, their diameters, and their inter-cluster distances
- None of the existing outlier-detection methods can match this feature because they output only a single number for each point

### 2.9) Automating outliers detection with SVM

Support Vector Machines (SVM) is a powerful machine learning technique. OneClassSVM is an algorithm that specializes in learning the expected distributions in a dataset. OneClassSVM is especially useful as a novelty detector method if you can first provide data cleaned from outliers; otherwise, it’s effective as a detector of multivariate outliers. In order to have OneClassSVM work properly, you have two key parameters to fix:


- gamma, telling the algorithm whether to follow or approximate the dataset distributions. For novelty detection, it is better to have a value of 0 or superior (follow the distribution); for outlier detection values, smaller than 0 values are preferred (approximate the distribution).

- nu, which can be calculated by the following formula: nu_estimate = 0.95 * f + 0.05, where f is the percentage of expected outliers (a number from 1 to 0). If your purpose is novelty detection, f will be 0.

## 3) Dealing with outliers
Should an outlier be removed from analysis? Should you keep outliers, or change them to another variable? The answer to these questions may seem straightforward, but isn’t so simple.

There are many strategies for dealing with outliers in data. Depending on the situation and data set, any could be the right or the wrong way.
### 3.1) Remove the outliers

### 3.2) Change the value of outliers
#### 3.2.1) Percentile Capping (Winsorization)
In layman's terms, Winsorization (Winsorizing) at 1st and 99th percentile implies values that are less than the value at 1st percentile are replaced by the value at 1st percentile, and values that are greater than the value at 99th percentile are replaced by the value at 99th percentile. The winsorization at 5th and 95th percentile is also common.
#### 3.2.2) Imputing

#### 3.2.3) Transforming (Discretization)




# Code

In [10]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from scipy import stats
from pyod.models.abod import ABOD
from pyod.models.cblof import CBLOF
from pyod.models.feature_bagging import FeatureBagging
from pyod.models.hbos import HBOS
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.lof import LOF
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import DBSCAN
from sklearn import svm


class handlingOutliers:
    def __init__(self, df):
        self.df = df
        
    def isolationForest(self,column,n_estimators=100,outliers_fraction=0.01):
        model=IsolationForest(n_estimators=n_estimators, max_samples='auto', contamination=float(outliers_fraction),max_features=1.0)
        model.fit(self.df[[column]])
        self.df['scores']=model.decision_function(self.df[[column]])
        self.df['anomaly']=model.predict(self.df[[column]])
        anomaly=self.df.loc[self.df['anomaly']==-1]
        anomaly_index=list(anomaly.index)
        return anomaly_index
    
    def KNN(self,outliers_fraction,column):
        model = KNN(contamination=float(outliers_fraction))
        model.fit(self.df[column].values.reshape(-1,1))
        self.df['scores'] = model.decision_function(self.df[[column]])
        self.df['anomaly']=model.predict(self.df[[column]])
        anomaly=self.df.loc[self.df['anomaly']==1]
        anomaly_index=list(anomaly.index)
        return anomaly_index
    
    def predictOutliers(self,column1,column2,classifier,outliers_fraction=0.3,random_state=np.random.RandomState(100),n_neighbors=5,scale=True):      
        classifiers = {
        'Angle-based Outlier Detector (ABOD)': ABOD(contamination=float(outliers_fraction)),
        'Cluster-based Local Outlier Factor (CBLOF)':CBLOF(contamination=float(outliers_fraction),check_estimator=False, random_state=random_state),
        'Feature Bagging':FeatureBagging(LOF(n_neighbors=n_neighbors),contamination=float(outliers_fraction),check_estimator=False,random_state=random_state),
        'Histogram-base Outlier Detection (HBOS)': HBOS(contamination=float(outliers_fraction)),
        'Isolation Forest': IForest(contamination=float(outliers_fraction),random_state=random_state),
        'K Nearest Neighbors (KNN)': KNN(contamination=float(outliers_fraction)),
        'Average KNN': KNN(method='mean',contamination=float(outliers_fraction))
         }
        colPara = {column1:[self.df[column1].min(),self.df[column1].max()],column2:[self.df[column2].min(),self.df[column2].max()]}
        if scale == True:
            scaler = MinMaxScaler(feature_range=(0, 1))
            self.df[[column1,column2]] = scaler.fit_transform(self.df[[column2,column2]])
            
        #multivariate outliers (2 columns like weigh and heigh)
        X1 = df[column1].values.reshape(-1,1)
        X2 = df[column2].values.reshape(-1,1)
        X = np.concatenate((X1,X2),axis=1)
        model = classifiers[classifier]
        model.fit(X)
        self.df['scores'] = model.decision_function(X)
        self.df['anomaly']=model.predict(X)
        anomaly=self.df.loc[self.df['anomaly']==1]
        anomaly_index=list(anomaly.index)
        return [anomaly_index,colPara]
    
    #EPS ==> The maximum distance between two samples for one to be considered as in the neighborhood of the other. 
    #This is not a maximum bound on the distances of points within a cluster. 
    #This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
    
    #min_samples ==> The number of samples (or total weight) in a neighborhood for a point to be considered as a core point.
    #This includes the point itself.

    def DBScan(self,eps=4,min_samples=2):
        model = DBSCAN(eps=eps, min_samples=min_samples)
        
        float_col = self.df.select_dtypes(include=['float']) # This will select float columns only
        for col in float_col.columns.values:
            self.df[col] = self.df[col].round(0).astype(int)        
        model.fit(self.df)
        self.df['anomaly'] = model.labels_
        return self.df[ self.df['anomaly'] == -1].index
        
    def OneClassSVM(self,outliers_fraction=0.01,gamma=0.1):
        nu_estimate = 0.95 * outliers_fraction + 0.05
        auto_detection = svm.OneClassSVM(kernel="rbf", gamma=gamma, degree=3, nu=nu_estimate)
        auto_detection.fit(self.df)
        self.df['anomaly'] = auto_detection.predict(self.df)
        anomaly=self.df.loc[self.df['anomaly']==-1]
        anomaly_index=list(anomaly.index)
        return anomaly_index
    
    def removeOutliers(self,anomaly_index):
        self.df.drop(anomaly_index,inplace=True)
        try:
            self.df.drop(['anomaly'],axis=1,inplace=True)
            self.df.drop(['scores'],axis=1,inplace=True)
        except:
            print("columns don't exist")
        
    def imputeOutliers(self,anomaly_index,column):
        self.df.loc[anomaly_index,column] = np.nan
        try:
            self.df.drop(['anomaly'],axis=1,inplace=True)
            self.df.drop(['scores'],axis=1,inplace=True)        
        except:
            print("columns don't exist")
            
    #factor ==> The common value for the factor k is the value 1.5. A factor k of 3 or more can be used to identify values 
    #that are extreme outliers or “far outs” when described in the context of box and whisker plots.       
    def removeOutliersIQR(self,column,factor=1.5): 
        Q1=self.df[column].quantile(0.25)
        Q3=self.df[column].quantile(0.75)
        IQR=(Q3-Q1) * factor
        Lower_Whisker = Q1 - IQR
        Upper_Whisker = Q3 + IQR
        self.df = self.df[self.df[column]< Upper_Whisker]
        self.df = self.df[self.df[column]> Lower_Whisker]
        return self.df

    def removeOutliersZScore(self,column,threshold=3):
        z_scores = stats.zscore(self.df)
        abs_z_scores = np.abs(z_scores)
        filtered_entries = (abs_z_scores < threshold).all(axis=1)
        return self.df[filtered_entries]

    def removeByTrimming(self,column,lowerW,upperW):
        index = self.df[(self.df[column] >= upperW)|(self.df[column] <= lowerW)].index
        self.df.drop(index, inplace=True)

    def WinsorizeStats(self):
        out = stats.mstats.winsorize(self.df, limits=[0.05, 0.05])
        return out
    
    def replaceByMedian(self,anomaly_index,column):
        self.df.loc[anomaly_index,column] = self.df[column].median()
        try:
            self.df.drop(['anomaly'],axis=1,inplace=True)
            self.df.drop(['scores'],axis=1,inplace=True)
        except:
            print("columns are already removed")

    def replaceByMedianUp(self,column,upperWhisker):
        self.df[column] = np.where(self.df[column] <upperWhisker, self.df[column].median(),self.df[column])

    def replaceByMedianLow(self,column,lowerWhisker):
        self.df[column] = np.where(self.df[column] >lowerWhisker, self.df[column].median(),self.df[column])


    def read(self):
        return self.df

In [25]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,35]})
f = handlingOutliers(df)
f.replaceByMedianLow("salary",10)
print(f.read())

    salary  age
0      1.0   10
1      2.0   11
2      3.0   12
3      4.0   15
4      2.0   40
5      4.0   90
6      4.0    8
7      3.0   20
8      7.0   17
9      4.0   19
10     4.0   35


In [26]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,35]})
f = handlingOutliers(df)
f.isolationForest("salary")

[6, 10]

In [27]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,35]})
f = handlingOutliers(df)
print(df)
print(f.removeOutliers(f.OneClassSVM()))
print(f.read())


    salary  age
0        1   10
1        2   11
2        3   12
3        4   15
4        2   40
5        4   90
6      100    8
7        3   20
8        7   17
9        4   19
10     200   35
columns don't exist
None
   salary  age
1       2   11
2       3   12
3       4   15
4       2   40
7       3   20
8       7   17
9       4   19


In [28]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,35]})
f = handlingOutliers(df)
d = f.DBScan()
f.removeOutliers(d)
print(f.read())


columns don't exist
   salary  age
0       1   10
1       2   11
2       3   12
3       4   15
7       3   20
8       7   17
9       4   19


In [29]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,35]})
f = handlingOutliers(df)
print(df)
ind =f.KNN(0.3,'salary')
print(f.replaceByMedian(ind,'salary'))
print(f.read())

    salary  age
0        1   10
1        2   11
2        3   12
3        4   15
4        2   40
5        4   90
6      100    8
7        3   20
8        7   17
9        4   19
10     200   35
None
    salary  age
0      1.0   10
1      2.0   11
2      3.0   12
3      4.0   15
4      2.0   40
5      4.0   90
6      4.0    8
7      3.0   20
8      4.0   17
9      4.0   19
10     4.0   35


In [9]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,35]})
f = handlingOutliers(df)
print(df)
print(f.predictOutliers('salary','age','Cluster-based Local Outlier Factor (CBLOF)'))
print(f.predictOutliers('salary','age','Histogram-base Outlier Detection (HBOS)'))
print(f.predictOutliers('salary','age','Angle-based Outlier Detector (ABOD)'))
print("ABOD is done")
print(f.predictOutliers('salary','age','Feature Bagging'))
print(f.predictOutliers('salary','age','Isolation Forest'))
print(f.predictOutliers('salary','age','K Nearest Neighbors (KNN)'))
print(f.predictOutliers('salary','age','Average KNN'))


    salary  age
0        1   10
1        2   11
2        3   12
3        4   15
4        2   40
5        4   90
6      100    8
7        3   20
8        7   17
9        4   19
10     200   35
[90. 90.]
okokokokokoko [0.01219512 0.01219512]
[0, 2, 8]
[1. 1.]
okokokokokoko [1. 1.]
[4, 5, 10]
[1. 1.]
okokokokokoko [1. 1.]
[4, 5, 10]
ABOD is done
[1. 1.]
okokokokokoko [1. 1.]
[4, 5, 10]
[1. 1.]
okokokokokoko [1. 1.]
[4, 5, 10]
[1. 1.]
okokokokokoko [1. 1.]
[4, 5, 10]
[1. 1.]
okokokokokoko [1. 1.]
[4, 5, 10]


In [31]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,350]})
f = handlingOutliers(df)
f.WinsorizeStats()

masked_array(
  data=[[  2,  10],
        [  2,  11],
        [  3,  12],
        [  4,  15],
        [  2,  40],
        [  4,  90],
        [100,   8],
        [  3,  20],
        [  7,  17],
        [  4,  19],
        [200, 200]],
  mask=False,
  fill_value=999999,
  dtype=int64)

In [34]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,350]})
f = handlingOutliers(df)
ind =f.KNN(0.3,'salary')
f.replaceByMedian(ind,"salary")
print(f.read())

    salary  age
0      1.0   10
1      2.0   11
2      3.0   12
3      4.0   15
4      2.0   40
5      4.0   90
6      4.0    8
7      3.0   20
8      4.0   17
9      4.0   19
10     4.0  350


In [35]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,35]})
f = handlingOutliers(df)
f.removeByTrimming("salary",1,10)
print(f.read())

   salary  age
1       2   11
2       3   12
3       4   15
4       2   40
5       4   90
7       3   20
8       7   17
9       4   19


In [36]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,35]})
f = handlingOutliers(df)
f.removeOutliersIQR("salary")

Unnamed: 0,salary,age
0,1,10
1,2,11
2,3,12
3,4,15
4,2,40
5,4,90
7,3,20
8,7,17
9,4,19


In [37]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,350]})
f = handlingOutliers(df)
f.removeOutliersZScore("salary")

Unnamed: 0,salary,age
0,1,10
1,2,11
2,3,12
3,4,15
4,2,40
5,4,90
6,100,8
7,3,20
8,7,17
9,4,19


In [9]:
df=pd.DataFrame({'salary':[1,2,3,4,2,4,100,3,7,4,200],'age':[10,11,12,15,40,90,8,20,17,19,350]})
df

Unnamed: 0,salary,age
0,1,10
1,2,11
2,3,12
3,4,15
4,2,40
5,4,90
6,100,8
7,3,20
8,7,17
9,4,19
