# Outliers 
Outliers are data points tha significantly deviate from the rest of the observations. They appear as extreme values that do not follow the general pattern of the dataset. 

Outliers can arise due to **measurement errors**, but they can also indicate important anomalies, such as **rare events** or **frauds**.

It is critical to *identify* and *threat* outliers, since they can impact the perfromance of machine learning models, even introducing bias. 


There are two categories of outlier detection methods: Statistical Methods, like **z-score** and **IQR**, and Machine Learning Methods, like **Isolation Forests** and **DBSCAN**.

# Statistical Methods 

## Z-Score

Measures how many **standard deviations** a data point  is away from the mean. 
The *z-score* of data point $x$ is given by

$$ z =  \frac{x-\mu}{\sigma}  $$


## IQR

We define a Lower Bound and Upper Bound based on IQR. Observations that fall out 
of these limits are considered outliers. More precisely:

* $IQR = Q3-Q1$
* $LB = Q1 - 1.5*IQR$
* $UB = Q3 + 1.5*IQR$




In [13]:
import numpy as np

class OutlierDetection:
    """
    Class for detecting outliers using Z-Score and IQR methods.
    """

    def __init__(self):
        self.mu = None
        self.std = None
        self.lower_bound = None
        self.upper_bound = None

    def fit(self, X):
        """
        Compute mean, standard deviation and IQR bounds for original data points.
        
        Parameters:
            X: array-like
            Training data from which moments are computed. 
        """
        # Guarantee that data is a numpy array
        X = np.array(X)

        # Compute moments for Z-Score
        self.mu = np.mean(X)
        self.std = np.std(X)
        
        # Compute moments for IQR
        Q1 = np.percentile(X,25)
        Q3 = np.percentile(X,75)
        
        self.lower_bound = Q1 - 1.5 * (Q3-Q1)
        self.upper_bound = Q3 + 1.5 * (Q3-Q1)

    def z_score_outliers (self, X, threshold =3):
        """
        Compute Z-statistic and classify outliers for training or new data.

        Parameters:
            X: array-like
                Data to be evaluated
            threshold:  float, optional (default=3)
                How many standard deviations away from the mean an 
            observation must be to be considered an outlier.

        Returns:
            Z: numpy-array
                The Z-Score for each data point.
            outliers: numpy-array
                Array attributing value 1 to outliers an 0 otherwise.
        """
        # Guarantee that data is a numpy array
        X = np.array(X)

        # It is a good practice to check if fit() has been called before using 
        # z_score_outliers()
        if self.mu == None or self.std == None:
            raise ValueError("Model must be fitted before detecting outliers")

        if self.std ==0:
            raise ValueError("Data without variability. Cant compute Z-statistic")
        
        Z = (X - self.mu)/self.std

        outliers = np.where(abs(Z)>threshold, 1, 0 )

        return Z, outliers


    def IQR_outliers(self, X):

        """
        Detect Outliers based on Interquartile (IQR) method 
    
        Parameters:
            X: array-like
                Data to be evaluated
        Returns:
            outliers: numpy-array
                Array attributing value 1 to outliers an 0 otherwise.
        """
        if self.lower_bound == None or self.upper_bound == None:
            raise ValueError("Model must be fit before detecting outliers")

        X=np.array(X)
        outliers = np.where(
            (X < self.lower_bound) | (X > self.upper_bound), 1,0 )
        return outliers 


In [14]:
data = [10, 12, 14, 15, 14, 13, 120, 14, 13]
outlier_detector = OutlierDetection()
outlier_detector.fit(data)
z_score_outliers = outlier_detector.z_score_outliers(data, threshold =2)
iqr_method_outliers = outlier_detector.IQR_outliers(data) 

print(f"Outliers using Z-Score Method: {z_score_outliers}.")
print(f"Outliers using IQR Method: {iqr_method_outliers}.")

Outliers using Z-Score Method: (array([-0.44622309, -0.38672668, -0.32723026, -0.29748206, -0.32723026,
       -0.35697847,  2.82607956, -0.32723026, -0.35697847]), array([0, 0, 0, 0, 0, 0, 1, 0, 0])).
Outliers using IQR Method: [1 0 0 0 0 0 1 0 0].


# Isolation Forest
Model-based outlier detection that relies on decision trees

# Outliers Threatment

Simply replacing outliers without thinking about why they have occurred is a 
dangerous practice. They may provide useful information about the process that
produced the data, which should be taken into account when forecasting. 

However, dealing with outliers is essential, since they can introduce bias in 
parameter estimatios, leading to innacurate predictions.