<a href="https://colab.research.google.com/github/alirezayazdani21/10_optimization_problems/blob/master/Outlier_detection_methods_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Outlier detection

outlier detection for time series data involves identifying data points that deviate significantly from the expected pattern or behavior of the series. There are several methods you can use to detect outliers in time series data using Python. Here are a few examples:



**Z-Score Method:** The z-score method measures the deviation of a data point from the mean in terms of standard deviations. Generally, a z-score above a certain threshold (e.g., 3) is considered an outlier.

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

data = np.random.randn(1000)  # Replace with your own time series data

# Calculate z-scores
z_scores = np.abs(stats.zscore(data))

# Set threshold for outliers, e.g. 3 standard deviations
threshold = 3

# Find outlier indices
outlier_indices = np.where(z_scores > threshold)[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)

plt.figure(figsize=(8,5));
plt.plot(data);
plt.scatter(outlier_indices,data[outlier_indices], marker='*', color='red');

**Median Absolute Deviation (MAD) Method:** The MAD method calculates the deviation of each data point from the median of the series in terms of median absolute deviation. Typically, a data point is considered an outlier if its MAD exceeds a certain threshold.




In [None]:
from scipy.ndimage import uniform_filter

# Set window size
window_size = 30

# Calculate mean and standard deviation within the window
rolling_mean = uniform_filter(data, size=window_size)
rolling_std = np.std(data)

# Calculate z-scores for each data point
z_scores = (data - rolling_mean) / rolling_std

# Set threshold for outliers
threshold = 3

# Find outlier indices
outlier_indices = np.where(np.abs(z_scores) > threshold)[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)

**Moving Z-Score Method:** The moving z-score method calculates the z-score for each data point within a moving window. Outliers are detected based on the z-scores exceeding a certain threshold.



In [None]:
from scipy.ndimage import uniform_filter

# Set window size
window_size = 30

# Calculate mean and standard deviation within the window
rolling_mean = uniform_filter(data, size=window_size)
rolling_std = np.std(data)

# Calculate z-scores for each data point
z_scores = (data - rolling_mean) / rolling_std

# Set threshold for outliers
threshold = 3

# Find outlier indices
outlier_indices = np.where(np.abs(z_scores) > threshold)[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)



**Quantile Method:** The quantile method involves calculating the lower and upper quantiles of the data distribution and considering data points outside of this range as outliers.


In [None]:
# Set quantile thresholds
lower_quantile = 0.01
upper_quantile = 0.99

# Calculate quantiles
lower_threshold = np.quantile(data, lower_quantile)
upper_threshold = np.quantile(data, upper_quantile)

# Find outlier indices
outlier_indices = np.where((data < lower_threshold) | (data > upper_threshold))[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)

plt.figure(figsize=(8,5));
plt.plot(data);
plt.scatter(outlier_indices,data[outlier_indices], marker='*', color='red');

**Robust Z-Score Method:** The robust z-score method uses the median and median absolute deviation (MAD) instead of the mean and standard deviation to measure the deviation of data points from the expected behavior.




In [None]:
# Calculate median and median absolute deviation (MAD)
median_val = np.median(data)
mad = np.median(np.abs(data - median_val))

# Calculate robust z-scores
robust_z_scores = 0.6745 * (data - median_val) / mad

# Set threshold for outliers
threshold = 3

# Find outlier indices
outlier_indices = np.where(np.abs(robust_z_scores) > threshold)[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)


**Moving Median Absolute Deviation (MAD) Method:** Similar to the moving z-score method, this approach calculates the median and median absolute deviation within a moving window. Outliers are identified based on the deviation from the window's median.


In [None]:
from scipy.ndimage import median_filter

# Set window size
window_size = 30

# Calculate median within the window
rolling_median = median_filter(data, size=window_size)

# Calculate median absolute deviation (MAD) within the window
rolling_mad = np.median(np.abs(data - rolling_median))

# Set threshold for outliers
threshold = 3

# Find outlier indices
outlier_indices = np.where(np.abs(data - rolling_median) / rolling_mad > threshold)[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)

**Autoregressive Integrated Moving Average (ARIMA) Residuals:** If you have fitted an ARIMA model to your time series data, you can examine the residuals to identify outliers. Unusually large residuals can indicate the presence of outliers.



In [None]:
import statsmodels.api as sm

# Fit ARIMA model to the time series data
model = sm.tsa.ARIMA(data, order=(2, 2, 1))  # Replace p, d, q with appropriate values
arima_model = model.fit()

# Get residuals
residuals = arima_model.resid

# Set threshold for outliers
threshold = 3

# Find outlier indices
outlier_indices = np.where(np.abs(residuals) > threshold)[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)
plt.plot(residuals)
plt.scatter(outlier_indices,data[outlier_indices], marker='*', color='red');

**One-Class Support Vector Machines (SVM):**
One-Class SVM is a machine learning algorithm that can be used for outlier detection. It learns the normal pattern of the data and identifies points that deviate significantly from it.


In [None]:
from sklearn.svm import OneClassSVM

# Fit One-Class SVM model
model = OneClassSVM(nu=0.05)  # Adjust nu parameter based on your data
model.fit(data.reshape(-1, 1))  # Reshape data if needed

# Predict outliers
outliers = model.predict(data.reshape(-1, 1)) == -1

# Find outlier indices
outlier_indices = np.where(outliers)[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)


**Isolation Forest:**
Isolation Forest is an ensemble algorithm that separates outliers by randomly partitioning the data. It measures the number of partitions needed to isolate an instance, and outliers tend to require fewer partitions.


In [None]:
from sklearn.ensemble import IsolationForest

# Fit Isolation Forest model
model = IsolationForest(contamination=0.05)  # Adjust contamination parameter based on your data
model.fit(data.reshape(-1, 1))  # Reshape data if needed

# Predict outliers
outliers = model.predict(data.reshape(-1, 1)) == -1

# Find outlier indices
outlier_indices = np.where(outliers)[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)


**The Local Outlier Factor (LOF):** is an unsupervised outlier detection algorithm that measures the local density deviation of a data point with respect to its neighbors. Here's an example of how you can use the LocalOutlierFactor for outlier detection in time series data:


In [None]:
import numpy as np
from sklearn.neighbors import LocalOutlierFactor

# Generate sample time series data
data = np.random.randn(1000, 1)  # Replace with your own time series data

# Fit the Local Outlier Factor model
lof = LocalOutlierFactor(contamination=0.05)  # Adjust the contamination parameter based on your data
outlier_scores = lof.fit_predict(data)

# Find outlier indices
outlier_indices = np.where(outlier_scores == -1)[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)



**Elliptic Envelope:** This is a robust method that fits an elliptical envelope to the data and identifies outliers as points lying outside the envelope (this is based on Mahalanobis distance, a measure of the distance between a point and a distribution.). Here's an example of how you can use the Elliptic Envelope for outlier detection in time series data:

In [None]:
import numpy as np
from sklearn.covariance import EllipticEnvelope

# Generate sample time series data
data = np.random.randn(1000, 1)  # Replace with your own time series data

# Fit the Elliptic Envelope model
envelope = EllipticEnvelope(contamination=0.05)  # Adjust the contamination parameter based on your data
envelope.fit(data)

# Predict outliers
outliers = envelope.predict(data) == -1

# Find outlier indices
outlier_indices = np.where(outliers)[0]

# Print outlier indices
print("Outlier indices:", outlier_indices)


In this example, the Elliptic Envelope model is fitted to the time series data. The contamination parameter is used to control the expected proportion of outliers in the data. Outliers are predicted using the predict() method, and the indices of the outliers are extracted.

**Change Point Detection:**
Change point detection focuses on identifying points in a time series where the underlying pattern or behavior changes significantly. These points can indicate the presence of outliers or shifts in the data distribution.

One popular approach for change point detection is using the `ruptures` library in Python. It provides various algorithms to detect change points in time series data. Here's an example using the Pelt algorithm:


In [None]:
!pip install ruptures

import numpy as np
import matplotlib.pyplot as plt
import ruptures as rpt

# Generate sample time series data

# Perform change point detection
algo = rpt.Pelt(model="rbf")
algo.fit(data)
result = algo.predict(pen=12)

# Plot the time series data with change points
plt.plot(data)
for point in result:
    plt.axvline(x=point, color='r', linestyle='--')
plt.xlabel('Time')
plt.ylabel('Value')
plt.title('Time Series with Change Points')
plt.show();

# Print change point indices
print("Change point indices:", result)



In this example, the `ruptures` library is used to detect change points in the time series data. The Pelt algorithm is applied with a penalty value (`pen`) to control the sensitivity of change point detection. The resulting change point indices are plotted as vertical lines on the time series plot.

You can explore other algorithms provided by the `ruptures` library, such as `BinarySeg` or `Dynp`, to find the most suitable approach for your specific data.

In [None]:
#!apt-get install texlive texlive-xetex texlive-latex-extra pandoc
#!pip install pypandoc
#from google.colab import drive
#drive.mount('/content/drive')
#!cp "./drive/MyDrive/Colab Notebooks/Outlier_detection.ipynb" ./
!jupyter nbconvert --to PDF "Outlier_detection.ipynb"

[NbConvertApp] Converting notebook Outlier_detection.ipynb to PDF
[NbConvertApp] Support files will be in Outlier_detection_files/
[NbConvertApp] Making directory ./Outlier_detection_files
[NbConvertApp] Making directory ./Outlier_detection_files
[NbConvertApp] Making directory ./Outlier_detection_files
[NbConvertApp] Making directory ./Outlier_detection_files
[NbConvertApp] Writing 57245 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 236167 bytes to Outlier_detection.pdf
