<a href="https://colab.research.google.com/github/hussain0048/Machine-Learning/blob/master/Anomaly_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Anomaly Detection [Outliers Detection]**

**Introduction:**

Anomaly detection is a process where you find out the list of outliers from your data. An outlier is a sample that has inconsistent data compared to other regular samples hence raises suspicion on their validity. The presence of outliers can also impact the performance of machine learning algorithms when performing supervised tasks. It can also interfere with data scaling which is a common data preprocessing step. As a part of this tutorial, we'll be discussing estimators available in scikit-learn which can help with identifying outliers from data.

Below is a list of scikit-learn estimators which let us identify outliers present in data that we'll be discussing as a part of this tutorial:

 - KernelDensity
 - OneClassSVM
 = IsolationForest
 - LocalOutlierFactor
 
We'll be explaining the usage of each one with various examples.

Let’s start by importing the necessary libraries.

In [None]:
!git clone https://github.com/hussain0048/Machine-Learning.git

# **Importing necessary libraries** #

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

# **Load Datasets**
We'll start by loading two datasets that we'll be using for our explanation purpose.

- Blobs Dataset - We have created a blobs dataset which has data of 3 clusters with 500 samples and 2 features per sample. We'll be using this dataset primarily for an explanation of sklearn estimators.
- Digits Dataset - The second dataset that we'll load is digits dataset which has 1797 images of 0-9 digits. Each image is of size 8x8 which is flattened and kept as an array of size 64.


In [None]:
 from sklearn.datasets import make_blobs

X, Y = make_blobs(n_features=2, centers=3, n_samples=500,
                  random_state=42)

print("Dataset Size : ", X.shape, Y.shape)

In [None]:
with plt.style.context("ggplot"):
    plt.scatter(X[:, 0], X[:, 1], c="tab:green")

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()

X_digits, Y_digits = digits.data, digits.target

print("Dataset Size : ", X_digits.shape, Y_digits.shape)

# 1- **KernelDensity** 
The KernelDensity estimator is available as a part of the kde module of the neighbors module of sklearn. It helps us measure kernel density of samples which can be then used to take out outliers. It uses KDTree or BallTree algorithm for kernel density estimation.

Below is a list of important parameters of KernelDensity estimator:

- algorithm - It accepts string value specifying which algorithm to use for kernel density estimation. We can specify one of the below values for this parameter.
  - auto - Default.
  - kd_tree
  - ball_tree
- kernel - It accepts string which let us specify which kernel to use for estimation. We can specify one of the below values.
  - gaussian
  - tophat
  - epanechnikov
  - exponential
  - linear
  - cosine

## 1.1  Fitting Model to Data

We'll first fit the KernelDensity estimator to our dataset using fit() method of it and then use it for finding out outliers

In [None]:
from sklearn.neighbors.kde import KernelDensity
# Estimate density with a Gaussian kernel density estimator
kde = KernelDensity(kernel='gaussian')
kde.fit(X)

## 1.2 - Calculate Log Density Evaluations for Each Sample
The KernelDensity estimator has a method named score_samples() which accepts dataset and returns log density evaluations for each sample of data. We'll divide these values into 95% as valid data and 5% as outliers based on the output of score_samples() function.

In [None]:
kde_X = kde.score_samples(X)
kde_X[:5]  # contains the log-likelihood of the data. The smaller it is the rarer is the sample

## 1.3-Dividing Dataset into Valid Samples and Outliers
Below we are trying to find out quantiles value for 5% of total data. We'll use that value to divide data into outliers and valid samples

In [None]:
from scipy.stats.mstats import mquantiles

alpha_set = 0.95
tau_kde = mquantiles(kde_X, 1. - alpha_set)

tau_kde

All the values in kde_X array which are less than tau_kde will be outliers and values greater than it will be qualified as valid samples. We'll try to find out indexes of samples that are outliers and valid. We'll then use these indexes to filter data to divide it into outliers and valid samples.

In [16]:
outliers = np.argwhere(kde_X < tau_kde)
outliers = outliers.flatten()
X_outliers = X[outliers]

normal_samples = np.argwhere(kde_X >= tau_kde)
normal_samples = normal_samples.flatten()
X_valid = X[normal_samples]

print("Original Samples : ",X.shape[0])
print("Number of Outliers : ", len(outliers))
print("Number of Normal Samples : ", len(normal_samples))

Original Samples :  500
Number of Outliers :  25
Number of Normal Samples :  475


## 1.4 -Plot Outliers with Valid Samples for Comparison
We have designed the method below named plot_outliers_with_valid_samples which takes as input valid samples and outliers and then plots them using different colors to differentiate between them. The figure will give a better idea about the performance of KernelDensity.


In [17]:
def plot_outliers_with_valid_samples(X_valid, X_outliers):
    with plt.style.context(("seaborn", "ggplot")):
        plt.scatter(X_valid[:, 0], X_valid[:, 1], c="tab:green", label="Valid Samples")
        plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c="tab:red", label="Outliers")
        plt.legend(loc="best")

In [None]:
plot_outliers_with_valid_samples(X_valid, X_outliers)

#2 OneClassSVM 

The OneClassSVM estimator is available as a part of svm module of sklearn. It's based on the SVM algorithm which is used behind the scene to make a decision about the sample is outlier or not.

Below is a list of important parameters of OneClasSVM which can be tweaked further to get better results:

 - kernel - It specifies the kernel type to be used for SVM. It accepts one of the below values as input.
   - linear
   - poly
   - rbf
   - sigmoid
   - precomputed
 - degree - It accepts integer specifying degree of polynomial kernel (kernel='poly'). It's ignored when other kernels are used.
 - gamma - It specifies kernel coefficient to use for rbf, poly and sigmoid kernels. It accepts one of the below string or float as input.
  - scale - It uses 1 / (n_features * X.var()) as value of gamma.
  - auto - Default. It uses 1 / n_features as the value of gamma.
 - nu - It accepts float value in the range (0, 1] specifying upper bound on the fraction of training errors and lower bound on the fraction of support vectors.
 - cache_size - It specifies kernel cache size in MB. It accepts integer values as input. The default value is 200 MB. It’s recommended using more value for bigger datasets for better performance.

## 2.1 Fitting Model to Data¶
We'll now fit OneClassSVM to our Gaussian blobs dataset. We'll then use the trained model to make predictions about samples to let us know whether the sample is an outlier or not.

In [None]:
from sklearn.svm import OneClassSVM

nu = 0.05  # theory says it should be an upper bound of the fraction of outliers
ocsvm = OneClassSVM(kernel='rbf', gamma=0.05, nu=nu)
ocsvm.fit(X)

## 2.2-Predict Sample Class (Outlier vs Normal)¶

OneClassSVM provides predict() method which accepts samples and returns array consisting of values 1 or -1. Here 1 represents a valid sample and -1 represents an outlier.


In [19]:
preds = ocsvm.predict(X)
preds[:10]

array([ 1, -1,  1,  1,  1,  1, -1,  1,  1,  1])

## 2.3 - Dividing Dataset into Valid Samples and Outliers

We'll now filter original data and divide it into two categories.

- Valid Samples
= Outliers
We'll also print the size of samples that were considered outliers by model.

In [None]:
X_outliers = X[preds == -1]
X_valid = X[preds != -1]

print("Original Samples : ",X.shape[0])
print("Number of Outliers : ", X_outliers.shape[0])
print("Number of Normal Samples : ", X_valid.shape[0])

## 2.4 Plot Outliers with Valid Samples for Comparison

In [None]:
plot_outliers_with_valid_samples(X_valid, X_outliers)

## 2.5 Important Attributes and methods of OneClassSVM
Below is a list of important attributes and methods of OneClassSVM which can be used once the model is trained to get meaningful insights.

- support_ - It returns indices of support vectors.
- support_vectors_ - It returns actual support vectors of SVM.
dual_coef_ - It represents coefficients of support vectors in decision function.
- coef_ - It returns an array of the same size as that of features in dataset representing weights assigned to each feature. It works only when kernel linear is used.
- intercept_ - It returns single float value representing intercept when using linear kernel.
- decision_function(X) - It accepts dataset as input and returns signed distance for each sample of data. If the distance is positive then the sample is valid and outlier if negative.


In [None]:
print("Support Indices : ",ocsvm.support_)

In [None]:
print("Support Vector Sizes : ", ocsvm.support_vectors_.shape)
ocsvm.support_vectors_[:5]

In [None]:
print("Dual Coef Size ", ocsvm.dual_coef_.shape)
ocsvm.dual_coef_[0][:5]

In [None]:
ocsvm_X = ocsvm.decision_function(X)

X_outliers = X[ocsvm_X < 0]
X_valid = X[ocsvm_X > 0]

print("Number of Outliers : ", X_outliers.shape[0])
print("Number of Normal Samples : ", X_valid.shape[0])

ocsvm_X[:10]

We can notice from the above output that decision_function() can be used to find out outliers as well and it'll return the same indexes as predict() for samples which are outliers.

## 2.6-Trying OneClassSVM on DIGITS Dataset.
We are now trying OneClassSVM on the digits dataset. We'll fit it to digits data and then use it to predict whether a sample is an outlier or not

## 2.7 Fitting Model to Data


In [None]:
 nu = 0.05  # theory says it should be an upper bound of the fraction of outliers
ocsvm = OneClassSVM(kernel='rbf', gamma=0.05, nu=nu)
ocsvm.fit(X_digits)

### 5.2- Plotting Confusion Matrix


In [None]:
plot_confusion_matrix(Y_test, gaussian_nb.predict(X_test))

### 5.3 -Important Attributes of GaussianNB ###
Below are list of important attributes available through estimator instance of GaussianNB.
  - class_log_prior_ - It represents log probability of each class.
  - epsilon_ - It represents absolute additive value to variances.
  - sigma_ - It represents variance of each feature per class. (n_classes x n_features)
  - theta_ - It represents mean of feature per class. (n_classes x n_features)
 

In [None]:
gaussian_nb.class_prior_

In [None]:
gaussian_nb.epsilon_

In [None]:
print("Gaussian Naive Bayes Sigma Shape : ", gaussian_nb.sigma_.shape)

In [None]:
print("Gaussian Naive Bayes Theta Shape : ", gaussian_nb.theta_.shape)

## 6. ComplementNB 
The first estimator that we'll be introducing is ComplementNB available with the naive_bayes module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of ComplementNB which can give helpful insight once the model is trained.


### 6.2 Fitting Model To Train Data

In [None]:
from sklearn.naive_bayes import ComplementNB
complement_nb = ComplementNB()
complement_nb.fit(X_train, Y_train)

### 6.3 - Evaluating Trained Model On Test Data.###
Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.
    

In [None]:
 Y_preds = complement_nb.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%complement_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%complement_nb.score(X_train, Y_train))

### 6.4 Plotting Confusion Matrix¶


In [None]:
plot_confusion_matrix(Y_test, complement_nb.predict(X_test))


###6.5 -Important Attributes of ComplementNB¶

Below are list of important attributes available through estimator instance of ComplementNB.

  - class_log_prior_ - It represents log probability of each class.
  - feature_log_prob_ - It represents log probability of particular feature      based on class. (n_classes x n_features)



In [None]:
complement_nb.class_log_prior_

In [None]:
print("Log Probability of Each Feature per class : ", complement_nb.feature_log_prob_.shape)

### 6.6 Finetuning Model By Doing Grid Search On Various Hyperparameters.
Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

  - alpha - It accepts float value representing the additive smoothing parameter. The value of 0.0 represents no smoothing. The default value of this parameter is 1.0.
  
We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation

In [None]:
%%time

params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
         }

complement_nb_grid = GridSearchCV(ComplementNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
complement_nb_grid.fit(X_digits,Y_digits)

print('Train Accuracy : %.3f'%complement_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%complement_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%complement_nb_grid.best_score_)
print('Best Parameters : ',complement_nb_grid.best_params_)

###6.6 Plotting Confusion Matrix

In [None]:
plot_confusion_matrix(Y_test, complement_nb_grid.best_estimator_.predict(X_test))

##7.MultinomialNB 
The first estimator that we'll be introducing is MultinomialNB available with the naive_bayes module of sklearn. We'll be first fitting it with default parameters to data and then will try to improve its performance by doing hyperparameter tuning. We'll also evaluate its performance using a confusion matrix. We'll even inform you regarding important attributes of MultinomialNB which can give helpful insight once the model is trained.

In [None]:
plot_confusion_matrix(Y_test, complement_nb_grid.best_estimator_.predict(X_test))


###7.1 Fitting Default Model To Train Data

In [None]:
from sklearn.naive_bayes import MultinomialNB
multinomial_nb = MultinomialNB()
multinomial_nb.fit(X_train, Y_train)

###7.2 Evaluating Trained Model On Test Data
Almost all models in Scikit-Learn API provides predict() method which can be used to predict target variable on Test Set passed to it.

In [None]:
Y_preds = multinomial_nb.predict(X_test)

print(Y_preds[:15])
print(Y_test[:15])

print('Test Accuracy : %.3f'%multinomial_nb.score(X_test, Y_test)) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : %.3f'%multinomial_nb.score(X_train, Y_train))

###7.3 Plotting Confusion Matrix

In [None]:
plot_confusion_matrix(Y_test, multinomial_nb.predict(X_test))

###7.4 Important Attributes of MultinomialNB
Below are list of important attributes available through estimator instance of MultinomialNB.

  - class_log_prior_ - It represents log probability of each class.
  - feature_log_prob_ - It represents log probability of particular feature based on class. (n_classes x n_features)

In [None]:
multinomial_nb.class_log_prior_

In [None]:
print("Log Probability of Each Feature per class : ", multinomial_nb.feature_log_prob_.shape)

###7.5 Finetuning Model By Doing Grid Search On Various Hyperparame
Below is a list of common hyperparameters that needs tuning for getting best fit for our data. We'll try various hyperparameters settings to various splits of train/test data to find out best fit which will have almost the same accuracy for both train & test dataset or have quite less difference between accuracy.

  - alpha - It accepts float value representing the additive smoothing parameter. The value of 0.0 represents no smoothing. The default value of this parameter is 1.0.
  
We'll below try various values for the above-mentioned hyperparameters to find the best estimator for our dataset by splitting data into 3-fold cross-validation.

In [None]:
%%time

params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
         }

multinomial_nb_grid = GridSearchCV(MultinomialNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
multinomial_nb_grid.fit(X_digits,Y_digits)

print('Train Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_train, Y_train))
print('Test Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_test, Y_test))
print('Best Accuracy Through Grid Search : %.3f'%multinomial_nb_grid.best_score_)
print('Best Parameters : ',multinomial_nb_grid.best_params_)

###7.6 Plotting Confusion Matrix
Below we are plotting the confusion matrix again with the best estimator that we found out using grid search.

In [None]:
plot_confusion_matrix(Y_test, multinomial_nb_grid.best_estimator_.predict(X_test))

References:
Scikit-Learn - Anomaly Detection [Outliers Detection]
https://coderzcolumn.com/tutorials/machine-learning/scikit-learn-sklearn-anomaly-detection-outliers-detection