## Outlier Detection with Isolation Forest

Isolation Forest, like any tree ensemble method, is built on the basis of decision trees. In these trees, partitions are created by first randomly selecting a feature and then selecting a random split value between the minimum and maximum value of the selected feature.

In principle, outliers are less frequent than regular observations and are different from them in terms of values (they lie further away from the regular observations in the feature space). That is why by using such random partitioning they should be identified closer to the root of the tree (shorter average path length, i.e., the number of edges an observation must pass in the tree going from the root to the terminal node), with fewer splits necessary.

A normal point requires more partitions to be identified than an abnormal point.
 
As with other outlier detection methods, an anomaly score is required for decision making. In case of Isolation Forest it is defined as:


> $s(x, n) = 2^-(\frac{E(h(x))}{c(n)})$



where h(x) is the path length of observation x, c(n) is the average path length of unsuccessful search in a Binary Search Tree and n is the number of external nodes. 

Each observation is given an anomaly score and the following decision can be made on its basis:

- Score close to 1 indicates anomalies
- Score much smaller than 0.5 indicates normal observations
- If all scores are close to 0.5 than the entire sample does not seem to have clearly distinct anomalies


In [0]:
# importing libaries ----
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import savefig
from sklearn.ensemble import IsolationForest
from sklearn import metrics
import seaborn as sns

In [0]:
# Generating data ----
rng = np.random.RandomState(42)

In [0]:
# Generating training data ----
X_train = 0.2 * rng.randn(1000, 2)
X_train = np.r_[X_train + 3, X_train]
X_train = pd.DataFrame(X_train, columns = ['x1', 'x2'])

In [0]:
# Generating new, 'normal' observation ----
X_test = 0.2 * rng.randn(200, 2)
X_test = np.r_[X_test + 3, X_test]
X_test = pd.DataFrame(X_test, columns = ['x1', 'x2'])

In [0]:
# Generating outliers ----
X_outliers = rng.uniform(low=-1, high=5, size=(50, 2))
X_outliers = pd.DataFrame(X_outliers, columns = ['x1', 'x2'])

## Isolation Forest 

In [0]:
# Create model ----
clf = IsolationForest(max_samples=100, 
                      random_state=rng)

In [8]:
# Training the model ----
clf.fit(X_train)

IsolationForest(behaviour='deprecated', bootstrap=False, contamination='auto',
                max_features=1.0, max_samples=100, n_estimators=100,
                n_jobs=None,
                random_state=RandomState(MT19937) at 0x7F8A20CCAA98, verbose=0,
                warm_start=False)

In [0]:
# Predictions ----
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

In [10]:
# New, 'normal' observations ----
print("Accuracy:", list(y_pred_test).count(1)/y_pred_test.shape[0])
# Accuracy: 0.93

Accuracy: 0.555


In [11]:
# Outliers ----
print("Accuracy:", list(y_pred_outliers).count(-1)/y_pred_outliers.shape[0])
# Accuracy: 0.96

Accuracy: 0.98


In [0]:
# Plot CM ----
def pretty_cm(y_pred, y_truth, labels):
    '''
    'Pretty' implementation of a confusion matrix with some evaluation statistics.
    
    Input:
    y_pred - object with class predictions from the model
    y_truth - object with actual classes
    labels - list containing label names
    '''

    cm = metrics.confusion_matrix(y_truth, y_pred)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, fmt="d", linewidths=.5, square = True, cmap = 'BuGn_r')
    ax.set_xlabel('Predicted label')
    ax.set_ylabel('Actual label')
    ax.set_title('Confusion Matrix', size = 15) 
    ax.xaxis.set_ticklabels(labels)
    ax.yaxis.set_ticklabels(labels)
    
    print('#######################')
    print('Evaluation metrics ####')
    print('#######################')
    print('Accuracy: {:.4f}'.format(metrics.accuracy_score(y_truth, y_pred)))
    print('Precision: {:.4f}'.format(metrics.precision_score(y_truth, y_pred)))
    print('Recall: {:.4f}'.format(metrics.recall_score(y_truth, y_pred)))
    print('F1: {:.4f}'.format(metrics.f1_score(y_truth, y_pred)))

In [0]:
#pretty_cm(y_pred_test, y_pred_test, [0, 1])

In [24]:
!pip install eif



## Extended Isolation Forest

In [0]:
import eif as iso

In [0]:
# Setting ExtensionLevel to 0 ----
if_eif = iso.iForest(X_train.values, 
                     ntrees = 100, 
                     sample_size = 256, 
                     ExtensionLevel = 0)

In [0]:
# Calculate anomaly scores ----
anomaly_scores = if_eif.compute_paths(X_in = X_train.values)

In [0]:
# Sort the scores ----
anomaly_scores_sorted = np.argsort(anomaly_scores)

In [0]:
# Define % of anomalies ----
anomalies_ratio = 0.009

#retrieve indices of anomalous observations
indices_with_preds = anomaly_scores_sorted[-int(np.ceil(anomalies_ratio * X_train.shape[0])):]

In [0]:
# set the level to 9 (number of dimensions - 1)
eif = iso.iForest(X_train.values, 
                  ntrees = 100, 
                  sample_size = 256, 
                  ExtensionLevel = X_train.shape[1] - 1)

In [0]:
anomaly_scores = eif.compute_paths(X_in = X_train.values)
anomaly_scores_sorted = np.argsort(anomaly_scores)
indices_with_preds = anomaly_scores_sorted[-int(np.ceil(anomalies_ratio * X_train.shape[0])):]