<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Anomalies_2_Isolation_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we are using an isolation forest for anomly detection.

**The IsolationForest** ‘isolates’ observations by randomly selecting<br>
a feature and then randomly selecting a split value between<br>
the maximum and minimum values of the selected feature.

In [None]:
from IPython.display import Image
Image(url="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*5UGQoBaapqYa-SxGscx7Ow.gif", width=600)

We have seen a number of methods to detect outliers, namely, the Z-Score and Interquartile Range methods. <br>

They are effective when the *underlying data follows a normal distribution* <br>(a distribution where most data points are closer to the mean and become less frequent as the distance to the mean increases). <br>

If the data does not have a normal distribution, then these methods may incorrectly classify normal observations as outliers.<br>

The Isolation Forest method is non-parametric, <br>
which means that we don’t have to make assumptions about how the underlying data is distributed.

We detect anomalies (outliers) to treat them before conducting data analyses.<br>
The anomaly detection technique can also be used to detect fraudulent credit card spending

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.ensemble import IsolationForest
#create a simple dataset
X = [[-1.1], [0.3], [0.5], [100]]
#train the model on the dataset
clf = IsolationForest(random_state=0).fit(X)
#predict the class of the following examples
clf.predict([[0.1], [0], [90]])

In [None]:
n_samples, n_outliers = 120, 40
rng = np.random.RandomState(0)
covariance = np.array([[0.5, -0.1], [0.7, 0.4]])
cluster_1 = 0.4 * rng.randn(n_samples, 2) @ covariance + np.array([2, 2])  # general
cluster_2 = 0.3 * rng.randn(n_samples, 2) + np.array([-2, -2])  # spherical
outliers = rng.uniform(low=-4, high=4, size=(n_outliers, 2))

X = np.concatenate([cluster_1, cluster_2, outliers])
y = np.concatenate(
    [np.ones((2 * n_samples), dtype=int), -np.ones((n_outliers), dtype=int)]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

In [None]:
import matplotlib.pyplot as plt

scatter = plt.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
handles, labels = scatter.legend_elements()
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.title("Gaussian inliers with \nuniformly distributed outliers")
plt.show()

In [None]:
from sklearn.ensemble import IsolationForest

clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X_train)

In [None]:
import matplotlib.pyplot as plt

from sklearn.inspection import DecisionBoundaryDisplay

disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="predict",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Binary decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.show()

In [None]:
disp = DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    response_method="decision_function",
    alpha=0.5,
)
disp.ax_.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor="k")
disp.ax_.set_title("Path length decision boundary \nof IsolationForest")
plt.axis("square")
plt.legend(handles=handles, labels=["outliers", "inliers"], title="true class")
plt.colorbar(disp.ax_.collections[1])
plt.show()

# **Isolation Forest for Anomaly Detection**

Z-Score and Interquartile Range methods identify at the variable level. <br>
If you have reason to believe that multiple variables interact with each other and create outliers,<br>
these methods will not be able to detect those outliers. <br>

For example:<br>
>an SAT score of 1350/1600 (90th percentile) does not seem to be an outlier by itself. <br>
However, if we introduce another dimension, age,<br>
we find that a 12-year-old got 1350/1600, <br>
this observation is likely an outlier for a sub-sample of 12-year-olds. <br>

Unlike single-variable outlier detection methods, <br>
**Isolation Forest detects outliers in multi-dimensional space**.

Isolation Forest randomly cuts a given sample until a point is isolated.

We saw in the Gif above, that it took four splits to isolate an outlier. <br>

Let's now use Isolation Forest to isolate a normal point. <br>

You'll see it takes many more splits to isolate a normal point. When using Isolation Forest, an outlier takes only a few splits, a normal datapoint will take many more.

In [None]:
Image(url="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*TgjBE1oHI9KNHoAj-imsGA.gif", width=600)

**The Isolation Forest Algorithmm**

First, the algorithm creates an isolation tree by going through the following steps:
1. Randomly select a sub-sample (Sci-kit learn’s default: 100 instances/data points)
2. Select a point to isolate.
3. Randomly select a feature (i.e., variable) from the set of features X.
4. Randomly select a threshold between the minimum and the maximum value of the feature x.
5. If the data point is less (greater) than the threshold, then it flows through the left branch of the tree (right).<br>
In other words, define the new minimum (maximum) of the range to the threshold for the next iteration.
6. Repeat steps 3 through 5 until the point is isolated or until a pre-defined max number of iterations is reached.
7. Record the number of times the steps 3 through 5 were repeated.



The Isolation Forest algorithm requires us to pick the percentage of anomalies in the dataset

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.ensemble import IsolationForest
from matplotlib import pyplot
from keras.utils import plot_model

**Create an imbalanced dataset**

In [None]:
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4)

In [None]:
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

**Create and train an IsolationForest model**

Train on the majority class only.

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples.<br>

The contamination should be either default or in the range (0, 0.5].

The contamination value can effect model performance.

In [None]:
model = IsolationForest(contamination=0.01)
# fit on majority class
trainX = trainX[trainy==0]
model.fit(trainX)

Set outliers = -1 class<br>
Set inliers = 1 class

In [None]:
yhat = model.predict(testX)
# mark inliers 1, outliers -1
testy[testy == 1] = -1
testy[testy == 0] = 1

In [None]:
score = f1_score(testy, yhat, pos_label=-1)
print('F-measure: %.3f' % score)

In [None]:
pyplot.scatter(testX[:, 0], yhat, s=30, label='prediction')
pyplot.scatter(testX[:, 0], testy, color='red', s=3,label='ground truth')
pyplot.legend()
pyplot.show()

**Assignment**<br>
1. Change the contamination value to get a better F score
2. Try different datasets to determine the effect on the performance
