<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Anomalies_2_Isolation_Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Using an isolation forest for classification

In [None]:
from sklearn.ensemble import IsolationForest
#create a simple dataset
X = [[-1.1], [0.3], [0.5], [100]]
#train the model on the dataset
clf = IsolationForest(random_state=0).fit(X)
#predict the class of the following examples 
clf.predict([[0.1], [0], [90]])

# **Isolation Forest for Anomaly Detection**

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.ensemble import IsolationForest
from matplotlib import pyplot

**Create an imbalanced dataset**

In [None]:
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4)

In [None]:
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

**Create and train an IsolationForest model**

Train on the majority class only. 

The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples.<br>

The contamination should be either default or in the range (0, 0.5].

The contamination value can effect model performance. 

In [None]:
model = IsolationForest(contamination=0.01)
# fit on majority class
trainX = trainX[trainy==0] 
model.fit(trainX)

Set outliers = -1 class<br>
Set inliers = 1 class

In [None]:
yhat = model.predict(testX)
# mark inliers 1, outliers -1
testy[testy == 1] = -1
testy[testy == 0] = 1

In [None]:
score = f1_score(testy, yhat, pos_label=-1) 
print('F-measure: %.3f' % score)

In [None]:
pyplot.scatter(testX[:, 0], yhat, s=30, label='prediction')
pyplot.scatter(testX[:, 0], testy, color='red', s=3,label='ground truth') 
pyplot.legend()
pyplot.show()

**Assignment**<br>
1. Change the contamination value to get a better F score
2. Try different datasets to determine the effect on the performance
