In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Anomaly Detection using Isolation Forest

This is my learning note about anomaly detection using **Isolation Forest** method. Anomaly is a data or observation value that very different from normal value. We also can undertand it as an **outliers**. How we deal with these kind of data is very difficult because it maybe the outlier it self is a new kind important inforation that means if we remove it we lost our vulnerable data. There are many application of anomaly detection in our daily lives such as credit card fraud detection, fault of machine in the manufacture context, malicious network activity and etc.

The main goal of anomaly detection is build a model that can detect and explain the anomaly of the data. There are many method that we can use to detect anomaly but in this case i will use **isolation forest** method. Basically this method works like another ensemble tree algorithm. But in this application to detect anomaly it is using distance of spliting in the tree to detect anoamlies.  

## A. Isoaltion Tree

### 1. Introduction

Isolation tree was firstly prpoposed by Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua in 2008 [1]. It is an unsupervised and non-parametric algorithm based trees. Basically we can identify how this algorithm works by its name. Isolated means separating an instances from the rest of another instances. Since we know that an anomalies are so rare and different they more suspectible to be isolated [1]. The main principle of how tree algorithm work is partitioning data point until the instances were isolated. From this concept we can know that anoamlies will have short paths in tree structure.

### 2. How Isolation Tree Works

Basically how isolation work are :
1. Pick two random features
2. Splitting random data point according to in the range minimum and maximum value of the choosen features.

![How Isolation Tree Works](https://miro.medium.com/max/2400/1*d-4xINDQHv0G82o2GUApJQ.png)





### 3. Anomaly Score

Anomaly score is value that states how anomalious obeserved data point. Anomaly value has range between 0 and 1. We can interpret score as follows:

1. When Anomaly score equal to 1, it means the data point that anomaly and the paths length is short.
2. When Anomaly score smaller than 0.5, it means the data point that normal and the paths length is long.
3. When Anomaly score aroun 0.5, it means that the dataset we evaluate is free from anomaly.

## B. Implementation of Isolation Tree Using Scikit-Learn

### 1. Import Library

First we import the library that we will use. There are **sklearn** package and **plotly** package. In this part we also import the dataset we will use in this study. In this study I am using Iris dataset from scikit learn datasets.

In [2]:
import plotly.express as px
from sklearn.datasets import load_iris
from sklearn.ensemble import IsolationForest

### 2. Load dataset

In [3]:
data = load_iris(as_frame=True)
X, y = data.data, data.target
df = data.frame
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### 3. Build the model

In the model building there are certain parameter that we must aware to gain best model performance:

1. **n_estiamtors** --> How many trees that we consider to use.
2. **contamination** --> How many anomaly proportion in the dataset. Also we can define it as a threshold to decide the data point is anomaly or normal.

3. **max_feature** --> maximum how many features that we use in the model training phase.
4. **max_samples** --> maximum samples used that considered from matrix feature.

In [4]:
iforest = IsolationForest(n_estimators=100, max_samples='auto',
                         contamination=0.05, max_features=4,
                         bootstrap=False, n_jobs=-1, random_state=1)

In [5]:
pred= iforest.fit_predict(X)
df['scores'] = iforest.decision_function(X)
df['anomaly_label'] = pred

### 4. Checking result

After predicting anomaly score and make label from it, we encode the label to value -1 as **outliers** and 1 as **inliers**

In [6]:
df[df.anomaly_label == -1]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,scores,anomaly_label
13,4.3,3.0,1.1,0.1,0,-0.039104,-1
15,5.7,4.4,1.5,0.4,0,-0.003895,-1
41,4.5,2.3,1.3,0.3,0,-0.038639,-1
60,5.0,2.0,3.5,1.0,1,-0.008813,-1
109,7.2,3.6,6.1,2.5,2,-0.037663,-1
117,7.7,3.8,6.7,2.2,2,-0.046873,-1
118,7.7,2.6,6.9,2.3,2,-0.055233,-1
131,7.9,3.8,6.4,2.0,2,-0.064742,-1


In [7]:
df[df.anomaly_label == 1]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,scores,anomaly_label
0,5.1,3.5,1.4,0.2,0,0.177972,1
1,4.9,3.0,1.4,0.2,0,0.148945,1
2,4.7,3.2,1.3,0.2,0,0.129540,1
3,4.6,3.1,1.5,0.2,0,0.119440,1
4,5.0,3.6,1.4,0.2,0,0.169537,1
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2,0.131967,1
146,6.3,2.5,5.0,1.9,2,0.122848,1
147,6.5,3.0,5.2,2.0,2,0.160523,1
148,6.2,3.4,5.4,2.3,2,0.073536,1


In [8]:
df['anomaly']=df.anomaly_label.apply(lambda x : 'outliers' if x == -1 else 'inliers')
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,scores,anomaly_label,anomaly
0,5.1,3.5,1.4,0.2,0,0.177972,1,inliers
1,4.9,3.0,1.4,0.2,0,0.148945,1,inliers
2,4.7,3.2,1.3,0.2,0,0.12954,1,inliers
3,4.6,3.1,1.5,0.2,0,0.11944,1,inliers
4,5.0,3.6,1.4,0.2,0,0.169537,1,inliers


### 5.Visualization

In [9]:
fig=px.histogram(df, x='scores', color='anomaly')
fig.show()

In [10]:
fig = px.scatter_3d(df, x='petal width (cm)',
                   y = 'petal length (cm)',
                   z= 'sepal width (cm)', color='anomaly')
fig.show()

## C. Conclusion

from the explanation above we can get more understanding about what anomaly is and the concept of anomaly detection briefly. We can see basically anomaly detection using isolation forest has certain similarity with another classification tree algorithm. Also, we can see how isolated forest anomaly detection algorithm work and build simple model to detect an anomaly in Iris dataset. From the model we have build we know there is some anomaly or outliers in the Iris dataset.

## D. Reference

1. Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.

2. Eugenia Anello. "Anomaly Detection With Isolation Forest". https://betterprogramming.pub/anomaly-detection-with-isolation-forest-e41f1f55cc6

