### Anomaly detection

Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. Anomalies can indicate critical incidents, such as technical glitches, fraud, or errors. 

***Anomaly detection, or outlier detection, is the identification of observations, events or data points that deviate from what is usual, standard or expected, making them inconsistent with the rest of a data set.***

### Introduction

Anomaly is something that is not normal. Any data point which is placed at a distance from all normal data points is an anomaly. Hence anomalies are also called outliers.

Anomaly detection is also called as deviation detection because anomalous objects have attribute values that are different from all normal data objects.

Everyone loves a mystery and that is what anomaly detection is – spotting the unusual, catching the fraud, discovering the strange activity.

<img src="pattern.png" width="650">

### Methods/Algorithms:

1. Isolation Forest
2. DB Scan
3. Local Outlier Factor

------------------------------------
## Isolation Forest
----------------------------

Isolation Forest is a technique for identifying outliers in data that was first introduced by Fei Tony Liu and Zhi-Hua Zhou in 2008. The approach employs binary trees to detect anomalies, resulting in a linear time complexity and low memory usage that is well-suited for processing large datasets.

Isolation Forests(IF), similar to Random Forests, are build based on decision trees. And since there are no pre-defined labels here, it is an unsupervised model.
Isolation Forests were built based on the fact that anomalies are the data points that are “few and different”.

In an Isolation Forest, randomly sub-sampled data is processed in a tree structure based on randomly selected features. The samples that travel deeper into the tree are less likely to be anomalies as they required more cuts to isolate them. Similarly, the samples which end up in shorter branches indicate anomalies as it was easier for the tree to separate them from other observations.

### How do Isolation Forests work?

Isolation Forests outlier detection are nothing but an ensemble of binary decision trees. And each tree in an Isolation Forest is called an Isolation Tree(iTree). The algorithm starts with the training of the data, by generating Isolation Trees.

### Step by Step Guide on How Isolation Forest Work

1. When given a dataset, a random sub-sample of the data is selected and assigned to a binary tree.
2. Branching of the tree starts by selecting a random feature (from the set of all N features) first. And then branching is done on a random threshold ( any value in the range of minimum and maximum values of the selected feature).
3. If the value of a data point is less than the selected threshold, it goes to the left branch else to the right. And thus a node is split into left and right branches.
4. This process from step 2 is continued recursively till each data point is completely isolated or till max depth(if defined) is reached.
5. The above steps are repeated to construct random binary trees.

After creating an ensemble of iTrees (Isolation Forest), the model training is complete. During scoring, the system traverses a data point through all the trees that were trained earlier. Now, an ‘anomaly score’ is assigned to each of the data points based on the depth of the tree required to arrive at that point. This score is an aggregation of the depth obtained from each of the iTrees. An anomaly score of -1 assigns anomalies and 1 to normal points based on the contamination parameter (percentage of anomalies present in the data).

<img src="isolation-forest.png" width="550">

***Anomaly score*** - it is a number that says how much current behavior is different from the expected. 

Higher number - higher anomaly we are looking at.

Isolation Forest (iForest) is an algorithm used for anomaly detection. It works on the principle of isolating observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Anomalies are the observations that are easier to isolate.

Here’s how the anomaly score is calculated in an Isolation Forest:

1. **Isolation Tree Construction**:
    - Build multiple isolation trees (iTrees). Each tree is constructed by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
    - This process continues recursively, resulting in each observation being isolated in the leaf nodes.

2. **Path Length Calculation**:
    - The path length of an observation is the number of edges traversed from the root node to the terminating node.
    - Anomalies are expected to have shorter path lengths because they are easier to isolate.

3. **Average Path Length**:
    - The average path length \(E(h(x))\) of an observation \(x\) is calculated over all the trees in the forest.
    - The path length for a given node can be adjusted based on the number of observations \(n\) at that node using the Harmonic number \(H(i)\).

4. **Normalization Factor**:
    - The average path length \(c(n)\) for a data set of size \(n\) is derived from the Harmonic number:
      ![image.png](attachment:image.png)

5. **Anomaly Score Calculation**:
    - The anomaly score \(s(x, n)\) for an observation \(x\) is calculated as:
      ![image-2.png](attachment:image-2.png)
    - The score ranges from 0 to 1, where a score close to 1 indicates an anomaly and a score close to 0 indicates a normal observation.

**Interpretation of the Anomaly Score**:
![image-3.png](attachment:image-3.png)

### Example
Let's walk through a simple example of calculating the anomaly score for an observation.

1. **Construct isolation trees**: Assume we build an iForest with 100 trees and the data set has 256 observations.
2. **Calculate path length**: Assume for a specific observation \(x\), the average path length across the 100 trees is 4.
3. **Compute normalization factor**: For n = 256,
   ![image-4.png](attachment:image-4.png)
   ![image-5.png](attachment:image-5.png)
4. **Calculate anomaly score**:

  ![image-6.png](attachment:image-6.png)

In this example, the anomaly score of 0.754 suggests that the observation \(x\) is likely an anomaly.

----------------------
### DB Scan
-----------------------

In DB Scan the boundary points are considered as outliers or Anomaly

---------------------
### Local Outlier Factor (LOF)
------------------------

Local Outlier Factor (LOF) is an algorithm used for identifying anomalies (outliers) in data. Unlike global outlier detection methods, LOF identifies outliers by considering the local density deviation of a given data point with respect to its neighbors. It compares the density of a data point to the densities of its neighbors; points that have a substantially lower density than their neighbors are considered outliers.

<img src="lof-steps.webp" width="750">

Local Outlier Factor (LOF):

The LOF of a point p is the average of the ratios of the local reachability density of p and those of its k-nearest neighbors:

If 
LOF(p)≈1, p is considered to be similar to its neighbors (not an outlier).

If 
LOF(p)>1, p is considered to be an outlier. The larger the LOF value, the more significant the outlier.