# Anomaly detection:
- identify rare events that deviate significantly from majority of the data.
- Used in real life applications 

# Why anamoly detection?
- Unexpected events can be caused by production faults or system defects.
- outliers can affect the performance of forecasting the models.

## There are 2 types of anomaly detection tasks, they are:
- Point wise anomaly detection.
- Pattern wise anomaly detection.

### Point wise:
looking at isolated points and marking them as anomalies.
<br/>

![image.png](attachment:image.png)

<br/>

### Pattern wise:
looking at sequence of points and identifying anomalies in the sequence.
<br/>

![image-2.png](attachment:image-2.png)

<br/>

# Methods to identify outliers in timeseries data:

- Mean Absolute Deviation (MAD): When data is normally distributed, we can reasonably conclude that points at each     tail are outliers. (ONLY FOR NORMALLY DISTRIBUTED METHOD)
    - It is done using z score method <br>
    ![image.png](attachment:image.png)

    <br>

    ![image-2.png](attachment:image-2.png)

    <br>
    - The outpliers affect the mean, also the z score


    - So we use robust z score method that uses median instead of mean 
    <br>

    ![image-3.png](attachment:image-3.png)
    
    <br>

    - robust z score:
    <br>
![image-4.png](attachment:image-4.png)
    <br>
        - We use 0.6745 because z score uses median cented value and usually smaller than 

*NOTE: ROBUST Z SCORE ONLY WORKS WHEN THE DATA IS CLOSER TO NORMAL DISTRIBUTION, THE MAD SHOULD NOT BE EQUALS TO ZERO*



# ISOLATION FOREST:
- Tree based algorithm to detect outliers.
- Partitions the data to many isolate points
    - Many partitions: means the point is an inlier
    - few partitions: means that the point is an outlier
    <br>
![image.png](attachment:image.png)
    <br>

    - In the below example, Xj is an outlier
  
    <br>

![image-2.png](attachment:image-2.png)
    <br>

In [2]:
from sklearn.ensemble import IsolationForest

?IsolationForest

[1;31mInit signature:[0m
[0mIsolationForest[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mn_estimators[0m[1;33m=[0m[1;36m100[0m[1;33m,[0m[1;33m
[0m    [0mmax_samples[0m[1;33m=[0m[1;34m'auto'[0m[1;33m,[0m[1;33m
[0m    [0mcontamination[0m[1;33m=[0m[1;34m'auto'[0m[1;33m,[0m[1;33m
[0m    [0mmax_features[0m[1;33m=[0m[1;36m1.0[0m[1;33m,[0m[1;33m
[0m    [0mbootstrap[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mn_jobs[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mverbose[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mwarm_start[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Isolation Forest Algorithm.

Return the anomaly score of each sample using the IsolationForest algorithm

The IsolationForest 'isolates' observations by randoml

In [None]:
# Contamination level of an isolation forest is generally considered 1/length(training_data) and random state=42

## Local Outlier Factor (LOF)

- unsupervised method for anomaly detection.

- Intution: compare the local density of a point to that of its neighbors.

- if density is less point is outlier otherwise it is an inlier.

- METRIC USE DFOR LOF is Reachability distance:<br>
![image.png](attachment:image.png)
<br>

- Once we have reachability distance calculated from all k nearest neighbors of A, we can calculate the local reachability density
<br>

![image.png](attachment:image.png)
<br>

- Local reachability density can be calculated by INVERSE OF AVERAGE BY ALL THE REACHABILITY DISTANCES

- If LOF is close to 1 or smaller than 1 the point is inlier, or else it is an outlier.

- LOF is present in sklearn

In [None]:
from sklearn.neighbors import LocalOutlierFactor

# LOF also takes the same contamination as IsolationForest.
# novelty = True lets the model predict if we have anomalies in new data or not.
?LocalOutlierFactor

[1;31mInit signature:[0m
[0mLocalOutlierFactor[0m[1;33m([0m[1;33m
[0m    [0mn_neighbors[0m[1;33m=[0m[1;36m20[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0malgorithm[0m[1;33m=[0m[1;34m'auto'[0m[1;33m,[0m[1;33m
[0m    [0mleaf_size[0m[1;33m=[0m[1;36m30[0m[1;33m,[0m[1;33m
[0m    [0mmetric[0m[1;33m=[0m[1;34m'minkowski'[0m[1;33m,[0m[1;33m
[0m    [0mp[0m[1;33m=[0m[1;36m2[0m[1;33m,[0m[1;33m
[0m    [0mmetric_params[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcontamination[0m[1;33m=[0m[1;34m'auto'[0m[1;33m,[0m[1;33m
[0m    [0mnovelty[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mn_jobs[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Unsupervised Outlier Detection using the Local Outlier Factor (LOF).

The anomaly score of each sample is called the Local Outlier Factor.
It measures the local deviation