In [1]:
%%capture

%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext tfl_training_anomaly_detection

In [2]:
%presentation_style

In [3]:
%%capture

%set_random_seed 12

In [4]:
%load_latex_macros


$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$
$\newcommand{\amax}{{\text{argmax}}}$
$\newcommand{\P}{{\mathbb{P}}}$
$\newcommand{\E}{{\mathbb{E}}}$
$\newcommand{\R}{{\mathbb{R}}}$
$\newcommand{\Z}{{\mathbb{Z}}}$
$\newcommand{\N}{{\mathbb{N}}}$
$\newcommand{\C}{{\mathbb{C}}}$
$\newcommand{\abs}[1]{{ \left| #1 \right| }}$
$\newcommand{\simpl}[1]{{\Delta^{#1} }}$


# Taxonomy of Anomaly Detection Approaches
<img src="_static/images/aai_presentation_first_slide.svg" alt="Snow" style="width:100%;">


# Overview of approaches to AD


Anomaly detection as a whole does not really have a common foundation.

Rather, there are loosely related approaches to it and methods from those approaches.

For actual applications of anomaly detection the structure of the data and the problem
at hand are of extreme importance for choosing the right method.

Sometimes, hand designed rules might work better than fancy algorithms! Generally, a thorough
statistical analysis of the data should be performed before using "heavier machinery".

Below we present a small of taxonomy of different approaches to AD with some representative methods. However, the lines between these approaches can become quite blurry at times.

## Distance based methods

__Philosophy__: a point is an outlier if it has few neighbors.

Distance to $k$-th nearest neighbor ($k$ needs to be determined), clustering (together Mahalanobis distance), local outlier factor (LOF), the matrix-profile for time series etc.

## Probabilistic methods

__Philosophy__: since most data points are normal, we can fit a probabilistic model to "normality". A point is an outlier if it has low probability under the fitted model.

Kernel density estimation, gaussian mixtures, extreme value theory, GAN-based anomaly detection, time series forecasting etc.

We already see blurry lines - gaussian mixtures could be considered a distance based method for clustering.

## Subspace based methods

__Philosophy__: The space where data lives can be partitioned into a normal region and an abnormal one. This partitioning might happen in a lower dimensional subspace. 

Isolation trees/forests, one class SVM, genetic algorithms, etc.

## Reconstruction based methods

__Philosophy__: We learn an encoding (or projection) to a low dimensional space and a corresponding decoding of our data. Since we mostly have normal data points, it will be more efficient for them than for the anomalies. Thus, anomalies will have higher reconstruction error. 

Autoencoder, variational autoencoder, assotiative memory models (Hopfield networks) etc, subspace methods, etc.

Again blurry lines: is PCA reconstruction based or probabilistic? The variational autoencoder also gives a probabilistic model for data as a byproduct.

## Supervised methods

__Philosophy__: We can just train a classifier to predict a binary (or more complex) label. The downside here is that this requires labeled data.

The typical classifiers with tricks for dealing with unbalanced classes.

## Other approaches and conclusion

Information based (removing anomalies should drastically decrease information content of the data set), domain specific, combination of different methods etc.

We cannot possibly try to cover all the approaches above. Instead, we will demonstrate a few techniques that often work well in practice and then dive deeper into two general topics: _time series_ and _extreme value theory_. 

The topics covered there will be useful in a variety of situations. 

<img src="_static/images/aai_presentation_last_slide.svg" alt="Snow" style="width:100%;">