# Anomaly detection

## Introduction

### Definition of the problem

Given a data set $\{x^{(1)}, x{^(2)}, ..., x^{(m)}\}$, we consider $x_{test}$ anomalous if the probability of $x_{test}$ 
being seen in the dataset $p(x_{test}) < \epsilon$, where $\epsilon$ is a very small number.

### Density estimation

Training set: $\{x^{(1)}, x{^(2)}, ..., x^{(m)}\}$
Each $x^{(i)}$ has n features

Each feature is assumed to have a normal distribution. The features are assumed to be independent from each other.

$$p(x) = p(x_1; \mu_1, \sigma^2_1)*p(x_2; \mu_2, \sigma^2_2)*p(x_3; \mu_3, \sigma^2_3)*... *p(x_n;\mu_n, \sigma^2_n) = \prod_{j = 1}^n p(x_j; \mu_j, \sigma_j^2)$$

## Algorithm

<pre>
   Fit parameters &mu;<sub>1</sub>, ..., &mu;<sub>n</sub>, &sigma;<sub>1</sub><sup>2</sup>, ..., &sigma;<sub>n</sub><sup>2</sup>
   Given a new example x, compute p(x):
</pre>

$$p(x) = \prod_{j = 1}^n p(x_j; \mu_j, \sigma_j^2) = \prod_{j = 1}^n \frac{1}{\sqrt{2\pi\sigma_j}}exp(-\frac{(x_j-\mu_j)^2}{2\sigma_j^2})$$

<pre>
   Anomaly if p(x) < &epsilon;
</pre>

## Evaluating an anomaly detection algorithm

Assume we have some labeled data, of anomalous (a few) and non-anomalous examples.

We will prepare:

1. Training set: $\{x^{(1)}, x{^(2)}, ..., x^{(m)}\}$ We assume all are normal, non-anomalous examples (not a problem if in reality some are not)
2. Cross validation set: $\{x_{cv}^{(1)}, x_{cv}{^(2)}, ..., x_{cv}^{(m_{cv})}\}$, $\{y_{cv}^{(1)}, y_{cv}{^(2)}, ..., y_{cv}^{(m_{cv})}\}$ 
3. Test validation set: $\{x_{test}^{(1)}, x_{test}{^(2)}, ..., x_{test}^{(m_{test})}\}$, $\{y_{test}^{(1)}, y_{test}{^(2)}, ..., y_{test}^{(m_{test})}\}$ 

The cross validation and test sets will have some anomalies (i.e, data points with y=1)

Alternatively we could only use a training set and a cross validation set, for example if we don't have enough examples of anomalies. 

The algorithm will predict:

$$
y = \begin{cases}
      1,  \text{if $p(x) < \epsilon$}\\
      0,  \text{if $p(x) >= \epsilon$}
\end{cases}
$$

Then you can use a metric like F<sub>1</sub> score to choose the parameter &epsilon;.

## When to use anomaly detection vs supervised learning

Use anomaly detection if:

1. You have a very small number of positive examples (y=1) (0-20 is common), and
2. There are many different "types"of anomalies so it is hard for any algorithm to learn from positive examples how anomalies look like.

Use supervise learning if:

1. You have a (relatively) big number of positive examples,that can let the algorithm get a sense of what positive examples are like.

## Transforming the features

If the features are non-gaussian they can be transformed, for example, applying a logarithm $log(x_2 + c)$ or root $x_3^{1/3}$