## Anomaly Detection

Anomaly detection is the task of detecting anomalies in data.

How it works?

- Train an algorithm on the training data
  - This algorithm compute the probability of $P(X=x)$,this, based in the *density estimation* (or Kernel Density Estimation) on the training data.
  - Compute the distance between the training data and the density estimation
  - Compute the threshold
  - Compute the anomaly score
- Use the algorithm on the test data
  - Compute the probability of $P(X=x_{\text{test}})$, if the probability is lower than the threshold $\epsilon$, the $x_{\text{test}}$ is an anomaly

The user define the region of anomaly, that is, the threshold.

Anomaly detection example
- Fraud detection
-

Anomaly detection algorithms

1. Choose $n$ features $x_i$ that you think be indicative of the anomaly examples.

2. Fit parameters $\mu_1, \mu_2, ..., \mu_n$ and $\sigma_1, \sigma_2, ..., \sigma_n$

$$\mu_j = \frac{1}{n} \sum_{j=1}^n x_j$$

$$\sigma_j = \sqrt{\frac{1}{n} \sum_{j=1}^n (x_j - \mu_i)^2}$$

3. Given new example $x$, compute $p(x)$

$$p(x)=\prod_{j=1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x_j-\mu)^2}{2\sigma^2}\right) $$

4. If $p(x) < \epsilon$, the new example is an anomaly

> Note
> If an feature is unusually large or small, this affects the overall probability $p(x)$ (decreasing it) and it is more likely to be an anomaly.

#### The importance of real-number evaluation

When developing a learning algorithm (choosing features, etc), making decisions is much easier if we have a way to evaluate our lerning algorithm.

Assume we have some *labeled data*, of anomalous and non-anomalous examples. (e.g. $y=1$ if the example is anomalous and $y=0$ if the example is not anomalous).

Training the algorithm on *training set* $x_1, x_2, ..., x_n$ and assume all examples are normal and labeled as $y=0$.

Using *cross validation set* $((x_{cv}^{(1)}, y_{cv}^{(1)}), ..., (x_{cv}^{(k)}, y_{cv}^{(k)}))$ modify the parameters ($\epsilon$) and the features that must include in the model. In the cross validation set we must include as anomalyes ($y=1$) examples as normal examples ($y=0$).

Using test set $((x_{test}^{(1)}, y_{test}^{(1)}), ..., (x_{test}^{(k)}, y_{test}^{(k)}))$ we evaluate the algorithm on the test set. Here we must include as anomalyes ($y=1$) examples as normal examples ($y=0$).

If we have a few examples, we only split the data into *training set* and *cross validation set*.

Since the data is highly imbalanced, we can use *undersampling* to balance the data. In this case is highly recommended compute those *metrics*:

- True positive rate, False positive rate, True negative rate, False negative rate
- Precision, Recall
- F1 score



Anomaly detection vs Supervised Learning

- Anomaly detection 
  - Works on very small positive examples ($y = 1$). (0-20 is common) and very large negative examples ($y = 0$).
  - Exists many types of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like.
  - Future anomalies may look nothing like any of the anomaous examples we've seen so far.
  - e.g. *fraud*

- Supervised learning
  - Large number of positive and negative examples.
  - Enough examples for algorithm to get a sense of what postive examples are like, future positive examples likely to be similar to ones in training set.
  - e.g. *spam*

Convert the non-gaussian data to gaussian data
- $\ln(x+c)$. The c avoiding log(0)
- $x^{1/n}$

While more features can add to model, we can use it to detect anomalies.
- Only 2 or 3 features is not enough to detect anomalies.

In [1]:
import pandas as pd
import h2o
from h2o.estimators.isolation_forest import H2OIsolationForestEstimator

In [2]:
import pandas as pd

In [None]:
pd.read_csv