# Gaussian Anomaly Detection

Detecting anomalies with data using Gaussian data assumptions is a simple but often effective technique.
In this notebook we explore the basics of anomaly detection with some Gaussian data.

In [None]:
import math
from functools import reduce
from scipy.stats import norm
import altair as alt
import numpy as np
import pandas as pd
alt.renderers.enable('notebook')

# Online mean and variance estimates

The following class uses [Chan's formula](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm) for online mean and variance estimates, so that we can calculate estimated mean and variance in a single pass over a large data set, and parallelize over partitioned data.

In this lab, we will not cover the details of how efficient streaming moments operate: the important takeaway is that obtaining means and variances from data for fitting Gaussian distributions is fast and efficient, and so it can be a useful tool in the anomaly detection toolbox.

In [None]:
from detail.streamingmoments import StreamingMoments

# Data and Moments

In the following cell we generate some Gaussian data, simulating data that we might collect from systems, users or the outside world.

We use our `StreamingMoments` class to get the mean and variance of the data. The data is generated in multiple independent partitions, to demonstrate that streaming moments can be computed across partitions and merged to get a global result.

In [None]:
# Generate some data divided into multiple paritions
data_partitions = [norm.rvs(0, 1, size=10000) for x in range(10)]

# Sketch the moments of each partition, and then merge the sketches to a final result
moments = reduce(lambda m1, m2: m1.merge_from(m2), \
                 [StreamingMoments() << p for p in data_partitions], \
                 StreamingMoments())

print("count= %d  mean= %.2f  variance= %.2f" % (moments.count, moments.mean(), moments.variance()))

# A Gaussian Anomaly Detector

In the following cell we define a class for doing anomaly detection, assuming Gaussian data. You can see that this class defines two example anomaly score methods. These two scores will be explored and compared below.

In [None]:
class GaussianAnomalyDetector(object):
    def __init__(self, mean, variance):
        self.mean = mean
        self.stdv = math.sqrt(variance)

    # take the "tail probabilility" as the anomaly score
    def anomaly1(self, x):
        cdf = norm.cdf(x, self.mean, self.stdv)
        # find the smallest tail probability
        anomaly = min(cdf, 1 - cdf)
        return anomaly
    
    # use the negative log of the tail probability as the score
    def anomaly2(self, x):
        cdf = norm.cdf(x, self.mean, self.stdv)
        # find the smallest tail probability
        t = min(cdf, 1 - cdf)
        # make sure we don't try to take the logarithm of zero
        t = max(t, 1e-100)
        return -math.log(2 * t)

# Create a Detector

The following cell creates an anomaly detector using the mean and variance we estimated from our example data.
We also define some example data points that we can use to examine how our anomaly scores behave.

Our data has a mean of zero and a standard deviation of 1, so our test data points span from zero to 8 standard deviations from the mean.

In [None]:
detector = GaussianAnomalyDetector(moments.mean(), moments.variance())
data = [0, -1, 2, -3, 4, -5, 6, -7, 8]

# Anomaly Score 1

First we will create a table of our test points and their anomaly scores using anomaly-score 1. Score 1 is simply the "tail probability": the probability mass to the right of our data, if it is greater than the mean, otherwise the mass of the tail to the left.

Numerically, we can see two things: as we get farther from the mean, these scores get smaller.
Therefore, for score 1, smaller values are "more anomalous".
Additionally, we can see that these values become small very quickly, rapidly approaching zero.

In [None]:
scores1 = pd.DataFrame({
  'x': data,
  'score1': [detector.anomaly1(x) for x in data]
})
scores1

# Plotting Anomaly Score 1

The following plot shows visually what we saw in the table: as values get farther from the mean, anomaly scores approach zero very fast. So comparing "very anomalous" values in an absolute sense will be difficult: the absolute difference between anomalous values will always be very close to zero as well.

In [None]:
alt.Chart(scores1).mark_line(point=True).encode(
    x='x',
    y='score1'
)


# Anomaly Score 2

In this table we chart the values of anomaly score 2 for our test points. Anomaly score 2 is based on the negative logarithm of the "tail probability" this means that as data becomes "more anomalous", the score gets larger.

In [None]:
scores2 = pd.DataFrame({
  'x': data,
  'score2': [detector.anomaly2(x) for x in data]
})
scores2

# Plotting Score 2

When we plot the table above, we see that visually this score may be easier to work with. More anomalous values have larger anomaly scores, and comparing absolute scores is easy.

In [None]:
alt.Chart(scores2).mark_line(point=True).encode(
    x='x',
    y='score2'
)


# Exercises
1. How would you choose an anomaly threshold for score 1?  Score 2?
1. What other anomaly scores can you design, based on Gaussian distribution PDFs or CDFs?
1. As the number of data samples you collect from a Gaussian distribution increases, how does the probability of a high anomaly score change?
1. How do these Gaussian-based anomaly scores behave if they are given non-gaussian data?
1. Can you adapt `GaussianAnomalyDetector` to use different distributions?