# Extreme Values and Outliers
When we collect samples, we may want to determine if values from our samples are somehow "outliers" - that is, anomalous - or if they are explainable by natural variations in the data.
Answering this question can be tricky, but there is a branch of statistics called Extreme Value Statistics devoted to providing principled answers to this question.
In this notebook we will explore how to apply Extreme Value Statistics (EVS) to the problem of distinguishing normal variations from anomalous outliers.

In [None]:
import math
import numpy as np
import scipy
import scipy.stats
from scipy.stats import gamma
import pandas as pd
import altair as alt
from detail.altairdf import altairDF
alt.renderers.enable("notebook")

# Normal Variation, or Outlier?
Consider the following distribution, whose CDF is plotted below.
Imagine that we took some samples, and want to know if the maximum value we measured is "normal" or an "outlier".

In the cell below, our hypothetical maximum value is 10.
We can see that 10 is out on the tail of our distribution, yet it is not outlandishly far out.
In theory, any positive value _might_ be sampled from a Gamma distribution.

Should we treat this value 10 as an outlier?

In [None]:
plotdf = altairDF(np.linspace(0, 20, num=20).tolist(), \
                  [lambda x: gamma.cdf(x, 1.)], \
                  labels = ["gamma(1.0)"], ycol="CDF")
chart = alt.Chart().mark_line().encode(x="x", y="CDF", color="color")
rule = alt.Chart().mark_rule(color="red").encode(x='maximum:Q')
alt.layer(chart, rule, data=plotdf).transform_calculate(maximum="10")

# Effect of Sample Size
Intuitively, we understand that the more samples we take, the more likely it is that we will measure "larger" values, just by chance.
In the cells that follow, we take some samples from a distribution, find the maximum value, and then average these maximum values to get a sense of how increasing sample sizes might affect what kind of maximum values we measure.

* What do you expect to happen as we increase sample size?
* In the previous example, how does our judgement of whether our maximum value of 10 is an "outlier" or "expected" change, as our sample size changes?

In [None]:
# Averaging maximum values over samples size of 100
sum([max(gamma.rvs(1.0, size=100)) for x in range(10)])/10

In [None]:
# Averaging maximum values over samples size of 1000
sum([max(gamma.rvs(1.0, size=1000)) for x in range(10)])/10

In [None]:
# Averaging maximum values over samples size of 10000
sum([max(gamma.rvs(1.0, size=10000)) for x in range(10)])/10

# Extreme Value Distributions
If we know the CDF of the distribution we are sampling from, it turns out that there is a simple formula that allows us to compute the CDF of a maximum, or "extreme" value from a sample of a given size.

Define our sample size as `n`, and define the CDF of our data distribution as F(x).
The probability that our maximum value is <= x is the probability that _all_ n of our samples are <= x, or:
`F(x) * F(x) * ... F(x)` or simply `F(x)^n`

In the following cells we define this function and use it to generate a table of values showing how this value changes as sample size increases:

In [None]:
def extremeCDF(x, n, cdf):
    return math.pow(cdf(x), n)

In [None]:
ssv = [100, 1000, 10000, 100000]
pd.DataFrame({
    'ss': ssv,
    'evCDF': [extremeCDF(10, ss, lambda x: gamma.cdf(x, 1.0)) for ss in ssv]
})

The table above shows that our sample size has the effect our intuition expects: the larger the sample size, the more probable a given value is.
For a largish sample size of 100000, sampling a maximum value of 10 is far more likely than if we only sample 100 values.

# An Extreme Value Anomaly Score

Since the Extreme Value has a well defined CDF, we can define an anomaly score using the same approach as we used previously for t-digests.
In the next cell we declare an extreme value anomaly detector that returns the negative logarithm of the CDF tail.
Here, the anomaly score depends on the sample size `n`, as well as the value:

In [None]:
from detail.tdigest import TDigest

class ExtremeValueAnomalyDetector(object):
    def __init__(self, td):
        self.td = td
    
    def anomaly(self, xmax, n):
        p = 1 - math.pow(self.td.cdf(xmax), n)
        p = max(p, 1e-100)
        return -math.log(p)

# Create an Anomaly Detector
To use our anomaly detector, we sketch our data with a t-digest and use it to instantiate a detector:

In [None]:
sketch = TDigest(compression = 0.05)
for x in gamma.rvs(1.0, size = 100000):
    sketch.update(x)
detector = ExtremeValueAnomalyDetector(sketch)

# Anomaly Scores For Sample Sizes
In the next cell we plot anomaly scores for different sample sizes.
The plot provides visual confirmation that as sample sizes increase, any given extreme value becomes more likely to appear in our sample, and so its anomaly score is smaller:

In [None]:
plotdf = altairDF(range(13), \
                  [lambda x: detector.anomaly(x, 100),lambda x: detector.anomaly(x, 1000),lambda x: detector.anomaly(x, 10000)], \
                  labels = ["100", "1000", "10000"], \
                  xcol="xmax", ycol="anomaly-score", ccol="sample-size")
alt.Chart(plotdf).mark_line(point=True,clip=True).encode(alt.Y('anomaly-score', scale=alt.Scale(domain=(0, 7))), x="xmax", color="sample-size")

# Exercises
1. Try generating an extreme value anomaly detector from a different data sketch
1. As sample size approaches infinity, what happens to the probability of an anomalous extreme value?
1. What are advantages and disadvantages of usin a sketch such as t-digest compared with a parametric distribution?