# Extreme Value Distributions

## A motivational example

### Suppose I'm collecting some values that I expect to be Normally distributed (mean 0, stdv 1).  I get a value of `5.0` Should I be suspicious?  Was that some kind of outlier that I might want to discard?   Or should I keep it?  What if it was a `4.0` ?   What if it was `6.0` ?   No value is _impossible_ from a Gaussian distribution, and yet most values on the number line are very very rare.

### With my value of `5.0`, does it impact my judgement if I was sampling 100 values, as compared to 1,000,000 values?

### Is there a principled way to make these kind of judgements?

## Lib imports

In [1]:
import math
import numpy as np

import plotly
import plotly.plotly as pty
from plotly.tools import FigureFactory as ff

# Work in offline-mode for notebooks
plotly.offline.init_notebook_mode()

# plotly version
plotly.__version__

'1.9.6'

## Suppose I am sampling from a Gaussian R.V. and I want to measure the maximum deviation from the mean.  In other words I am sampling from |N(0,1)|:

In [2]:
# Generate a sample of size (m), take its maximum.
# Repeat (n) times and return those (n) maximum values
def sampleMaxVals(n, m):
    return [max([abs(z) for z in np.random.randn(m)]) for x in xrange(n)]

## If I take the maximum of 100 values sampled from a Gaussian, what kind of values could I expect?

In [3]:
sampleMaxVals(5, 100)

[2.8538677178404419,
 3.7014300355335532,
 2.7307805735713901,
 2.5797158259943886,
 2.3035328035709752]

## What if my sample size is larger? (1000)

In [4]:
sampleMaxVals(5, 1000)

[3.4148635992390592,
 3.3594308026191126,
 3.4029265075442026,
 2.8490645360412166,
 3.1307050494859046]

## Still Larger?   (10000)

In [5]:
sampleMaxVals(5, 10000)

[4.0683504989213022,
 3.5072592214112794,
 3.9655352189181658,
 4.523331243717883,
 4.4184741665070142]

## Intuitively: the larger I make my sample size, the more likely it is that my sample will include some larger values.  When sampling from a Gaussian, the probability of any finite value is non-zero.

# The maximum value in my sample size _is itself a distribution_ -- an Extreme Value Distribution.

## This distribution depends on:

* ## the shape of the underlying distribution I'm sampling from
* ## the size of my sample

## We can generate samples and plot a sketch of these Extreme Value Distributions:

In [6]:
ss = 10000
hist_data = [sampleMaxVals(ss, 100), sampleMaxVals(ss, 1000), sampleMaxVals(ss, 10000), sampleMaxVals(ss, 100000)]
group_labels = ['ss = 100', 'ss = 1000', 'ss = 10000', 'ss = 100000']
fig = ff.create_distplot(hist_data, group_labels,
                         bin_size = 0.1, show_rug = False)
fig['layout'].update(title='Extreme Value Sampling Distributions for |N(0,1)|')
plotly.offline.iplot(fig, validate=False)

## The Generalized Extreme Value Distribution
## "The Central Limit Theorem for Extreme Values"
![GEV](http://mathurl.com/zlteayo.png)

## For |N(0,1)| this takes the form of the Gumbel Distribution (gamma -> 0):
![Gumbel](http://mathurl.com/jqfv998.png)

## Gamma is a shape parameter for the category of EV distribution, but mu<sub>m</sub> and sigma<sub>m</sub> are also shape parameters, that are a function of sample size (m).  In the case of the Gumbel distribution for |N(0,1)|, closed form expressions are known for mu<sub>m</sub> and sigma<sub>m</sub>:

![shapes](http://mathurl.com/zcnfer8.png)

## Now we have a principled way to measure how likely it is that a sample is an outlier: we can examine the probability of the null hypothesis that is was _not_ an outlier:

In [7]:
# (x) is the value to test, (m) is the sample size to assume
def probNotOutlier(x, m):
    lnm = math.log(m)
    mu = math.pow(2 * lnm, 0.5) - (math.log(lnm) + math.log(2*math.pi)) / (2 * math.pow(2*lnm, 0.5))
    sigma = math.pow(2 * lnm, -0.5)
    return 1-math.exp(-math.exp(-(x-mu)/sigma))

In [8]:
[(ss, probNotOutlier(5, ss)) for ss in [100, 1000, 10000, 100000, 1000000]]

[(100, 0.0004776185651890197),
 (1000, 0.0012875611741147708),
 (10000, 0.006274716044369466),
 (100000, 0.04372976008366991),
 (1000000, 0.33857974737245156)]