# Extreme Value Distributions

## A motivational example

### Suppose I'm collecting some values that I expect to be Normally distributed (mean 0, stdv 1).  I get a value of `5.0` Should I be suspicious?  Was that some kind of outlier that I might want to discard?   Or should I keep it?  What if it was a `4.0` ?   What if it was `6.0` ?   No value is _impossible_ from a Gaussian distribution, and yet most values on the number line are very very rare.

### With my value of `6.0`, does it impact my judgement if I was sampling 100 values, as compared to 1,000,000 values?

### Is there a principled way to make these kind of assessments?

## Lib imports

In [1]:
import numpy as np

import plotly
import plotly.plotly as pty
from plotly.tools import FigureFactory as ff

# Work in offline-mode for notebooks
plotly.offline.init_notebook_mode()

# plotly version
plotly.__version__

'1.9.6'

## Sample from the extrema (maximum) of some Gaussian distributions

In [2]:
# Generate a sample of size (m), take its maximum.
# Repeat (n) times and return those (n) maximum values
def sampleMaxVals(n, m):
    return [max(np.random.randn(m)) for x in xrange(n)]

## If I take the maximum of 100 values sampled from a Gaussian, what kind of values could I expect?

In [3]:
sampleMaxVals(5, 100)

[2.0961966220808508,
 1.9944822085324472,
 2.6571031095809334,
 2.1642120639502425,
 2.4792606689074557]

## What if my sample size is larger? (1000)

In [4]:
sampleMaxVals(5, 1000)

[2.7868659781395566,
 3.093538669437458,
 3.2584131316827731,
 3.7528771528372347,
 3.695775503467305]

## Still Larger?   (10000)

In [5]:
sampleMaxVals(5, 10000)

[4.0820235319420268,
 4.3201784177495401,
 3.3332205977811045,
 4.0329307575626565,
 3.7165510880325843]

## Intuitively: the larger I make my sample size, the more likely it is that my sample will include some larger values.  When sampling from a Gaussian, the probability of any finite value is non-zero.

# The maximum value in my sample size _is itself a distribution_ -- an Extreme Value Distribution.

## This distribution depends on:

* ## the shape of the underlying distribution I'm sampling from
* ## the size of my sample

## We can generate samples and plot a sketch of these Extreme Value Distributions:

In [6]:
ss = 10000
hist_data = [sampleMaxVals(ss, 100), sampleMaxVals(ss, 1000), sampleMaxVals(ss, 10000), sampleMaxVals(ss, 100000)]
group_labels = ['ss = 100', 'ss = 1000', 'ss = 10000', 'ss = 100000']
fig = ff.create_distplot(hist_data, group_labels,
                         bin_size = 0.25, show_rug = False)
fig['layout'].update(title='Extreme Value Sampling Distributions')
plotly.offline.iplot(fig, validate=False)