In [1]:
# default_exp simulate.distribution

# Validate

Expose distribution tests for exploratory analysis.

In [2]:
#hide
from nbdev.showdoc import *

In [3]:
#hide
pwd = %pwd
if pwd.split('/')[-1] == 'nbs':
    %cd ..

/Users/davidrichards/codes/hydra/lab


In [5]:
%matplotlib inline
from lab.util.test_functions import *
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

## Validating a Distribution

Sometimes I look at a variable, check for missing values, size, shape, and a few metrics. That's a good start to explore a variable. If I look a little deeper, I can check the distribution of the data, to see if it makes sense. It's a fun trick, and it can be useful. It comes at the danger of over-analyzing things.

But it also supports another thing I like to do, to create a baseline model that adds no real insight but gives me a point of comparison. Thisis how I look at machine learning--what good is it adding to something simple or obvious? Maybe it's a good idea to check if the distribution makes sense, if I have my wits about me before spending too much time building new tools.

It's a balancing act.

Either way, we can validate a distribution:

* plot: ...
* box plot: ...
* KS: ...
* Shapiro Wilk: ...
* The other: ...

And there's more too...

In [9]:
#export
DEFAULT_VALIDATION_SIZE = 500

def is_traditional_distribution(o):
    """A 'traditional' distribution has an `rvs`
    method attached, a tool for extracting random
    variables. Check if this is the case."""
    return hasattr(o, 'rvs')

def extract_values(o, *a, **kw):
    """Create a simple interface for either
    a distribution (e.g. `o.rvs(**kw)`) or a
    function (`o(**kw)`, or just an array-like
    object. (`np.array(o)`)."""
    if is_traditional_distribution(o): return np.array(o.rvs(*a, **kw))
    if callable(o): return np.array(o(*a, **kw))
    return np.array(o)

def extract_test_values(o, *a, **kw):
    """Offer test-sized values"""
    kw = {**{'size': DEFAULT_VALIDATION_SIZE}, **kw}
    return extract_values(o, *a, **kw)

The `is_traditional_distribution` looks for whether this is something from `scipy.stats` (meaning, it has an `rvs` method).

In [10]:
assert is_traditional_distribution(stats.skewnorm)

## Extract Values

I like having top-level functions that hide the details of underlying code. The `extract_values` function will see if it can create random values from an `rvs` method. If it can't do that, it will call a function. If not, it will ensure the object is a NumPy array.

Also, there's a shortcut that suggests `DEFAULT_VALIDATION_SIZE` is a good sample size with `extract_test_values`.

In [11]:
values = extract_values(stats.skewnorm, 5, size=3, loc=2, scale=0.001)
check_is_near(values, 2, atol=0.01)
dist = lambda *a, **kw: np.zeros(3) + 2
values = extract_values(dist)
check_equals(values, 2)
values = extract_values(np.zeros(3) + 1)
expected = [1,1,1]
check_equals(expected, values)

assert len(extract_test_values(stats.t, df=3) == DEFAULT_VALIDATION_SIZE)
assert len(extract_test_values(stats.t, df=3, size=3) == 3)

In [14]:
def plot_qq(o, **kw):
    pass

Statistical validation can show me whether data fits a distrbution numerically.

In [12]:
def validate_with_shapiro_wilk(o, accept=0.05, **kw):
    """Validate with the Shapiro-Wilk test..."""
    kw = {**dict(size=DEFAULT_VALIDATION_SIZE), **kw}
    data = extract_values(o, **kw)
    w, p = stats.shapiro(data)
    if len(data) > 5000: return w <= 1 - accept
    return p <= 1 - accept

def validate_distribution(o, validator=validate_with_shapiro_wilk, **kw):
    """Validate a distribution generally, using Shapiro-Wilk as
    a default validation."""
    return validator(o, **kw)

In [13]:
assert validate_with_shapiro_wilk(stats.norm)
with check_raises(message="Shapiro-Wilk should require at least 3 items to work."):
    validate_with_shapiro_wilk(stats.norm, size=2)
assert validate_distribution(stats.norm)

* ~~extract_test_values~~
* plot_qq
* plot_distribution_vs_normal
* fix shapiro
