# Anderson Darling Test

The Anderson-Darling Test is used to determine whether a collection of data points follows a given distribution. It gives more weight to the tail than the KS test does. Mathematically, it tests the null hypothesis

\begin{align}
H_0: \text{data follows the distribution }P(x)
\end{align}

against the alternative hypothesis

\begin{align}
H_a: \text{data does not follow the distribution }P(x)
\end{align}

with the test statistic being

\begin{align}
A^2 = - n - S
\end{align}

where

\begin{equation}
S = \sum_{i=1}^{n} \frac{(2i - 1)}{n}[ \ln(F(Y_{i})) + \ln( 1 - F(Y_{n + 1 - i}) ]
\end{equation}

$F$ being the cumulative distribution function, and

$Y_i$ being the ordered data points

Ok, cool. But what does this equation really mean? How does this ensure tail is given more weight more it is in KS test? Let's break it down using a story and some examples!

## Scenario

<!-- Let's say you want to model how much money customers spend in a visit to your website. One data scientist on your team suggests that this random variable (amount spent) follows a gaussian distribution. While another suggest this is more exponential. How do you determine which one more accurately models customer behavior? You can perform AD test to get an analytical proof for this. -->


<!-- Let's say you want to model how much time each user spends on a given article -->


Let's say you run a datacenter. You want to model how much time between mechanical failures (network cables, disks, cooling, etc.) so that you can schedule maintainence accordingly. One data scientist on your team suggests that this random variable (amount between failures) follows a gaussian distribution. While another suggest this is more exponential. How do you determine which one more accurately models the datacenter machines behavior? You can perform AD test to get an analytical proof for this.

* this is the data we see (dont add the code for data generation, pull it from github csv instead)
* its hard to eyeball and tell if this is gaussian or exponential
* run ad test
* what if the data had 10n points? what if it followed a heavier tailed dist? or light tailed?

## Get Dataset

In [26]:
import numpy as np
from plotly import graph_objects as go

np.random.seed(42)

In [27]:
# visualization helper
def visualize(X, fig=None, fig_title="Histogram", x_title=None, y_title="Number of Points", trace_name=None):
    # create new figure object. else add trace to existing
    if fig is None:
        fig = go.Figure()
    
    # create histogram
    fig.add_trace(
        go.Histogram(x=X, name=trace_name)
    )
    
    # add axis details
    fig.update_layout(
        title=fig_title,
        xaxis_title=x_title,
        yaxis_title=y_title,
    )
    
    return fig

In [28]:
n_points = 500

In [32]:
# generate lognormal samples | 2, 0.75 | 5, 0.5 | 0, 0.5
data = np.random.lognormal(
    mean=5,
    sigma=0.3,
    size=n_points,
)

fig = visualize(
    data,
    fig_title="Distribution of Days b/w Failures",
    x_title="Days b/w Failures",
    y_title="Number of Instances with N Days b/w Failures",
    trace_name="Actual Data"
)
fig.show()

## Create Estimates

In [30]:
# MLE estimates for mu, sigma
data.mean(), data.std()

(155.42200364466495, 48.38053219829909)

In [33]:
# estimate parameters for gaussian
gaussian_mean = 155
gaussian_std = 48

gaussian_samples = np.random.normal(
    loc=gaussian_mean,
    scale=gaussian_std,
    size=n_points,
)

fig = visualize(
    gaussian_samples,
    fig=fig,
    fig_title="Distribution of Days b/w Failures",
    x_title="Days b/w Failures",
    y_title="Number of Instances with N Days b/w Failures",
    trace_name="Normal Estimate"
)
fig.show()

In [34]:
# MLE estimates
np.log(data).mean(), np.sqrt(
    ((np.log(data) - np.log(data).mean())**2).mean()
)

(5.032545358952086, 0.30277067796393886)

In [35]:
# estimate parameters for gaussian
lognormal_mean = 5.03
lognormal_sigma = 0.30

lognormal_samples = np.random.lognormal(
    mean=lognormal_mean,
    sigma=lognormal_sigma,
    size=n_points,
)

fig = visualize(
    lognormal_samples,
    fig=fig,
    fig_title="Distribution of Days b/w Failures",
    x_title="Days b/w Failures",
    y_title="Number of Instances with N Days b/w Failures",
    trace_name="Lognormal Estimate"
)
fig.show()

## Run Test

with the test statistic being

\begin{align}
A^2 = - n - S
\end{align}

where

\begin{equation}
S = \sum_{i=1}^{n} \frac{(2i - 1)}{n}[ \ln(F(Y_{i})) + \ln( 1 - F(Y_{n + 1 - i}) ]
\end{equation}

In [42]:
np.sort(data)[:10]

array([62.24781203, 63.14522383, 68.21655739, 68.98048645, 70.11791224,
       77.78767238, 77.79933022, 81.49484728, 81.6040838 , 82.05256103])

In [None]:
y_i = np.sort(data)

### Effect of sample size

### Effect of tail

### Effect of distribution

# Conclusion

**Pros**
- more sensitive than KS test

**Cons**
- test statistic has to be calculated