# Using Chebyshev's inequality to detect accumulated precipitation data anomalies.
The basic idea is to gather accumulated precip data from stations near each other. Under the assumption that, stations near each other will produce similar data to each other, we can treat the accumulated precip data as a random variable with some distribution that is unknown to us. Then we can use Chebyshev's Inequality or the Markov inequality to get and upper bound for the probability that a certain data point belongs to this distribution. All we need for this is the *sample mean* and *sample variance* of our local distribution which is simple to compute.

I will break this down into steps:
1. Given a station that we think is an anomaly or not find all of its neighbors in a reasonable radius (More on choosing the radius later)
+ Compute the mean and variance of the stations neighbors excluding the station itself (More on this as well).
+ Compute the Chebyshev Inequality and obtain an upper bound for the probability of such a point.
+ Gather results based on their bound and interpret the results.

Some important things to remember:
- The distribution of a stations neighbors may also be contaminated. This is the nature of the beast. We will explore the impacts of this depending on how contaminated the distribution is.
- This method heavily relies on the assumption that observations geographically close to each other behave similarly.
- For now we will only consider the accumulated precip. It may be possible to extend this to several variables.

## Chebyshev's Inequality
Chebyshev's Inequality is used to obtain an upper bound for the probability that a certain number is outside a distance from the mean. More precisely, let $X$ be a random varaible with some distribution $D$. Let $\mu, \sigma^{2}$ be the mean and variance, respectively, of your data. Then 
$$
\begin{equation}
P(|X-\mu| \geq \epsilon) \leq \dfrac{\sigma^{2}}{\epsilon^{2}}=\delta
\end{equation}
$$
#### Our case:
In our case $x \geq 0$ and we will only consider $P(X-\mu \geq \epsilon) \leq P(|X-\mu| \geq \epsilon) \leq \dfrac{\sigma^{2}}{\epsilon^{2}}=\delta$. We can obtain $\epsilon$ by considering the concentration inequality,
$$
P(X \geq \alpha)=P(X-\mu \geq \alpha-\mu)=P(X-\mu \geq \epsilon) \leq \dfrac{\sigma^{2}}{\epsilon^{2}}=\delta, \quad \epsilon = \alpha-\mu
$$
where $\alpha$ is the point of interest. For $0<\epsilon<1$ the problem is uninteresting and we can intuitively conclude that alpha is not a bad data point because it is very close to the mean, UNLESS $\sigma > \epsilon$. If $\delta>1$ then it strongly suggests that $\alpha$ belongs to the distribution.

## Example dataset
We will use this strategy on some simple synthetic distributions.
#### Uniform(0,1)

In [117]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(10)
x = np.random.uniform(0,1,15)
mu = x.mean()
var = x.var()
for i in np.linspace(0,1,10):
    alpha = i
    epsilon = alpha-mu
    bound = var/(epsilon**2)

    print('Mu: {}, \nVar: {}, \nAlpha: {}, \nEpsilon: {}, \nBound: {}\n'.format(mu,var,alpha,epsilon,bound))

Mu: 0.4720924834365427, 
Var: 0.09793207613751041, 
Alpha: 0.0, 
Epsilon: -0.4720924834365427, 
Bound: 0.43941086385513173

Mu: 0.4720924834365427, 
Var: 0.09793207613751041, 
Alpha: 0.1111111111111111, 
Epsilon: -0.3609813723254316, 
Bound: 0.7515456722288489

Mu: 0.4720924834365427, 
Var: 0.09793207613751041, 
Alpha: 0.2222222222222222, 
Epsilon: -0.2498702612143205, 
Bound: 1.5685408003993397

Mu: 0.4720924834365427, 
Var: 0.09793207613751041, 
Alpha: 0.3333333333333333, 
Epsilon: -0.13875915010320938, 
Bound: 5.086296804356288

Mu: 0.4720924834365427, 
Var: 0.09793207613751041, 
Alpha: 0.4444444444444444, 
Epsilon: -0.02764803899209828, 
Bound: 128.11391266612617

Mu: 0.4720924834365427, 
Var: 0.09793207613751041, 
Alpha: 0.5555555555555556, 
Epsilon: 0.08346307211901288, 
Bound: 14.058410781164257

Mu: 0.4720924834365427, 
Var: 0.09793207613751041, 
Alpha: 0.6666666666666666, 
Epsilon: 0.19457418323012393, 
Bound: 2.5867504267057058

Mu: 0.4720924834365427, 
Var: 0.097932076137510