# Probability and sampling distributions
***
1. Discrete
2. Continuous
3. Random data generation with python, spark mllib
    https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs
4. Kernel density esimation


# Two ways to work with distributions in Python
***
- [Scipy](https://www.scipy.org/)

>This module contains a large number of probability distributions as well as a growing library of statistical functions.


- [SymPy](http://www.sympy.org/)

> SymPy is a Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python.


# Working with stats in SymPy
***
## Finite Variables Types



In [11]:
from sympy.stats import Die, density, E, variance, std, cdf
from sympy import symbols, pprint

X = Die('X', 6)
density(X).dict

{1: 1/6, 2: 1/6, 3: 1/6, 4: 1/6, 5: 1/6, 6: 1/6}

In [12]:
cdf(X)

{1: 1/6, 2: 1/3, 3: 1/2, 4: 2/3, 5: 5/6, 6: 1}

In [13]:
pprint([E(X), std(X)])

⎡     √105⎤
⎢7/2, ────⎥
⎣      6  ⎦


## Discrete Types
***
Poisson Distribution

In [14]:
from sympy.stats import Poisson, density, E, variance
from sympy import Symbol, simplify

rate = Symbol("lambda", positive=True)
z = Symbol("z")
P = Poisson("x", rate)
pprint(density(P)(z))
pprint(cdf(P)(z))

 z  -λ
λ ⋅ℯ  
──────
  z!  
⎧   -z - 1  z + 1                                   
⎪  λ      ⋅λ     ⋅(z + 1)⋅γ(z + 1, λ)               
⎪- ────────────────────────────────── + 1  for z ≥ 0
⎨               (z + 1)!                            
⎪                                                   
⎪                   0                      otherwise
⎩                                                   


In [15]:
pprint(variance(P))

   2            
- λ  + λ⋅(λ + 1)


In [16]:
from sympy.stats import Normal, density, E, std, cdf, skewness, P, sample
from sympy.stats import Normal, density, E, std, cdf, skewness

mu = Symbol("mu")
sigma = Symbol("sigma", positive=True)
z = Symbol("z")

mu = Symbol("mu")
sigma = Symbol("sigma", positive=True)
z = Symbol("z")


In [19]:
%matplotlib inline
from matplotlib import pyplot as plt
from sympy.stats import Gamma, E, density, sample_iter, cdf, ContinuousRV
import numpy as np

X = Normal('X', 0, 1)
Y = Normal('Y', 500, 1)

e = 0.1
k = 1

Mix = (1-e) * X + e * Y / k

M = (1-e) * density(X)(z) + e * density(Y/k)(z)

In [20]:
x = Symbol('x')
H = ContinuousRV(x, M, Interval(0,1))

NameError: name 'Interval' is not defined

In [18]:
sample(H)

NameError: name 'H' is not defined

In [None]:
u = np.linspace(-10, 40,1000)
density(X)(Matrix([0, 0]))
v1 = [M.evalf(subs={z: i}) for i in u]
plt.plot(u, v1)

In [None]:
x = Symbol('x')
X = ContinuousRV(x, 2*x, Interval(0, 1))

In [None]:
u = np.linspace(-10, 100)
D1 = density(X); D2 = density(Y); D3 = density(M)
v1 = [D1(i).evalf() for i in u]
v2 = [D2(i).evalf() for i in u]
v3 = [D3(i).evalf() for i in u]
plt.plot(u, v1)
plt.plot(u, v2)
plt.plot(u, v3)

In [None]:
X

In [None]:
from scipy.stats import *
import matplotlib.pyplot as plt
import numpy as np

In [None]:
fig, ax = plt.subplots(1, 1)
x = np.linspace(norm.ppf(0.01),
    norm.ppf(0.99), 10000)

ax.plot(x, norm.pdf(x),
    'r-', lw=4, alpha=0.7, label='norm pdf')

rv = norm()
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')

r = norm.rvs(size=1000)

ax.hist(r, normed=True, histtype='stepfilled', alpha=0.2, bins=100)
ax.legend(loc='best', frameon=False)

plt.show()

In [None]:
rv = norm()
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')

In [None]:
mean, var, skew, kurt = norm.stats(moments='mvsk')

# Central Limit Theorem
***

<div class="alert alert-block alert-info">
Let $\xi_1, \xi_2, ... , \xi_N$ be a set of $N$ independent random variates, which come from an arbitrary probability distribution $P(x_1, \cdots,x_N)$ with with mean $\mu$ and finite variance $\sigma^2$. Then sample mean
 
<center> $\overline{\xi} \longrightarrow \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)$ as $n \rightarrow \infty$ </center>

or

<center> $ Z = \frac{\overline{\xi} - \mu}{\sigma / \sqrt{n}} = \frac{\sum\limits_{i=1}^{n}\xi_i - n\sigma}{\sigma\sqrt{n}} \longrightarrow 
\mathcal{N}\left(0, 1\right)$ as $n \rightarrow \infty$</center>
</div>

***
Resources: 
- https://www.khanacademy.org/math/statistics-probability
- https://onlinecourses.science.psu.edu/stat414

# Limitations of CLT
***

The CLT approaches gaussian distribution with increasing sample size for a  statistic, but
<br />
<br />
<center><div class="alert alert-block alert-warning">
NB! No rule of thumb for  $n$
<br />
Always need to check (algebraically, via simulation) how close to normality we are, $n=2$ can be enough , $n=1000$ can be not
</div></center>


***
Resources:
- https://www.quora.com/Why-is-the-central-limit-theorem-considered-such-a-foundational-concept-to-inferential-statistics
- https://stats.stackexchange.com/questions/81074/how-useful-is-the-clt-in-applications
- https://stats.stackexchange.com/questions/61798/example-of-distribution-where-large-sample-size-is-necessary-for-central-limit-t/61849#61849


# Limitations of CLT
## Examples

time between events (lag between an ad and an increase in sales, or time between failures in reliability theory, or latency in IT).  What CLT says is that if you take enough samples of time between events, their means will follow a normal distribution (and believe me, they do!). This leads to a scary result of average times between events becoming negative with a nonzero probability.

the sample third and fourth moments are averages and so the CLT should apply. The Jarque-Bera test relies on that (plus Slutsky, I guess, for the denominator, along with asymptotic independence), in order to obtain a chi-square distribution for the sum of squares of standardized values.

# Limitations of CLT
***
Suppose, we have $\xi$ with a distribution like this:
<img src="../pictures/xzits.png">



# Limitations of CLT
***
$\overline{\xi}$ for $n = 1000$ has a shape:

<img src="../pictures/shape.png">

- near normal to treat for some cases (density for 2 sd of the mean)
- not normal to assess probability of being more than 3 sd




<img src="../pictures/shape.png">

In [None]:
!ls ../pictures

## Пример расчета свойств выборок при помощи ЦПТ

## Confidence intervals

## Confidence intervals based on bootstrap
http://www.kdnuggets.com/2016/01/hypothesis-testing-bootstrap-apache-spark.html

## Apache Spark Bootstraping