### Header

In [1]:
import numpy as np
import pandas as pd
from math import sqrt
import matplotlib.pyplot as plt
import matplotlib as mpl
import ipywidgets as ipw
from scipy.stats import zscore
from scipy.special import comb
from scipy.integrate import quad
from sklearn.model_selection import train_test_split

def normal_dist(x):
        return 1/(np.sqrt(2*np.pi))*np.exp(-0.5*np.power(x,2))
    
# what value of sigma on x axis of the normal distribution has an area
# from -sigma to +sigma equals to the confidence
def sigma_from_confidence(conf):
    sigma=np.hstack([np.linspace(0,1.5,500), np.linspace(1.505,3.5,400)])
    
    #half the area from 0 = the integral under curve
    area = map(lambda x:quad(normal_dist,0,x)[0] ,sigma) 
    area = 2*np.round(np.array(list(area)),10)
    return sigma[area<conf][-1]

def prob_from_zscore(zscore):
    area = round(quad(normal_dist,-np.inf,zscore)[0],5)
    return area

In [2]:
prob_from_zscore(6)

1.0

In [3]:
sigma_from_confidence(99.9999999999999999)

3.5

### HIPOTESIS TESTING

#### Hipotesis $H_0$:
Is the hipotesis that there NOT a significant difference between the measured random variable mean and the known mean of the population. 

#### Hipotesis $H_x$
Are the hipotesis that there is significant difference between the random variable and the population.Therefore there is a causality involved in the difference.

#### Error type 1:
Fail to prove $H_0$. i.e. Wrongly dismissing H0 when it is True.

#### Error Type 2:
Fail to disprove $H_0$

### P-Value:
is the probability of the hipostesis $H_0$ being correct. i.e. The probability of a value as extreme or more extreme as the sample mean $\overline{x}$ occurring in the normal distribution of the population means. A low P-value ( 5% or less ) can be used to conclude the that $H_0$ is incorrect.
$$P(-\mu_{\overline{x}}\leq \overline{x} \leq +\mu_{\overline{x}})\leq5\%\;\;\;\text{"Two tails" test case}$$
$$P(-\mu_{\overline{x}}\leq \overline{x})\leq5 \;\;\;\; or \;\;\;\; P(\overline{x} \leq +\mu_{\overline{x}})\leq5\%\;\;\;\text{"one tail" test case}$$

Mathematically is the area under the normal distribution of a normal distribution from $-\infty$ to $-Z\_Value_{\overline{x}}$ and from $+Z\_Value_{\overline{x}}$ to $\infty$

$$ Z_{\overline{x}}=\frac{\mu_{\overline{x}}-\mu}{\sigma_{\overline{x}}}$$



In [None]:
'''An estimulant was tested in 100 rats. The mean response time was 1.05s with std=0.5s. And the known 
average response time is 1.2. Did the response time had anithing to do with the drug? '''

std_err=0.5/sqrt(100)
z=(1.05-1.2)/std_err
print(z)
# since z-score = 3 STD-deviations -> the area under the curve of the std-distribution is 99.4% the the 
# probability of a the mean of the sample being less from 3 STD-dev from the mean and More than 3 STD-dev
# is 0.6% therefore the hipotesis H0 can be discarted, and the response time had to be caused by the drug

### Sum of random variables.

If z = x + y and x and y are independent random variables then:
$$E(z)=E(x+y)=E(x)+E(y)$$
$$\mu_z=\mu_x+\mu_y$$
$$\sigma_z^2=\sigma_x^2+\sigma_y^2$$

Similarly if z = x - y and x and y are independent random variables then:
$$E(z)=E(x+y)=E(x)-E(y)$$
$$\mu_z=\mu_x-\mu_y$$
$$\sigma_z^2=\sigma_x^2{\color{red}+}\sigma_y^2$$


##### Example 1:
Two groups of 100 people each are in a diet. **Group x** is in a low carbohidrate, **group y** in a control diet with no reduction in fat. After months **group x** lost a mean of 9.32lbs with s=4.67lbs. **Group y** lost 7.4lbs with s = 4.04lbs.
Find the difference of weight loss between the two groups with a confidence interval of 95%

In [None]:
diff_mean=9.32-7.4
# finding the std of the distributions means from the samples std
x_means_dist_std = 4.67/sqrt(100)
y_means_dist_std = 4.04/sqrt(100)
diff_std = sqrt(pow(x_means_dist_std,2)+pow(y_means_dist_std,2)) #from (9)
z_score=sigma_from_confidence(.95)
print('Confidence Interval: {} to {}'.format(diff_mean-z_score*diff_std,diff_mean+z_score*diff_std))

#### Example 2
- prove the low fat diet has a higher weight loss efffect.<br>
    hipotesis H0: The low fat diet had no effect on the weight loss, hence $(\mu_{\overline{x}})-(\mu_{\overline{y}})=0$

In [None]:
#What is the probability of the mean of the means differences distribution being 0 or less
zscore = (diff_mean-0)/diff_std
p = round(prob_from_zscore(zscore),4)
print('Z-score: {}\nP-value: {}%'.format(round(zscore,3),round((1-p)*100,4)))
#since the p_value is less than 0.1% th null hipotrsis can be discarded

In [None]:
## Understanding Histograms

x=np.random.randn(10000)
plt.subplots(1,2,figsize=(10,5))
plt.subplot(1,2,1)
y=plt.hist(x,bins=100)[0]
plt.subplot(1,2,2)
x,y=np.unique(x,return_counts=True)
bins = 100
X = np.linspace(x.min(),x.max(),bins+1)
y = np.zeros(bins)
for i in range(bins):
    y[i] = np.nonzero(np.logical_and(x>=X[i], x<=X[i+1]))[0].size #conts how many values are between the edges of a bin 
plt.hist(x,bins=X,)
plt.show()

### SKEWNESS AND KURTOSIS

#### Skewness

SKEWNESS MEASURES THE SIMETRY OF A DATA DISTIBUTION. NEGATIVE SKEWNESS INDICATES THE "TAIL" OF A DISTRIBUTION IS SKEWED THE LEFT, POSITIVE SKEWNESS MEANS THE "TAIL" IS SKEWED TO THE RIGHT.

IN THE GRAPH ABOVE THE SKEWNESS IS NEGATIVE.

PEARSONS COEFICIENT OF SKEWNESS $$ SK_P = \Big(\frac{n}{(n-1)(n-2)}\Big)\frac{\sum_{i=1}^{n}{(x_i-\mu)^3}}{\sigma^3} $$

WHEN THE DATA CONSISTS OF A FRECUENCY DISTRIBUTION (RANGES OF VALUES WITH A GUIVEN FRECUENCY) THEN IF  f  IS THE FRECUENCY:

$$ SK_P = \Big(\frac{n}{(n-1)(n-2)}\Big)\frac{\sum_{i=1}^{n}{f\cdot(x_i-\mu_{i})^3}}{\sigma^3} $$


#### Kurtosis
KURTOSIS MEASURES THE "FLATNESSS" OR "POINTIESNESS" OF A DATA DISTIBUTION COMPARED TO A NORMAL DISTRIBUTION

IN THE GRAPH ABOVE THE KURTOSIS IS POSITIVE SINCE THE GRAPH IS "POINTY" INSTEAD OF FLAT

$$K =\Big(\frac{n(n+1)}{(n-1)(n-2)(n-3)}\Big)\cdot\frac{\sum_{i=1}^{n} (x_i-\mu)^4}{\sigma^4}-\frac{3(n-1)^2}{(n-2)(n-3)} $$

IF IT IS A FRECUENCY DISTRIBUTION THEN $$\dots\frac{\sum_{i=1}^{n} f_i\cdot(x_i-\mu_i)^4}{\sigma^4}\dots $$


Is a measure of the join variability of two random variables which have a linear relationship.

$$\text{if E is expected value} \to cov(X,Y)=E[(X-E(X))\cdot(Y-E(Y))]$$
when applied to discretes random variables$$ \mathcal{\LARGE\sigma}_{xy}=\sum_{(x,y)} f(x,y)\cdot(x-\mu_x)(y-\mu_y)$$


### <font color='blue'> From _Standard probability and statistics tables and formulae daniel Zwillinger , stephen Kokoska
</font>


In [None]:
skew?

In [None]:
# swimming pool acidity dataset
from scipy.stats import mode
ph = np.array(
    [
        6.4, 6.6, 6.2, 7.2, 6.2, 8.1, 7.0,
        7.0, 5.9, 5.7, 7.0, 7.4, 6.5, 6.8,
        7.0, 7.0, 6.0, 6.3, 5.6, 6.3, 5.8,
        5.9, 7.2, 7.3, 7.7, 6.8, 5.2, 5.2,
        6.4, 6.3, 6.2, 7.5, 6.7, 6.4, 7.8
    ] )
print(np.mean(ph),np.median(ph),mode(ph)[0][0],)


In [None]:
from scipy.stats import mode
mode([1,1,1,2,2,2,3,3,3])

In [None]:
np.unique(ph,return_counts=True)

### Charasteristic function

### Moments

### Entropy

Laplace distribution: The difference between two independent, identically distributed random variables follow this distribution. Such is the residuals from a model and the real measurements .
$$f(x|\mu,\beta)=\frac{1}{2\beta}e^{-\frac{|x-\mu|}{\beta}}$$
knowing that the gamma function is:
$$\Gamma(z)=\int\limits_{0}^{\infty}x^{z-1}e^{-z}dx$$
Then the momments are:
$$\mu_r=\Big(\frac{1}{2}\Big)$$

In [None]:
from ipywidgets import widgets, interact;

@interact(alpha = widgets.FloatSlider(min=0.1,max=100,step=0.1,value=10),
         beta = widgets.FloatSlider(min=0.1,max=100,step=0.1,value=10),
         mu = widgets.FloatSlider(min=-10,max=10,step=0.1,value=1),
         sigma = widgets.FloatSlider(min=0.1,max=10,step=0.01,value=1),)
def basic_normal_laplace_cdf(alpha,beta,mu,sigma):
    
    # gaussian probability density function
    gpdf = lambda arr,m,s : 1/(s*np.sqrt(2*np.pi))*np.exp(-0.5*np.power((arr-m)/s,2))
    
    # gaussian cummulative density function
    gcdf = lambda arr : np.cumsum(arr)
    
    # Mills Ratio
    R = lambda cdf,pdf: (1-cdf)/pdf
    
    # normal Laplace probability density function
    def nl_pdf(x,alpha,beta,mu,sigma):
        pdf = gpdf(x,mu,sigma)
        cdf = gcdf(pdf)
        r  = R(cdf,pdf)
        print(np.any(pdf==0))
        L = (x-mu)/sigma
        return cdf*L-pdf*L*((beta*r*(alpha*sigma-L)-alpha*r*(beta*sigma+L))/(alpha+beta))
    x = np.linspace(-20,20,200)
    nld = nl_pdf(x,alpha,beta,mu,sigma)
#     print(x,type(nld))
    plt.plot(x,nld)