<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#1.-Random-Variables" data-toc-modified-id="1.-Random-Variables-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>1. Random Variables</a></span></li></ul></li><li><span><a href="#Probability-distribution" data-toc-modified-id="Probability-distribution-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Probability distribution</a></span></li><li><span><a href="#Discrete-Distributions" data-toc-modified-id="Discrete-Distributions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Discrete Distributions</a></span></li><li><span><a href="#Statistical-functions" data-toc-modified-id="Statistical-functions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Statistical functions</a></span></li><li><span><a href="#Hypothesis-testing" data-toc-modified-id="Hypothesis-testing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Hypothesis testing</a></span></li><li><span><a href="#Statistical-Models" data-toc-modified-id="Statistical-Models-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Statistical Models</a></span></li><li><span><a href="#Probability-Density-Functions-and-Cumulative-Distribution-Functions" data-toc-modified-id="Probability-Density-Functions-and-Cumulative-Distribution-Functions-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Probability Density Functions and Cumulative Distribution Functions</a></span></li><li><span><a href="#Multivariate-Distributions-and-Tests" data-toc-modified-id="Multivariate-Distributions-and-Tests-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Multivariate Distributions and Tests</a></span></li><li><span><a href="#Non-parametric-Methods" data-toc-modified-id="Non-parametric-Methods-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Non-parametric Methods</a></span></li></ul></div>

**Using scipy to understand a data set**

In [2]:
from scipy.stats import describe 

data = [0.2, -1.91, 0.41, -0.7, -0.03, 0.53, -0.1]
description = describe(data)

for key, value in description._asdict().items(): 
    print(f"{key}:{value}")

nobs:7
minmax:(-1.91, 0.53)
mean:-0.22857142857142856
variance:0.7120476190476189
skewness:-1.2203002099412794
kurtosis:0.28876812085165326


In [4]:
from scipy.stats import describe 

data2 = [[1,2], [3,4]]
description = describe(data2)

for key,value in description._asdict().items(): 
    print(f"{key}:{value}")

nobs:2
minmax:(array([1, 2]), array([3, 4]))
mean:[2. 3.]
variance:[2. 2.]
skewness:[0. 0.]
kurtosis:[-2. -2.]


**Using scipy to perform calculations on a normal distribution**

In [5]:
from scipy.stats import norm 
import numpy as np 

distribution = norm(0,2)  #mean and standard deviation 
array = np.array([-2, -1, 0.0, 1, 2, 3])

print (f"Cumulative distribution: {distribution.cdf(array)}")
print(f"PPF:{distribution.ppf(0.5)}")
print(f"Sampled distribution: {distribution.rvs(size=5)}")

Cumulative distribution: [0.15865525 0.30853754 0.5        0.69146246 0.84134475 0.9331928 ]
PPF:0.0
Sampled distribution: [ 0.1495538   0.35598259 -1.63577539  0.17290247 -0.73660232]


**Using scipy to perform a t-test**

Used to determine if the mean of a sampling distribution is significantly different from a given reference mean (population mean)

In [12]:
# Manufacturer claims that the volume of a drink is 225 ml
# is this true? 
# volumes look like they are lower than 225 ml 

from scipy import stats

volumes = [220.1, 220.5, 221.2, 221.8, 222.5, 223.1, 223.7]
result = stats.ttest_1samp(volumes, 225)
print(result)
print("-----")
print(result[1])

Ttest_1sampResult(statistic=-6.249990669147964, pvalue=0.0007777459267474468)
-----
0.0007777459267474468


### 1. Random Variables

**Generating random variables**

Uniform, Normal, Exponential, etc. 

In [14]:
from scipy.stats import uniform, norm, expon 

#Example: Uniform distribution 
data1 = uniform.rvs(size = 100)
print(data1)

print()

# Example: Normal distribution 
data2 = norm.rvs(size = 100)
print (data2)

print()

# Example: exponentail distribution 
data3 = expon.rvs(size = 100)
print(data3)

[0.38365154 0.64992949 0.07761502 0.60325459 0.73674025 0.71542407
 0.37277582 0.48504099 0.00910934 0.43852116 0.09404274 0.98684974
 0.75599639 0.39652622 0.10901255 0.8185625  0.74371899 0.72817371
 0.53039774 0.21805092 0.69886342 0.54288128 0.50999744 0.1182268
 0.01974857 0.58416851 0.33580088 0.42575701 0.81509121 0.52579852
 0.79360297 0.97393955 0.49156342 0.8147491  0.1711942  0.40359373
 0.7509017  0.52978067 0.23948485 0.02086207 0.7353068  0.78607555
 0.2727768  0.4395637  0.06642154 0.55203341 0.75137476 0.27494629
 0.11390457 0.74514938 0.76192735 0.31548182 0.27401287 0.44095328
 0.8906154  0.23158622 0.84599063 0.57411319 0.53709177 0.12209399
 0.94651758 0.24521098 0.51715308 0.13384271 0.52054808 0.8872801
 0.16239083 0.16111402 0.45464206 0.65148192 0.16296755 0.2781843
 0.42109104 0.9571285  0.35256818 0.19175826 0.28808779 0.17676487
 0.32125734 0.62845974 0.43836009 0.8824237  0.44945533 0.46965069
 0.1539367  0.20237206 0.97389082 0.78297983 0.58411493 0.3797720

## Probability distribution 
**Contionous distribution**

Normal, exponential, chi squared, etc

In [16]:
# Example: Normal distributions 
data = norm.rvs(loc=0, scale=1, size=1000) # Generate random numbers
print(data)

print()

mean, variance = norm.fit(data) # fit distribution to data 
print("The mean is", mean)
print()

print("The variance is", variance)


[ 8.49863730e-01  8.68057217e-01 -1.73554136e+00 -5.41361498e-01
  4.27327656e-01  5.94566417e-01 -7.24784074e-01  5.61180296e-01
 -1.07627650e+00 -3.60556472e-01  4.71148070e-01 -4.80415161e-01
 -1.73525240e-01 -8.76306029e-01 -4.14991843e-01  7.24395048e-01
 -4.42572139e-01  1.63304136e+00 -4.67457767e-01  1.01458739e+00
 -5.25803189e-01 -1.65755914e-01 -3.78959278e-02  9.34355862e-01
 -7.23254220e-01 -2.54025128e+00 -1.42299179e+00 -1.21239827e+00
 -7.47414124e-01  5.94387475e-01 -9.04148145e-01  2.37943010e-01
 -7.96393482e-01  1.71064205e+00  3.76616502e-01 -6.45761623e-01
 -9.82257133e-01  4.18314465e-01  2.73539633e-02 -2.99119137e-01
  1.99390107e-02  1.45811039e+00 -2.91979827e-01 -1.06370329e-01
  2.40515166e-03 -5.23769666e-01  1.69672371e+00  6.47039333e-02
  1.35722021e+00  3.23030489e+00 -3.33038493e-01 -1.35751492e-01
 -1.02065155e+00  8.38337041e-01 -5.68113591e-01  1.42092873e+00
 -6.03674694e-01 -1.26670185e+00  2.02243058e+00 -7.84841565e-01
  7.89865115e-01 -1.68818

## Discrete Distributions 
Poisson, Binomial, germetric, etc

In [17]:
from scipy.stats import poisson, binom, geom 

#Example: Binomial distribution
successes = binom.rvs(n=10, p=0.5, size=1000) #simulate 1000 trials 
print(successes)

print()

mean, variance = binom.stats(n=10, p=0.5) # calculate statistics
print("The mean is", mean)
print()
print("The variance is", variance)

[ 7  7  6  4  6  2  3  4  6  2  3  5  4  7  7  3  4  6  4  4  5  6  3  4
  5  4  6  5  6  4  7  1  7  7  5  6  2  1  3  3  2  7  6  7  6  7  5  4
  7  5  1  2  2  6  5  7  3  7  6  5  7  3  4  6  9  2  5  4  7  6  2  5
  5  5  6  7  3  6  7  5  4  5  4  4  6  5  4  5  2  4  4  9  5  6  4  5
  7  7  3  5  4  4  6  7  3  4  7  0  4  3  7  5  5  7  6  4  7  4  4  8
  4  7  5  5  9  7  4  2  4  1  6  6  5  5  5  3  7  5  5  7  6  5  4  5
  6  4  5  7  1  4  5  5  5  4  5  6  4  6  3  7  3  5  5  6  7  4  4  2
  2  4  4  4  8  5  6  4  4  5  5  4  5  4  4  3  6  6  3  5  3  4  6  9
  5  5  5  5  5  3  6  5  3  3  4  8  6  5  5  5  7  6  6  6  4  4  6  7
  4  4  6  4  2  4  6  3  3  6  5  5  7  6  5  4  6  3  5  7  6  6  7  4
  5  5  5  5  4  3  5  2  7  4  5  4  3  6  5  4  6  5  4  4  5  5  5  6
  7  7  7  5  5  4  5  5  6  5  2  6  4  2  2  4  6  8  4  4  7  3  4  5
  5  6  5  7  4  7  7  2  6  4  5  6  9  7  5  8  6  7  6  3  5  4  6  3
  6  5  4  5  8  6  4  2  4  4  6  5  4  6  3  4  4

## Statistical functions 
**Descriptive statistics**
Mean, Minimum, Variance, Skewness, Kurtosis, etc

In [18]:
from scipy.stats import describe, norm 
import numpy as np 

#Generate some random data from a normal distribution 
data = norm.rvs(loc=0, scale=1, size=1000)
print(data)

print()

# Calculate descriptives statistics using describe 
description = describe(data)
print(description)
print()

# print the descriptive statistics 
print("Descriptive statistics")
print("Mean:", description.mean)
print("Minimum:", description.minmax[0])
print("Maximum:", description.minmax[1])
print("Variance:",description.variance)
print("Skewness:", description.skewness)
print("Kurtosis:",description.kurtosis)

[ 3.10644735e-01 -7.21953272e-01 -4.10193174e-01 -8.50248172e-01
  1.92175420e+00  9.03884435e-01 -5.07501243e-01 -2.81636158e-01
  2.45863438e-01  3.43365076e-01  1.28990930e+00  5.35603350e-01
  3.40977655e-01  8.22425509e-01  1.74200074e+00  2.15448890e-01
  7.75373224e-01 -8.22578274e-01  9.10574716e-01  6.67320822e-01
  2.64003396e-02  6.48604965e-01 -7.42337425e-02 -5.23709835e-01
 -9.76581232e-01 -3.87154375e-01  1.36569340e+00 -5.91290171e-01
 -1.46789479e+00 -5.79251066e-01 -1.50838913e+00 -1.79900645e+00
 -1.61856300e+00  7.41998984e-02 -2.14841823e-01  1.96520125e+00
  6.55235087e-02 -1.26264853e+00  1.92146281e+00 -2.35631460e+00
 -1.25992626e+00 -6.17510442e-01  1.09440306e+00 -9.53816965e-02
 -1.79539791e-01  1.69288226e+00 -6.52730465e-01 -1.38457060e+00
  2.73453194e-01 -3.33392259e-02  2.81991341e+00 -1.12455006e+00
  1.01313563e+00  5.20388636e-01  1.08264675e+00 -1.75625532e+00
  2.65579320e-02  1.37986485e+00  5.52332990e-01  7.30256912e-01
 -5.34728860e-01 -2.30984

## Hypothesis testing 
t-test, ANOVA, Kolmogorov-Smirnov test, etc

In [19]:
from scipy.stats import ttest_ind, f_oneway, ks_2samp

# Example: t-test
group1 = norm.rvs(loc=0, scale=1, size=1000)
print(group1)

print()

group2 = norm.rvs(loc=0.5, scale=1, size=1000)
print(group2)

print()

t_stat, p_value = ttest_ind(group1, group2)
print("Test statistics:", t_stat)
print("P-value", p_value)

[-1.12845484e+00  2.07738013e-01  1.83502789e+00  6.53165118e-01
  8.54464889e-01 -1.20537721e+00  3.86077910e-01  6.16519139e-01
 -1.72115230e-01  5.85581811e-01  1.26387366e+00 -6.31739492e-01
 -2.01703114e+00  3.66734322e-01 -1.14860409e+00  1.22607787e-01
  8.78811659e-01  5.94199651e-01  1.42081614e+00  1.09578399e+00
  9.74415096e-01  6.80595806e-01  1.61989802e-01 -1.15588248e+00
  3.87156677e-01 -1.26885811e+00  1.42613179e+00 -8.00873937e-01
 -7.38607815e-01  5.15029201e-01  5.82176045e-01  5.67285055e-01
  3.21357229e-01  1.63680998e+00 -7.59254239e-01  8.66125252e-01
 -8.84925576e-01 -3.54066947e-02  9.01064999e-01 -6.08660630e-01
 -4.56515147e-01  3.89740855e-02 -9.72961879e-01  5.02807658e-02
  5.54148514e-01  5.80407607e-01  1.38036759e+00  7.09810711e-02
  1.18817576e+00 -7.94122204e-03 -2.29003803e+00  1.24223599e+00
  1.73658779e+00 -1.33859528e+00  2.20100501e+00 -1.77117412e-01
  1.12563389e+00  2.62923029e-04 -2.29861801e-01  1.24179046e+00
  6.35993792e-01  1.44517

## Statistical Models 
Linear regression 

In [22]:
from scipy.stats import linregress

# Example: linear regression 
x = np.array([0,1,2,3,4])
y = np.array([0,0.8,0.9, 0.1, -0.8])

slope, intercept, r_value, p_value, std_err = linregress(x,y)

print("Slope:", slope)
print("intercept:", intercept)
print("r_value:", r_value)
print("p_value:", p_value)
print("std_err:", std_err)

Slope: -0.22999999999999998
intercept: 0.66
r_value: -0.5276561879022921
p_value: 0.36079504544175134
std_err: 0.2137755832643195


In [27]:
from scipy.stats import linregress 

#Example: Linear regression
math_score = np.array([70,92,45,48,76,65,67,83,54,29])
phys_score = np.array([87,58,64,56,88,43,72,53,72,55])
slope, intercept, r_value, p_value, std_err = linregress(math_score,phys_score)

print("slope:", slope)
print("intercept:", intercept)
print("p_value:", p_value)

slope: 0.12980489448375143
intercept: 56.635272136972034
p_value: 0.6444585968786516


$\text{phys_score} = (0.12980489448375143\times \text{ math_score}) + 56.635272136972034$ 

## Probability Density Functions and Cumulative Distribution Functions 

**PDTs and CDFs**

Compute probability density (PDF) and cumulative distribution function(CDF) for various distributions. 

In [28]:
from scipy.stats import norm 

# example: Normal distribution PDF and CDF 
x = np.linspace(-5,5,100)

pdf_values = norm.pdf(x,loc=0, scale=1) #PDF values
cdf_values = norm.cdf(x, loc=0, scale=1) #CDF values

print("pdf_values:",pdf_values,"\n")
print("cdf_values:", cdf_values)

pdf_values: [1.48671951e-06 2.45106104e-06 3.99989037e-06 6.46116639e-06
 1.03310066e-05 1.63509589e-05 2.56160812e-05 3.97238224e-05
 6.09759040e-05 9.26476353e-05 1.39341123e-04 2.07440309e-04
 3.05686225e-04 4.45889725e-04 6.43795498e-04 9.20104770e-04
 1.30165384e-03 1.82273110e-03 2.52649578e-03 3.46643792e-03
 4.70779076e-03 6.32877643e-03 8.42153448e-03 1.10925548e-02
 1.44624148e-02 1.86646099e-02 2.38432745e-02 3.01496139e-02
 3.77369231e-02 4.67541424e-02 5.73380051e-02 6.96039584e-02
 8.36361772e-02 9.94771388e-02 1.17117360e-01 1.36486009e-01
 1.57443188e-01 1.79774665e-01 2.03189836e-01 2.27323506e-01
 2.51741947e-01 2.75953371e-01 2.99422683e-01 3.21590023e-01
 3.41892294e-01 3.59786558e-01 3.74773979e-01 3.86422853e-01
 3.94389234e-01 3.98433802e-01 3.98433802e-01 3.94389234e-01
 3.86422853e-01 3.74773979e-01 3.59786558e-01 3.41892294e-01
 3.21590023e-01 2.99422683e-01 2.75953371e-01 2.51741947e-01
 2.27323506e-01 2.03189836e-01 1.79774665e-01 1.57443188e-01
 1.36486009e

## Multivariate Distributions and Tests 
**Multivariate Normal Distribution**
Generate samples and perform statistical tests

In [30]:
from scipy.stats import  multivariate_normal 

# Example: Multivariate normal distribution 

mean = [0,0]
cov = [[1,0.5], [0.5,1]]
mv_normal = multivariate_normal(mean, cov)
samples = mv_normal.rvs(size=100)

print("Samples:", samples)

Samples: [[-1.30259507e+00 -8.62547690e-01]
 [-5.41792416e-02 -7.46669427e-01]
 [ 5.77576786e-01 -1.12229662e+00]
 [-1.28355853e+00 -2.11338032e+00]
 [-2.78355769e-01  7.35778632e-01]
 [-7.42861445e-02 -9.47733159e-01]
 [ 8.10328502e-01 -8.01962774e-02]
 [ 2.62223926e+00 -3.17851212e-01]
 [ 1.45310284e+00  8.01288107e-01]
 [ 1.07721438e+00  8.49595393e-01]
 [-1.85592303e+00 -9.04703080e-01]
 [-1.82884780e-01  4.82075028e-01]
 [ 5.68449479e-02 -4.32854446e-03]
 [-2.01487642e+00 -1.08423193e+00]
 [ 7.30803604e-01 -4.36743859e-03]
 [ 6.05527294e-01  9.96304963e-01]
 [-1.14930931e-01  4.02900944e-01]
 [-2.03713308e+00  8.05288350e-01]
 [-9.91610600e-01 -3.01746319e-01]
 [-3.66564154e-01 -3.24196055e-01]
 [-9.09762091e-01 -1.24054800e+00]
 [ 9.32093962e-01  6.68847437e-02]
 [ 2.06196611e-01  7.63526512e-01]
 [ 2.83962002e+00  3.49910663e+00]
 [ 3.17472997e-01  1.11292552e+00]
 [ 1.51172772e+00  1.38867196e+00]
 [ 6.76869246e-01  1.78429837e+00]
 [-1.68942889e+00 -2.82977773e-01]
 [ 1.555783

## Non-parametric Methods

**Kernel Density estimation (KDE)**

Estimate the probability density function of a random variable. 

In [33]:
from scipy.stats import gaussian_kde 


# Example: Kernel density estimations(kde)
data = np.random.normal(size=1000)

kde = gaussian_kde(data)

# Generate a grid of point for evaluation 
x_grid = np.linspace(min(data),max(data),100)

estimated_density = kde.evaluate(x_grid) #Evaluate desnity on a grid 

print("data:", data,"\n")
print("x_grid:", x_grid,"\n")
print("estimated_desnity:",estimated_density)

data: [-1.82731712e-01 -1.26209365e+00  4.44960036e-01  4.57812940e-01
 -6.93200969e-01  9.29081884e-01  7.41311418e-01  5.90108195e-01
  8.27002076e-01 -2.08858291e-02  1.81459167e+00 -2.28229615e-02
  1.62444135e+00  3.93976780e-01 -2.91661159e+00 -9.03718420e-02
 -6.14829206e-01  2.18837432e+00 -3.19129973e-01  2.57446446e-01
  1.55125270e+00 -2.32899872e+00 -1.12518420e-01 -5.81001982e-01
  1.19998920e-01  1.75940935e+00  7.63260375e-01 -8.52196386e-01
 -4.66879286e-01  1.77691175e+00  3.30081839e-02 -7.86879341e-01
  3.33990173e-02  5.91116508e-01 -1.42194639e+00 -5.85687891e-02
 -4.51722589e-01  5.89247149e-01  1.58621126e+00 -2.21124091e-01
 -3.24718776e-02  9.02090626e-01  1.33597523e-01 -4.20929662e-01
  6.61833427e-01 -4.37473859e-01 -3.03944279e-01  9.39169541e-01
 -5.19887877e-01 -1.57846541e-02 -4.85371662e-01 -2.39606033e+00
  1.02781084e+00  1.21460232e+00 -2.49465961e+00 -5.92214118e-01
 -6.30051289e-01 -1.05739656e+00 -3.83265051e-01  1.33469235e+00
 -1.86155621e+00 -1