# Statistics Primer

![](https://i.stack.imgur.com/c88K3.png)

## Table of contents

- [Preparation](#1)
- [Discrete and Continuous Variables](#2)
  - PMF (Probability Mass Function)
  - PDF (Probability Density Function)
  - CDF (Cumulative Distribution Function)
- [Distributions](#3)
  - Uniform Distribution
  - Normal Distribution
  - Binomial Distribution
  - Poisson Distribution
  - Log-normal Distribution
- [Summary Statistics and Moments](#4)


## Preparation <a id="1"></a>

In [None]:
# Dependencies

# Standard Dependencies
import os
import numpy as np
import pandas as pd
from math import sqrt

# Visualization
from pylab import *
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import seaborn as sns

# Statistics
from statistics import median
from scipy import signal
from scipy.misc import factorial
import scipy.stats as stats
from scipy.stats import sem, binom, lognorm, poisson, bernoulli, spearmanr
from scipy.fftpack import fft, fftshift

# Scikit-learn for Machine Learning models
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Read in csv of Toy Dataset
# We will use this dataset throughout the tutorial
df = pd.read_csv('toy_dataset.csv')

## Discrete and Continuous Variables <a id="2"></a>

A discrete variable is a variable that can only take on a certain number of values. If you can count a set of items, then it’s a discrete variable. An example of a discrete variable is the outcome of a dice. It can only have 1 of 6 different possible outcomes and is therefore discrete.

A continuous variable can take on an infinite number of values. An example of a continuous variable is length. Length can be measured to an arbitrary degree and is there continuous.

In statistics we represent a distribution of discrete variables with PMF's (Probability Mass Functions) and CDF's (Cumulative Distribution Functions). We represent distributions of continuous variables with PDF's (Probability Density Functions) and CDF's. 

The PMF defines the probability of all possible values x of the random variable. A PDF is the same but for continuous values.
The CDF represents the probability that the random variable X will have an outcome less or equal to the value x. The name CDF is used for both discrete and continuous distributions.

The functions that describe PMF's, PDF's and CDF's can be quite daunting at first, but their visual counterparts look quite intuitive.

### PMF (Probability Mass Function)

Here we visualize a PMF of a binomial distribution. You can see that the possible values are all integers. For example, no values are between 50 and 51. 

The PMF of a binomial distribution in function form:

![](http://reliabilityace.com/formulas/binomial-pmf.png)

See the "[Distributions](#3)" section for more information on binomial distributions.

In [None]:
# PMF Visualization
n = 100
p = 0.5

fig, ax = plt.subplots(1, 1, figsize=(17,5))
x = np.arange(binom.ppf(0.01, n, p), binom.ppf(0.99, n, p))
ax.plot(x, binom.pmf(x, n, p), 'bo', ms=8, label='Binomial PMF')
ax.vlines(x, 0, binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)
rv = binom(n, p)
#ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1, label='frozen PMF')
ax.legend(loc='best', frameon=False, fontsize='xx-large')
plt.title('PMF of a binomial distribution (n=100, p=0.5)', fontsize='xx-large')
plt.show()

### PDF (Probability Density Functions)

The PDF is the same as a PMF, but continuous. It can be said that the distribution has an infinite number of possible values. Here we visualize a standard normal distribution with a mean of 0 and standard deviation of 1.

PDF of a normal distribution in formula form:

![](https://www.mhnederlof.nl/images/normalpdf.jpg)

In [None]:
# Plot normal distribution
mu = 0
variance = 1
sigma = sqrt(variance)
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.figure(figsize=(16,5))
plt.plot(x, stats.norm.pdf(x, mu, sigma), label='Normal Distribution')
plt.title('Normal Distribution with mean = 0 and std = 1')
plt.legend(fontsize='xx-large')
plt.show()

### CDF (Cumulative Distribution Function)

The CDF maps the probability that a random variable X will take a value of less than or equal to a value x (P(X ≤  x)). CDF's can be discrete or continuous. In this section we visualize the continuous case. You can see in the plot that the CDF accumulates all probabilities and is therefore bounded between 0 ≤ x ≤ 1.

The CDF of a normal distribution as a formula:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/187f33664b79492eedf4406c66d67f9fe5f524ea)

*Note: erf means "[error function](https://en.wikipedia.org/wiki/Error_function)".*

In [None]:
# Define X and Y
X  = np.arange(-2, 2, 0.01)
Y  = exp(-X ** 2)

# Normalize Y


# Plot the PDF and CDF


## Distributions <a id="3"></a>

A Probability distribution tells us something about the likelihood of each value of the random variable.

A random variable X is a function that maps events to real numbers.

The visualizations in this section are of discrete distributions. Many of these distributions can however also be continuous.

### Uniform Distribution

A Uniform distribution is pretty straightforward. Every value has an equal change of occuring. Therefore, the distribution consists of random values with no patterns in them. In this example we generate random floating numbers between 0 and 1.

The PDF of a Uniform Distribution:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/648692e002b720347c6c981aeec2a8cca7f4182f)

CDF:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/eeeeb233753cfe775b24e3fec2f371ee8cdc63a6)

In [None]:
# Define a variable uniform_dist 
# Set uniform_dist to 1000 random values set between 0 and 1


In [None]:
#Plot uniform_dist as a scatter plot


In [None]:
plt.figure(figsize=(18,5))
sns.distplot(uniform_df)
plt.title('Random/Uniform distribution', fontsize='xx-large')

### Normal Distribution

A normal distribution (also called Gaussian or Bell Curve) is very common and convenient. This is mainly because of the [Central Limit Theorem (CLT)](https://en.wikipedia.org/wiki/Central_limit_theorem), which states that with a large amount of independent random variables (like coin flips) the distribution tends towards a normal distribution.

PDF of a normal distribution:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/2ce7e315b02666699e0cd8ea5fb1a3e0c287cd9d)

CDF:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/187f33664b79492eedf4406c66d67f9fe5f524ea)


In [None]:
# Generate Normal Distribution


# Create a Pandas Series for easy sample function



# Create a Pandas Series for easy sample function




In [None]:
# Scatterplot


In [None]:
# Normal Distribution as a Bell Curve


### Binomial Distribution

A Binomial Distribution has a countable number of outcomes and is therefore discrete.

Binomial distributions must meet the following three criteria:

1. The number of observations or trials is fixed. In other words, you can only figure out the probability of something happening if you do it a certain number of times.
2. Each observation or trial is independent. In other words, none of your trials have an effect on the probability of the next trial.
3. The probability of success is exactly the same from one trial to another.

An intuitive explanation of a binomial distribution is flipping a coin 10 times. If we have a fair coin our chance of getting heads (p) is 0.50. Now we throw the coin 10 times and count how many times it comes up heads. In most situations we will get heads 5 times, but there is also a change that we get heads 9 times. The PMF of a binomial distribution will give these probabilities if we say N = 10 and p = 0.5. We say that the x for heads is 1 and 0 for tails.

PMF:

![](http://reliabilityace.com/formulas/binomial-pmf.png)

CDF:

![](http://reliabilityace.com/formulas/binomial-cpf.png)


A **Bernoulli Distribution** is a special case of a Binomial Distribution.

All values in a Bernoulli Distribution are either 0 or 1. 

For example, if we take an unfair coin which falls on heads 60 % of the time, we can describe the Bernoulli distribution as follows:

p (change of heads) = 0.6

1 - p (change of tails) = 0.4

heads = 1

tails = 0

Formally, we can describe a Bernoulli distribution with the following PMF (Probability Mass Function):

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/a9207475ab305d280d2958f5c259f996415548e9)


In [None]:
# Change of heads (outcome 1)
p = 0.6

# Create Bernoulli samples


# Plot Distribution


In [None]:
#Set the bernoulli distribution to bernoulli samples of size 1000 and propability 0 using rvs
bern_dist = bernoulli.rvs(p, size=1000)

### Poisson Distribution

The Poisson distribution is a discrete distribution and is popular for modelling the number of times an event occurs in an interval of time or space. 

It takes a value lambda, which is equal to the mean of the distribution.

PMF: 

![](https://study.com/cimages/multimages/16/poisson1a.jpg)

CDF: 
![](http://www.jennessent.com/images/cdf_poisson.gif)

In [None]:
#define x and y

#Plot the poisson distribution


### Log-Normal Distribution

A log-normal distribution is continuous. The main characteristic of a log-normal distribution is that it's logarithm is normally distributed. It is also referred to as Galton's distribution.

PDF: 

![](https://www.mhnederlof.nl/images/lognormaldensity.jpg)

CDF:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/29095d9cbd6539833d549c59149b9fc5bd06339b)

Where Phi is the CDF of the standard normal distribution.

In [None]:
# Specify standard deviation and mean

# Create log-normal distribution

# Visualize log-normal distribution


## Summary Statistics and Moments <a id="4"></a>

**Mean, Median and Mode** 

Note: The mean is also called the first moment.


![](https://qph.fs.quoracdn.net/main-qimg-29a4925034e075f16e1c743a4b3dda8b)

### Moments

A moment is a quantitative measure that says something about the shape of a distribution. There are central moments and non-central moments. This section is focused on the central moments.

The 0th central moment is the total probability and is always equal to 1.

The 1st moment is the mean (expected value).

The 2nd central moment is the variance.

**Variance** = The average of the squared distance of the mean. Variance is interesting in a mathematical sense, but the standard deviation is often a much better measure of how spread out the distribution is.

![](http://www.visualmining.com/wp-content/uploads/2013/02/analytics_formula_variance.png)

**Standard Deviation** = The square root of the variance

![](http://www.visualmining.com/wp-content/uploads/2013/02/analytics_formula_std_dev.png)

The 3rd central moment is the skewness.

**Skewness** = A measure that describes the contrast of one tail versus the other tail. For example, if there are more high values in your distribution than low values then your distribution is 'skewed' towards the high values.

![](http://www.visualmining.com/wp-content/uploads/2013/02/analytics_formula_skewness.png)

The 4th central moment is the kurtosis.

**Kurtosis** = A measure of how 'fat' the tails in the distribution are.

![](http://www.visualmining.com/wp-content/uploads/2013/02/analytics_formula_kurtosis.png)

The higher the moment, the harder it is to estimate with samples. Larger samples are required in order to obtain good estimates.

In [None]:
# Summary
print('Summary Statistics for a normal distribution: ')
# Median

# Standard Deviation



# Mean



# Variance



# Return unbiased skew normalized by N-1



# Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis 
# (kurtosis of normal == 0.0) normalized by N-1




In [None]:
**The End**