# Probability Distribution - Discrete
---
- Author: Diego Inácio
- GitHub: [github.com/diegoinacio](https://github.com/diegoinacio)
- Notebook: [distributions.ipynb](https://github.com/diegoinacio/data-science-notebooks/blob/master/Probability-and-Statistics/distributions.ipynb)
---
Brief overview of *discrete probability distributions*.

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display, HTML

import numpy as np
import pandas as pd

In [None]:
np.seterr(divide='ignore')
plt.rcParams['figure.figsize'] = (16, 8)

## Introduction
---
A **discrete probability distribution** is the probability distribution of a random variable with countable number of values. For example, list of real number or non-negative integers ($k = 1, 2, 3 ..$).. Thus, a **discrete random variable** is a random variable whose probability distribution is discrete.

The [probability mass function](https://en.wikipedia.org/wiki/Probability_mass_function) of a discrete random variable $X$ with $k$ occurrences can be expressed as:

$$ \large
f(k)=Pr(X = k)
$$

### Cumulative distribution functions
---
A **cumulative distribution function** calculates the probability of a random observation being less than or equal a certain value. It can be expressed as:

$$ \large
F(k)=Pr(X \leq k)=\sum_{i}^{k}Pr(X = i)
$$

## Uniform distribution
---
A **discrete uniform distribution** is characterized by [symmetry](https://en.wikipedia.org/wiki/Symmetric_probability_distribution) and finite number of values that are equally likely to be observed. In other words, every one of $\large n$ values has equal probability $\large\frac{1}{n}$ to occurs. It is convenient to represent its values generally by all integers $k$ in an interval $[a,b]$, so $a$ and $b$ are the main parameters of the distribution, expressed as:

$$ \large
F(k;a,b)=Pr(X = k)=\frac{1}{b - a + 1}
$$

In [None]:
def uniformDistribution(k, a, b):
    den = b - a + 1
    return 1/den

In [None]:
ud = uniformDistribution

# Distribution A
a, b = 10, 35
RANGE = np.arange(a, b + 1)
n = RANGE.size

plt.bar(
    RANGE, [ud(k, a, b) for k in RANGE], 
    alpha=0.5, 
    label=f'$n={n} \\quad|\\quad ab:[{a},{b}]$'
)

# Distribution B
a, b = 25, 60
RANGE = np.arange(a, b + 1)
n = RANGE.size

plt.bar(
    RANGE, [ud(k, a, b) for k in RANGE], 
    alpha=0.5, 
    label=f'$n={n} \\quad|\\quad ab:[{a},{b}]$'
)

# Distribution C
a, b = 45, 65
RANGE = np.arange(a, b + 1)
n = RANGE.size

plt.bar(
    RANGE, [ud(k, a, b) for k in RANGE], 
    alpha=0.5, 
    label=f'$n={n} \\quad|\\quad ab:[{a},{b}]$'
)

# Visualization
plt.xlabel('k')
plt.ylabel('Probability')
plt.legend()
plt.show()

### Cumulative uniform distribution
---
The *cumulative uniform distribution* of the *discrete uniform distribution* can be expressed, for any $k \in [a,b]$, as:

$$ \large
F(k;a,b)=Pr(X \leq k)=\frac{\lfloor{k}\rfloor - a + 1}{b - a + 1}
$$

In [None]:
def cumulativeUniformDistribution(k, a, b):
    k = np.floor(k)
    num = k - a + 1
    den = b - a + 1
    return num/den

In [None]:
cud = cumulativeUniformDistribution

# Cumulative distribution A
a, b = 10, 35
RANGE = np.arange(a, b + 1)
n = RANGE.size

plt.bar(
    RANGE, [cud(k, a, b) for k in RANGE], 
    alpha=0.5, 
    label=f'$n={n};ab:[{a},{b}]$'
)

# Cumulative distribution B
a, b = 25, 60
RANGE = np.arange(a, b + 1)
n = RANGE.size

plt.bar(
    RANGE, [cud(k, a, b) for k in RANGE], 
    alpha=0.5, 
    label=f'$n={n};ab:[{a},{b}]$'
)

# Cumulative distribution C
a, b = 45, 65
RANGE = np.arange(a, b + 1)
n = RANGE.size

plt.bar(
    RANGE, [cud(k, a, b) for k in RANGE], 
    alpha=0.5, 
    label=f'$n={n};ab:[{a},{b}]$'
)

# Visualization
plt.xlabel('k')
plt.ylabel('Probability')
plt.legend()
plt.show()

## Binomial distribution
---
A **discrete binomial distribution** provide the probability of getting exactly *k successes* in *n* independent [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution) [trials](https://en.wikipedia.org/wiki/Bernoulli_trial). In other words, binomial distribution is a discrete probability distribution which expresses the probability of one set of two alternatives: success (with probability $p$) and failure (with probability $1 - p$).

$$ \large
f(k;n,p)=Pr(X=k)={n \choose k}p^k (1-p)^{ n-k}=\frac{n!}{k!(n-k)!}p^k (1-p)^{ n-k}
$$

where:
- ${n \choose k}$ is a **binomial coeficient**;
- $k$ is the number of successes;
- $n$ is the total number of trials;
- $p$ is the probability.

In [None]:
def factorial(n):
    # Factorial of n
    if n <= 1:
        return 1
    return n*factorial(n - 1)

def binomial(k, n):
    # Binomial coefficient of n and k
    n_ = factorial(n)
    k_ = factorial(k)
    nk_ = factorial(n - k)
    return n_//(k_*nk_)

def binomialDistribution(k, n, p):
    # Binomial distribution
    B = binomial(k, n)
    return B*p**k*(1 - p)**(n - k)

In [None]:
bd = binomialDistribution

n = 50
RANGE = np.arange(n, dtype=np.uint64)

# Distributions A, B and C
plt.bar(RANGE, [bd(k, n, 0.15) for k in RANGE], alpha=0.5, label='p=0.15')
plt.bar(RANGE, [bd(k, n, 0.50) for k in RANGE], alpha=0.5, label='p=0.50')
plt.bar(RANGE, [bd(k, n, 0.75) for k in RANGE], alpha=0.5, label='p=0.75')

# Visualization
plt.xlabel('k successes')
plt.ylabel('Probability')
plt.legend()
plt.show()

### Cumulative binomial distribution
---
The *cumulative uniform distribution* of the *discrete binomial distribution* can be expressed as:

$$ \large
F(k;n,p)=Pr(X \leq k)=\sum_{i=0}^{k}{n \choose i}p^i (1-p)^{n-i}=\sum_{i=0}^{k}\frac{n!}{i!(n-i)!}p^i (1-p)^{n-i}
$$

In [None]:
def cumulativeBinomialDistribution(k, n, p):
    # Cumulative binomial distribution function
    K = np.arange(k + 1)
    B = np.array([binomial(i, n) for i in K])
    return np.sum(B*p**K*(1 - p)**(n - K))

In [None]:
cbd = cumulativeBinomialDistribution

n = 50
RANGE = np.arange(n, dtype=np.uint64)

# Cumulative distributions A, B and C
plt.bar(RANGE, [cbd(k, n, 0.15) for k in RANGE], alpha=0.5, label='p=0.15')
plt.bar(RANGE, [cbd(k, n, 0.50) for k in RANGE], alpha=0.5, label='p=0.50')
plt.bar(RANGE, [cbd(k, n, 0.75) for k in RANGE], alpha=0.5, label='p=0.75')

# Visualization
plt.xlabel('k successes')
plt.ylabel('Probability')
plt.legend()
plt.show()

## Poisson distribution
---
The **discrete poisson distribution** express the probability of a number of events occurring in a certain interval of time or space if these events occur independently of the last event.

For example, while controled walking your heart have an average of *120 beats per minute*. In this case, the heartbeats are independent, what it means that when a heartbeat occurs it does not change the probability of when the next one will happen. In intervals of 10 seconds (interval less than 1 minute) has a Poisson distribution with mean 12. This means that in 10 seconds is more likely you have 11 or 12 heartbeats, however 10 and 13 are also likely but with smaller probability.

$$ \large
f(k;\lambda)=Pr(X=k)=\frac{\lambda^k e^{-\lambda}}{k!}
$$

where:
- $k$ is the number of occurrences ($k$ must be greater than $\lambda$);
- $\lambda$ is a positive real number, equal to the expected number of occurrences during the given interval;
- $\large e$ is the *Euler's number*;
- $k!$ is the factorial of $k$.

In [None]:
def poissonDistribution(k, l):
    # Poisson distribution
    num = l**k*np.exp(-l)
    den = factorial(k)
    return num/den

In [None]:
pd = poissonDistribution

n = 50
RANGE = np.arange(n, dtype=np.uint64)

# Distributions A, B and C
plt.bar(RANGE, [pd(k, 10) for k in RANGE], alpha=0.5, label='$\lambda=10$')
plt.bar(RANGE, [pd(k, 15) for k in RANGE], alpha=0.5, label='$\lambda=15$')
plt.bar(RANGE, [pd(k, 30) for k in RANGE], alpha=0.5, label='$\lambda=30$')

# Visualization
plt.xlabel('k')
plt.ylabel('Probability')
plt.legend()
plt.show()

### Cumulative Poisson distribution
---
The *cumulative uniform distribution* of the *discrete Poisson distribution* can be expressed as:

$$ \large
f(k;\lambda)=Pr(X \leq k)=\sum_{i=0}^{k}\frac{\lambda^i e^{-\lambda}}{i!}
$$

In [None]:
def cumulativePoissonDistribution(k, l):
    # Cumulative Poisson distribution function
    K = np.arange(k + 1)
    Li= np.array([l**i for i in K])
    I_ = np.array([factorial(i) for i in K])
    return np.sum(Li*np.exp(-l)/I_)

In [None]:
cpd = cumulativePoissonDistribution

n = 50
RANGE = np.arange(n, dtype=np.uint64)

# Cumulative distributions A, B and C
plt.bar(RANGE, [cpd(k, 10) for k in RANGE], alpha=0.5, label='$\lambda=10$')
plt.bar(RANGE, [cpd(k, 15) for k in RANGE], alpha=0.5, label='$\lambda=15$')
plt.bar(RANGE, [cpd(k, 30) for k in RANGE], alpha=0.5, label='$\lambda=30$')

# Visualization
plt.xlabel('k')
plt.ylabel('Probability')
plt.legend()
plt.show()