![QMUL](Images/QMUL-logo.jpg)

# Statistics for Biologists


## Probability distributions - distributions

## Probability distributions

There are a number of fundamental probability distributions that should be studied and understood in detail.
It is also straightforward to know which __R__ functions can generate such distributions.

- __Discrete__ probability distributions include: the uniform discrete, Bernoulli,
binomial, geometric and Poisson distributions.

- Univariate __continuous__ probability distributions include: the uniform continuous, exponential, Normal, Chi-squared ($\chi^2$), log-normal, Gamma.

For all these distributions, we can calculate notable quantities, including _expectation_ and _variance_.

The __expected value__ of a continuous random variable $X$ is
\begin{equation*}
E[X] = \int_{-\infty}^{\infty} x f_X(x) dx
\end{equation*}

What is the expected value of a discrete random variable?

The __variance__ of a random variable is
\begin{equation*}
Var[X] = E[(X-E[X])^2] = E[X^2] - (E[X])^2
\end{equation*}

### Discrete uniform distribution

Assume that, in a DNA sequence alignment for 6 haploid samples in a single nucleotide polymorphism, each possible diallelic frequency has the same probability of occurring, such that
$P(X=i)=\frac{1}{6}$ for all $i \in \{1,2,3,4,5,6\}$. 

We say that $X$ follows a _discrete uniform distribution_ on $S=\{1,2,3,4,5,6\}$. 

The discrete uniform distribution is a probability distribution where a finite number of values are equally likely to be observed: every one of $n$ values has equal probability of $\frac{1}{n}$.

What is the pmf and cdf of a discrete uniform distribution?

In [None]:
pmf_dunif<-function(x, min=0, max=1) ifelse(x>=min & x<=max & round(x)==x, 1/(max-min+1), 0)
cdf_dunif<-function(q, min=0, max=1) ifelse(q<min, 0, ifelse(q>=max, 1, (floor(q)-min+1)/(max-min+1)))

In [None]:
plot(x=1:6, y=pmf_dunif(1:6, min=1, max=6), ylab="f_X", xlab="X", pch=16, cex=1.5)

In [None]:
plot(x=1:6, y=cdf_dunif(1:6, min=1, max=6), type="s", ylab="F_X", xlab="X", ylim=c(0,1), lwd=1.5)

What is the probability of $X$ being equal to 2? What is the probability of $X$ being 2 or less?

In [None]:
pmf_dunif(x=2, min=1, max=6) 
cdf_dunif(q=2, min=1, max=6) 

### Bernoulli distribution

A _Bernoulli trial_ is an experiment that has two outcomes. The sample space is $S=\{success, fail\}$ or $S=\{1, 0\}$.

If $X$ is a random variable following a Bernoulli distribution, then

$P(X=1)=p$ and $P(X=0)=1-p$

A classical example of a Bernoulli experiment is a single toss of a coin. The coin might come up heads with a probability $p$ and tails with a probability $1-p$.

The experiment is called _fair_ if $p=0.5$.

The expected value of a Bernoulli random variable $X$ is

$E[X]=1 \times p + 0 \times (1-p) = p$

The variance of a Bernoulli random variable $X$ is

$Var[X] = E[(X-E[X])^2] = (1-p)^2 \times p + (0-p)^2 \times (1-p) = p(1-p)$ 

In [None]:
# pmf: Bernoulli trial
plot(x=0:1, y=dbinom(x=0:1, size=1, prob=0.9), ylab="f_X", xlab="X", pch=16, cex=1.5)

In [None]:
# expected value and variance (with a numerical approach)
samples <- rbinom(n=1e6, size=1, prob=0.9)
mean(samples)
var(samples)

### Binomial distribution

The probability of obtaining $k$ successes out of $n$ Bernoulli trials is

$P(Y=k) = \frac{n!}{k!(n-k)!}p^k(1-p)^{n-k} $

The random variable $Y$ follows a _Binomial distribution_ with parameters $n$ and $p$, denoted by $B(n,p)$.

$E[Y]=np$

$Var[Y]=np(1-p)$

The variance is a measure of spread and it increases with $n$ and decreases as $p$ approaches $0$ or $1$. For a given $n$, the variance is maximised when $p=0.5$.

### Exercise 

Colour blindness (colour vision deficiency, or CVD) affects approximately 1 in 12 men in the global population. If we take a random sample of 80 men, what is the probability that 10 will be affected by this condition?

Define the random variable and plot its probability mass function. What is the sample space of the random variable? What is the expected value and variance of its probability distribution?

In __R__, `dbinom` calculates the pmd of a Binomial distribution while `choose` calculates binomial coefficients.

### Exponential distribution

If $X$ is an exponential random variable then its density function is given by

$f_X(x) = \lambda e^{-\lambda x}$ if $x \geq 0$ or $0$ otherwise.

$E[X] = 1 / \lambda$

$Var[X] = 1 / \lambda^2$

In [None]:
# pdf: Exponential distribution
x = seq(0,5,0.001)
plot(x=x, y=dexp(x=x, rate=0.7), type="l", ylab="f_X", xlab="X", lwd=1.5)

### Normal distribution

The probability density function of a Normal (or Gaussian) random variable is 

$f_X(x) = \frac{1}{\sqrt{2\pi\sigma}} e^{-{\frac{(x-\mu)^2}{2\sigma^2}}} $

which depends on a location parameter $\mu$ and a scale parameter $\sigma$.

If $X \sim N(\mu, \sigma^2)$, then

$E[X]=\mu$

$Var[X]=\sigma^2$

This distribution is symmetric with a bell shape. Points of inflection are at $\mu - \sigma$ and $\mu + \sigma$. The mean $\mu$ is also the median and mode.

In [None]:
# pdf: Normal distribution
x = seq(-5,5,0.001)
plot(x=x, y=dnorm(x=x, mean=0, sd=0.2), type="l", ylab="f_X", xlab="X", lwd=2)

In [None]:
# cdf: Normal distribution
x = seq(-10,10,0.01)
plot(x=x, y=pnorm(q=x, mean=0, sd=1), type="l", ylab="F_X", xlab="X", lwd=1.5)

The distribution of height $X$ in a population follows a Normal distribution (when corrected by sex). Assume that it is $N(\mu=170, \sigma^2=100)$. 
* What is $P(X>190)$? 

In [None]:
1 - pnorm(q=190, mean=170, sd=sqrt(100))

* What is $P(X=170)$? 
* Where is $P(X)$ "maximised"?

In [None]:
x <- seq(150, 190, 1)
plot(x, dnorm(x=x, mean=170, sd=10))
x[which.max(dnorm(x=x, mean=170, sd=10))]

In [None]:
# N(mean=1,sd=1)
mu=1; scale=1
a=2; b=3
mean(a*rnorm(n=1e5, mean=mu, sd=scale)+b)
a*1+b

In [None]:
# N(mean=1,sd=1)
mu=1; scale=1
a=2; b=3
var(a*rnorm(n=1e5, mean=mu, sd=scale)+b)
a^2*1^2

There are several rules for manipulating expected values and variances.
* If $X$ is a continuous random variable with pdf $f_X(x)$, then for any real-valued function $g$
\begin{equation*}
E[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) dx
\end{equation*}
* If $a$ and $b$ are constants, $E[aX+b]=aE[X]+b$ and $Var[aX+b]=a^2Var[X]$

### Intended Learning Outcomes 

At the end of this session, you are now be able to:
* Describe the principles of set theory and set operations
* Illustrate the foundations of probability theory and appropriate counting methods
* Identify dependence and indepedence of events
* Show the utility of distribution functions for random variables
* Demonstrate how to implement basic probability calculus in __R__