![QMUL](Images/QMUL-logo.jpg)

# Statistics for Biologists


## Probability distributions - functions

## Distribution functions

With every random variable $X$, we associate a function called the cumulative distribution function of $X$.
> The _cumulative distribution function_ or _cdf_ of a random variable $X$, denoted by $F_X(x)$ is defined by
\begin{equation*}
F_X(x) = P_X(X \leq x) \text{, for all } x
\end{equation*}

Consider the experiment of sampling three nucleotidic bases and let $X$ be the number of GC bases observed.
What is the cdf of $X$? Can we plot it?

\begin{align*}
F_X(x) &= 0 \text{, if } -\infty < x < 0 \\
F_X(x) &= ? \text{, if } \\
...
\end{align*}

In [None]:
cdf <- c(0,1/8,4/8,7/8,1,1)
plot(x=c(-1,0,1,2,3,4), cdf, xlab="X", type="s", lwd=2)

Note tha the step function $F_X(x)$ is defined for all values of $x$, not just those is $\mathcal{X}$.
Also note that $F_X(x)$ has jumps at values of $x_i \in \mathcal{X}$ and the size of the jump at $x_i$ is equal to $P(X=x_i)$.

$F_X(x)$ can be discontinuous but has the property of right-continuity.

The function $F_X(x)$ is a cdf if and only if the following three conditions hold:
1. $\lim_{x \rightarrow -\infty} F_X(x)=0$ and $\lim_{x \rightarrow \infty} F_X(x)=1$.
2. $F_X(x)$ is a nondecreasing function of $x$.
3. $F_X(x)$ is right-continuous.

Whether a cdf is continuous or not corresponds to the associated random variable being continuous or not.

> A random variable $X$ is _continuous_ if $F_X(x)$ is a continuous function of $x$. A random variable $X$ is _discrete_ if $F_X(x)$ is a step function of $x$.

> The random variables $X$ and $Y$ are _identically distributed_ if, for every set $A \in \mathcal{B}^1$ (the smallest algebra sigma), $P(X \in A) = P(Y \in A)$.

If $X$ and $Y$ are identically distributed, $F_X(x)=F_Y(x)$ for every $X$.

$F_X(x)$ completely determines the probability distribution of a random variable $X$.

Consider the experiment of sampling nucleotidic bases and let $X$ be the number of GC bases observed and $Y$ the number of AT bases observed. 

Are $X$ and $Y$ identically distributed?

## Density and mass functions

Apart from its cdf $F_X$, another function is associated with a random variable $X$.
This function is called either the probability density function (pdf) or probability mass function (pmf).

> The _probability mass function (pmf)_ of a discrete random variable $X$ is given by
\begin{equation*}
f_X(x) = P(X=x) \text{ for all } x
\end{equation*}

Recall that $f_X(x)$ is the size of the jump in the cdf at $x$. 


In [None]:
pmf <- c(1/8,3/8,3/8,1/8)
plot(x=c(0,1,2,3), pmf, xlab="X", cex=2, pch=16, ylim=c(0,4/8))

From the pmf, we can calculate probabilities of events, as 
\begin{equation*}
P(a \leq X \leq b) = \sum_{k=a}^b f_X(k)
\end{equation*}
for positive integers $a$ and $b$, with $a \leq b$.

What happens in the continuous case?

> The _probability density function_ or _pdf_ , $f_X(x)$, of a continuous random variable $X$ is the function that satisfies
\begin{equation*}
F_X(x) = \int_{-\infty}^{x} f_X(t) dt \text{ for all } x
\end{equation*}

The expression "$X$ has a distribution given by $F_X(x)$" can be written as "$X \sim F_X(x)$". We can also write $X \sim f_X(x)$ or $X \sim Y$ if $X$ and $Y$ have the same distribution.

The pdf (or pmf) contains the same information as the cdf. We can use either one to solve problems.

We need to be more careful in the continuous case. In the discrete we can sum over values of the pmf to get the cdf. Similarly, in the continuous case we substitute sum with integrals.


\begin{equation*}
P(X \leq x) = F_X(x) = \int_{-\infty}^x f_x(t) dt
\end{equation*}
If $f_X(x)$ is continuous, then
\begin{equation*}
\frac{d}{dx} F_X(x) = f_X(x)
\end{equation*}

A function $f_X(x)$ is a pdf (or pmf) of a random variable $X$ if and only if
1. $f_X(x) \geq 0$ for all $x$
2. $\sum_x f_X(x)=1$ (pmf) or $\int_{-\infty}^{\infty} f_X(x) dx =1 $ (pdf)

### Exercise

Assume that a genetic test for a certain disease provides three possible outcomes on the individual susceptibility to the disorder: `none`, `mild`, or `severe`. Each outcome has the same probablity to occur.

Consider that you are interested in cases where the susceptibility is different than `none`.

What is the probability that, out of two tested individuals, you observe at least one outcome different than `none`?

Define your random variable and plot its cdf and pmf.
