![QMUL](Images/QMUL-logo.jpg)

# Statistics for Biologists


## Probability theory - set theory

Probability theory is the foundation for all statistical inferences. Through the use of models of experiments, we are able to make inferences about populations based on examining only a part of the whole.

Here we are going to outline the basic ideas of probability theory that are of direct importance for statistical inferences. As statistics builds upon probability theory, the latter builds upon the set theory.

### Intended Learning Outcomes 

By the end of this session, you will be able to:
* Describe the principles of set theory and set operations
* Illustrate the axiomatic foundations of probability theory and appropriate counting methods
* Identify dependence and indepedence of events
* Show the utility of distribution functions for random variables
* Demonstrate how to implement basic probability calculus in _*R*_

## Set theory

If one of our main objectives in statistics is to draw conclusions about a population of objects after an experiment, then it is essential to identify the possible outcomes of it.

> The set $S$ of all possible outcomes of a particular experiment is called the _sample space_ for the experiment.

If the experiment consists of tossing a coin, then the sample space contains only two outcomes, heads and tails, and therefore: $S=\{H,T\}$.

If, on the other hand, the experiment consists of observing the new GCSE scores of randomly selected pupils, the sample space would be the set of integers between 0 and 9, that is $S=\{0,1,2,...,8,9\}$.

Finally, consider an experiment where the observation is the reaction time to a stimulus. In this case, the sample space consist of all positive numbers, that is $S=(0,\infty)$.


Imagine that our experiment consists on observing the nucleotidic sequence of a particular gene of interest. 

What is the sample space? 

$S_G=\{A,C,G,T\}$

How can we represent this sample space in __R__?

In [None]:
# how to represent arrays in R
a <- c("A", "B", "C")
a

b <- c(4, 1, 0.4)
b

c <- c("A", 5, 2.4, 1e3)
c

In [None]:
# how to access values in an array (1-based indexing)
c[1]
c[2:3]
c[b[1]]

In [None]:
# how to change values in an array
c 
c[3] <- "purine"
c

c[b[2]] <- "purine"
c

In [None]:
# how to concatenate arrays
a
b
d <- c(a,b)
d

d[5]
a[d[5]]

In [None]:
# types and operations, and how to call functions
a
typeof(a)
length(a)
p <- sort(a)
p

In [None]:
b
typeof(b)
length(b)
sort(b)

In [None]:
c
typeof(c)
length(c)
sort(c)

$S_G=\{A,C,G,T\}$

How can we represent this sample space in __R__?

In [None]:
S <- c("A", "C", "G", "T")
S

In [None]:
length(S)

S[1]

sample(x=S, size=1)

In [None]:
# how to access documentation
?sample

Now suppose that we are interested in making inferences on the amino acidic sequence of a protein. 

What is the sample space?

$S_P=?$

Finally, let's suppose that our observations consist in divergence between orthologous genes of arbitrary length. 
In other words, for a large set of genes we calculate the relative difference in nucleotidic content between genes in different species. 

What is the sample space for such divergence? How can we represented in __R__?

$S_D=?$

In [None]:
S <- seq(from=0, to=1, by=0.001)

?seq

In [None]:
length(S)

In [None]:
hist(x = S) 

In [None]:
?hist

From these examples, we evince that we can classify sample spaces into two types according to the number of elements they can contain.
Sample spaces can be either _countable_ or _uncountable_.
If the sample space is finite or each element can be put into a 1-1 correspondence with a subset of integeres, the sample space is countable.
Therefore, the toss coin and GCSE scores are countable whereas the reaction time consists of an uncountable sample space.

The distinction between countable and uncountable sample spaces, despite sometimes trivial, is of great importance as it dictates the way with probabilities can be assigned.
In practice, probabilistic methods associated to uncountable sample sizes are less cumbersome and can provide an approximation to the true countable situation.

Once the sample space has been defined (e.g., is it countable or uncountable?), we can consider collections of possible outcomes of an experiment.
> An _event_ is any collection of possible outcomes of an experiment, that is, any subset of $S$, including $S$ itself.

In [None]:
# an event A of set S
S <- c("A", "C", "G", "T")
cat("sample space is", S, "\n")

A <- S[1:2] 
cat("\nevent is",A,"\n")

cat("\nIs event A in S?")
A %in% S

In [None]:
A
S
A %in% S

In [None]:
# TRUE and FALSE
typeof(TRUE)
typeof(FALSE)

In [None]:
# logical operations (AND)
TRUE & TRUE

TRUE & FALSE

FALSE & FALSE

In [None]:
# logical operations (OR)
TRUE | TRUE

TRUE | FALSE

FALSE | FALSE

In [None]:
# logical operations (AND)
1 & 1

1 & 0

0 & 0

In [None]:
# logical operations (OR)
1 | 1

1 | 0

0 | 0

In [None]:
cat("Is event A in S?")
A %in% S

In [None]:
prod(A %in% S)

In [None]:
TRUE & TRUE

1 & TRUE

1 & 1

1 * 1

In [None]:
# an event A of sample space S
S <- seq(from=0, to=1, by=0.001)

A <- S[S<0.05]

In [None]:
S<0.05 

In [None]:
S[S<0.05]

In [None]:
# conditions to select values in an array: < > == !=
S[S==0]
S[S!=0]

In [None]:
S[S>1]

In [None]:
# use which to return the index values
which(S>0.9)

S[which(S>0.9)]

In [None]:
index <- which(S>0.90)
S[index]

In [None]:
S_h <- hist(S, plot=FALSE)
A_h <- hist(A, plot=FALSE)

plot(S_h)
plot(A_h, add=TRUE, col="red")

In [None]:
cat("Is event A in S?")
prod(A %in% S) 

Let $A$ ben an event, a subset of $S$. We say that the event $A$ occurs if the outcome of the experiment is in the set $A$.

\begin{equation}
A \subset B \iff x \in A \implies x \in B
\end{equation}
\begin{equation}
A = B \iff A \subset B \text{ and } B \subset A
\end{equation}

### Exercise

A pain assessment chart defines self-perceived levels of pain as discrete integer numbers from 0 (no pain) to 10 (worst possible pain).

Define the sample space of pain level. Define the event of "more than severe pain" with pain level greater than 6.

Assume that out of 40 patients in A&E, 75\% of them are reported as "more than severe pain" while 25\% of them display a pain level of 0. Produce a `barplot` of the distribution of pain levels at A&E with the information you have. Label axes and provide captions.

