# Probability theory

## Uncertainty

Probability is used to quantify our uncertainty concerning unknown events or phenomena.

Uncertainty comes in 2 different flavours:
1. **Epistemic** (also, model uncertainty): originates due to our imperfect understanding of the processes that we are modelling. Colleting more, and better, data can help to reduce this.
2. **Aleatoric**: originates in the intrinsic variability of these processes. Collecting more data won't change this.

ie. we can become extremely confident that our model correctly understands a process, but if it tells us that there's an 80% chance of outcome A and 20% chance of outcome B, then there is still uncertainty in any prediction that we make.

## Fundamental concepts

### Basic notation

$Pr(A)$ represents the probability of an event $A$ occurring

$Pr(\bar{A})$ represents the probability that an event $A$ will not occur, and is equal to $1 - Pr(A)$

### Joint probability

If we need to consider the possible occurrence of more than one event, for example events $A$ and $B$, then we're interested in their **joint probability distribution**.

The probability that they both occur (ie. the joint probability of A & B):

$Pr(A \wedge b) = Pr(A, B)$

Under conditions where $A$ & $B$ are **independent**, then $Pr(A, B) = Pr(A)Pr(B)$

### Union

The Union of events $A$&$B$ is the probability that $A$ or $B$ occur, but not both:
- We'll accept the ocurrence of either $A$ or $B$ to satisfy the criteria, so we can add their probabilities together.
- But we won't accept their joint occurrence, so we need to subtract the probability that **both** $A$ and $B$ occur together from the total

$Pr(A \vee B) = Pr(A) + Pr(B) - Pr(A \wedge B)$

> If the two events are mutually exclusive (they can't both occur at the same time) then $Pr(A \wedge B)$ is $0$ and can be ignored.

### Conditional probability

When we're working with multiple interdependent events, information about the status of one event can tell us something about the status of other over events. We can therefore update our beliefs about these events in light of this information.

If we know the joint probability of $A$ & $B$, this is the probability of $A$ **and** $B$ occurring together and is calculated by multiplying the two distributions together.
If we also know the  probability distribution of event $A$, then dividing the joint probability by $Pr(A)$ must yield $Pr(B)$.

More formally; the **conditional probability** of event $B$ occurring, given that event $A$ occurs, is given as follows:

$Pr(B|A) \triangleq \frac{Pr(A,B)}{Pr(A)}$

For example,
- Event $A$ represents the probability that it rains each morning
- Event $B$ represents the probability of there being a traffic jam outside my house in the morning
- After months of measurements, we believe that the probability of it raining **AND** there being a traffic jam outside my house $Pr(A, B) = 0.5$
- If we also know that the probability of it raining on any given morning is $Pr(A) = 0.7$, then we can use conditional probability to estimate the probability of a traffic jam specifically on the mornings that it rains:

$Pr(B|A) = \frac{Pr(A, B)}{Pr(A)} = \frac{0.49}{0.7} = 0.7$

So whilst we believe that the probability of having rain & traffic jams together on any given morning is approx. 50%, as soon as we know that its raining outside, we can update our expectation that there will be a traffic jam to 70%.

> NB. If $Pr(A) = 0$, its occurrence is not possible and $Pr(B|A)$ would be undefined.

### Independence & conditional independence

If two events are **independent**, then the outcome of one event has no impact on our beliefs about the outcome of the other; our belief about one event does not *depend* on our beliefs about the other.
In this case, their joint probability is simply the product of their individual probabilities:

$Pr(A, B) = Pr(A)Pr(B)$

If the probability of either event is independent of the other, then the conditional probability of either event given the other will collapse into the probability distribution of the single event. We call this **conditional independence**:

$Pr(A|B) = Pr(A)$

In scenarios where an event depends on several others, information on one of these dependent events (an 'intermediate' event) can contain all the information we need concerning (some of) the others. In this case, conditioning on the intermediate event allows us to treat the main event as independent of the others.

#### An example

- Let $A$ represent the event 'the bus will arrive in the next 5 minutes'.
- $B$ is the event 'the bus can be seen leaving the previous stop'. (once we can see the bus leaving the previous stop, it never takes more than 5 minutes to get to our stop)
- And $C$ is the event 'the screen at the bus stop says the bus will arrive in the next 5 minutes'.

Clearly $A$ is dependent on $C$: if we can't see the bus but the sign says it will be here in the next 5 minutes, then this influences our beliefs about $A$.
However, as soon as we can see the bus leaving the previous stop then looking at the bus stop sign will no longer influence our beliefs about $A$; $A$ becomes **conditionally independent** of $C$, given $B$.

$Pr(A|B,C) = Pr(A|B)$
Or, equivalently:
$Pr(A,C|B) = Pr(A|B)Pr(C|B)$

This can be written $A \perp C|B$

### Random variables

Let $X$ represent a variable; a quantity whose value we are interested in.
If we do not know the value of $X$ with certainty (ie. there is uncertainty concerning the current value, or how the value will change with time), then we call $X$ a **random variable**.

> The **state space** $\mathcal{X}$: the set of possible values that $X$ can take.

### Discrete random variables

**Discrete**: the number of possible values that a random variable can take is restricted to a finite set

Examples:

- Binary classes: 1, 0
- Categories: 'small', 'medium', 'large'
- Specific set of values: {0, 1, 2, 3, 4, 5}

#### Probability mass function (pmf)

The probability that $X$ takes a specific value $x$ is denoted $Pr(X=x)$

We can treat scenarios where $X$ takes one of its possible values $x$ as an event.

The **probability mass function** is a function which computes the probability of these events for each/any possible value of a **discrete** random variable.

$p(x) \triangleq Pr(X=x)$

The **pmf** satisfies 2 properties:
1. Always returns a value between 0 & 1 (inclusive): $0 \leq p(x) \leq 1$
2. The sum of $p(x)$ across all values in the state space always equals 1: $\sum_{x \in \mathcal{X}} p(x) = 1$

### Continuous random variables

**Continuous**: the random variable can take on any of the infinite possible values that sit on the range between two real values (possibly infinite)

#### Cumulative distribution function (cdf)

When working with continuous random variables, it is no longer possible to create a countable set of distinct possible values that the variable can take. The **pmf** therefore does not apply to continuous variables.

Instead, we can partition the real line into a countable set of intervals and we can define events where $X$ takes any value within defined intervals.

The **cumulative distribution function** is a function which computes the probability that $X$ takes a value equal to, or less than, some possible value $x$:

$P(x) = Pr(X \leq x)$

The **cdf** satisfies the following properties:
1. Always returns a value between 0 & 1 (inclusive): $0 \leq P(x) \leq 1$
2. $P(x)$ across the full interval of possible values equals 1
3. $P(x)$ is a **monotonically non-decreasing function**

From the **cdf**, it is possible to calculate the probability that $X$ lies inside any interval:
$Pr(a < X \leq b) = P(b) - P(a)$

#### Probability density function (pdf)

The **pdf** is the derivative of the **cdf**

$p(x) \triangleq \frac{d}{dx}P(x)$

#### Percent point function (ppf)

Also known as the ***inverse cdf*** or ***quantile function***

## Laws

### Sum rule

Also know as the ****rule of total probability***

### Product rule


### Chain rule

## Bayes' rule

## Properties of a distribution

### Moments

#### Mean

bla

#### Variance


#### Mode


### Conditional moments

#### Covariance

#### Correlation

## Central limit theorem