<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Chapter-1:-An-introduction-to-Bayesian-inference" data-toc-modified-id="Chapter-1:-An-introduction-to-Bayesian-inference-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Chapter 1: An introduction to Bayesian inference</a></span></li><li><span><a href="#Chapter-2:-Subjective-worlds-of-Frequentist-and-Bayesian-statistics" data-toc-modified-id="Chapter-2:-Subjective-worlds-of-Frequentist-and-Bayesian-statistics-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Chapter 2: Subjective worlds of Frequentist and Bayesian statistics</a></span><ul class="toc-item"><li><span><a href="#A-couple-of-useful-ideas" data-toc-modified-id="A-couple-of-useful-ideas-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>A couple of useful ideas</a></span><ul class="toc-item"><li><span><a href="#You-can-ignore-this-if-you-just-want-answers" data-toc-modified-id="You-can-ignore-this-if-you-just-want-answers-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>You can ignore this if you just want answers</a></span></li></ul></li><li><span><a href="#The-key-takeaway" data-toc-modified-id="The-key-takeaway-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>The key takeaway</a></span></li></ul></li><li><span><a href="#Chapter-3:-Probability---the-nuts-and-bolts-of-bayesian-inference" data-toc-modified-id="Chapter-3:-Probability---the-nuts-and-bolts-of-bayesian-inference-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Chapter 3: Probability - the nuts and bolts of bayesian inference</a></span><ul class="toc-item"><li><span><a href="#Probability-distributions" data-toc-modified-id="Probability-distributions-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Probability distributions</a></span><ul class="toc-item"><li><span><a href="#Notation" data-toc-modified-id="Notation-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Notation</a></span></li></ul></li><li><span><a href="#Formulae/facts" data-toc-modified-id="Formulae/facts-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Formulae/facts</a></span><ul class="toc-item"><li><span><a href="#Independence" data-toc-modified-id="Independence-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Independence</a></span></li></ul></li><li><span><a href="#Central-limit-theorem" data-toc-modified-id="Central-limit-theorem-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Central limit theorem</a></span></li></ul></li></ul></div>

In [2]:
import numpy as np

## Chapter 1: An introduction to Bayesian inference

Very very gentle.

## Chapter 2: Subjective worlds of Frequentist and Bayesian statistics

Will skip the philsophical discussion here. Frequentists think in terms of repeated events and different outcomes, Bayesians are more thinking in terms of degrees of "uncertainty". Somewhat subtle, doesn't matter if you don't fully appreciate distinction.

### A couple of useful ideas

The core idea will be Bayes' formula that allows us to move from effect to cause.

So we see some evidence and then update our beliefs about what *ought to have been the case* about the underlying mechanism generating the data.

Somewhat math-y:

$$ \Pr (\text{Effect | Cause}) \xrightarrow{\text{Bayes' Theorem}} \Pr (\text{Cause | Effect}) $$

And without the mysterious "Bayes' Theorem" stage:

$$ \Pr (C | E) = \frac{\Pr(E | C) \cdot \Pr(C)}{\Pr (E)}$$

#### You can ignore this if you just want answers

A bit of algebra makes this formula totally undisputable. Draw yourself a Venn diagram and observe that the numerator is $ = \Pr (C \cap E)$.

Perhaps some more intuition can be provided by the following:

$$ \Pr (C | E) = \Pr(C) \cdot \frac{\Pr(E | C)}{\Pr (E)}$$

In this form it is clear that as the initial (soon to be called "prior") probability of C increases then *given any evidence/effect* the LHS term (soon "posterior") increases. But, more interestingly, the factor by which this increase happens is $ = \frac{\Pr(E | C)}{\Pr (E)}$. This term is $ > 1$ when E is more likely if C has been observed, and is $< 1$ in the reverse case.

**Inituitive example** (for this factor idea) - If I am happy almost always (e.g. 95% or $\Pr (E = \text{Happy}) = 0.95$), and am always happy (so 100%, or 1) when my kill:death ratio is above 4 then if you see me and I'm happy you have some reason to believe that my kill:death ratio is above 4, but not a huge amount of reason. Conversely, if I am almost without fail miserable and you see me happy you might think actually there is quite a bit of a chance of my k:d being above 4.

**Extension**
In the example above we basically dealt with the terms *apart from* $\Pr (C)$. If you know that I suck at online shooting games then *even if* you see me happy you still won't think my k:d > 4, because it is just so unlikely in the first instance. Instead you might *look for alternative reasons* to explain my happiness. This relates to the idea of "explaining away" - where the existence of one cause makes another less likely - because it is no longer required to account for the effects we observed. e.g. if you see that I am happy and you know that I just got a job promotion you become less reliant on the k:d explanation. Note, that in the limiting case where I am *only* happy when my k:d > 4, our problem reduces to propositional logic. You might therefore like to think of Bayes/stats/probability in general, as an extension of logic to the real world. In fact, you might want to question any mental distinction between math and logic you have in the first instance...

### The key takeaway

The last little bits are the following:
- $prior + data \xrightarrow{\text{model}} posterior$
- $p(\theta | data) = \frac{p(data | \theta) p(\theta)}{p(data)}$

No explanation needed, just reformulation of the above in terms of a data generating model with parameters (a vector often), $\mathbf{\theta}$

## Chapter 3: Probability - the nuts and bolts of bayesian inference

Just a quick summary of a few facts and properties that will be relied on extensively. Not much explanation will be added here. (Anyone with a decent maths background should find no dragons here.)

### Probability distributions

These will be the bread and butter of all that we do. In particular, because, unlike a point estimate (e.g. the mean of a distribution) they implicitly 'carry along' the uncertainty (and where it is placed) associated with our belief. Remember, uncertainty is what Bayesian analysis is all about.

Distributions are:
- non-negative for any event
- sum to 1 across the entire possibility space

For discrete cases we speak of "probability mass". For continuous cases we speak in terms of "densities". For any specific value the probability density is zero.

Note, we do not commit the common logical fallacy!

Impossible $\rightarrow p(x) = 0$.

But...

$p(x) = 0 \nrightarrow$ impossible.

i.e. our events all having zero *density* is not us saying that they will never occur.

#### Notation

- Discrete: $\Pr (X = a)$
- Continuous: $p(X = \theta)$

However, you will often see real abuses of notation occur with continuous cases.

e.g. $p_{random\_variable}(RV = \alpha)$, becomes...

$p_A(\alpha)$, becomes...

$p(A)$

### Formulae/facts

Will skip the discrete versions, they're all analogous.

- Sum to 1:
    - $\int p(x) dx = 1$
- Expected value:    
    - $\mu = \mathbb{E}[x] = \int x \cdot p(x) dx$
- Marginal distribution:
    - $p_A(\alpha) = \int p_{AB}(\alpha, \beta) d\beta$
    - OR $p(A) = \int p(A, B) dB$
- Conditional distribution:
    - $p (A | B) = \frac{p(A, B)}{p(B)}$
- Independent:
    - $p(A | B) = p(A)$
    - $p(A, B) = p(A) \cdot P(B)$
        - It is very easy to verify that the above two expressions are imply and are implied by each other.
        - Note as well that to show independence, one must show that this relationship holds for any possible value of the variables.

#### Independence

Note that, not being independent, (i.e. "dependence"), is not the kind of dependence that might first spring to mind. It is not any kind of causal comment, rather simply that information about one gives some information about the other. The reason could be direct (e.g. a parent's hair colour to a child's), co-varying (e.g. sibling intelligence), or even a complicated process that is not obvious at all (note this still falls into the "co-varying" scenario, just the degree of separation can be massive is the point being made here. 

### Central limit theorem

Without this guy, stats would be in some trouble...!

CLT is applicable generally when sample size > 20. It says the distribution of an RV becomes approximately normal.

> The above CLT applies to the average of independent, identically distributed random variables. However, there are also central limit theorems that apply far less stringent conditions. This means that whenever an output is the result of the sum or average of a number of largely independent factors, then it may be reasonable to assume it is normally distributed. For example, one can argue that an individual's intelligence is the result of the average of a number of factors, including parenting, genetics, life experience and health, among others. Hence, we might assume that an individual's test score picked at random from the population is normally distributed.