# What is Probability?

Probability is an elusive concept. The idea seems intuitive to many of us, but it is difficult to pin down with a specific definition. When forced to think and speak about it, we say things like ‘what are the odds’ or ‘what are the chances’ because we have an intuition that it will involve thinking about how "likely'' certain possibilities are to occur but even here, we're essentially substituting the term “likely” for “probable." Thinking a bit harder, we'd like to say "perhaps it involves mapping possibilities to numbers such that we have higher numbers for more likely possibilities." That would be a start. To think more completely about the nature of probability, it will be helpful to consider how our predecessors have thought about it historically, particularly the two main schools of modern probability and statistics: frequentists and Bayesians.

## Interpretations and definitions in brief

The pioneers of probability, such as Bernoulli, Bayes, and Laplace, considered _probability_ to represent a degree-of-belief in a statement or hypothesis. Thus, for example, when a Bayesian is talking about the probability $P(D)$ of the earth being a certain distance from the sun, they are talking about the state of their knowledge of the relevant distance, given all the information that they have. 

For frequentists, probability is defined as the long-run relative frequency of the event, given many repeated experiments. As we'll see, in some cases, this solves an apparent challenge of the Bayesian framework (which we'll deal with), that the Bayesian definition _seems_ subjective: if we define probability as the the long-run relative frequency of the event, it doesn't matter from whose perspective we derive the results, as relative frequencies are objective. This goal is admirable. However, it introduces additional challenges, such as "What does it now mean to say that 'The probability of the true distance between the earth and the sun is so and so'?" In a frequentist framework, there is only one true distance and so it doesn't make sense to even talk about a probability here.

To be clear, when done correctly, frequentism solves the challenge of subjectivity, but we also raise it now to give a hint of its limitations. In this chapter, we will start first by stating our desired outcomes, show how history has influenced our ways of thinking, and finally show how the issues with frequentism are solved by Bayesianism.



## What we're looking for: towards a principled statistical workflow

Most conversations and works about interpretations and definitions of probability are not really actually about interpretations and definitions of probability. More often than not, they're driven by worldviews about the best way to do statistical inference and often the conversation about definitions is a proxy for a real conversation around the methods we want to use. To this end, we want to expose our motivations clearly and succinctly.

This book is part of a movement working towards principled statistical workflows that allow humans to understand the world better and to make better decisions. We believe that such a movement requires a lot of moving parts and our opinionated take is that working data scientists require the following:

- The ability to build models that capture our current state of knowledge of the world;
- Principled and consistent workflows: this includes necessitating being explicit around assumptions and ideally that someone else performing the analysis with the exact same information would get the same results;
- The ability to model the data-generating process(es);
- An ergonomic and intuitive abstraction layer, such as those provided by probabilistic programming languages (which are also in active development!).


There are also key questions around how to build organizations in which such work can be seamlessly intergrated into the decision function, among other cultural concerns, such as how do we think, talk about, and model uncertaintly, but our main project in this book is to equip individual data practitioners with the key concepts and tools to do the above.


We believe that Bayesian inference, when combined with probabilistic programming and modern hardware, is well-poised to meet all of these needs. Through a discussion of the different historical interpretations and definitions of probability, we'll also meet the practical considerations that impact this question, including the affordances of both frequentist and Bayesian statistics, what they're capable of, and what it looks and feels like to work with them. To this end, our intention is not to cast shade on frequentism by any means, but to show how principled Bayesian workflows better meet the needs we have outlined above. In this spirit, let's jump into a brief historical tour of how humans have thought about probability over time.

## Frequentism

### Playing games and the doctrine of chances

In 1654, the Chevalier de Méré posed a question to Blaise Pascal that became known as the _problem of points,_ involving a question around a game of chance. This question was first posed in the 15th century, but Méré's posing it to Pascal is central to any historical discussion of probability as it played a key role in Pascal's correspondence with Fermat, which laid the foundations for modern probability theory. In fact, the problem of points resulted in many developments, such as Pascal working through the first explicit discussions of what we now call _expected value._ These were the first inklings of probability taking on a quantitative nature. Probability had of course been thrown about, but is was essentially qualitative up to this point.

What is important to recognize is that these pioneers were _explicitly_ thinking about games of chance: playing cards, rolling dice, flipping coins, and developing the first explicit quantitative techniques to think about probabilities in this setting. To be clear, in this setting, we can enumerate a finite set of explicit outcomes, perform experiments (either explicitly or using thought experiments), and calculate the fraction of times a particular event occurs, out of all possible occurences, and define probability as such:

$$P(E) = \frac{\text{Number of ways E can occur}}{\text{Number of total outcomes}}$$

The reason for being so explicit about this is that we shouldn't necessarily expect such a theory to be applicable for probabilistic questions that occur ouside games of chance. To expect this is akin to expecting that machine learning agents, trained to play games such as chess, Go, and Starcraft, will somehow perform well in the real-world. Or expecting that self-driving cars trained on real-world car data will perform well when confronted with streets full of other self-driving cars.

### Frequencies

In games of chance, we can map out explicitly all the possible options, assign probabilities, and theoretically calculate the probability of any given outcome. But the question still remained: how does this relate to empirically measured frequencies? It was the genius of Jacob Bernoulli and what he called his "golden theorem", now called the "law of large numbers" (LLN), that allowed us to answer this question. The LLN states that:

> the average of the results obtained from a large number of trials should be close to the expected value and tends to become closer to the expected value as more trials are performed.

Let's do an experiment. In fact, let's do Bernoulli's urn experiment. Bernoulli's idea was to have an urn filled with 10s of 1,000s of balls, some red and others white, with the fraction of red balls being some unknown quantity $F$. He would then draw balls from the urn and calculate the sample ratio $S$, in an effort to estimate $F$. 

Let’s simulate it and check out $S$ as we draw more and more balls ($n$). The law of large numbers says that with increasing $n$, $S$ will get closer to $F$. (If any of the following code doesn't yet make sense to you, it will in the following chapters.)


In [None]:
#| output: false
#Import packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

In [None]:
#| fig-cap: Simulating the Law of Large Numbers
rng = np.random.default_rng(42)  # Set seed for PRNG
k = 10**3 # largest number of draws
F = 0.6 # True urn fraction of red balls
data = rng.binomial(1, F, k)  # draw balls from urn

x = np.arange(k)  # number of draws
S = np.zeros(k)  # array to store calculations of S 

# Loop over number of draws and calculate S
for i in x:
    S[i] = np.mean(data[:i+1])

plt.plot(x, S)
plt.xlabel('sample size $n$')
plt.ylabel('Sample fraction $S$');


By eye, we can see that $S$ indeed seems to converge towards $F$: pretty good! But did this answer Bernoulli's question? Bernoulli was interested in finding out what the true fraction $F$ was and showed that $S$, which we can calculate, gets closer to $F$ as $n$ gets larger. So it tells us about the probability of $S$, rather than that of $F$!

 _If this feels counter-intuitive to your prior classical statistics training, you’re not alone._ Aubrey Clayton, to whom the authors are much indebted, has dubbed this [Bernoulli's Fallacy](https://cup.columbia.edu/book/bernoullis-fallacy/9780231199940). Let's now dive a bit deeper into this fallacy as it sheds light on the types of questions frequentism and Bayesian are adept at handling, along with the types of questions that scientists are usually interested in.

### Bernoulli's fallacy

Recall that the law of large numbers (LLN) states that:

> the average of the results obtained from a large number of trials should be close to the expected value and tends to become closer to the expected value as more trials are performed.

So what Bernoulli was saying with the law of large numbers, and what we demonstrated with the above simulation, is that, the more balls we draw, the sample urn ratio $S$ gets closer to the true urn fraction $F=f$. Using the notation of probabilty,  

> $P(S \text{ is close to} f | F = f)$ is high. 

Now Bernoulli's interest was in finding out things about the true urn fraction $F$ and it seems intuitive (if not obvious) that the statements "the sample urn ratio is close to the true urn fraction" and "the true urn fraction is close to the sample urn ratio" should be interchangeable. Although the property of closeness is indeed symmetric in many cases, it is definitely NOT when considering probabilities. In our case, the resulting statement would be that 

> $P(F \text{ is close to} s | S = s)$ is high,

where $s$ is the observed sample fraction.

Bernoulli interpreted it to mean he could make statements about the true urn fraction and he did *not* have the notation to recognize that he was confusing $P(S | F)$ for $P(F | S).$

Essentially, he confused the following two statements:
    
- We conclude that, with high probability, the observed sample fraction is close to the true urn fraction.
- We conclude that, with high probability, the true urn fraction is close to the observed sample fraction.

The confusion is understandable as the difference is challenging to parse: one way to think about it is that the first statement says that, if you sample many times, then 95% of the time (for example) the observed sample fraction will be close to the true urn fraction. But the latter makes a claim about the true urn fraction, given a single sample fraction (which could actually be quite far from the true urn fraction). To summarize:

> $P(S | F)$ is not logically the same as $P(F | S)$, even if in some cases, they may be identical quantities.

In Bernoulli's case, the difference did not particularly matter but in many cases it does. As we'll soon see, Bernoulli's fallacy is the same mistake as asumming the following are the same:

- the probability of person X having an infectious disease, given a positive test result;
- the probability of person X getting a positive test result, knowing that they have the infectious disease.

Later, we'll see how the former can be around 30%, while the latter is close to 100%.

In another statistical anecdote, popular statistics outlet FiveThirtyEight had an article that described how, by focusing on the wrong probability statement, English football fan Charles Reep may have contributed to English football’s fascination with the “no more than three passes” style of playing - and possibly helping to “ruin decades of English soccer”. Here, Reep was fixated on “the percentage of goals generated by passing sequences of a certain length” (i.e. $P(sequence length | goal)$), which focuses on sequence lengths that produce a goal. Yet, the more valuable question that should have been answered is $P(goal | sequence length)$, or “the probability of a sequence length generating a goal”, which requires looking at sequences that _didn’t_ produce a goal.

How does this all relate to frequentist and Bayesian ways of approaching statistics? Using both examples above to illustrate, frequentist methods start with a given hypothesis (having a disease; producing a goal) and ask about the probability of observed data (positive test; sequence length). Bayesian methods start with data (positive test; sequence length) and ask the probability of a range of hypotheses (having a disease; producing a goal). In a sense, frequentist and Bayesian methods answer _mirror_ questions. To reinforce the point, frequentist methods do *not* tell us about the probability of a specific value or hypothesis, given observed data. In fact, they *cannot* tell us about the former, as the probability of a specific value doesn't even make sense for a frequentist! What they can and do tell us about is the probability of seeing certain data, assuming specific values and/or hypotheses, which can be valuable.

Yet, the vast majority of scientific questions are of the former form: we want to know the probability of the actual rate of infection (or more generally, the probability of a particular hypothesis) given the observed data, and *not* the probability of seeing the observed data, were the hypothesis true (which is what frequentist hypothesis testing gives us).

Put another way, when doing science, we are interested in learning about the _inferential_ probability, rather than the _sampling_ probability.

This issue does not merely apply to estimates, as above, but to the underlying questions these methods are able to answer. For another example, let's look at confidence intervals. As Jake Vanderplas wrote in [Frequentism and Bayesianism: A Python-driven
Primer](https://arxiv.org/pdf/1411.5018.pdf), given a 95% confidence interval,

> - A frequentist would say: “If this experiment is repeated
many times, in 95% of these cases the computed confidence interval will contain the true $f$.”
>  - A Bayesian would say: “Given our observed data, there is a 95% probability that the true value of $f$ lies within the credible region”.

To be fair to Bernoulli, for many of the questions he was interested in, including games of chance, $P(S | F)$ and $P(F | S)$ are numerically identical, even if logically opposite. However, even in games of chance, the frequentist interpretation of probability presents other issues, such as in hypothesis testing and in estimating the probability of extreme events.


### Hypothesis testing

Null hypthesis significance testing (NHST) is the gold standard for hypothesis testing in a freqentist framework. Let's say that we wanted to estimate the difference in means between two groups, a control and an experiment. We would set up: 

* a null hypothesis $H_0$ (that there is no difference), 
* an alternative hypothesis (that there is a difference), and 
* a test statistic (related to the sample difference in means).

We then calculate the p-value, defined as the probability of observing a test statistic as extreme or more extreme than the one observed, assuming that the null hypothesis is true. Generally, if $p \leq 0.05$, we reject the null hypothesis $H_0$ at the 95% significance level in favour of the alternative hypothesis. Note that we're essentially calculating $P(T | H_0)$, the probability of the test statistic, given the null hypothesis, rather than the probability of $H_0$, given the test statistic, and hence committing Bernoulli's fallacy! Having said that, when used and interpreted correctly, NHST is valid.

The major challenges with NHST are the arbirtariness of the 5% significance level and the choice of the test statistic. Many treatises have been written on these so we will say merely a few more words, in the spirit of our purpose, that of looking for principled statistical workflows and methodologies. So let us be clear: NHST is anything but principled! It is motley collection of dozens (if not more) statistical tests, the choice of which to use in any given situation is often unclear. Your authors have been hearing the same types of questions for years now: can I use Student's t-test here? Or is Mann-Whitney more appropriate? Is it a paired test or a one sample test? Scrap that, I just realised what I'm looking for is the Chi-squared test!

It is, of course, important to choose the right test when doing NHST, however, a principled workflow shouldn't require anybody to choose between a set of dozens of bespoke tests, all built for slightly different purposes, and all containg relatively opaque assumptions ("wait, does my data need to be normally distributed for this test? If so, do I need to test for that before doing the hypothesis test I actually want to do?"). As we'll see in this book, a principled Bayesian workflow includes a single protocol that can be used for most types of scientific questions.

In addition to (1) their propensity for answering the wrong questions, (2) the arbitrariness of what we call significant and not, and (3) the practical difficulties involved in choosing between an array of tests, the other major concern we (and many others) have is the unintuitive nature of the p-value. The p-value is _the probability of observing a test statistic as extreme or more extreme than the one observed, given that the null hypothesis is true_. This definition is so challenging to parse that even research scientists are prone to make the mistake of thinking the p-value is the probability that the null hypothesis is true -- which is the wrong definition! Or that it's the probability of observing your data if the null hypothesis is true -- which it isn’t! For more on this, check out [Not Even Scientists Can Easily Explain P-values](https://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/)! Once again, we are not claiming the NHST is wrong. When used correctly, it is correct! What we are saying is that it is often used incorrectly and that it is easy to use incorrectly.

To summarize: 

* The key definition of p-value is unintuitive and, as a result, often misinterpreted by experts;
* The collection of tests to use in NHST is difficult to navigate, and lay practitioners incorrectly assume that they just have to pick a test to answer their statistical inference question,
* In the vast majority of scientific settings, NHST answers the wrong statistical inference question ($P(data | hypothesis)$ rather than $P(hypothesis | data)$), and 
* Frequentism can be considered unnecessarily complicated, and this will become clear once you learn the logic of lay Bayesian statistical practice.



### Becoming a definition


For Bernoulli, using frequencies was a method of calculating probabilities, rather than definitional. It took the 19th and 20th centuries and such statistical and mathematical powerhouses as Francis Galton, Karl Pearson, Ronald Fischer, and John Venn, to make frequencies the defintion of probability. 

A major issue here is that this definition doesn't even make sense for games of chance! To see this, consider the following: as Aubrey Clayton points out in _Bernoulli's Fallacy_, the likelihood of getting 13 spades in a hand of bridge is 1 in 635 billion, which is many orders of magnitude larger than the amount of bridge hands dealt throughout history. A frequentist method could potentially require us to deal bridge hands for millions of years, in order find the correct frequency. But what if we still saw it zero times, or twice, by chance? But a frequentist would then say  "you need to do the experiment more times" or "we need to perform this experiment as a thought experiment due to the limitations of the physical world." Note how this interpretation results in a circularity: if I flip a coin enough times and don't get close to 50% heads, a frequentist would tell me that there is something wrong with the coin or I haven't flipped it enough times.


This means that, from the frequentist POV, the logic of probability is:

> We somehow "know" the frequency of an outcome --> this allows us to compute one-time probabilities

Whereas the logic really shoud be:

> we know things about the physics and symmetry of coins that leads to a degree-of-belief around $P(H)$, say, $P(H)=0.5.$ --> We then predict the long-run frequency of the theoretical result of flipping the coin lots of times --> THIS is the frequency

The real damage was done when the frequentist interpretation of probability became definitional, rather than an approach for calculating it. This resulted in _the definition_ of probability becoming objective, which was part of a broader scientitic project. Bernoulli had always interpreted _probability_ as representing a degree-of-belief and used the LLN as a method of calculation, not as definition.

## Bayesianism, degrees of belief, and probability as logic


### Solving Bernoulli's fallacy

Reverend Thomas Bayes was explicitly interested in answering inverse questions such as "given a sample rate of infection, what can we say about the infection rate in the entire population?"  Thus, in a strong sense, his aim was to find a solution to Bernoulli's fallacy, although he did not frame it as such. In order to do this, he needed to be able to switch between conditional probabilities $P(A | B)$ and $P(B | A)$ and, to this end, he proved an equation we now know as Bayes' Theorem:


$$P(B | A) = \frac{P(A | B) P(B)}{P(A)}$$

$B$ can be a hypothesis or some variable we're attempting to estimate, while $A$ is classically the data we've observed. Notice that the space of hypotheses is large, as is the space of ways in which the data could have been generated and collected, meaning that we’re able to assign probabilities to them.

Let's spend a little more time with this form of the equation, $P(H | D) = \frac{P(D | H) P(H)}{P(D)},$ for data $D$ and hypothesis $H.$ For reasons we'll go into more deeply later, much of the time we are interested in what $P(H | D)$ is proportional to, in which case the denominator is irrelevant:

$$P(H | D) \propto P(D | H)P(H).$$


### Solving the inverse problem: temporal and causal order are unimportant

One of Bayes' keen insights in answering the _inverse problem_ was to decouple probability and causation. Until Bayes can on the scene, it made sense to talk about $P(E_1 | E_2)$, the probability of an event $E_1$ occurring, knowing that $E_2$ occurred, only if $E_1$ occurred after $E_2$. But Bayes' solution to the inverse problem now meant that it made sense to think and talk about $P(E_1 | E_2),$ independent of temporal or causal order!

Let's consider a variation on a simplified example that Edwin Jaynes constructed in [Clearing up Mysteries — The Original Goal](https://link.springer.com/chapter/10.1007/978-94-015-7860-8_1):

> You have an urn containing 5 red balls and 5 white balls, randomly mixed. You take out one ball $B_1$ and it is white. You take out a second ball $B_2$. What is the probability that $B_2$ is white?

We know that there are 4 white balls and 5 red balls in the urn before you draw $B_2$, from which we conclude the probability $B_2$ being white is $4/9$. Now let's consider a second related, yet distinct question:

> You have an urn containing 5 red balls and 5 white balls, randomly mixed. You take out one ball $B_1$ and do not look at it. You take out a second ball $B_2$ and it is white. What is the probability that $B_1$ is white?

This can seem counter-intuitive, even for those of us who have seen this example many times! The psychological paradox-of-sorts lays in reconciling the following two observations:

* The colour of $B_2$ CANNOT causally impact the colour of $B_1$;
* The colour of $B_2$ DOES provide information about the distribution of colours across the other balls and thus does tell us something about the colour of $B_1$.

To make this totally clear, consider the case in which our urn contains 1 ball of each colour. In this case, it is clear that, if the second ball is white, then the first must have been red, and vice versa. What this reveals is that we are dealing with logical statements and connections here, rather than causal ones. What Bayes had done was framed probability as logic, rather than merely in causal terms, and it was this _probability as logic_ that allowed him to move back and forth between $P(E_1 | E_2)$ and $P(E_2 | E_1)$ to solve inverse problems. This idea of _probability as logic_ allows us to incorporate all forms of information into our state of knowledge and probabilities. We will have more to say about the utility of thinking about _probability as logic_ as we go on, but now it's time see how Bayes' Theorem can lead to a principled workflow.

### The principled inferential workflow

Let's now name the terms in the above equation as they are key players in the world of Bayesian inference:

* $P(H | D)$ is the **posterior distribution** and tells us about the probabilty of the hypothesis $H$, in light of the data;
* $P(D | H)$ is the **likelihood** of the data, given the hypothesis _or_ the probability of seeing the data, asssuming that the hypothesis is true;
* $P(H)$ is the **prior distribution**, which encodes what we know about the hypothesis before seeing the data.

We can think about these terms as follows:

> The posterior is what we want, the data is what we have, and the prior encodes our knowledge about the world.


So what does the workflow look like to calculate our posterior distribution? In the basics of Bayesian model building, the steps are

1. To completely specify the model in terms of _probability distributions_. This includes specifying 
    - what the form of the sampling distribution of the data is, given by the **likelihood** _and_ 
    - what form describes our _uncertainty_ in the unknown parameters, given by the **prior**.
2. Calculate the _posterior distribution_.

This is it. We will see this play out time and time again throughout this book. We'll also see how probabilistic programming languages, such as PyMC, make these steps relatively straight, including calculating the posterior for us! 

Note that, in completely specifying the model in terms of _probability distributions_, we have to make explicit what we consider the data-generating process(es) to be, which is one of our requirements for our workflow.

It is now worth saying a few words about the prior, as it is one of the less well understood aspects of Bayesian inference.

### On the subjectivity of the prior

The main concern voiced about the prior is the perspective that it is subjective. First we'd like to note that this concern is understandable, if overblown. The truth of the matter is that the prior is intended to encode certain assumptions about our data and the data-generating process(es) that any analysis will necessarily contain, explicit or otherwise, and that all Bayesian inference does for us in this regard is to force us to be explicit about our assumptions.

Perhaps more importantly, the true nature of the project of Bayesian inference requires that, if two people have exactly the same information, then their priors and analyses should be the same. This underlies not only the underpinnings of Bayesian inference, but of probabilistic programming languages, such as PyMC, and principled (Bayesian) workflows, which we are teaching in this book.

Both Bayes and Laplace knew these things and used them well. Laplace, for example, used this method to estimate the mass of Saturn, given orbital data and knowledge of celestial mechanics:

In [None]:
#| echo: false
from IPython.display import Image
Image("../../img/saturn-laplace.png")

We'll have a lot more to say about the choice of priors in the coming chapters, but we wanted to mention it here as we're aware it's top of mind for many people. But it also begs the question: how do we go about incorporating known information into our probability distributions? This raised a broader point worth mentioning about Bayesian: that _all_ probability incorporates information, in other words, that _all probability is conditional_. Let’s address this last point now.

### All probability is conditional

Notice that, in the above figure, we are interested in $P(M | D, I),$ the probability distribution of the mass of Saturn $M$, given data $D$ and something $I.$ This $I$ is the set of all other information used to calculate $P.$ In Laplace's case, that distribution included, and therefore was _conditioned on_, what people knew about physics, celestial mechanics, and the telescopic telemetry of the time. Once again, Bayesian inference requires us to be explicit about our assumptions, along with what we're conditioning on. In the Bayesian interpretation of _probability_, then, **all probability** is conditional.

An obvious question would then be: "What are we conditioning on when we say, for a fair coin, the probability of heads $P(H)=0.5$?" And the answer is we're conditioning on all the information that led us to presume the coin is fair, from the physics of wind resistance and gravity, to trust in the Mint that they are making unweighted coins.

**The point is**: There is ALWAYS information we're considering, and that information is used to condition our probabilities.

**The question is**: do you want to be explicit about it?

Having said that, it can become cumbersome to write $P(X | I)$ all the time so, for brevity's sake, it is standard to write $P(X)$, as long as we remember, when we write $P(H)=0.5$ in the case of a coin flip, what we really mean is $P(H | I)=0.5,$ where $I$ is all the relevant information that led us to assume the coin is far. Now let's move back to thinking about _probability as logic_ and how it can help us.


### Probability as logic

The 20th century saw key developments in the formalization of _probability as logic_, from Andrey Kolmogorov's work on Measure Theory to Richard Cox's axiomatic approach to probability and Edwin Jaynes' work on principled Bayesian workflows (such as robust approaches for assigning priors), the principle of _maximum entropy_, and his work on transformation groups.

Cox, a professor of physics at Johns Hopkins University, stepped back and considered what quantitative rules might exist for reasoning that is both logical and consistent and he began by thinking about how we can express relative beliefs in the truth value of certain propositions. He discovered that _transitivity_ was a minimal requirement for expressing beliefs consistently, that is, 

* if I believe proposition $A$ more than I believe proposition $B$ **and**
* I believe proposition $B$ more than I believe proposition $C$, **then**
* I also necessarily believe proposition $A$ more than I believe proposition $C$.

He then made two other assumptions, which seem reasonable:

* if I state how much I believe $X$ to be true, I've also (implicitly) stated something about how much I believe $X$ to be false (although Cox didn't assume anything about the nature of this relationship);
* if I state (1) how much I believe in $X$ AND  (2) much I believe $X$, given $Y$, then I have also specified how much I believe that both $X$ and $Y$ are true.

Under these assumptions, assigning real numbers to degrees of belief, and using the techniques of Boolean logic, Cox was able to show that

$$P (X | I) + P (\bar{X} | I) = 1$$

and

$$P (X, Y | I) = P(X, Y, I) \times P(Y | I)$$

Not only was Cox able to derive these building blocks of probability theory from his minimal assumptions but also two important corollaries, _Bayes' theorem_ and _marginaliztion_ (see below).


Edwin Jaynes continued this project of _probability-as-logic_ and made it clear that Bayesian inference is not subjective, rather it is about the information you have and the degree of certainty given that information. Any two people with the same information should form the same priors and analysis. Moreover, Jaynes envisaged a future in which you could program these analysis into a robot.

### A word on marginalization

The second corollary to Cox's theorem was the principle of _marginalization_:

$$ P(X | I) = \sum_i P(X , Y_i | I),$$

where the $Y_i$ s are all the possibilities for another variable $Y$. As we'll see time and time again, it turns out that marginalization is key for statistical inference, in particular when dealing with what are known as _nuisance parameters_, and is something that Bayesian inference is better-equipped to deal with than frequentism is.

In [Frequentism and Bayesianism: A Python-driven Primer](https://arxiv.org/pdf/1411.5018.pdf), Jake Vanderplas gives a nice treatment of an example that harks back to Reverend Bayes himself:

> Alice and Bob enter a room. Behind a curtain there is a
billiard table, which they cannot see. Their friend Carol rolls
a ball down the table, and marks where it lands. Once this mark
is in place, Carol begins rolling new balls down the table. If
the ball lands to the left of the mark, Alice gets a point; if
it lands to the right of the mark, Bob gets a point. We can
assume that Carol’s rolls are unbiased: that is, the balls have
an equal chance of ending up anywhere on the table. The first
person to reach six points wins the game.
Here the location of the mark (determined by the first roll)
can be considered a nuisance parameter: it is unknown and
not of immediate interest, but it clearly must be accounted for
when predicting the outcome of subsequent rolls. If this first
roll settles far to the right, then subsequent rolls will favor
Alice. If it settles far to the left, Bob will be favored instead.
Given this setup, we seek to answer this question: _In a
particular game, after eight rolls, Alice has five points and
Bob has three points. What is the probability that Bob will get
six points and win the game?_

The naive frequentist approach uses maximum likelihood to estimate the location of the marker and then estimates the probability using what's called a binomial likelihood (to be defined in Chapter 3). The Bayesian approach recognizes that we need to incorporate our uncertainty around the location of the marker into our analysis and uses marginalization to do by setting $Y$ above to the location of the marker. As Vanderplas demonstrates via simulation, the Bayesian approach gives the correct answer and the naive frequentist approach does not. He also points out the issue is not necessarily with frequentism, but with applying it naively. However, he also correctly notes that Bayesianism provides a more natural framework for these types of questions.

## Towards a principled statistical workflow

As stated earlier, this book is part of a movement of working towards principled statistical workflows that allow humans to understand the world better and to make better decisions. We believe that such a movement requires a lot of moving parts and our opionated take is that working data scientists require the following:

- The ability to build information-inclusive models;
- Principled and consistent workflows: this includes necessitating being explicit around assumptions and ideally that someone else performing the same analysis with exactly the same information would get the same results;
- The ability to model the data-generating process(es);
- An ergonomic and intuitive abstraction layer, such as those provided by probabilistic programming languages (which are also in active development!).


There are also key questions around how to build organizations in which such work can be seamlessly intergrated into the decision function, among other cultural concerns, such as how do we think, talk about, and model uncertaintly but our main project in this book is to equip individual data practitioners with the key concepts and tools to do this.

What's clear to us is that the above is achieved far more readily by Bayesian inference than by frequentism because 

- it requires us to be explicit about our assumptions and mdoeling choices;
- the workflow is principled, in the sense that the steps are similar each time, and there is not a different test, statistic, or approach for each type of question, as opposed to frequentism;
- modeling the data-generating process is baked specifically into the workflow through our priors and likelihoods.
- PPLs provide good, ergonomic abstraction layers for such modeling.

We look forward to going on this Bayesian journey with you!

___