# Synopsis


This notebook provides a quick overview of the histories of probability theory and statistics. It then briefly introduces some concepts in probability theory, including:

* Events, micro-states, macro-states
* Sample spaces 
* Axioms of probability
* Functions of random variables
* Connection to statistical physics

# Read libraries

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from copy import copy, deepcopy
from pathlib import Path
from sys import path

path.append( str(Path.cwd().parent) )

In [None]:
import itertools

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import scipy.stats as stats

from Amaral_libraries.my_stats import half_frame, get_product_sample_space

In [None]:
my_fontsize = 15

# A brief history of probability theory (see [Wikipedia article](https://en.wikipedia.org/wiki/History_of_probability))

To the best of my knowledge, the systematic study of probabilities started in Europe in the 1500s. **Girolamo Cardano**, a mathematician born in Pavia, introduced the binomial expansion (see [Wikipedia article](https://en.wikipedia.org/wiki/Gerolamo_Cardano) for the source of these sentences)

> $(x + y)^n$

and the binomial coefficients. He was apparently "notoriously short of money and kept himself solvent by being an accomplished gambler and chess player". He used the game of throwing dice to understand the basic concepts of probability. He demonstrated the efficacy of defining odds as the ratio of favorable to unfavorable outcomes.

His book about games of chance, *Liber de ludo aleae* ("Book on Games of Chance"), written around 1564, but not published until 1663, contains the first systematic treatment of probability, as well as a section on effective cheating methods. One has to wonder if others able to reach the same understanding of stochastic/random/aleatory processes chose to keep that knowledge to themselves.

The French mathematicians **Pierre de Fermat** and **Blaise Pascal** took the torch from Cardano. Through their correspondence in the 1650s, they set the foundations of probability theory. Pascal developed an arithmetic approach to calculate the binomial coefficients, which ended up being called **Pascal's triangle**. Fermat is credited with carrying out the first-ever rigorous probability calculation. In it, he demonstrated why a professional gambler was correct in his experience that if he bet on rolling at least one six in four throws of a die he won in the long term, whereas betting on throwing at least one double-six in 24 throws of two dice resulted in his losing.  

It is important to understand the importance of scientific correspondence at this time.  This was before there were scientific journals.  Scientists advertised their work through letters to their colleagues and rivals.  Those letter would then be transcribed and discussed at salons around Europe.  An accomplished scientist might eventually publish his $-$ they were mostly men who did this even though many women participated in the salon discussions $-$ work in book form.



Many of the results presented in Fermat and Pascal's correspondence ended up in book form due to **Christiaan Huygens**' publication of *Van Rekeningh in Spelen van Gluck* ("Reasoning on Games of Chance"). Huygens introduced the concept of **expected value**.

Huygens' book later inspired the work of **Jacob Bernoulli** and **Pierre-Simon Laplace**.  The Swiss mathematician Bernoulli is also responsible for defining the **Bernoulli trial**, a stochastic 'experiment' with two possible outcomes ('failure' and 'success') and a constant probability of success $p$. He wrote an influential book on probabilities $-$ *Ars Conjectandi* ("The Art of Casting") $-$ which was published after his death. In the book, Bernoulli makes the first known derivation of the law of large numbers, a **theorem** that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should approach the **expected value** as the number of trials tends to infinity. 

The French polymath Laplace was the one the most accomplished scientist of Western Europe. He was responsible for advances on many fields.  Among his contributions to probability theory and statistics, stands his 1812 book *Théorie analytique des probabilités* ("Analytical Theory of Probabilities"). The first half of the book dealt with  probability methods, the second half with statistical methods and applications. 

In his *Essai philosophique sur les probabilités* (1814), Laplace set out a mathematical system of inductive reasoning based on probability, which we would today recognize as Bayesian. He begins the text with a set of "principles of probability":

> 1. Probability is the ratio of the *favored events* to the total possible events.
>
> > $P(s) = \frac{s}{n}$
>
> 2. The first principle assumes equal probabilities for all events. When this is not true, we must first determine the probabilities of each event. Then, the probability is the sum of the probabilities of all possible favored events.
>
> 3. For independent events, the probability of the occurrence of all is the probability of each multiplied together.
>
> > $P(B, A) = P(A)~P(B)$
>    
> 4. For events not independent, the probability of event B following event A (or event A causing B) is the probability of A multiplied by the probability that, given A, B will occur.
>
> > $P(B, A) = P(A)~P(B~|~A)$
>
> 5. The probability that A will occur, given that B has occurred, is the probability of A and B occurring divided by the probability of B.
>
> > $P(A~|~ B) = \frac{P(A~ \cap~ B)}{P(B)}$ 

One well-known formula arising from Laplace's system is the rule of succession, given as principle seven. Suppose that some trial has only two possible outcomes, labeled *success* and *failure*. Under the assumption that little or nothing is known a priori about the relative plausibilities of the outcomes, Laplace derived a formula for the probability that the next trial will be a success.

> $P({\text{next outcome is success}}) = \frac {s+1}{n+2}$

where $s$ is the number of previously observed successes and $n$ is the total number of observed trials. It is still used as an estimator for the probability of an event if we know the event space, but have only a small number of samples.  **The rule of succession has an advantage over simply using the first principle on finite data.  The first principle would set the probability of any unobserved outcome to zero. Thus, it cannot even be used in the absence of any data**. 

Later in the 19$^{th}$  century, the probabilistic approach to stochastic processes was exploited by the Physicists **Ludwig Boltzmann** and **J. Willard Gibbs** to develop statistical physics. Below, I provide an example.  

The axiomatic approach to probability was fully developed by the Russian polymath **Andrey Kolmogorov** in his  *Foundations of the Theory of Probability* (1933). 


# A brief history of statistics (see [Wikipedia article](https://en.wikipedia.org/wiki/History_of_statistics))

While initially statistics referred to the collection of data and their description (what is now called *descriptive statistics*), it grew to include the analysis and interpretation of data (what is now called *inferential statistics*).  It is in the domain of inferential statistics that probability theory comes in and connects to statistics.

The word statistics is derived from the New Latin *statisticum collegium* ("council of state") and the Italian word *statista* ("statesman" or "politician"). The reason is that statistics owns its existence to the needs of the bureaucratic state. How many people are there? Where do they live? How long will they live? What do they produce? What do they consume? How many can we conscript for our armies? 

<img src = "Images/westminster_abbey_pyx.png" width = 400>

Both the **Han Dynasty** and the **Roman Empire** extensively gathered data on the size of the empire's population, geographical area and wealth (descriptive statistics). The first recorded example of inferential statistics in Western Europe was the **Trial of the Pyx**, which is a test of the purity of the coinage of the Royal Mint. Starting in the 12$^{th}$ century, the Trial used as a statistical sampling method. After minting a large number of coins, a single coin was selected and placed in the Pyx $-$ a box in Westminster Abbey (see photo, pyx is the name for a container to store consecrated hosts). After a given period, the coins were removed and weighed and a smaller sample were tested for purity. 

In 1662, **John Graunt** and **William Petty** developed early census methods that provided a framework for modern demography. They produced the first life table, which tabulate the probabilities of survival to each age. In his book *Natural and Political Observations Made upon the Bills of Mortality*, Graunt used analysis of the mortality rolls to make the first statistically based estimation of the population of London.  

In 1710, **John Arbuthnot** investigated the human sex ratio at birth. He consulted birth records in London for the period 1629 to 1710 (82 years), and found that the number of males born in London exceeded the number of females in every year. Assuming an equal ratio, the probability of observing more male than female births in a year would be 1/2. Thus the probability of the observed outcome under this hypothesis would be $0.5^{82}$.  This is vanishingly small, leading Arbuthnot to conclude that this was not due to chance, but to "divine providence". For this and similar work, Arbuthnot is credited as the first user of **significance tests**.  

In the past, accomplished scientists made important contributions to statistics. **Isaac Newton** was employed by the the Royal Mint and worked to uniformize coinage. Laplace made important contributions to demographics. **Johann Carl Friedrich Gauss** spent 14 years later in life surveying the territory of Hanover. 

From the beginning, statistics would also play a critical role in the development of science and the scientific method $-$ testing scientific hypotheses and models. In the 1500s, **Tycho Brahe** used the **arithmetic mean** to reduce the error in his estimates of the position of celestial bodies. The English mathematicians **Roger Cotes** and **Thomas Simpson** developed the **theory of errors**, which postulates how measurements may differ from actual values.  This work would lead to the development of the normal distribution.

The English economist **William Playfair** introduced the idea of graphical representation into statistics. His *Statistical Breviary* (1801) popularized the line chart, bar chart, pie chart, and histogram. In 1801, Gauss developed and applied the **method of least squares** to accurately predict the orbit of Ceres $-$ a dwarf planet in the asteroid belt $-$ using only a few data points corresponding to less than 1% of the entire orbit.


The development of modern mathematical statistics owes much to the work of English polymath (and racist) **Francis Galton** and his (also racist) compatriot **Karl Pearson**.  Galton introduced many important statistical concepts, including the standard deviation, correlation, regression. He applied these methods to the study of the variety of human characteristics $-$ height, weight, and eyelash length among others $-$ and found that many of these could be well fitted to a normal curve distribution.  Galton's study of the accuracy of 787 guesses of the weight of an ox at a country fair, lead to the concept of **wisdom of the crowd**.

Pearson's work, and that of Galton's, underpins many of the "classical" statistical methods that we will study in these materials.  They include the correlation coefficient, the method of moments for the estimation of the parameters of distributions fitted to samples, and the P-value.  Pearson emphasized the statistical foundation of scientific laws and initiated the concept of statistical hypothesis testing. Pearson also developed a powerful dimensionality reduction technique $-$ principal component analysis.

**Ronald Aylmer Fisher**, a racist English polymath, is frequently credited as the genius who "almost single-handedly created the foundations for modern statistical science". Fisher pioneered the **principles of the design of experiments** and developed computational algorithms for analyzing data from his balanced experimental designs. Critically, he pursued the analysis of real data as the springboard for the development of new statistical methods. Fisher's book *Statistical Methods for Research Workers* (1925) became the standard reference for the teaching and study of statistics. It was followed in 1935 by *The Design of Experiments*, which was also widely used. 

In addition to analysis of variance, Fisher named and promoted the **method of maximum likelihood estimation** as a replacement for the method of moments and the method of least squares. Fisher is also responsible for the popularity of the 5% level of significance. He stated that deviations exceeding twice the standard deviation are regarded as significant. Before him, deviations exceeding three times the **probable error** were considered significant. For a normal distribution, the probable error is approximately 2/3 of the standard deviation.


# Probability




## Bernoulli trials and Bernoulli processes

A **binomial trial** or **Bernoulli trial** is a random 'experiment' with two possible outcomes: *success* and *failure*.  The probability of success in an individual trial is denoted $p$ and **is assumed to be constant**. 

Examples of Bernoulli trials include:

> Tossing a coin $-$ getting `heads` is success, getting `tails` is failure.
> 
> Rolling a die $-$ rolling a `1` is success, `any other outcome` is failure.
>
> Drawing a card from a shuffled deck $-$ getting an `Ace` is success. `any other outcome` is failure.
>
> Birth of child $-$ getting a `girl ` is success, `any other outcome` is failure.
>
> Football match involving Portugal $-$ Portugal `winning` is success, `any other outcome` is failure.

A **Bernoulli process** is a finite or infinite sequence of Bernoulli trials.  The outcome of trial $i$ is denoted $X_i$ and it can take values in $\{0, 1\}$, where `1` is success and `0` is failure.

We will learn some of the properties of Bernoulli processes when discussing [discrete random variables](nb_09_Discrete_random_variables.ipynb).


## Equi-probability assumption, i.e., maximum entropy

The equi-probability assumption states that every simple outcome in a process has the same probability of occurring. For example, when rolling a die, the probability of getting any face value is identical $P(i) = p$. 

> $P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 6p$   .

If we further assume that when rolling a die we will always obtain a single face $-$ i.e., the die always settles onto an stable equilibrium position $-$ then the probabilities of getting each of the faces must add up to one.

> $1 = P(1 \lor 2 \lor 3 \lor 4 \lor 5 \lor 6)$  ,

where $\lor$ is a logical `or`.  Thus,

> $6p = 1 \Rightarrow p = \frac{1}{6}$  .

If the assumption $P(i) = p$ holds, then we say that **the die is fair**.   The idea of fairness comes from the observation that everyone has the same information.  If a die is loaded, then just some people know that this is the case and what kind of bias the die has, providing them with an unfair advantage in a game of chance involving the die.

This concept of fairness as everyone holding the same information is the reason why insider trading is illegal. Unless, of course, you are a member of congress or a market maker, in which case it is just par for the course.



# Micro-states, macro-states, events, and sample spaces 

A possible outcome from a random process defines a **micro-state**. For example, a tossed coin has two possible micro-states: `Head` and `Tail`. Two tossed coins have four micro-states: `Head_Head`, `Head_Tail`, `Tail_Head`, and `Tail_Tail`.


The set of all possible outcomes of a random process is called the **Sample space**, and it is usually denoted by $S$. **Sample spaces can be discrete (and thus countable) or continuous (uncountable)**.

A subset of the sample space is called an **event**. Examples of events for two tossed coins are `2 heads`, `at least 1 head`, `equal faces`.  A set of micro-states with some particular characteristic can also be called a macro-state.

Typically, one wants to know the likelihood that a given event will occur. For example, 

> What is the probability of `tossing 2 heads`? 
>
> What is the probability that `your favorite team wins the Super Bowl`?
>
> What is the probability of [`nuclear war`](https://www.wired.com/story/micromorts-nuclear-war/)?


**Some times defining the sample space is easy (outcomes of tossing two coins), sometimes it is very very hard (*known unknowns and unknown unknowns*)**.


## Sample spaces and counting techniques

Nowadays, calculating sample spaces is made dramatically easy by the availability of powerful computers. However, for some situations even a very powerful computer will be useless because sample spaces can grow so fast in size.

In order to determine the size of sample spaces or the size of certain events, one uses **counting techniques**.

The size of the sample space of rolling a die and tossing a coin can be calculated by **multiplying** the size of the individual sample spaces.


In [None]:
toss_coin = {'Heads', 'Tails'}
roll_6_die = {1, 2, 3, 4, 5, 6}
card_suits = {'Spades', 'Hearts', 'Diamonds', 'Clubs'}
card_ranks = {2, 3, 4, 5, 6, 7, 8, 9, 10, 'Jack', 'Queen', 'King', 'Ace'}

simple_sets = [toss_coin, roll_6_die]
events = get_product_sample_space( simple_sets )
print( f"\nThere are {len(set(events))} by combining the simple events "
       f"from the sets {simple_sets}.\n" )
print(events)
print()

events = sorted( list( itertools.product( *simple_sets ) ) )
print( f"\nThere are {len(events)} by combining the simple events "
       f"from the sets {simple_sets}.\n" )
print(events)
print()


In [None]:
simple_sets = [card_ranks, card_suits]
events = get_product_sample_space( simple_sets )
print( f"\nThere are {len(set(events))} by combining the simple events "
       f"from the sets {simple_sets}.\n" )
print(events)



<br><br><br>

<br><br><br>

Another type of sample space arises from **permutations**. Order matters for permutations, which refer to different ordering of the $n$ items in a set.  The number of permutations of the $n$ elements in a set is 

> $P_n = n!$ 


If we are only interested in permutations of $r$ elements drawn from a set of $n$ elements, then we have

> $P_{r,n} = \frac{n!}{(n-r)!}$

In [None]:
simple_events = {'A', 'B', 'C', 'D'}
r = 2

events = sorted( list( itertools.permutations(simple_events, r = r) ) )

print( f"There are {len(set(events))} permutations of the set "
       f"{simple_events} with {r} elements.\n" )
print(events)


<br><br><br>

<br><br><br>


While order matters for permutations, there are situations in which we are also drawing from a single set **without replacement** but the order does not matter.  For example, consider a Powerball-like lottery game in which we are drawing 6 balls out of a possible 69. What are the number of **combinations** $-$ that is, possible outcomes?

The number of distinct subsets of $r$ elements that can be drawn from a possible set of $n$ is given by:

> $C_r^n = \frac{n!}{k!(n-r)!}$

You have already encountered this expression for the Pascal's triangle problem. 


In [None]:
simple_events = {'A', 'B', 'C', 'D', 'E'}
r = 2

events = sorted( list( itertools.combinations( simple_events, r = r ) ) )

print( f"There are {len(set(events))} combinations of length {r} of the set "
       f"{simple_events}.\n" )
print(events)

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>


In some situations, one may be interested in combinations drawing from a single set but **with replacement**. The number of **combinations** in this case is:


> $CR_r^n = C_{r}^{n+r-1}$



In [None]:
simple_events = {'A', 'B', 'C', 'D', 'E'}
r = 2

events = sorted( list( itertools.combinations_with_replacement( simple_events, r = r ) ) )

print( f"There are {len(set(events))} combinations with replacement of "
       f"length {2} of the set {simple_events}.\n" )
print(events)

# Kolmogorov's axioms of probability

In the an axiomatic view of probability, the **probability** of an event belonging to the sample space $S$ is a number that can be assigned to the event and that satisfies the following properties:

> **Axiom 1:** $P(S) = 1$
>
> **Axiom 2:** $0 \le P(E) \le 1$ for $E \subset S$ 
>
> **Axiom 3:** If $E_1 \cap E_2 = \emptyset$, then $P(E_1 \cup E_2) = P(E_1) + P(E_2)$

Let's see what these axioms mean in a concrete case $-$ yes, a die. With a die, we have 6 possible distinct outcomes: 

> $S = \{ 1, 2, 3, 4, 5, 6 \}$ ,

so the probability of getting any of those six outcomes must equal 1: $P(S) = 1$.  

Let us define some events to play with:

> $E_1 = \{ 1 , 2 \}$,  $E_2 = \{ 1 , 3 \}$, $E_3 = \{ 3 \}$

All these events are sub-sets of $S$, so the second axiom applies.  Axiom 2 tells us that the probability of each of those events should be greater or equal to 0 and smaller or equal to 1.  This clearly makes sense.  In fact, we can tell that each of these events has a probability greater than 0 and smaller than 1.

What about the third axiom? If we consider $P(E_1 \cup E_2)$, we can see that $E_1 \cap E_2 \ne \emptyset$.  On the other hand, for $P(E_1 \cup E_3)$, we can see that $E_1 \cap E_3 = \emptyset$, so 

> $P(E_1 \cup E_3) = P(E_1) + P(E_3)$.


From these axioms, it follows that 

> Probability of empty set: $P(\emptyset) = 0$
>
> Probability of sample space minus some event: $P(S-E) = 1 - P(E)$
>
> Probability of sub-set of subset: $E_1 \subset E_2 \Rightarrow P(E_1) \le P(E_2)$
>
> Probability of two independent events: $E_1$ is independent of $E_2 \Rightarrow P(E_1 \land E_2) = P(E_1)~P(E_2)$



# Connection to statistical physics and kinetic theory

The concepts used so far are crucial to *statistical physics*, which provided a mechanistic understanding of *Thermodynamics*. 

Consider a **cubic box** containing $n$ particles. Particles are moving around and colliding with the walls and one another. 

The equi-probabilty assumption, let us conclude that every has an equal probability of being in any section of the box. Further, assuming we are dealing with point particles without attractive interactions, it follows that the position of a particle is independent from the position of the other particles.

In order to get some intuition for the implications of these simple assumptions, let us partition the box into 8 smaller cubes and label each cube with a number between 1 and 8. **What is the sample space for this problem when $n = 2$?**


What about for $n = 10$?

How many events correspond to all the particles being in box 1?

In the frequentist interpretation of probability and in statistical physics, **the probability of a macro-state (event) is the ratio of the number of micro-states that satisfying the event condition to the total number of micro-states in the sample space**.

So, what is the probability of finding all particles in box 1?

What about the probability of finding $k = 1, \dots, n$ particles inside box 1? 

# Functions of random variables

Imagine that you are an insurance company ans you trying to decide how much to charge Florida homeowners for insuring their houses. In order to estimate your costs related to covering claims, you need to have information about the costs associated with a particular type of storm and the frequency of each type of storm.

Replace S since I am using it fr Sample Space

Let $P(S)$ be the probability of a storm with a strength $S$ occurring within a one week period during the Summer or Fall. Assume also, that the amount of damages to homes is some function of $S$

> $C(S)$  .

What is the distribution of payouts?

Let's assume that S is a continuous variable, then 

> $1 = \int_0^{S_m} ~p(S)~dS$

and

> $1 = \int_0^{C_m} ~q(C)~dC$

It then follows that 

> $q(C)~dC = p(S)~dS $

and 

> $q(C) = p(S)~\frac{dS}{dC}$.

For $p(S) = e^{-S}$ and $C(S) = S^2$, we find

> $S = \sqrt{C}$ ,
>
> $q(C) = e^{-\sqrt{C}} ~\frac{d \sqrt{C}}{dC}$
>
> $~~~~~~~= \frac{1}{2 ~\sqrt{C}} ~e^{-\sqrt{C}} ~ $.






In [None]:
s = np.arange(0.1, 10., 0.1)
c = np.arange(0.1, 100., 0.1)

fig = plt.figure( figsize = (12, 5) )
ax = []

ax.append( fig.add_subplot(121))
half_frame(ax[-1], 'Strength, S', 'Probability density', font_size = my_fontsize)
ax[-1].semilogy(s, np.exp(-s), 'b-', lw = 2)

ax[-1].set_xlim(0, 10)
ax[-1].set_ylim(0.00001, 1)

ax.append( fig.add_subplot(122))
half_frame(ax[-1], 'Cost, C', 'Probability density', font_size = my_fontsize)
ax[-1].semilogy(c, 1/2*np.sqrt(1/c)*np.exp(-np.sqrt(c)), 'r-', lw = 2)

ax[-1].set_xlim(0, 100)
ax[-1].set_ylim(0.00001, 1)

plt.tight_layout()


# Exercises


So, why was the gambler correct? Why is it smart to bet on rolling at least one six in four throws, but not to bet on rolling a double-6 in twenty four throws of two dice?