# Synopsis


This notebook provides a brief overview of the interpretations of probability. I also discuss conditional probabilities, and Bayes' theorem. 

# Read libraries

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from copy import copy, deepcopy
from pathlib import Path
from sys import path

path.append( str(Path.cwd().parent) )

In [None]:
import itertools
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import scipy.stats as stats

from numpy import linspace
from scipy import integrate

from Amaral_libraries.my_stats import half_frame

In [None]:
my_fontsize = 15

# Conditional probability

Sometimes, probabilities need to be re-evaluated as new information becomes available.  The probability of `SOME CURRENT EVENT` may now be quite different from what it was a week ago.  The necessity to handle these situations gives rise to the concept of conditional probability.  Consider two events $A$ and $B$,

> $P(B ~|~ A)$

is the probability of $B$ conditional on $A$ being true. The conditional probability obeys the relationship

> $P(B ~| ~A) = \frac{P(A ~\cap ~B)}{P(A)}$

if $P(A) > 0$.

From that it follows that

> $P(A ~\cap~ B) = P(B ~|~ A)~ P(A) = P(A ~|~ B)~ P(B)$


The definition of conditional probability provides a way to determine whether two events are **independent**. If $P(B ~|~ A) = P(B)$, then $B$ is independent of $A$ (and vice-versa).


Let's consider an example that buffles a lot of people.  Imagine that you are a contestant in a game show.  Behind one of three closed doors is a cash prize, behind the others is nothing. You are given the choice of selecting a door to open. After you have selected a door, the host shows you what is behind one of the doors you did not chose, and asks you whether you want to change your choice.  What should you do?

To many people there seems like there is no correct answer to this question, that there is a 50% chance that the  cash prize is behind each remaining closed door. However, there is in fact a correct answer.  The host and I both assume that you want to select the door hiding the cash prize, instead of the one with nothing, that is, the host will never choose to open the door hiding the cash prize. Let us also assume that the game is fair, that is, the prize has *a priori* an equal chance of being behind every door.

So, let's analyze the problem: Before anything else happens, you have a 1/3 probability of selecting the correct door and a 2/3 probability of selecting the wrong door.

Once the host open a door, presumably the one **without** the prize behind it, then things become perfectly clear:

> if before you were correct in your original choice, then you should stick with your original choice, 
>
> if before you were wrong in your original choice, then you should change your choice.  

**How probable was it that you were correct in your original choice? How probable was it that you were wrong in your original choice?**


# Interpretations of probability (see [Wikipedia article](https://en.wikipedia.org/wiki/Probability_interpretations)) 

Mathematically, probability is not controversial and lays on a solid foundation. The interpretation of probability is another matter all together.  Currently, the greatest controversy is between so-called **frequentists** and **Bayesians**.

The Wikipedia article mentioned above summarizes the different interpretations:

<img src = "Images/interpretations.png" width = 100%>

In the **classical** interpretation, one assumes that basic events are equi-probable $-$ the principle of indifference.  The conceptual basis for this is an hypothesized symmetry of all possible outcomes. For example, when considering the process of rolling a die, the assumption is that all 6 outcomes are equally probable, so the probability of each outcome is simply 1/6. In general, if there are $N$ outcomes, then the probability of each outcome is

> $p = \frac{1}{N}$.

A major advantage of this approach is that it **does not require past observations**.

Clearly, this will interpretation will fail when there are an infinite number of outcomes. Another major problem is that in case they are not equi-probable, then their probabilities need to be determined somehow.  The classical approach is thus well suited to fair coins and dice, but not to unfair ones. It is also not appropriate for situations such as determining **the probability that tomorrow's maximum temperature will be greater than 40$^o$F**.

In the **frequentist** interpretation, one uses past observations to determine the probability of an event $-$ the probability of an event is determined by how frequently it occurs.  That it, probability values are obtained empirically.  It follows that empirical basis of the frequentist approach meshes well with the scientific method and the search for 'truth'. 

Frequentists determine that the probability of getting `heads` in a coin toss is 1/2, not because there are two equally likely outcomes but because repeated series of large numbers of trials demonstrate that the empirical frequency converges to the limit 1/2 as the number of trials goes to infinity.  This results in the frequentist approach's two major weaknesses. First, it requires lots of past observations in order to estimate probabilities. Even then, for any two finite series of observations, even very long ones, the estimated probability will be somewhat different. The second weakness derives from the first. In order to determine that the probability of tossing `heads` is 1/2, is has to make use of the theory of errors, which makes statements about the probability of observing a given error. This results in a circular process: the concept of probability needs the concept of frequency but the concept of frequency needs the concept of probability. 


In the **Bayesian** interpretation $-$ which is also denoted **subjective** or **epistemic** $-$ probability is a measure of the **degree of belief** of an individual or organization when assessing an uncertain situation. As such, two individuals may view the occurrence of the same event as having different probabilities.  If I am playing a dice game with you, I may believe the dice to be fair while you may believe that they are loaded. 

More generally, the world around us shows us many examples of individuals holding different beliefs.  Stock traders make buying/selling decisions based on their beliefs about where a stock's price is headed. Pundits argue for different outcomes in an election. Different models provide different estimates for what the global temperature will be in 2050.

While this makes it sound as if all beliefs $-$ opinions $-$ are of equal value, this is not correct.  As we will see below, Bayesians use Bayes' theorem to define a rigorous process for using data for updating one's belief's. The challenge comes from the fact that this process relies on the definition of a prior probability.  For example, when considering the tossing of a coin, **in the absence of any data**, a Bayesian could, for example, use the principle of indifference to select the prior probability of tossing `heads` (see analysis below).

In [None]:
print(Fore.RED, Style.BRIGHT)
print('Two acceptable priors for the probability of tossing heads:\n', 
      Style.RESET_ALL)

x = linspace(0, 1, num = 100)

fig = plt.figure( figsize = (6, 4))
ax = fig.add_subplot(111)

half_frame(ax, 'p', 'Prior probability', font_size= my_fontsize)

ax.plot(x, stats.beta(5, 5).pdf(x))
ax.plot(x, stats.beta(2, 2).pdf(x))
ax.set_xlim(0, 1)

plt.show()

This actually gives rise to one of the weaknesses of the Bayesian interpretation.  The issue is that for a given problem, multiple thought experiments could apply, and choosing one is a matter of judgement: different people may assign different prior probabilities. This issue is known as the **reference class problem**.  

# Bayes' Theorem

The concept of conditional probability connects belief (given by probability) with information.  This has actually enormous consequences.  Since one cannot ever observe an infinite number of events, one cannot in most situations truly determine $P(E)$.  One can nonetheless build hypotheses for what $P(E)$ is $-$ a so-called **prior**.  **Conditional probabilities enable us to update our priors as new information becomes available!**

This is expressed by Bayes' Theorem which appear to simply re-write an equation above but does so much more


> $P(B | A) = \frac{P(A | B)~ P(B)}{P(A)}$

if $P(A) > 0$.


Let's now make use of Bayes' theorem to understand how new data enables us to update our prior probability for $p_{\rm heads}$. Let's denote the new data by ${\bf X} = \{X_1, ..., X_m\}$, the current estimate of our confidence in the value of $p_{\rm heads}$ by $P_{\theta}(p_{\rm heads})$. Then, it follows that the **posterior probability** of $P'(p_{\rm heads})$ is given by:

> $P'_{\theta}(p_{\rm heads}) = P(p_{\rm heads} ~|~ {\bf X}) = \frac{P({\bf X}~|~ p_{\rm heads})}{P({\bf X})}~P_{\theta}(p_{\rm heads})$ .

Things get computationally complex because:

> $P({\bf X}) = \int_0^1 ~P({\bf X}~|~ p_{\rm heads})~P_{\theta}(p_{\rm heads}) ~dp_{\rm heads} $.

To demonstrate the process, let us first define our initial prior and generate a set of observation by simulating the tossing of a coin in the computer.


## Generate observations

In [None]:
n_observations = 1000
X = {}

In [None]:
# Generate fair observations
#
X['fair'] = stats.randint.rvs(0, 2, size = n_observations)
print(f"The observed probability of tossing heads is {sum(X['fair'])/len(X['fair']):.3f}")

In [None]:
# Generate quite biased observations
#
bias = 0.6
X['biased'] = [1 if y < bias else 0 for y in stats.uniform.rvs(size = n_observations)]
print(f"The observed probability of tossing heads is {sum(X['biased'])/len(X['biased']):.3f}")

In [None]:
# Generate very biased observations
#
bias = 0.8
X['very biased'] = [1 if y < bias else 0 for y in stats.uniform.rvs(size = n_observations)]
print(f"The observed probability of tossing heads is "
      f"{sum(X['very biased'])/len(X['very biased']):.3f}")

## Set prior

In [None]:
p_heads = linspace(0, 1, num = 1+2**7)
initial_prior = {}

In [None]:
# Define priors 1
#
alpha, beta = 5, 5
initial_prior['beta55'] = stats.beta(alpha, beta).pdf(p_heads)

In [None]:
# Define prior 2
#
initial_prior['uniform01'] = stats.uniform(0, 1).pdf(p_heads)

In [None]:
# Define (unreasonable) prior 3
#
initial_prior['uniform.4.2'] = stats.uniform(0.4, 0.2).pdf(p_heads)

## Calculate the probability of observing the data given the prior

In [None]:
def calculate_prob_data(prior, data, p_heads):
    """
    Calculates the probability of observing the data given the prior. 
    Integrates prior distribution multiplied by the probability of observing
    the data given that parameter value using the romb algorithm which 
    requires a list of 1+2^k equally spaced parameter values and the value 
    of the prior at those values
    
    inputs:
        prior -- list of float, values of the prior at different values of parameter
        data -- float, value of observation
        p_heads -- kust of floats, different values of parameter
        
    outputs:
        float, probability of observing the data given the prior
    
    
    """
    integrand = [p*p_theta if data == 1  else (1-p)*p_theta 
                 for p, p_theta in zip(p_heads, prior)]
    
    dx = 1 / len(integrand)
    return integrate.romb(integrand) * dx
        


**Experiment with the different priors and levels of bias in the data.**

In [None]:
fig = plt.figure( figsize = (8, 6))
ax = fig.add_subplot(111)

prior = copy(initial_prior['beta55'])
for i, data in enumerate( X['fair'] ):
    prob_data = calculate_prob_data(prior, data, p_heads)
    
    posterior = []
    for p, p_theta in zip(p_heads, prior):
        
        if data == 1:
            posterior.append(p * p_theta / prob_data)
        else:
            posterior.append((1-p) * p_theta / prob_data)
    
    if i%50 == 0:
        ax.plot(p_heads, posterior, 'r', alpha = 0.4)
    prior = copy(posterior)

# Plot initial and final form of P(p_heads)
#
half_frame(ax, '$p_{heads}$', 'Posterior probability')
ax.plot(p_heads, initial_prior['beta55'], lw = 4)    
ax.semilogy(p_heads, posterior, 'r', lw = 4)
# ax.plot(p_heads, posterior, 'r', lw = 4)

ax.vlines([0.2, 0.4, 0.6, 0.8], 0, max(posterior)*10, color = '0.6')

ax.set_xlim(0, 1)
ax.set_ylim(10**(-10), max(posterior)*10)

# Check integral of final posterior

final = integrate.romb(posterior) / len(posterior)
print(Fore.RED, Style.BRIGHT)
print(f"The integral of the final posterior estimate equals {final:.6f}\n", 
      Style.RESET_ALL)


plt.show()


# Practical impact of Bayes' theorem

One of the most important applications of Bayes' theorem is in determining whether mass testing for a medical condition is appropriate or not. Consider the following situation involving a low prevalence infectious disease such as HIV. Assume that the incidence rate in the population is 0.1%, 

> $P(D) = 0.001$ .  

It follows that the probability of not having the disease is

> $P( \not D) = 0.999$ .

Consider a nearly perfect test that **correctly diagnosis the disease** 99% of the time. Thus, the probability of getting a `+` result in the test of an individual who has the disease is:

> $P(+~|~D) = 0.99$ . 

Assume also that the test **correctly diagnosis absence of the disease** 95% of the time. Thus, the probability of getting a `-` result in the test of an individual who does not have the disease is: 

> $P(- ~|~ \not D) = 0.95$,

and it follows that 

> $P(+~|~ \not D) = 0.05$.


**Imagine you take the test and get a positive result. In the absence of any other information, what is the probability that you *do* have the disease?**

> $P(D~ |~ +) = $ ?

Let's 'unpack' this probability

> $P(D~ |~ +) = \frac{P(+ ~|~ D)~ P(D)}{P(+)}$ 

> $~~~~~~~~~~~~~~~~  = \frac{(0.99 * 0.001)}{P(+)}$

> $~~~~~~~~~~~~~~~~  = \frac{0.00099}{P(+)}$

We still need to determine $P(+)$.

> $P(+) = P(+ ~|~ D)~ P(D) + P(+~|~ \not D )~ P(\not D)$

> $~~~~~~~~  = (0.99 * 0.001) + (0.05 * 0.999) = 0.05094$

and it follows that 

> $P(D~ |~ +) = 0.0194$

**Less than 2%!**

So, you should get another test.  

What does it mean if you get a positive result again?

Why would that change things?

# Exercises
