In [47]:
# Frequentist versus Bayesian Statistics

# To use this tutorial, read the commands and execute the code section by section

# The learning objective is to gain insights into thinking about inference
# from a "Frequentist" versus a "Bayesian" perspective. In brief, because a
# Frequentist does not consider the probability of an event or state of the
# world or hypothesis, only their frequency of occurrance, it is not possible 
# to ask questions of the form "what is the probabilty that hypothesis x is
# true?" Instead, one can only consider questions of the form, "what is the
# probabilty that I would have obtained my data, given that hypothesis x
# is true?" In contrast, Bayesians consider the probabilities of such
# things (often called the strength of belief), but doing so can require
# making assumptions that can be difficult to prove.

# For NGG students, this tutorial is meant to be used in tandem with this
# Discussion on the NGG Canvas site: https://canvas.upenn.edu/courses/1358934/discussion_topics/5440322

# Copyright 2019 by Joshua I. Gold, University of Pennsylvania
# Originally written for Matlab, translated to Python on 1.29.20 by CMH

# *****************************************************************************************

import numpy as np 

# Let's start with a simple example, taken from:
# https://en.wikipedia.org/wiki/Base_rate_fallacy#Example_1:_HIV

# "Imagine running an HIV test on" A SAMPLE "of 1000 persons ..."
N = 10000 # size of the SAMPLE

# "The test has a false positive rate of 5% (0.05)..."
# i.e., the probability that someone who takes the test gets a POSITIVE 
# result despite the fact that the person does NOT have HIV
falsePositiveRate = 0.05

# "...and no false negative rate."
# i.e., The probability that someone who takes the test gets a NEGATIVE 
# result despite the fact that the person DOES have HIV
falseNegativeRate = 0

In [48]:
# Question #1: If someone gets a positive test, is it "statistically significant"?

# Answer: Statistical significance from the Frequentist perspective is 
# typically measured by comparing p to a threshold value; e.g., p<0.05.
# In this case, "p" is shorthand for "the probabilty of obtaining the 
# data under the Null Hypothesis", so we are checking for:
#     p(data | Null Hypothesis) < 0.05. 
# Here we take the Null Hypothesis as "not infected", and the data are 
# just the single positive test. 

# Therefore, the relvant p-value is simply the false-positive rate: 
# p=0.05, which is typically considered "not significant." However, 
# you can also see that it is not particularly informative.
p = falsePositiveRate

In [49]:
# Question #2: What is the probability that if someone gets a positive
# test, that person is infected?

# Answer: Here we are asking for a different probability:
#  p(infected | positive test) = p(hypothesis | data) = the "posterior
#  probability" of the hypothesis, given the data.

# Let's work our way backwards to figure out what information we need to 
#  solve this problem.

# We can compute the probability that someone with a positive test is 
# infected from a particular population as:
# pIsInfectedGivenIsPositiveTest = sum(isInfected&isPositive)./sum(isPositive);

# (I am keeping this commented out right now because the variables on
# the right-hand side of the equation are not defined yet, so Matlab would
# give an error if we try to evaluate that expression.)

# It should be obvious that to compute this quantity, we need to know the
# number of people in the population who are actually infected (i.e., we
# need to compute the number of people corresponding to isInfected ==
# true), in addition to knowing the number of people who had a positive test.

# So let's start by definining how many in the population are actually
# infected. We'll start by assuming that that *real* rate of infection is
# 0.5 (i.e., half the POPULATION is infected), and then do a quick 
# simulation to find out how many in our SAMPLE of N people are infected.
# We can do this simulation by by getting N picks from a 
# binomial distribution, where each pick determines "isInfected" for a
# single person according to the assumed rate of infection:
isInfected = np.random.binomial(1,0.5,N)

# Now we can count the number infected
nInfected = np.sum(isInfected)

# Now we need to count the number of people who got a positive test in this
# example. There is no false negative rate, which implies that everyone
# who is infected got a positive test:
isPositive = list(isInfected)

# But there is a non-zero false-positive rate, which implies that some of 
# the people who are not infected will also have a positive test. We
# can use binornd again to generate random picks from a binomial 
# distribtuion according to the false-positive rate:
notInfectedIndices = np.where(isInfected == 0)[0]
# nNotInfected = len((isInfected[notInfectedIndices]))
for x in notInfectedIndices:
    isPositive[x] = np.random.binomial(1, falsePositiveRate)

# Now we can compute the probability that someone with a positive test is infected:
actuallyInfected_idx = np.where(np.equal(isInfected, isPositive))[0] 
actuallyInfected = isInfected[actuallyInfected_idx]
pIsInfectedGivenIsPositiveTest = np.sum(actuallyInfected)/np.sum(isPositive)
print(pIsInfectedGivenIsPositiveTest)

0.9522915101427498


In [53]:
# Let's do the same thing, but this time we will try different values for 
# the proportion of the population that is actually infected. What you
# should notice is that the PROPORTION INFECTED GIVEN A POSITIVE TEST
# depends (a lot!) on the OVERALL RATE OF INFECTION. Put another way, to
# determine the probabilty of a hypothesis, given your data 
# (e.g., proportion infected given a positive test), you have to know the
# probability that the hypothesis was true without any data.

# Why is this the case? It is a simple consequence of the definition of 
# a conditional probability, formulated as Bayes' Rule. Specifically,
# the joint probability of two events, call them A and B, is defined as:
# p(A and B) = p(A) * p(A | B)
# p(B and A) = p(B) * p(A | B)

# Now, calling A the Hypothesis and B the Data, then rearranging, we
# get:
# p(Hypothesis | Data) = p(Data | Hypothesis) * p(Hypothesis)
#                         ------------------------------------
#                                        p(Data)

# So you cannot calculate the probability of the hypothesis, given the
# data (i.e., the Bayesian posterior), without knowing the probability 
# of the hypothesis independent of any data (i.e., the prior)
infectedProportions = np.arange(0, 1.1, 0.1).tolist()

for ii in np.arange(0, len(infectedProportions)):
   
    # Simulate # infections in the SAMPLE, given the POPULATION rate
    isInfected = np.random.binomial(1, infectedProportions[ii], N)
   
    # Count the number infected
    nInfected = np.sum(isInfected);
   
    # Make array of positive tests, given that falseNegativeRate=0 ...
    isPositive = list(isInfected);
   
    # And falsePositiveRate > 0
    notInfectedIndices = np.where(isInfected == 0)[0]
    # nNotInfected = len((isInfected[notInfectedIndices]))
    for x in notInfectedIndices:
        isPositive[x] = np.random.binomial(1, falsePositiveRate)
   
    # The probability that someone with a positive test is infected
    actuallyInfected_idx = np.where(np.equal(isInfected, isPositive))[0] 
    actuallyInfected = isInfected[actuallyInfected_idx]
    pIsInfectedGivenIsPositiveTest = np.sum(actuallyInfected)/np.sum(isPositive)
   
    # We can compute the Bayesian Posterior as:
    # p(hypothesis | data) = (p(data | hypothesis) * p(hypothesis)) / p(data)
    # Note that we are using the true rate from the full POPULATION, 
    # so these predictions will differ slightly from the probability computed 
    # above (pIsInfectedGivenIsPositiveTest) from the SAMPLE
    pDataGivenHypothesis = 1 - falseNegativeRate;
    pHypothesis = infectedProportions[ii];
    pData = np.sum(isPositive)/len(isPositive);   
    pHypothesisGivenData = (pDataGivenHypothesis * pHypothesis) / pData;

    # Compute the theoretial posterior probability: 
    print('Infection rate={0:.3f}, proportion infected given a positive test={1:.3f}, Posterior={2:.3f}'
          .format(infectedProportions[ii], pIsInfectedGivenIsPositiveTest, pHypothesisGivenData))

Infection rate=0.000, proportion infected given a positive test=0.000, Posterior=0.000
Infection rate=0.100, proportion infected given a positive test=0.680, Posterior=0.677
Infection rate=0.200, proportion infected given a positive test=0.832, Posterior=0.846
Infection rate=0.300, proportion infected given a positive test=0.897, Posterior=0.917
Infection rate=0.400, proportion infected given a positive test=0.925, Posterior=0.948
Infection rate=0.500, proportion infected given a positive test=0.946, Posterior=0.959
Infection rate=0.600, proportion infected given a positive test=0.967, Posterior=0.960
Infection rate=0.700, proportion infected given a positive test=0.979, Posterior=0.978
Infection rate=0.800, proportion infected given a positive test=0.987, Posterior=0.985
Infection rate=0.900, proportion infected given a positive test=0.993, Posterior=0.996
Infection rate=1.000, proportion infected given a positive test=1.000, Posterior=1.000
