To use this tutorial, read the text and then try to generate code to solve the exercises. Answers will be posted to GitHub after the class they are due.

The learning objective is to gain insights into thinking about inference from a "Frequentist" versus a "Bayesian" perspective. In brief, because a Frequentist does not consider the probability of an event or state of the world or hypothesis, only their frequency of occurrance, it is not possible to ask questions of the form "what is the probabilty that hypothesis x is true?" Instead, one can only consider questions of the form, "what is the probabilty that I would have obtained my data, given that hypothesis x is true?" In contrast, Bayesians consider the probabilities of such things (often called the strength of belief), but doing so can require making assumptions that can be difficult to prove.

Let's start with a simple example, taken from:

https://en.wikipedia.org/wiki/Base_rate_fallacy#Example_1:_HIV

"Imagine running an HIV test on A SAMPLE of 1000 persons ..."

"The test has a false positive rate of 5% (0.05)..." i.e., the probability that someone who takes the test gets a POSITIVE result despite the fact that the person does NOT have HIV

"...and no false negative rate." i.e., The probability that someone who takes the test gets a NEGATIVE result despite the fact that the person DOES have HIV.

Answers to the exercises below will be found [here](https://github.com/PennNGG/Quantitative-Neuroscience/tree/master/Answers%20to%20Exercises/Python) after the due date.

In [16]:
import numpy as np

### Exercise #1: If someone gets a positive test, is it "statistically significant" at the p<0.05 level? Why or why not?

From a frequentist perspective, a p-value represents the probability of observing the data if the null hypothesis is true. Here, the data is just the singular positive test (N=1) and the null hypothesis is that someone is not infected.

As such, our p-value is just the false positive rate of 5% or 0.05. Since it is not less than 0.05, we'd say it is not statistically significant.

In [12]:
N = 1
false_positive = 0.05
false_negative = 0

print("If someone gets a positive test, the p-value is "+str(false_positive))
if false_positive < 0.05:
    print("This is statistically significant at the 0.05 level.")
else:
    print("This is not statistically significant at the 0.05 level.")

If someone gets a positive test, the p-value is 0.05
This is not statistically significant at the 0.05 level.


### Exercise #2: What is the probability that if someone gets a positive test, that person is infected?

Following on Exercise #2, let's do the same thing, but this time we will try different values for the proportion of the population that is actually infected. What you should notice is that the **PROPORTION INFECTED GIVEN A POSITIVE TEST** depends (a lot!) on the **OVERALL RATE OF INFECTION**. Put another way, to determine the probabilty of a hypothesis, given your data (e.g., proportion infected given a positive test), you have to know the probability that the hypothesis was true without any data.

Why is this the case? It is a simple consequence of the definition of a conditional probability, formulated as Bayes' Rule. Specifically, the joint probability of two events, call them A and B, is defined as: $$p(A\,and\,B) = p(A) \times p(B\,|\,A)$$ $$p(B\,and\,A) = p(B) \times p(A\,|\,B)$$

Now, calling A the Hypothesis and B the Data, then rearranging, we get:$$p(Hypothesis\,|\,Data) = \frac{p(Data\,|\,Hypothesis) \times p(Hypothesis)}{p(Data)}$$

So you cannot calculate the probability of the hypothesis, given the data (i.e., the Bayesian posterior), without knowing the probability of the hypothesis independent of any data (i.e., the prior).

For this exercise, assume a range of priors (infection rates) from 0 to 1 in steps of 0.1.
**************

This is a question that can be answered from the Bayesian perspective where we can represent the probability of someone being infected with HIV given a positive test as:

$$
Posterior \;=\; \frac{(Likelihood)(Prior)}{Marginal}
$$
or
$$
P(\mathrm{HIV}\mid +) \;=\; \frac{P(+\mid \mathrm{HIV})\,P(\mathrm{HIV})}{P(+)}
$$
where
$$
P(+) = P(+ \mid \text{HIV}) \, P(\text{HIV}) \;+\; P(+ \mid \text{No HIV}) \, P(\text{No HIV})
$$

In [15]:
#we must estimate the prior to get our probability value for exercise 2
prior_lists = np.arange(0.0, 1.1, 0.1)
for prior in prior_lists:
  likelihood = 1-false_negative #the false negative rate is 0, so if someone has HIV, the test will always be 100% accurate
  p_no_hiv = 1 - prior
  marginal = (likelihood*prior)+(false_positive*p_no_hiv)
  posterior = (likelihood*prior)/marginal
  print(f"If someone gets a positive test and the infection rate is {round(prior*100,2)}%, the probability that they are infected is {round(posterior*100,2)}%")


If someone gets a positive test and the infection rate is 0.0%, the probability that they are infected is 0.0%
If someone gets a positive test and the infection rate is 10.0%, the probability that they are infected is 68.97%
If someone gets a positive test and the infection rate is 20.0%, the probability that they are infected is 83.33%
If someone gets a positive test and the infection rate is 30.0%, the probability that they are infected is 89.55%
If someone gets a positive test and the infection rate is 40.0%, the probability that they are infected is 93.02%
If someone gets a positive test and the infection rate is 50.0%, the probability that they are infected is 95.24%
If someone gets a positive test and the infection rate is 60.0%, the probability that they are infected is 96.77%
If someone gets a positive test and the infection rate is 70.0%, the probability that they are infected is 97.9%
If someone gets a positive test and the infection rate is 80.0%, the probability that they a