### Bayesian Statistics with Conditional Probability


People rely on the collective intelligence of previous experiences to protect themselves or to make better decisions in the future, 
like saving themselves from eating bad food at the wrong restaurant. 

We discussed conditional probability of an event as the probability obtained using additional information that some other event has already occurred. 
We used the following formula for finding P(B|A):

                                             P(A ∩ B)  
                                    P(B|A) = --------  
                                               P(A)

Where the occurrance of event B is dependent on event A.    

In this notebook, 
we will extend the discussion of conditional probability to applications of Bayes' theorem (or Bayes' rule). 
Bayes' rule is used to update the prior probabilities based on additional information that is obtained later. 
Bayes' theorem deals with sequence of events where each occurrance of a subsequent event provides 
new information that is used to revise the probability of the previous event.
The terms _prior probability_ and _posterior probability_ are the common terms for this.

**Prior probability** (a priori) is an initial probability value obtained before any additional information is obtained.

**Posterior probability** (a posteriori) is a probability value that has been revised by using additional information that is later obtained.

### Bayes' Theorem

The probability of event A, given that event B has subsequently occurred, is mathematically represented as below:

                                             P(A) * P(B/A)
                        P(A/B) = -------------------------------------
                                  [P(A) * P(B/A)] + [ P(A') * P(B/A')]
                                  

Consider an example of conducting cancer tests. 
Tests detect things that don’t exist (false positive) and miss things that do exist (false negative).
People often consider the test results directly, without considering the errors in the tests. 
Bayes’ theorem converts the results from a test into the real probability of the event. 

**Correct for measurement errors...** 
If you know the real probabilities and the chance of a false positive and false negative, 
you can correct for measurement errors.

**Relate the actual probability to the measured test probability...** 
Bayes’ theorem lets you relate `P(A|X)`, the chance that an event A happened given the indicator X, 
and `P(X|A)`, the chance the indicator X happened given that event A occurred. 
Given mammogram test results and known error rates, you can predict the actual chance of having cancer.

Bayes’ Theorem: 
It lets you take the test results and correct for the “skew” introduced by false positives. 
Consider the example of cancer test again to illustrate what Bayes' formula is doing.

Let 'A' be the event of person having cancer.
Let 'X' be the event of positive test

P(A|X) = Chance of having cancer (A) given a positive test (X). 
This is what we **want to know**: 
How likely is it to have cancer with a positive result?

P(X|A) = Chance of a positive test (X) given that you had cancer (A). This is the chance of a true positive.

P(A) = Chance of having cancer.

P(not A) = Chance of not having cancer.

P(X|not A) = Chance of a positive test (X) given that you didn’t have cancer (~A). 
This is a false positive.

It all comes down to the chance of a true positive result divided by the chance of any positive result. We can simplify the equation to:

                      P(X/A) * P(A)
            P(A/X) = ---------------
                          P(X)

P(X) is a normalizing constant and helps scale our equation. 
Pr(X) tells us the chance of getting any positive result, 
whether it’s a real positive in the cancer population or a false positive in the non-cancer population. 
It’s a bit like a weighted average and helps us compare against the overall chance of a positive result.

The example below illustrates the formula... 

----
`1. Consider an example:` In Boone County, Missouri 51% of the adults are males.
One adult is randomly selected for a survey involving credit card usage. 
What is the prior probability that the selected person is a male?

**Solution: ** It's known that 51% of the adults in the county are males. 
Consider 'm' as an event of selecting an adult. 
Then the probability of randomly selecting an adult and getting a male is given by P(m) = 0.51

`2. Consider another example:` 
It is later learned that the selected survey subject was about smoking cigars. 
It is known that 9.5% of males smoke cigars, 
whereas 1.7% of females smoke cigars (based on data from the Substance Abuse and Mental Health Services Administration).
Use this additional information to find the probability that the selected subject is a male if we know the subject smokes cigars.

**Solution: ** Based on the additional given information, we have the following:
    
  Let c denote the event that the adult smokes cigars
        
  c' is the compliment event of c and represents adults not smoking a cigars
        
  P(m) = 0.51 because 51% of the adults are males
    
  P(m') = 0.49 because 49% of the adults are females (not males)
    
  P(c|m) = 0.095 because 9.5% of the males smoke cigars 
  (That is, the probability of getting someone who smokes cigars, given that the person is a male, is 0.095.)

  P(c|m') = 0.017 because 1.7% of the females smoke cigars 
  (That is, the probability of getting someone who smokes cigars, given that the person is a female, is 0.017)

Applying Bayes' theorem to the information above, we get the following result:

                                               p(m) * p(c/m)
                        P(m | c) = --------------------------------------
                                    [p(m) * p(c/m)] + [ p(m') * p(c/m')]
                                    
                                 =         0.51 * 0.095
                                   -------------------------------
                                   (0.51 * 0.095) + (0.49 * 0.017)
                                   
                                 =  0.853
                                

Before we knew that the survey subject smoked cigars, there is a 0.51 probability that the survey subject is male. 
But after realizing that the subject smoked cigars, 
the probability is revised to 0.853. 
There is a 0.853 probability that the cigar−smoking respondent is a male. 
The likelihood of a male increased dramatically with the additional information that the subject smokes cigars.

Let's apply Bayes theorem to a multivariate dataset to learn more. Load the framingham data from the directory '/dsa/data/all_datasets/framingham' ... 
This data is from the Framingham Heart Study : https://www.framinghamheartstudy.org

In [None]:
framingham_data <- read.csv("/dsa/data/all_datasets/framingham/framingham.csv")
head(framingham_data)

In [None]:
with(framingham_data,table(currentSmoker,TenYearCHD))

**Question: ** What is the probability for a person to have coronary heart disease condition who is a smoker?

According to the Bayes' theorem formula, let's define events...

**Solution **: 
Let c be an event of selecting a current smoker and c' be event of selecting a non current smoker

Let d be the event of person having a risk of coronory heart disease and d' be event of person not having a risk of coronory heart disease...

                    p(d) * p(c/d)
     p(d/c) =  -------------------------------------
               [p(d) * p(c/d)] + [ p(d') * p(c/d')]
            

P(d|c) = Chance of having coronary heart disease (d) given a person is a smoker (c). 
This is what we want to know: 
How likely is it to have heart disease if a person smokes? 

P(c|d) = Chance of a person being a smoker (c) given that he has coronary heart disease (d). 
This is the chance of a true positive 0.517

P(d) = Chance of having coronary heart disease (311+333)/(1762+333+1834+311)  = 644/4240 = 0.15

P(d') = Chance of not having coronary heart disease (1834+1762)/(1762+333+1834+311) = 3596/4240 = 0.85

P(c|not d) = Chance of person being smoker (c) given that he doesn't have the disease (d'). 
This is a false positive 0.49

p(c)  = (1762+333)/(1762+333+1834+311)  = 2095/4240 = 0.4941


              P(c ∩ d)      333
    p(c/d) =  --------   = ----- = 0.517
                p(d)        644


                P(c ∩ d')       1762
    p(c/d') =  -----------   = ------ = 0.49
                 p(d')          3596

                         p(d) * p(c/d)               
    p(d/c) =   ---------------------------------------
                    [p(d) * p(c/d)] + [ p(d') * p(c/d')]


                     0.15 * 0.517            
           =   -----------------------------------
                  (0.15 * 0.517) + (0.85 * 0.49)
         
           =    0.1569
        

Here is a simplified version of Bayes rule. 
Compare the results.

                                      p(d) * p(c/d)
                        P(d/c)  =  -------------------
                                         p(c)
                                         
                                     0.15 * 0.517
                                =  ---------------
                                         0.4941
                                         
                                =  0.1569

**Example from: ** [IPSUR](https://cran.r-project.org/web/packages/IPSUR/vignettes/IPSUR.pdf)

**Misfiling Assistants problem.**
In this problem, there are three assistants working at a company: 
Moe, Larry, and Curly. 
Their primary job duty is to file paperwork in the filing cabinet when papers become available.
The three assistants have different work schedules:

|        |Moe |Larry |Curly
|--------|----|------|-----
|Workload|60% |30%   |10%

That is, Moe works 60% of the time, Larry works 30% of the time, and Curly does the remaining 10%, and they file documents at approximately the same speed. Suppose a person were to select one of the documents from the cabinet at random. 

Let M be the event, M = {Moe filed the document}  and 

Let L and C be the events that Larry and Curly, respectively, filed the document. 


What are these events’ respective probabilities? 
In the absence of additional information, reasonable prior probabilities would just be

|        |Moe        |Larry      |Curly|
|--------|-----------|-----------|-----------|
|Workload|P(M) = 60% |P(L) = 30% | P(C) = 10%|

Now, the boss comes in one day, opens up the file cabinet, and selects a file at random. 
The boss discovers that the file has been misplaced. 
The boss is so angry at the mistake that (s)he threatens to fire the one who erred. 
The question is: Who misplaced the file?

The boss decides to use probability to decide, and walks straight to the workload schedule. 
(S)he reasons that, since the three employees work at the same speed, 
the probability that a randomly selected file would have been filed by each one would be proportional to his workload.
The boss notifies Moe that he has until the end of the day to empty his desk. 
But Moe argues in his defense that the boss has ignored additional information.
Moe’s likelihood of having misfiled a document is smaller than Larry’s and Curly’s, 
since he is a diligent worker who pays close attention to his work.
Moe admits that he works longer than the others, 
but he doesn’t make as many mistakes as they do. 
Thus, Moe recommends that – before making a decision – the boss should update the probability 
(initially based on workload alone) to incorporate the likelihood of having observed a misfiled document.

And, as it turns out, the boss has information about Moe, Larry, and Curly’s filing accuracy in the past (due to historical performance evaluations). 
The performance information may be represented by the following table:

|        |Moe |Larry |Curly
|--------|----|------|-----
|Workload|0.003| 0.007| 0.010


In other words, on the average, Moe misfiles 0.3% of the documents he is supposed to file. 
Notice that Moe was correct: he is the most accurate filer, followed by Larry, and lastly Curly. 
If the boss were to make a decision based only on the worker’s overall accuracy, 
then Curly should get the axe.
But Curly hears this and interjects that he only works a short period during the day, and consequently makes mistakes only very rarely; 
there is only the tiniest chance that he misfiled this particular document.

The boss would like to use this updated information to update the probabilities for the three assistants, that is, 
(s)he wants to use the additional likelihood that the document was misfiled to update his/her beliefs about the likely culprit. 

Let **A** be the event that **a document is misfiled**.
What the boss would like to know are the three probabilities...

            P(M|A), P(L|A), and P(C|A)
            
We will show the calculation for P(M|A), the other two cases being similar.
We use Bayes’ Rule in the form

                  P(M ∩ A)        
        P(M|A) = ----------
                    P(A)
   

Let’s try to find P(M ∩ A), which is just P(M) · P(A|M) by the Multiplication Rule.
We already know P(M) = 0.6 and P(A|M) is nothing more than Moe’s misfile rate, 
given above to be P(A|M) = 0.003.
Thus, we compute

        P(M ∩ A) = (0.6)(0.003) = 0.0018.

        P(L ∩ A) = 0.0021 and P(C ∩ A) = 0.0010.

Using the theorem of Total Probability we can write P(A) = P(A ∩ M) + P(A ∩ L) + P(A ∩ C).

        P(A) = 0.0018 + 0.0021 + 0.0010 = 0.0049
        
                                         0.0018
    According to Bayes' rule,  P(M|A) = --------  
                                         0.0049

                                       = 0.37

The above last quantity is called the posterior probability that Moe misfiled the document. 
We can use the same argument to calculate


|        |Moe        |Larry      |Curly|
|--------|-----------|-----------|-----------|
|Workload|P(M/A) = 0.37| P(L/A) = 0.43| P(C/A) = 0.20

The conclusion:
Larry gets the axe.
What is happening is an intricate interplay between the time on the job and the misfile rate. 
It is not obvious who the winner (or in this case, loser) will be, 
and the statistician needs to consult Bayes’ Rule to determine the best course of action.

Let's try to implement the same thing in R. 
All the math in the problem above used four simple steps. 

In [None]:
# prior_probs are the prior probabilities we assumed as below. These are assumed based on their working duration as we don't 
# have any prior
prior_probs <- c(0.6, 0.3, 0.1)

# Information about Moe, Larry, and Curly’s past historical performance evaluations considered as likelihood for them 
# to commit the misfiling.
like <- c(0.003, 0.007, 0.01)

# Generate posterior probabilities based on prior probability and likelihood of each event.
post <- prior_probs * like   # Note: This is vector math
post

In [None]:
post/sum(post) # More vector math

We see that we can compute the results using R.
Later in the course, you will see Bayes' Rule applied to a classification problem.