# Conditional Probabilities

Conditional Probability: Probability of one event given the occurrence of one or more other events, e.g. P(A|B).

The conditional probability can be calculated using the joint probability;

$ \large P(A|B) = \large \frac{P(A\cap B) }{P(B)} $

Conditional probability is not symmetrical, but one conditional probability can be calculated using the other conditional
probability. 

# Bayes Theorem

Bayes Theorem provides a principled way for calculating a conditional probability and an alternative to using the joint probability. This alternate approach of calculating the conditional probability is useful either when the joint probability is challenging to calculate, or when the reverse conditional probability is available or easy to calculate.

$ \large P(A|B) = \frac {P(B|A) * P(A)}{P(B)} \hspace 1cm $   Alias   $ \hspace 1cm \large Posterior = \frac {Likelihood * Prior} {Evidence} $   

Sometimes P(B) is not available easily. Then

$ P(B) = P(B|A)*P(A) + P(B|not A)* P(notA) $

which could use

$ P(not A) = 1 - P(A); P(B|not A) = 1 - P(not B|not A) $

## Fire Smoke example

$ \large P(Fire|Smoke)  =  \frac {P(Smoke|Fire) * P(Fire)}{P(Smoke)} \hspace 1cm $

We can also think in terms of a binary classifier.

P(B|A) : True Positive Rate (TPR) - the sensitivity  or P(Smoke given Fire)

P(B|not A): False Positive Rate (FPR) or P(Fire given No Smoke)

P(not B|not A): True Negative Rate (TNR) - specificity or P(No Fire given No Smoke)

P(A|B): Positive Predictive Value (PPV) - the precision or P(Fire given Smoke)


## Elderly Fall cases

Consider the case where an elderly person falls, what is the probability that they will die from the fall? 
Let's assume base rate of someone elderly dying P(A) is 10%, the base rate for elderly people falling P(B) is 5%, and from all elderly people, 7% of those that die had a fall P(B|A). 

p_die_fall = (p_fell_given_dead * p_die) / p_fall 

(7% * 10%) / 5% = 14% 


## Spam email detection

Consider the case where we receive an email and the spam detector puts it in the spam folder, what is the probability it was spam? 
P(A) = 2% of the email we receive is spam P(A). 
P(B|A) = Spam detector is really good and when an email is spam that it detects it accuracy of 99%, 
P(B|not A) = When an email is not spam, it will mark it as spam with a very low rate of 0.1% 

We need to estimate 
P(Spam given Detected) = P(Detected given Spam) * P(Spam) / P(Detected)

We don't know P(Detected), but we can calculate it using:
P(Detected) = P(Detected given Spam) * P(Spam) + P(Detected given not Spam) * P(not Spam)

We know P(Detected given not Spam) which is 0.1% and we can calculate P(not Spam) as 1 - P(Spam) = 1 - 0:02 = 0:98

Hence P(Detected) = 0:99 * 0:02 + 0:001 * 0:98 = 0:02078; That is, about 2% of all email is detected as spam, regardless of whether it is spam or not.

Now we can calculate the answer to our original question as:
P(Spam given Detected) = (0:99 * 0:02) / 0:02078 =  0:95283926852743

That is, if an email is in the spam folder, there is a 95.2% probability that it is in fact spam.

## Liers and Lie Detectors

Consider the case where a person is tested with a lie detector and gets a positive result suggesting that they are lying, what is the probability that the person is indeed lying? 
P(A) = Most people that are tested are telling the truth, such as 98%, meaning (1 - 0:98) or 2% are liars. 

P(Lying) = 2% 

P(B|A) = When someone is lying, test can detect them well, but not great, such as 72% of the time. 

P(Caught given lying) = 72% & P(Not Caught given Lying) = 18%

P(not B|not A) = When the machine says they are not lying, this is true 97% of the time. 

P(Not Caught given Not Lied) = 97% & P(Caught given Not Lied) = 3% stupid machine!

We need P(Lying given Caught) = P(Caught given Lying) * P(Lying) / P(Caught)

P(A|B) = 0:72 * 0:02 / P(Positive)

But P(Caught) = P(Caught given lying)* P(Lying) + P(Caught given Not Lying) * P(Not Lying)

= 72% * 2% + 3% * 98% = 4.38% 

Hence 

P(A|B) = 0:72 * 0:02 / 0.0438 = 0.328767 


![1_NMBaonn0t2fPHYDYl1L17g.png](attachment:1_NMBaonn0t2fPHYDYl1L17g.png)

#  Bayes Theorem and Machine Learning

A ML model can be thought of as a hypothesis about the relationship between input (X) and output (y). The practice of applied ML is the testing and analysis of different hypotheses (models) on a given dataset. Bayes Theorem provides a probabilistic model to describe the proportional relationship between data (D) and a hypothesis (H); for example:

P(h|D) = P(D|h) * P(h) / P(D)

Or the probability of a given h being true given some observed D can be calculated as the probability of observing D given the
h multiplied by the probability of the h being true regardless of the D, divided by the probability of observing the data regardless of the hypothesis.

Specifically named 
- P(h|D):Posterior probability of the hypothesis (the thing we want to calculate)
- P(h): Prior probability of the hypothesis

Any prior domain knowledge is captured in the prior probability. If not, then all hypotheses may have the same prior probability. 
- If the P(D) increases, then the probability of the hypothesis holding given the data P(h|D) decreases. (????)
- Conversely, P(h) and the probability of observing the data given the hypothesis increases, P(h|D) increases.

Testing different models on a dataset means estimating the probability of each hypothesis (h1; h2; h3; ... ; in H) being true
given the constant D. If we do not have any prior information about the hypothesis being tested, they can be assigned a uniform probability, 

Hence goal is to locate a hypothesis that best explains the observed data or maximise P(D|h). This simplification provides the basis for an optimization procedure for searching for a model and set of parameters that best fits the data, referred to generally as density estimation.



# Density Estimation

Common modeling problem: Density estimation involves selecting a distribution function and the parameters that best explains the joint probability distribution. Often density is too challenging; instead, we are happy with a point estimates from the target distribution, such as the mean. 

Two common approaches that frame the problem as optimization searching for a distribution and set of parameters are:
- Maximum a Posteriori (MAP), a Bayesian method. This is a popular replacement for maximizing the likelihood is maximizing the Bayesian posterior probability density of the parameters instead.
- Maximum Likelihood Estimation (MLE), a frequentist method. Tries to maximize the probability of observing the data from the joint probability distribution. P(x1; x2; x3; ... ; xn; $\theta $) is conditional probability or likelihood given parameters. 

The quantity that we are calculating is posterior probability of A given B and P(A) is referred to as the prior probability of A. Removing normalizing constant of P(B), the posterior is proportional to the probability of B given A multiplied by the prior. Maximizing this quantity over a range of $\theta $ solves an optimization problem for estimating the central tendency of the posterior probability (e.g. the model of the distribution). Hence referred to as maximum a posteriori estimation, or MAP estimation for short, and sometimes simply maximum posterior estimation.

Or taking away the normalizing constants, maximising P($\theta $|X) is equivalent to maximizing P(X|$\theta $)* P($\theta $)

... Finding MAP hypotheses is often much easier than Bayesian learning, because it
requires solving an optimization problem instead of a large summation (or integration)
problem.

This is very similar to MLE, with the addition of the prior probability over the distribution and parameters. If we assume a uniform prior, then both calculations are equivalent. Because of this equivalence, both MLE and MAP often converge to the same optimization problem for many machine learning algorithms. 

.... to be updated...