# 1) Bayes Rule Review

## Bayes Rule & Probability Review
- You 've probably already seen below formula

$$p(A|B) = \frac{p(A,B)}{p(B)}$$

- $p(A|B)$ - conditional
- $p(A,B)$ - joint
- $p(B)$ - marginal 
- This is the most basic form, but too abstract


## Bayes Example

- A = {Buy, does not buy} ,B = {USA,Candada,Mexico}
- Suppose we want to find p(Buy?|Country)

[]()|CA|US|MX|
----|---|---|---|
Buy = 1(did buy)|20|50|10|
Buy = 0(did not buy)|300|500|200|

- Marginal probability
- p(country=Mexico) = 210/(210+550+320) = 0.19
- p(country=US) = 550/(210+550+320) = 0.51
- p(country=CA) = 320/(210+550+320) = 0.30

## Joint Probabilities
- Now let's find p(Buy?,Country)
- How many possibilities?
- Buy?= 2 possibilities
- Country = 3 possibilities
- Total possibilities = 2 x 3 = 6
- Think of it as an area or volume
- In general, total = |RV1| x|RV2|x |RV3| x ... x|RVn|
- Grows exponentially as we add more variables
- Curse of dimensionality
- is a bad thing, because as the volume grows
- we need to do more computation
- need more samples to get accurate estimates
- p(Buy = 1,CA) = 20/1080(210+550+320) = 0.019
- p(Buy = 0,CA) = 300/1080(210+550+320) = 0.28
- p(Buy = 1,US) = 50/1080(210+550+320) = 0.046
- p(Buy = 0,US) = 500/1080(210+550+320) = 0.46
- p(Buy = 1,MX) = 10/1080(210+550+320) = 0.0093
- p(Buy = 0,MX) = 200/1080(210+550+320) = 0.19

- These seem a lot smaller than the marginal 
- Sum of all possible outcomes must = 1
- If number of total possibilities grows exponentially,actual probability values will shrink exponentially
- Another consequence of curse of dimensionality
- Computers have finite precision-32bit float holds 32bits of info
- can't store infinite number of values


## Underflow
- As probability -> 0, eventually computer will round down to 0
- called the underflow problem
- common in probability 
- log grows slowly as its argument increases


## Conditional probabilities

- p(Buy = 1|CA) = 0.019/0.30 = 0.06
- p(Buy = 0|CA) = 0.28/0.30 = 0.93
- p(Buy = 1|US) = 0.046/0.51 = 0.09
- p(Buy = 0|US) = 0.46/0.51 = 0.91
- p(Buy = 1|MX) = 0.009/0.19 = 0.05
- p(Buy = 0|MX) = 0.185/0.19 = 0.97

note: some roundoff error
- No longer sums to 1, sums to 3 why?
- we are given country- the space of random variables is only buy/not buy
- country is not random here



[]()|CA|US|MX|
----|---|---|---|
Buy = 1(did buy)|20|50|10|
Buy = 0(did not buy)|300|500|200|

- p(Buy = 1| Country = US) = p(Buy = 1,Country = US)/p(country = US)
- = (50/1080)/[(50+500)/1080]
- = 0.09

## Similar but different problem

[]()|CA|US|MX|
----|---|---|---|
Buy = 1(did buy)|20|50|10|
Buy = 0(did not buy)|200|500|100|

- p(Buy=1|Country=CA) = 0.1
- p(Buy=1|Country=US) = 0.1
- p(Buy=1|Country=MX) = 0.1




## Independence
- When 2 variables are independent, the joint becomes the multiple of the marginals, e.g. if A &B are independent:
- p(A,B) = p(A)p(B)
- so, if Buy & Country are independent
- p(Buy|Country) = p(Buy,Country)/p(Country) = p(Buy)p(Country)/p(Country) = p(Buy)

## Manipulating Bayes Rule
- let's make it look more like the form we'll use in 
$$
p(A|B) = \frac{p(A,B)}{p(B)}
$$

- The opposite is also true:
$$
p(B|A) = \frac{p(B,A)}{p(A)}
$$
- Since p(A,B) = p(B,A)
$$
p(A|B) = \frac{p(B|A)p(A)}{p(B)}
$$

$$
p(B) = \sum_{A}^{}{p(A,B)} = \sum_{A}^{}\frac{p(B|A)}{p(A)}
$$

- If working with continuous distributions, sum turns into integral

$$
p(A|B) = \frac{p(B|A)p(A)}{\int{p(B|A)p(A)dA}}
$$

- Can also think of the bottom term as a normalization constant so that the distribution sums to 1


$$
p(A|B) \propto p(B|A)p(A)
$$

- Many times, we just want the $\underset{A}{\operatorname{argmax}}p(A|B)$
$$
\underset{A}{\operatorname{argmax}}p(A|B) = \underset{A}{\operatorname{argmax}}p(B|A)p(A)
$$

## Bayes for classification
$$
p(y|x) = \frac{p(x|y)p(y)}{p(x)}
$$

- p(x|y) is a generative distribution - it tells us what does x look like? given the class is y


# 2) Probability Execise

- We have a fair coin: p(H) = p(T)= 0.5, H= head, T= Tail
- we plan to toss the coin 200 times in total
- After 20 tosses, we have 15H, 5T
- What is the total # of heads we expect to get by the end of the experiment?
(N = 200)

## Gambler's Fallacy
- You just lost 100 times, you must have a better chance of winning next?
- Incorrect
- Doesn't matter how many times you have lost already, your chance of losing next are the same as they have always been


# 3) Monty Hall Problem

- Famous problem in probability, inspired by a TV game show
- TV show was Let's make a deal, host was monty hall, hence the monty hall problem

## How the game works
- you pick a door(door 1)
- monty hall opens a door you didn't pick, reveals a goat (door 2)
- you are given a choice: stay with door 1 or switch to door 3
- This question might seem silly, it doesn't matter which door you choose,
- the probability of each is 1/3 right?

[answer](https://cba.snu.ac.kr/ko/sblcolumn?mode=view&bbsidx=77838)

# 4) Maximum likelihood - Mean of Gaussian

- Maximum likelihood - what does that mean?


## A Gaussian sample
- Suppose we have collected one data point from a source of Gaussian distributed data,call it x
- What is the probability density of that one data point?

$$
p(x) = \frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2})
$$

## Multiple sample
- In a real experiment, we will collect multiple samples
- Typically these samples are IID- Independent and identically distributed
- Identically distributed - they are all Gaussian with the same mean/variance
- Independent
    - If I flip coin and I've gotten 10 heads so far, what's the next toss likely to be?
    - p(T|HHHHHHHHHH) = p(T) = 0.5




## Joint probability density
- This independence means I can multiply the probability of each individual sample to get the joint probability of all the samples

$$
p(x_{1},x_{2},...,x_{N}) =\prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{1}{2}\frac{(x_{i}-\mu)^2}{\sigma^2})
$$

## Data likelihood

- we want to phrase it as probability of data given the parameter
- p(data|parameter)
- Parameters depend on what the model is. e.g. Gaussian, Beta, Gammma, etc
- Gaussian has mean/variance , we will focus on mean

$$
p(x_{1},x_{2},...,x_{N}|\mu) =\prod_{i=1}^{N} p(x_{i}|\mu)
$$

## Maximum likelihood 
- How we have likelihood, how do we maxmize it?
- what is the best setting of $\mu$, such that the likelihood is maximized?
- When we want to maximize a function wrt a variable, calculus provides the tools we need
- Taking the log is useful(log-likelihood)
- Why? Gaussian has an exponential,derivative of exponential is the same thing, so it won't be easy to solve


## Getting rid of the exponential
- log(exp(A)) = A
- This is ok because log() is a monotonically increasing function
- if $\mu^{*}$ yields maximum P,then it also yields maximum log(P)



## Log-likelihood

$$
L = logp(X|\mu) = log \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{1}{2}\frac{(x_{i}-\mu)^2}{\sigma^2})\\
= \prod_{i=1}^{N} log \frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{1}{2}\frac{(x_{i}-\mu)^2}{\sigma^2}) \\
= \prod_{i=1}^{N} -\frac{(x_{i}-\mu)^2}{2\sigma^2} + const\;wrt\; \mu
$$

## Derivative of log-likelihood
$$
\frac{\partial L}{\partial \mu} = \sum_{i=1}^{N}\frac{x_{i}-\mu}{\sigma^2} = 0 \\
\mu = \frac{1}{N}\sum_{i=1}^{N}x_{i}
$$