# 1) Bayes Rule Review

## Bayes Rule & Probability Review
- You 've probably already seen below formula

$$p(A|B) = \frac{p(A,B)}{p(B)}$$

- $p(A|B)$ - conditional
- $p(A,B)$ - joint
- $p(B)$ - marginal 
- This is the most basic form, but too abstract


## Bayes Example

- A = {Buy, does not buy} ,B = {USA,Candada,Mexico}
- Suppose we want to find p(Buy?|Country)

[]()|CA|US|MX|
----|---|---|---|
Buy = 1(did buy)|20|50|10|
Buy = 0(did not buy)|300|500|200|

- Marginal probability
- p(country=Mexico) = 210/(210+550+320) = 0.19
- p(country=US) = 550/(210+550+320) = 0.51
- p(country=CA) = 320/(210+550+320) = 0.30

## Joint Probabilities
- Now let's find p(Buy?,Country)
- How many possibilities?
- Buy?= 2 possibilities
- Country = 3 possibilities
- Total possibilities = 2 x 3 = 6
- Think of it as an area or volume
- In general, total = |RV1| x|RV2|x |RV3| x ... x|RVn|
- Grows exponentially as we add more variables
- Curse of dimensionality
- is a bad thing, because as the volume grows
- we need to do more computation
- need more samples to get accurate estimates
- p(Buy = 1,CA) = 20/1080(210+550+320) = 0.019
- p(Buy = 0,CA) = 300/1080(210+550+320) = 0.28
- p(Buy = 1,US) = 50/1080(210+550+320) = 0.046
- p(Buy = 0,US) = 500/1080(210+550+320) = 0.46
- p(Buy = 1,MX) = 10/1080(210+550+320) = 0.0093
- p(Buy = 0,MX) = 200/1080(210+550+320) = 0.19

- These seem a lot smaller than the marginal 
- Sum of all possible outcomes must = 1
- If number of total possibilities grows exponentially,actual probability values will shrink exponentially
- Another consequence of curse of dimensionality
- Computers have finite precision-32bit float holds 32bits of info
- can't store infinite number of values


## Underflow
- As probability -> 0, eventually computer will round down to 0
- called the underflow problem
- common in probability 
- log grows slowly as its argument increases


## Conditional probabilities

- p(Buy = 1|CA) = 0.019/0.30 = 0.06
- p(Buy = 0|CA) = 0.28/0.30 = 0.93
- p(Buy = 1|US) = 0.046/0.51 = 0.09
- p(Buy = 0|US) = 0.46/0.51 = 0.91
- p(Buy = 1|MX) = 0.009/0.19 = 0.05
- p(Buy = 0|MX) = 0.185/0.19 = 0.97

note: some roundoff error
- No longer sums to 1, sums to 3 why?
- we are given country- the space of random variables is only buy/not buy
- country is not random here



[]()|CA|US|MX|
----|---|---|---|
Buy = 1(did buy)|20|50|10|
Buy = 0(did not buy)|300|500|200|

- p(Buy = 1| Country = US) = p(Buy = 1,Country = US)/p(country = US)
- = (50/1080)/[(50+500)/1080]
- = 0.09

## Similar but different problem

[]()|CA|US|MX|
----|---|---|---|
Buy = 1(did buy)|20|50|10|
Buy = 0(did not buy)|200|500|100|

- p(Buy=1|Country=CA) = 0.1
- p(Buy=1|Country=US) = 0.1
- p(Buy=1|Country=MX) = 0.1




## Independence
- When 2 variables are independent, the joint becomes the multiple of the marginals, e.g. if A &B are independent:
- p(A,B) = p(A)p(B)
- so, if Buy & Country are independent
- p(Buy|Country) = p(Buy,Country)/p(Country) = p(Buy)p(Country)/p(Country) = p(Buy)

## Manipulating Bayes Rule
- let's make it look more like the form we'll use in 
$$
p(A|B) = \frac{p(A,B)}{p(B)}
$$

- The opposite is also true:
$$
p(B|A) = \frac{p(B,A)}{p(A)}
$$
- Since p(A,B) = p(B,A)
$$
p(A|B) = \frac{p(B|A)p(A)}{p(B)}
$$

$$
p(B) = \sum_{A}^{}{p(A,B)} = \sum_{A}^{}\frac{p(B|A)}{p(A)}
$$

- If working with continuous distributions, sum turns into integral

$$
p(A|B) = \frac{p(B|A)p(A)}{\int{p(B|A)p(A)dA}}
$$

- Can also think of the bottom term as a normalization constant so that the distribution sums to 1


$$
p(A|B) \propto p(B|A)p(A)
$$

- Many times, we just want the $\underset{A}{\operatorname{argmax}}p(A|B)$
$$
\underset{A}{\operatorname{argmax}}p(A|B) = \underset{A}{\operatorname{argmax}}p(B|A)p(A)
$$

## Bayes for classification
$$
p(y|x) = \frac{p(x|y)p(y)}{p(x)}
$$

- p(x|y) is a generative distribution - it tells us what does x look like? given the class is y


# 2) Probability Execise

- We have a fair coin: p(H) = p(T)= 0.5, H= head, T= Tail
- we plan to toss the coin 200 times in total
- After 20 tosses, we have 15H, 5T
- What is the total # of heads we expect to get by the end of the experiment?
(N = 200)

## Gambler's Fallacy
- You just lost 100 times, you must have a better chance of winning next?
- Incorrect
- Doesn't matter how many times you have lost already, your chance of losing next are the same as they have always been


# 3) Monty Hall Problem

- Famous problem in probability, inspired by a TV game show
- TV show was Let's make a deal, host was monty hall, hence the monty hall problem

## How the game works
- you pick a door(door 1)
- monty hall opens a door you didn't pick, reveals a goat (door 2)
- you are given a choice: stay with door 1 or switch to door 3
- This question might seem silly, it doesn't matter which door you choose,
- the probability of each is 1/3 right?

[answer](https://cba.snu.ac.kr/ko/sblcolumn?mode=view&bbsidx=77838)

# 4) Maximum likelihood - Mean of Gaussian

- Maximum likelihood - what does that mean?


## A Gaussian sample
- Suppose we have collected one data point from a source of Gaussian distributed data,call it x
- What is the probability density of that one data point?

$$
p(x) = \frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2})
$$

## Multiple sample
- In a real experiment, we will collect multiple samples
- Typically these samples are IID- Independent and identically distributed
- Identically distributed - they are all Gaussian with the same mean/variance
- Independent
    - If I flip coin and I've gotten 10 heads so far, what's the next toss likely to be?
    - p(T|HHHHHHHHHH) = p(T) = 0.5




## Joint probability density
- This independence means I can multiply the probability of each individual sample to get the joint probability of all the samples

$$
p(x_{1},x_{2},...,x_{N}) =\prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{1}{2}\frac{(x_{i}-\mu)^2}{\sigma^2})
$$

## Data likelihood

- we want to phrase it as probability of data given the parameter
- p(data|parameter)
- Parameters depend on what the model is. e.g. Gaussian, Beta, Gammma, etc
- Gaussian has mean/variance , we will focus on mean

$$
p(x_{1},x_{2},...,x_{N}|\mu) =\prod_{i=1}^{N} p(x_{i}|\mu)
$$

## Maximum likelihood 
- How we have likelihood, how do we maxmize it?
- what is the best setting of $\mu$, such that the likelihood is maximized?
- When we want to maximize a function wrt a variable, calculus provides the tools we need
- Taking the log is useful(log-likelihood)
- Why? Gaussian has an exponential,derivative of exponential is the same thing, so it won't be easy to solve


## Getting rid of the exponential
- log(exp(A)) = A
- This is ok because log() is a monotonically increasing function
- if $\mu^{*}$ yields maximum P,then it also yields maximum log(P)



## Log-likelihood

$$
L = logp(X|\mu) = log \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{1}{2}\frac{(x_{i}-\mu)^2}{\sigma^2})\\
= \prod_{i=1}^{N} log \frac{1}{\sqrt{2\pi\sigma^{2}}}exp(-\frac{1}{2}\frac{(x_{i}-\mu)^2}{\sigma^2}) \\
= \prod_{i=1}^{N} -\frac{(x_{i}-\mu)^2}{2\sigma^2} + const\;wrt\; \mu
$$

## Derivative of log-likelihood
$$
\frac{\partial L}{\partial \mu} = \sum_{i=1}^{N}\frac{x_{i}-\mu}{\sigma^2} = 0 \\
\mu = \frac{1}{N}\sum_{i=1}^{N}x_{i}
$$

# 5) Maximum likelihood - Click-Through Rate

- CTR = click through rate, also conversion rate
- Common measure in e-commernce,online advertising, clickbait news sites
- Not a Gaussian distribution, more like a coin toss(2 possible outcomes)
- Buy/don't buy, click/dont' click
- we will use CTR even though all the methods could be applied to conversion rate
- They are all just the Bernoulli distribution(first seen in my logistic regression class- cross entropy cost function is log of Bernoulli likelihood)

## Problem setup
- H = click, T = No click, H+T = total number impressions
- also IID
- Let's call p(H) = p, so p(T) = 1-p
- Bernoulli only has 1 parameter(Gaussian has 2)
- Suppose we flip 2 H, 3 T - what is the total likelihood?


## Likelihood
- L(2H,3T) = $p^2(1-p)^3$= p(H)p(H)p(T)p(T)p(T)
- $L(N_{H},N_{T}) = p^{N_{H}}(1-p)^{N_{T}}$
- what's the maximum likelihood estimate of p?


## Log-likelihood
$$
L = log(p^{N_{H}}(1-p)^{N_{T}}) = N_{H}logp + N_{T}log(1-p)\\
\frac{\partial L}{\partial p} = \frac{N_{H}}{p} - \frac{N_{T}}{1-p} = 0\\
p = \frac{N_{H}}{N_{H}+N_{T}}
$$

## Interesting Result
- suppose let $x_{i} = 1$ be heads, $x_{i}$=0 be tails
$$
p=\frac{1}{N}\sum_{i=1}^{N}x_{i}
$$
- Exact same result as the Gaussian (p is the mean of the Bernoulli distribution)

- Gaussian mean log likelihood
$$
\mu = \frac{1}{N}\sum_{i=1}^{N}x_{i}
$$

## Problem
- Same problem we had before- we don't know how precise this measurement is
- Intuitively, we konw 1/10 is not as accurate as 10/100, which is not as accurate as 100/1000
- They all give the same p
- Next, we'll look at the frequentist way of dealing with this, and later the Bayesian way

# 6) Confidence Intervals

- Non-Bayesian/ Frequentist methods of dealing with uncertainty of measurement of parameters

## Sum of Random Variables
- We will use a hat to denote an estimate paramter, no hot to be the true parameter
- Is a sum of random variables

$$
\hat{\mu} = \frac{1}{N}\sum_{i=1}^{N}x_{i}
$$

- X is random
- Y is random
- Z = X + Y
- Is Z also random? Yes
- It has a probability distribution, a mean, a variance

- The trick here is that our estimate of the mean is also a random variable
- states that sum of IID random variables tends to a Gaussian distribution

$$
\hat{\mu} \sim N(\mu,\frac{\sigma^2}{N})
$$

## Distribution of estimate
- As we collect more samples(N), variance decreases
- $\mu$ and $\sigma$ refer to mean/std dev of X
- $\mu$ estimate should have the same mean of X
- more variance in X should lead to more variance in $\mu$ estimate

$$
\hat{\mu} \sim N(\mu,\frac{\sigma^2}{N})\\
E(X) = \mu, var(X) = \sigma^2
$$

## Scale difference
- variance grows proportionally to variance of X
- only decreases by square root of N
- therefore, need to collect many more samples to account for larger variance

$$
\hat{\mu} \sim N(\mu,\frac{\sigma^2}{N})\\
var(\hat{\mu}) \propto \sigma^2 . var(\hat{\mu}) \propto 1/N
$$

## Equivalent formulation

$$
\hat{\mu} - \mu \sim N(0,\frac{\sigma^2}{N})
$$

## Confidence Intervals
- we want to know the range of values that are likely to contain true $\mu$
- shade in 95% of middle of Gaussian, we can say $\mu$ is probably here
- Note: 95% CI(confidence Interval) doesn't tell us $\mu$ is in this interval with probability 95%
- In reality, all we can say is if we did many experiments to calculate the sample mean, 95% of the time, those confidence intervals would contain the true $\mu$

![](https://cn.bing.com/th?id=OIP.Mmp-41G15YALhVs1_OGisQHaFK&pid=Api&rs=1)

## Confidence leval / Significance Level
- we call the confidence level 1 - $\alpha$
- we call the significent level $\alpha$
- we'll see significance level again later with statistical testing

## Confidence interval limits
- we want the min/max value for the range where $\mu$ should lie
- let's call them $x_{left},x_{right}$
- we want to find the limits such that the area under the Gaussian is 0.95
- calculus provides the tools - integral

$$
0.95 = 1- \alpha = \int_{x_{left}}^{x_{right}}N(x;0,\frac{\sigma^2}{N})dx
$$

- Standardize the normal and rescale
- New limits $z_{left}$ and $z_{right}$

$$
0.95 = 1- \alpha = \int_{z_{left}}^{z_{right}}N(z;0,1)dx\\
z = \frac{(x - \mu)}{(\sigma/\sqrt{N})}
$$

- we're pretending $\mu$  = 0 and we'll shift it back later

## Cumulative Distribution Function(CDF)
- can we make use of this?

$$
P(X \le x) = \int_{-\infty}^{x}p(x)dx
$$

- Guassian is symmetric
- so if we want 5% on the tail ends, then we want each tail to be 2.5%

- In other words, $z_{right}$ should give us an area of 1 - 0.05/2 = 0.975

$$
0.975 = \int_{-\infty}^{z_{right}}N(z;0,1)dz = \Phi(z_{right})\\
z_{right} = \Phi^{-1}(0.975)
$$

## Inverse CDF
- Scipy has a function to do this
- scipy.stats.norm.pdf
- ppf = percent point function, because statisticians like crazy names
- since Gaussian is symmetric

$$
z_{left} = -z_{right}\\
z_{left} = \Phi^{-1}(0.025)
$$

$$
z_{left} = z_{\alpha/2}\\
z_{right} = z_{1-\alpha/2}
$$

## Confidence Interval

$$
[\hat{\mu}+z_{left}\frac{\sigma}{\sqrt{N}},\hat{\mu}+z_{right}\frac{\sigma}{\sqrt{N}}]
$$

- we don't know $\sigma$
- But this is a valid approximation

$$
[\hat{\mu}+z_{left}\frac{\hat{\sigma}}{\sqrt{N}},\hat{\mu}+z_{right}\frac{\hat{\sigma}}{\sqrt{N}}]\\
\hat{\sigma} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\hat{\mu})^2}
$$

## Confidence Interval Approximation

95%CI = 
$$
 \approx [\hat{\mu}+z_{left}\frac{\hat{\sigma}}{\sqrt{N}},\hat{\mu}+z_{right}\frac{\hat{\sigma}}{\sqrt{N}}]
$$

## Bernoulli Confidence Interval
- replaced the Gaussian symbols with Bernoulli symbols

var(X) = p(1-p)  
95%CI =   
$$
\approx [\hat{p}+z_{left}\sqrt{\frac{\hat{p}(1-\hat{p})}{N}},\hat{p}+z_{right}\sqrt{\frac{\hat{p}(1-\hat{p})}{N}}]
$$

# 7) Bayesian Paradigm

- CLT(Central Limit Theorem)

# Reference

[beta-distribution](https://www.statlect.com/probability-distributions/beta-distribution)  
[probability theory](http://sanghyukchun.github.io/58/)  
[beta distribution calculator](https://keisan.casio.com/exec/system/1180573226)  
[beta_distribution_and_baseball](http://varianceexplained.org/statistics/beta_distribution_and_baseball/)  
[binomial-distribution](https://www.mathsisfun.com/data/binomial-distribution.html)