# Bayesian Machine Learning in Python: A/B Testing

## Part 1 - Probability Review

### Probability Review

- Marginal Distributions:    p(A), p(B)
- Joint Distribution:        p(A, B)
- Conditional Distribution:  p(A|B), p(B|A)

#### Marginalization

- How can you find the marginal distribution given the joint?

$
\displaystyle p(A) = \sum_B {p(A, B)} \\
\; \\
\displaystyle p(B) = \sum_A {p(A, B)}
$

#### Conditional Distributions

- How can we calculate the conditional from the joint?

$
\displaystyle p(A | B) = \frac{p(A, B)}{p(B)} = \frac{p(A, B)}{\sum_A{p(A, B)}} \\
\; \\
\displaystyle p(B | A) = \frac{p(A, B)}{p(A)} = \frac{p(A, B)}{\sum_B{p(A, B)}}
$

#### Conditional from Conditional

Given: $\displaystyle  p(A | B), p(B) $<br>
Want: $\displaystyle  p(B | A) $<br>
Recall: $\displaystyle  p(A, B) = p(A | B)p(B) $ <br>
<br>
Derive: $\displaystyle  p(B | A) = \frac{p(A, B)}{p(A)} = \frac{p(A, B)}{\sum_B{p(A, B)}} = \frac{p(A | B)p(B)}{\sum_B{p(A | B)p(B)}} $ -> This is the **Bayes' Rule**!

#### Discrete vs. Continuous Random Variables

- p() is now a probability density (not a probability)
- the rules still hold!

Joint: $\displaystyle  p(x, y) $
Marginal: $\displaystyle  p(x) = \int{p(x, y)dy} \;,\; p(y) = \int{p(x, y)dx}$

#### Conditional Distributions

$
\displaystyle p(x | y) = \frac{p(x, y)}{p(y)} = \frac{p(x, y)}{\int{p(x, y)dx}}\\
\displaystyle p(y | x) = \frac{p(x, y)}{p(x)} = \frac{p(x, y)}{\int{p(x, y)dy}}
$

#### Bayes' Rule

$
\displaystyle p(x | y) = \frac{p(y | x)p(x)}{\int{p(y | x)p(x)dx}} \\
\displaystyle p(y | x) = \frac{p(x | y)p(y)}{\int{p(x | y)p(y)dy}}
$

### Example Dataset

|| CA | US | MX |
| --- | --- | --- | --- |
|Buy = True | 20 | 50 |10 |
| Buy = False | 300 | 500 | 200 |

##### Marginal

* Let's find p(Country) first:

$
\displaystyle p(Country = MX) = (10 + 200) / ((20+300) + (50+500) + (10+200)) = 210 / 1080 = 0.1944 \\
\displaystyle p(Country = US) = (50 + 500) / ((20+300) + (50+500) + (10+200)) = 550 / 1080 = 0.51 \\
\displaystyle p(Country = CA) = (20 + 300) / ((20+300) + (50+500) + (10+200)) = 320 / 1080 = 0.30 \\
$

##### Joint Probabilities

- there are 6 joint probabilities.
- the number of probability values increases exponentially as we add more random variables

$ Volume = |x_1| \times |x_2| \times |x_3| \times ... \times |x_N| $
-> **Curse of dimensionality!**
<br>

$
\displaystyle p(Buy=True, CA) = 20 / 1080 = 0.019 \\
\displaystyle p(Buy=False, CA) = 300 / 1080 = 0.28 \\
\displaystyle p(Buy=True, US) = 50 / 1080 = 0.046 \\
\displaystyle p(Buy=False, US) = 500 / 1080 = 0.46 \\
\displaystyle p(Buy=True, MX) = 10 / 1080 = 0.009259 \\
\displaystyle p(Buy=False, MX) = 200 / 1080 = 0.18518
$

##### Conditional Probabilities

$
p(Buy=True | CA) = 0.019/0.30 = 0.07 \\
p(Buy=False | CA) = 0.28/0.30 = 0.93 \\
p(Buy=True | US) = 0.046/0.51 = 0.09 \\
p(Buy=False | US) = 0.46/0.51 = 0.91 \\
p(Buy=True | MX) = 0.0093/0.19 = 0.0476 \\ 
p(Buy=False | MX) = 0.185/0.19 = 0.9523 \\
$

In [1]:
print(f"10/1080: {10/1080}\n 200/1080: {200/1080} \n 210/1080: {210/1080}\n buy&mx: {(10/1080)/(210/1080)} \n nobuy&mx: {(200/1080)/(210/1080)}")

10/1080: 0.009259259259259259
 200/1080: 0.18518518518518517 
 210/1080: 0.19444444444444445
 buy&mx: 0.047619047619047616 
 nobuy&mx: 0.9523809523809523


##### Conditional Probabilities - Alternative Calculation

$ \displaystyle p(Buy=True | US) = \frac{p(Buy=True, US)}{p(Country=US)} = \frac{50/1080}{(50+500)/1080} = \frac{50}{(50+500)} $

#### Example 2


|| CA | US | MX |
| --- | --- | --- | --- |
| Buy=True | 20 | 50 | 10 |
| Buy=False | 180 | 450 | 90 |

$
\displaystyle p(Buy=True | CA) = 0.1 \\
\displaystyle p(Buy=True | US) = 0.1 \\
\displaystyle p(Buy=True | MX) = 0.1
$

### Independence

* Intuitively: knowing the value of one random variable doesn't tell me anything about the other
* Ex. A = Person 1 coin toss result
* Ex. B = Person 2 coin toss result
* Flipping the same (fair) coin twice is still independent.

$ A \perp B \iff p(A, B) = p(A)p(B) $

**"Two random variables are independent if and only if their joint distribution p(A, B) is equal to the product of their marginal distributions p(A) times p(B)."**


#### Independence (conditional)

- Suppose Buy and Country are independent

$
\displaystyle Buy \perp Country \iff p(Buy, Country) = p(Buy)p(Country)  \\
\displaystyle p(Buy | Country) = \frac{p(Buy, Country)}{p(Country)} = \frac{p(Buy)p(Country)}{p(Country)}
$

*continuing with Example 2*

$
\displaystyle p(Buy) = (20 + 50 + 10)/(200 + 500 + 100) = 0.1
$

### Simple Probability Exercise

Assumption:
* each coin toss in a sequence of coin tosses is independent
* they are identically distributed:
    - i.e. probability of heads is the same for each toss
* iid = independent and identically distributed
* the coin is a fair coin: p(H) = p(T) = 0.5
* we plan to toss the coin 200 times
* we have tossed 100 times so far, results:
    - 80 heads , 20 tails

**2 Typical answers:**

1. Since it's a fair coin, we expect to get 100 heads and tails each.

2. The past is fixed and they can't affect future predictions.
    - 100 tosses left (expect to see 50 heads and tails each)
    - Total heads 80+50 = 130
    - Total tails 20+50 = 70
    
**The correct answer is 2.**
- Coin tosses are independent, i.e. p(toss2 | toss1) = p(toss2)
- the past is fixed
- we can only predict the future outcomes
- we therefore predict 50 *more* heads and 50 *more* tails


#### The Gambler's Fallacy
- The incorrect choice 1. is so common, it has a name
- it is a false believe that things will "balance out" in the end
- e.g. a gambler who has just lost several times believes it is *more likely* they will win next
- the chance of losing th enext game is just as bad as it has always been.

### The Monty Hall Problem

- Famous problem in probability, inspired by a TV game show
- TV show was *Let's Make a Deal*, host was *Monty Hall* - hence **The Monty Hall Problem**

*How does it work?*

1. contestant picks a door out of three, without opening it.
2. Monty Hall opens a different door and reveals a goat.
3. contestant can stay with the original choice or switch doors.
---

**Calculation**

- let's assume contestant chooses door #1

random variable C: which door the car is behind (C = 1, 2, 3)<br>
random variable H: which door Monty Hall opens (assume H = 2)

$
p(H = 2 | C = 1) = 0.5 \\
p(H = 2 | C = 2) = 0 \\
p(H = 2 | C = 3) = 1
$

- what probability do we want?

$ Want: p(C = 3|H = 2), p(C = 1| H = 2) $

**Bayes rule -> way of switching around the givens (C & H)**

$
\displaystyle p(C = 3|H = 2) = \frac{p(H=2, C=3)}{p(H=2)} = \frac{p(H=2|C=3)p(C=3)}{\sum_{c=1}^3{p(H=2|C=c)p(C=c)}}  \\
\displaystyle p(C = 3|H = 2) = \frac{p(H=2, C=3)p(C=3)}{p(H=2|C=1)p(C=1) + p(H=2|C=2)p(C=2) + p(H=2|C=3)p(C=3)}
$

- let's assume p(C) = 1/3

$
\displaystyle p(C = 3|H = 2) = \frac{\frac{1}{3}}{\frac{1}{2}\frac{1}{3} + \frac{1}{3}} = \frac{2}{3}
$

- always switch!

$
\displaystyle p(C = 3|H = 1) = \frac{\frac{1}{2}\frac{1}{3}}{\frac{1}{2}\frac{1}{3} + \frac{1}{3}} = \frac{1}{3}
$

## Maximum Likelihood Estimation

Maximum likelihood estimation is a technique for statistical modeling.<br>
Imagine we have collected data from an experiment and would like to fit a model to that data.<br>
Such a model usually comes with parameters. Our job now is to find the best parameters that they model the collected data as closely as possible.

- modern example: Deep Learning / Neural Networks
- the *learning* part is simply finding the best **parameters** to fit the **data**.

### The Bernoulli Distribution

- Example:
    - p(heads) = 0.6
    - p(tails) = 0.4
    
Mathematical:
- discrete random variable
- PMF (probability mass function)

$
\displaystyle p(x) = \theta^x(1-\theta)^{1-x}
$

*in this case, x can only be 0 or 1*<br>
*$\theta$ is the only parameter in this distribution*

**Note:** $\displaystyle  p(x = 1) = \theta$

$
p(x = 1) = \theta^1(1-\theta)^{1-1} = \theta \\
p(x = 0) = \theta^0(1-\theta)^1 = 1 - \theta \\
\; \\
p(x = 1) + p(x = 0) = 1
$

**Problem Setup**
- suppose we have collected some data (flipped a coin several times)

$
\displaystyle data = {x_1, x_2,...,x_N}
$

**Likelihood:**

$
\displaystyle L(\theta) = p(data | \theta) = \prod_{i=1}^N{p(x_i|\theta)} = \prod_{i=1}^N{\theta^{x_i}(1-\theta)^{1-x_i}}
$


### What is the Likelihood a Function of?

- many people think it's **x**
    - this is not correct!
    - the **x**s are just the values we recorded in our experiment (1s and 0s)
-the variable is $\theta$

**Example**

$
x_1 = 1, x_2 = 0, x_3 = 1 \\
L(\theta) = \prod_{i=1}^N{\theta^{x_i}(1-\theta)^{1-x_i}} = \theta^1 \times (1-\theta)^1 \times \theta^1
$

### Why is it called 'Maximum Likelihood'?

- what value of $\theta$ makes the data we collected **most probable**?
- what values of $\theta$ maximizes the likelihood?
- ex. if we got 100 heads, 0 tails, $\theta = 5%$ would not make sense!
- it's more likely the probability is closer to 100%

$\displaystyle p(H) \approx \frac{N_H}{N_H + N_T} $

### Maximizing a Function

- we want to take the derivative of L with respect to $\theta$
- we want to find the value of $\theta$ that makes the derivative 0
- we call this theta hat, hat is the symbol usually used for statistical estimates

$
\displaystyle \frac {dL}{d\theta} = 0 \\
\;\\
\displaystyle \hat{\theta} = \arg\max_{\theta}L(\theta)
$

### Log-Likelihood

- it's better to take the log of the likelihood before differentiating
- usually, the derivative is easier to solve 
- for Bernoulli, it's solvable both ways
- why does this work?
    - log() is monotonically increasing function
    - whatever $\theta$ maximizes L also maximizes logL
    - pick any two values, where A > B -> is log(A) > log(B)?

**Calculation**

- take the log

$
\displaystyle l(\theta) = \log{L(\theta)} = \log\prod_{i=1}^N{\theta^{x_i}(1-\theta)^{1-x}}\\
\;\\
\displaystyle = \sum_{i=1}^N{\{x_i{\log\theta} + (1-x_i)\log(1-\theta)\}}
$

- take the derivative

$
\displaystyle l(\theta) = \sum_{i=1}^N \{x_i\log\theta + (1-x_i)\log(1-\theta)\} \\
\displaystyle \frac{dl}{d\theta} = \frac{1}{\theta}\sum_{i=1}^Nx_i - \frac{1}{1-\theta}\sum_{i=1}^N(1-x_i)
$