# Bayesian Methods

## Probability Basics
#### PMF vs. PDF
Probability mass function refers to a discrete probability distrbution while a probability density function refers to a continuous range. 

#### Independence
Two variables are considered independent if the product of their **marginals** is equal to the probability of their **joint**. $P(X, Y) = P(X)P(Y)$. 

#### Conditional Probability
The **conditional** probability $$P(X|Y) = \frac{P(Y and X)}{P(Y)}$$

#### Chain Rule
The probability of X and Y is the conditional probability of X given Y and the marginal probability Y.
$P(X,Y) = P(X|Y)P(Y)$
$P(X,Y,Z)=P(X|Y,Z)P(Y|Z)P(Z)$

#### Sum Rule
If you want to figure out the marginal probability p(x) and you know only the joint probability x and y, you can integrate out the variable Y. 
$p(X) = \int_{\infty}^{-\infty} p(X,Y)dY$

### Bayes Theorem 
$$P(\theta|X) = \frac{P(X,\theta)}{P(X)} = \frac{P(X|\theta)P(\theta)}{P(X)}$$

The posterior = the likelihood X the prior divided by the evidence.
The prior lets us know something about our data, like some data is oriented around 0. The likelihood shows how well the parameters explain our data. The posterior is the probability after we observe our data.

### Probabolistic Models
The point of our probabolistic models is to draw from some random variables and then check for connections with our randomly drawn variables. 

## The Bayesian approach to statistics
The frequentist treat the randomness as objective and the Bayesian treat the randomness as subjective. Bayesian believe that parameters are random, and data is fixed, frequentist have the opposite few. Frquentist require that the number of data points is much  greater than the number of parameters- but Bayesian methods work for any number of data points. Frequentist use max likelihood, but Bayes try to compute the posterior. The prior in Bayes acts as a regularizer. Bayes works well for on-line methods. 

### Bayesian Network (not Bayesian neural network)
The nodes in a Bayesian model are random variables and the edges are direct impacts. The probabolistic model is a joint probability over all variables, where the joint probability is written as a product of each probability given it's parents. 
$$P(X_1...,X_n) = \prod_{k=1}^{n}P(X_k|P_a(X_k))$$
Parents, pa(G) = {R, S}
pa(S) = {R}

Joint probability of $P(S,R, G) = P(G|S, R)P(S|R)P(R)$

### Naive Bayes Classifier 
$$P(c, f_1,...,f_n)=P(c)\prod_{i=1}^{N}P(f_i|c)$$

We have a class C, that directly affects the value of the features f. For different classes the distribution of the features may be different and the joint distribution is written above- the probability of the class times the product over all features. 

When Bayes Net's have cycles they are called Markov Random Fields

### Bayes Network Example
We beign with a directed graph with a thief, alarm, earthquake, and radio. Given by $P(t, a, e, r) = P(t)P(e)P(a|t, e)P(r|e)$. Where the variables stand for thief, alarm, earthquake, and radio. We first start by defining our priors/marginals for T and E.Then we define probabilities for alarm given 0 or 1 outcomes from theif and earthquake, given that they are the alarm's parents. Last we compute the chances for radio given earthquake. These are all binary events, whether the event happended or not. 

The probability of thief given alarm with Bayes formula:
$$P(T|A)=\frac{P(T,A)}{P(A)}=\frac{P(T,A,E) + P(T, A, \overline{E})}{P(T,A,E)+P(T,A,\overline{E})+ P(\overline{T},A,E) + P(T,A,\overline{E})}$$

It's difficult to tell from the video but it seems like the formula he is using is: **condition = joint/ joint of parameters** and **the joint = the joint including cases w/ parents / the same as above for each term**

### Linear Regression
We can use several versions of the coordinate systems to save space in memory. 

We can approach linear regression from a Bayesian perspecitive. Where we have weights, target, and data we can write $P(w, y|X)=P(y|X,w)P(w)$ 

## Probability Distribution Estimation 
Estimation is used for estimating some variable in the setting of probability distributions. Given some data that we have observed MLE and MAP are ways that we can model that observed data with a Gaussian distribution that 

### Maximum a posteriori (MAP) 
Computing the evidence can be expensive so we try to find the value of the parameters that maximizes the posterior prbability. We use the Bayes theorem and by some magic we drop the evidence part and we have some numeriacal methods we can use. The teach is trash and this seems like an important concept. 

We are given some data. Assume a joint distribution on both the data and theta, $P(D, \theta)$.

The goal of the MAP estimate is a good value of theta for D, to do this we choose $\theta_{map}= argmax_{\theta}P(D|\theta)P(\theta)$ The terms of the probability maximized are the same but flipped from MLE. 

**Pros:**
MAP is easy to compute and interpretable. It also avoids overfitting- "regularization". Another advantage is that the MAP tends to look like MLE asymptotically. 

**Cons:**
Point estimate. No representation of your uncertainty in theta. The MAP is not invariant under reparameterization, which is not true for MLE. Must assume a prior on theta- 


### Maximum Likelihood Estimation (MLE)
MLE is a method that determines values for the parameters of a model. The parameter values are found such that they maximize the likelihood that the process described by the model produced the data that were actually observed. 

If the events that generated the data are proved to be independent then the total probability of observing all of data is the product of observing each data point individually- the product of the marginal probabilities. 

$$\theta_{MLE} = argmax_{\theta}P(X|\theta)$$

### Conjugate Distributions/Prior
In Bayesian probability theory, if the posterior distributions p(θ|x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. If our likelihood is a Gaussian distribution and our prior is a Gaussian, then we know that our posterior will be a Gaussian. 

One thing that is important about the Gaussian distribution is that it is e to the power of some parabola, so that when we multiply two normal distributions we are adding the exponents and we get a new distribution that is the combination of the two.  

If we choose our prior as a conjugate then the posterior will have the same form. The beta distribution is the conjugate to a Bernouli likelihood and creates a Beta posterior. 

If we choose the correct conjugate prior, then we can work out the parameters to the posterior by using set formula and looking up the answers. We can also generate the posterior exactly. 

### Gamma Distribution
The gamma distribution is given by $$\frac{b^a}{\Gamma(a)}\gamma^{a-1}e^{-b\gamma}$$ The gamma function is $$\Gamma(n) = (n-1)!$$ Used to model an factorial distributions. The gamma distribution is used in many fields to model things like the amount of rainfall in an area over a period of time. The gamma has a range (0, infinity).

**Precision:** The precision is the inverse of the variance. 

We can substitue the variance for the precision. 

### Beta Distribution
The beta distribution has two parameters that are assumed to be positive and define don the interval (0,1). The beta distribution is the conjugate prior for the Bernouli, binomial, and geometric distributions.  