# Bayesian Statistics

## [Slides](https://github.com/cs109/2015/blob/master/Lectures/16-BayesianMethods.pdf)

### Recommended Reading
[The theory that would not die](https://itunes.apple.com/us/book/the-theory-that-would-not-die/id646319489?mt=11)

[Think Bayes](https://itunes.apple.com/us/book/think-bayes/id705489536?mt=11)

[Probabilisitic Programming and Bayesian Methods for Hackers](https://itunes.apple.com/us/book/bayesian-methods-for-hackers/id1045072989?mt=11)

- [iPython Notebook](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers)

[Doing Bayesian Data Analysis](https://www.amazon.ca/Doing-Bayesian-Data-Analysis-Tutorial/dp/0124058884/ref=dp_ob_title_bk)

[Bayesian Data Analysis](https://www.amazon.ca/Bayesian-Data-Analysis-Andrew-Gelman/dp/1439840954/ref=pd_bxgy_14_img_2?_encoding=UTF8&pd_rd_i=1439840954&pd_rd_r=b4c8b1a6-cd71-11e8-ab44-9143b7b890b1&pd_rd_w=FreZI&pd_rd_wg=RXthh&pf_rd_i=desktop-dp-sims&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_p=cda2b2aa-f379-4b98-b5ff-b78659186dbe&pf_rd_r=X4FZ259D7N4VZ1CY1P4P&pf_rd_s=desktop-dp-sims&pf_rd_t=40701&psc=1&refRID=X4FZ259D7N4VZ1CY1P4P)

# Bayes' rule
$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

$P(A)$ is your prior belief
$P(B)$ is new information (something you observed)
- Now you want to update your belief

$P(A|B)$ is the posterior probability for $A$

## Statistical Notation
$$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)}$$

Treating the data y as fixed,

$$p(\theta|y) \propto L(\theta)p(\theta)$$

Bayes' rule says the posterior density is proportional to the likelihood function times the prior density.

**Likelihood** in statistics means that you are taking the probability of the data given the parameter but you are viewing that probability as a function of the parameter. 

The classical frequentist is used to thinking of $\theta$ as an unknown constant and they are very uncomfortable giving it a probability distribution. 

The Bayesian perspective is that $\theta$ is unknown we want to quantify it's uncertainty and propability is the best way to do that.

A Bayesian is, in essense, treating everything as a random variable.

### The ongoing debate is:
Where does the prior come from -- $p(\theta)$

Bayesians are thinking about the rational agents that can subjectively impart a prior.

# Discriminative vs. Generative Classifiers

- What to model and what not to model?

**Discriminative:** just model $p(y|x)$

**Generative:** give a full probability model $p(x,y)=p(x)p(y|x)=p(y)p(x|y)$

# Naive Bayes Spam Filter
- Naive Bayes assumption: conditional independence given spam, and also conditional independence given not spam.

# Full Probability Modeling
The process of Bayesian data analysis can be idealized by dividing it into the following three steps:
1. Setting up a full probability model - a joint probability distribution for all (observed,) observable and unobservable quantities in a problem...
2. Conditioning on observed data: calculating and interpreting the appropriate posterior distribution - the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data
3. Evaluating the fit of the model and the implications of the resulting posterior distribution...

### Think like a Bayesian, check like a frequentist.

# Conjugate Priors: Beta-Binomial

$X$ - observed data

$p$ - parameter

$\text{Bin}$ - Binomial

$n$ - sample size

$$X|p \sim Bin(n,p)$$ 

$\beta$ - The density

$$p \sim Beta(a,b)$$

- $Beta(a,b)$ [density](https://en.wikipedia.org/wiki/Beta-binomial_distribution):

$$f(p) \propto p^{a-1}(1-p)^{b-1}$$
![distribution](https://upload.wikimedia.org/wikipedia/commons/e/e1/Beta-binomial_distribution_pmf.png)

Conjugate Priors are extremely convenient. Beta is a whole family of distributions. If you start out in that family and then observe data, you are still in that family.

Posterior is then $p|X = x \sim Beta(a + x, b + n - x)$

# Conjugate Priors: Normal-Normal

### Normal is [conjugate](https://en.wikipedia.org/wiki/Conjugate_prior) to itself.


# Supervised Learning

## Random Forest

- Tune number of trees and number of features
- Rule of Thumb: The number of features on the $\sqrt{\text{Number of Features}}$

## Out of Bag Error
- Very similar to cross-validation
- Measured during training
- Can be too optimistic

# Variable Importance
- Again use out of bag samples
- Predict class for these samples
- Randomly permute values of one feature
- Predict classes again
- Measure decrease in accuracy

## Continued...
- Measure split criterion improvement
- Record importements for each feature
- Accumulate over whole ensemble

# Regression
- Preserves the distance between the labels
- A classification problem does not concern itself with "is 1 star closer to 3 stars or 5 stars" in a ranking system.
# k-NN
- How would you modify k-NN for Regression?
    - Average the values of the K-NN
    - Or, build a weighted average of the k-NN

```python
KNeighborsRegressor(k = 5, weights = 'uniform')
```
- Ends up being quite discrete in the predictions

```python
KNeighborsRegressor(k = 5, weights = 'distance')
```
- You weight by $\frac{1}{\text{distance}}$
    - This is the default in scikit-learn
- You end up overweighting outliers and including them all

# Decision Tree

# Regression Tree
- You cannot use the gini impurity
    - Now you use your squared error
- Again we average, this time over all points in one of the cells
- During training, split in the way that reduces the squared error the most
- Same idea as before
- Train multiple trees in parallel and average
- Different defaults
- max_features = n_features (_need to check this in the scikit-learn documentation_)
- square error

# SVM for Regression
- The original incarnation of SVM was purely classification
- We need to rethink the optimization problem

## Boosting
- Addes weighting to data and smoothes out the regression prediction.

# Best Practices
- Typical Progress in a machine learning problem is to make high gains early and then fight hard for small gains until the end of the project.

## Generally
- It will be harder than it looks
- Know your applicaiton:
    - zero values
    - outliers
    - where do labels come from
        - most humans generate labels and there is judgement bias. The source of a lot of noise.
- Document, document, document
    - for yourself and for others
- commit, pull, push, repeat