<H3> Bayesian Probability </H3>

* A core part of machine learning as well a fundamentally alternate viewpoint on statistical theory itself. 
* The frequentist view of the world (which is how most people learn statistics) is that one can only make make statements based on observed data (i.e. data sampling). 
* Bayesians allow for any prior beliefs about the data, prior to doing any sampling, allowing it to alter the posterior belief based on data.
* Helpful in situations where there is not much data. For example in earthquake modelling there maybe only be 4 or 5 earthquakes to have ever occurred on some particular fault.
* Bayesian probability statements are also easier to interpret 

--------------

<h4>Q) 
Suppose we are given a coin and told that it could be biased, so the
probability of landing heads is not necessarily 0.5. Let θ denote the
probability of it landing heads. We wish to learn about θ .
We toss the coin N times and obtain Y heads. In frequentist statistics,
the point estimate of θ would be Y / N, and a confidence interval can be
constructed around this.
Is this reasonable? </h4>

Ans:
Well, say we performed 100 tosses and got 48
heads. The point estimate would be θ = 0.48. However in this situation
it may be more reasonable to conclude that the coin isn’t biased as the vast majority of coins in the world are not biased, and observing 48 heads in 100 tosses is a normal outcome from tossing an unbiased coin.

In other words, rather than concluding that θ = 0.48, we may wish to
include prior information to make a more informed judgement.

In a Bayesian analysis, we first need to represent our prior beliefs about
θ . Specifically, we construct a probability distribution p (θ) which
encapsulates our beliefs.

There is no one way to do this! p (θ) represents the beliefs of one
particular person based on their assessment of the prior evidence – it
will not be the same for different people if they have different knowledge
about what proportion of coins are biased. In some cases, p (θ) may be
based on subjective judgement, while in others it may be based on
objective evidence. This is the essence of Bayesian statistics –
probabilities express degrees of beliefs.
However since θ here represents the probability of the coin landing
heads, it must lie between 0 and 1. So the function we use to represent
our beliefs should only have mass in the interval [ 0 , 1 ] .

The Beta distribution only has mass in [ 0 , 1 ] so it is a sensible choice for the probability mass funcrion of our prior belief.

----------------------

<h4>Give a high level overview of Bayesian decision making</h4>

We start with prior beliefs p (θ) about the state of the world. After observing the
data y , we update these to give p (θ| y ) . Based on this, we then choose an
action a i from the set of k actions.

--------------------

<H4>List the fundamental rules of probability</H4>

<b>1) UNION (OR) rule:</b> $$p(a \ or \ b) = p(a) + p(b) - p(both)$$  i.e. a union b = p(a) + p(b) - p(a intersect b)

<b>2) Product/Joint probability rule:</b>
$$p(a,b)=p(a|b)*p(b) = p(b|a)*p(a) $$

i.e. Specific combination of a & b = prob. of one given the other * prob of the other

<b>3) SUM rule / Marginal distribution:</b>
$$p(a)=\sum_{b} p(a,b) =\sum_{b} p(a|b)*p(b) $$
i.e. p(a) across all the possible values of b

<b>4) Bayes Rule</b>
$$ P(a \mid b) = \frac{P(b \mid a) \, P(a)}{P(b)}=\frac{P(a,b) \,}{P(b)} $$

The denominator here can be extended using the Marginal distribution rule to become:

$$ P(a \mid b) =\frac{P(b \mid a) \, P(a)}{\sum_{b} p(b|a)*p(a)} $$


Important to know: 
* P(a) is referred to as the prior
* P(b $\mid$ a) is referred to as the liklihood
* P(a $\mid$ b) is referred to as the posterior
-----------------

<h4>Q) Explain the chain rule</h4>

P(A,B,C) = P(A| B,C) P(B,C) = P(A|B,C) P(B|C) P(C)

P(A, B, ..., Z) = P(A| B, ..., Z) P(B| C, ..., Z) P(Y|Z) P(Z)

------------------

<h4> Describe the Beta distribution</h4>
The Beta distribution only has mass in [ 0 , 1 ] and so it makes a good distribution to use for representing probabilities.

* beta(1,2) = 0.5
* beta(2,1) = 0.5
* beta(1,1) = 1

-------------

<h4>Explain prior and posterior probabilities in relation to estimating parameters</h4>

Ans: In a typical inference problem we have an unknown parameter θ which
we wish to estimate. For example, θ may be the mean of a Normal
distribution, or the probability of a particular coin landing heads when
tossed. We also have data Y , such as the outcome of tossing the coin
multiple times. We wish to use the data Y to learn about θ .

The prior distribution p (θ) represents our beliefs about θ before
incorporating the information from the data.
The posterior distribution p (θ| Y ) represents our beliefs about θ
after incorporating the information from the data.
Bayes theorem tells us how to move from p (θ) to p (θ| Y ) . I.e. given we
have some beliefs about θ before seeing the data, it tells us the beliefs
p (θ| Y ) we should have about θ after seeing the data.


---------------------

<h4>Explain what is a conjugate prior</h4>
The prior is conjugate if it belongs to the same family as the posterior. This will be the case if it matches the likelihood in terms of its dependence on the parameter.

If the posterior distributions p(θ|x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. E.g. The beta distribution is a conjugate prior to the binomial liklihood and the resulting posterior is also a beta distribution

There are three reasons why it's useful if your problem has a conjugate prior: 

* Calculating θ is, is made a lot easier: Think about Bayes theorem and what the denominator is for a simple problem in which θ is one of 2 possible values. Think about what the denominator calculation would look like. <br>Not imagine θ is one of any possible values in a distriution -- how much harder is to calculate the denominator now?

* A conjugate prior gives you a way to control how much influence the likilihood has in determining the posterior. 
http://lesswrong.com/lw/5sn/the_joys_of_conjugate_priors/

* Every new observation leads only to a change in the values of the parameters of the distribution for θ, as indicated by the sequential learning in § 1; no new algebra needed.

<h4>Explain what is a predictive function in relation to decision and risk?</h4>

* If you see questions such as p($\theta$  | Y) this is trying to determine $\theta$ given the historical observations
* If you see questions such as p(future value | past results) you are being asked to uses Bayes to predict future values based on taking into account all your historical observations up to that point into your prior.
* E.g. (Z|R)

<h4>What is the fundamental equation of Bayesian prediction that uses theorem of total probability?</h4>

if you have
p($\tilde{Y}$ | B) and p(B | Y) <br>
(Y here represents posterior prob. of historical results)
<br><br>
Then p($\tilde{Y}$ | Y) = $\int$ p($\tilde{Y}$ | B) p(B | Y)

<h4>Q: Explain conditional independence</h4>

Ans: If a and b are independent then p(a and b) = P(a) + p(b)
Unfortuntely total independence is rare. Instead two variables may be independent under certain scenarios. 
Hence we can say, a and b are independent, given c.

If a <i>independent</i> b | c, then p(a,b|c) = p(a|c)*p(c)


--------------

<h4> Q: What is a loss function?</h4>

Also known as a cost function. The loss function L (θ, θ̂) defines the loss incurred if we estimate the true
value of θ by $\hatθ̂$

Expected Loss = Loss matrix ($L_{kj}$) * probability of it being x and each possible Class C.   This is expressed as a continuous solution space rather than discreet (so we are interested in the probability of a region, not a point) hence the integral. And we sum up this loss for all k and j (so yeah the sum of the loss in the loss matrix)

In the loss matrix, k and j represent the class labels (e.g. isCancer, isNormal). 

<img src="img/lossMatrix.png" height="100" width="150">

Down the side it's what you say the label is, along the top it's what it actually is, and the value in the cross-section is some loss function you devise (in this case loss may be heavier if you say isNormal when its actually isCancer)

----------------

<h4>Q: What is maximum liklihood estimation</h4>

A method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters.

In formulas:<br>
$\theta$ represents the parameter <br>
$\theta$ can be one more variables (e.g. mean and std. dev)<br>

Max $p(data |  \theta)$ across all possible values of $\theta$<br>
=Max $p(X_i =x_i|\theta)$ across all possible values of $\theta$

Pro's:
- Easy to compute & Interpret<br>
- Asymptotically Consistent (converges towards to true solution as side of data, N, increases)
- Lowest asymptotic variance (lowest possible error)
- Invariant: Any transformation on the real $\theta$ can also be applied to the MLE $\theta$

Cons:<br>
- Point estimate - no indication of how much uncertainty there is
- $\theta$ may not be unique - could have more than 1 solution

-----------------

<h4>Q: What is a decision region, decision boundary?</h4>

Ans: 
* Decision region - a subset of your solution space that has been labelled as one classification.
* Decision boundary - the boundary between decision regions.

Decision Theory is concerned with making a decision based on probabilities and particularly Bayes Theorem.
How this is applied to machine learning and classification is examined here.

<h4>Q: Describe minimizing the risk of misclassification</h4>
    
For classification we can either minimize the probability of misclassification or maximize the probability of correct classification. We can model the decision in terms of Bayes theorem and then pick the classicification based on the whichever option has the lower (minimization of risk) or higher (maximization of being correct) probability.

<h4>What is Utility?</h4>
Ans: The opposite of loss, U = -L

---------------------

#### Q: Explain Liklihood, Maximum Liklihood, and Log-likihood

<b>Liklihood:</b>

Liklihood is the opposite of knowing a probability distribution and asking questions about the probability of seeing a value based on that distributions. Likilhood asks what the is the probability of seeing that distribution given the observed values.

$$ L(\theta) = p(X|\theta) = \prod^N_{n=1} p(X_n|\theta) $$

The liklihood of a model with parameter(s) $\theta$ = the probability of seeing the data sample X given $\theta$ = the probability of all x given $\theta$ multiplied together (assuming each x is independent)


<b>Log-liklihood</b>

For computational reasons its better to work with the log-liklihood.

$$ ln \, L(\theta) = \sum^N_{n=1} ln\,p(X_n|\theta) $$

Note: By using log you can move from a 'product of' equation to a 'sum of' equation as <a href="https://people.richland.edu/james/lecture/m116/logs/properties.html">"the log of a product is the sum of the logs"</a>

<b> Maximum Liklihood Estimation</b>

Think back to generative models where you are trying to build an internal view of an unknown external world  based on your observations. This is what Maximum liklihood estimation (MLE) seeks to do. It is a way of determining the parameter(s) $\theta$ of whatever model you assume the external world to be, so as to maximize the chances of the values X being observed.

$$ \hat{\theta} _{MLE} = \underset{\theta}{\arg\max} \sum\limits_{i=1}^n \log f(x_i|\theta)  $$

Maximizing the log-liklihood is done by actually minimizing the negative log-liklihood  $- \sum^N_{n=1} ln\,p(X_n|\theta) $ 


To find the minimum of a function we need to find the point at which the gradient is 0.

<b> Maximum Liklihood Estimation for a Gaussian distribution</b>

If we suspect the external model to be a Gaussian distribution then the process of determining the parameters gets simplified to:

mean = $ \frac1 N * \sum^N_{n=1} x_n$ - i.e. the mean of your sample

variance = $ \frac1 N * \sum^N_{n=1} (x_n - \hat\mu)^2$ - i.e. the variance of your sample




Good video: https://www.youtube.com/watch?v=TaotW-u6eys


--------------

#### Q: Imagine you observe two samples of data that come from two Gaussian distributions. You then recieve a new data point. How would you use Maximum Liklihood Estimation to determine which of the two underyling classess the data point belongs to?

Ans:

Assign based on whether $ P(x | \mu_1^* \sum_1^*) > P(x | \mu_0^* \sum_0^*)$

i.e. If the probabiloty of x happening given class 1 is greater than the probability of x happening given class 0

#### Q: What is the formula for calculating risk?

$$Risk(action_i) = \int p(\theta|y)L(\theta, action_i)p(\theta) = \sum_{\theta} p(\theta|y)L(\theta,action_i) $$

In English:

Risk of doing an action = the sum of: 
    - the prob. of an adverse outcome given data y, 
    - times the size of loss(if that outcome happened and you did that action) 
    - times the prob. the prior prob out that outcome happening)

--------------

<h4>Q: What are the two main philosophies to predicting the class of something given x, (p|x)?</h4>

Ans: Empirical Risk Empirical distribution and Bayesian Decision.

<h4>Q: Explain zero-one loss utility</h4>
Ans: Measuring the prediction performance based on the count of correct predictions. In the case of 1 class, Sum of (If correct = 1 else 0). For 2 classes, Sum of (If correct then 1 or 2 depending on which class it was, else 0).

For j classes:

$$ U(c* = j) = \sum_i{U}ij p(c^{true} = i|x^*) $$

<H4>Explain the Trapezoid rule</H4>

The Trapezoid rule is one method for calculating the integral (space under a curve) in cases where we cannot use an analytic solution for the indefinite integral.


<img src="img/0005.png" height="100" width="200">

It works by calculating the average value and multiplying it by the width:

$$\int_a^bf(x)dx \approx \frac{f(a)+f(b)}{2}(b-a)$$

You can divide the problem space into several widths that you compute separately and add up in you wish to gain more accuracy.


<b>Explain Non-informative priors</b>
 
If we have no strong prior beliefs about μ and use an uninformative
prior such as Gamma(0,0). Without any prior knowledge, our posterior mean is simply the empirical mean, and the posterior variance is the empirical variance.

So we can still do Bayesian inference even when we have no strong prior beliefs about the parameter ( μ ) here - just take prior that has a very high variance, like this one.

<H4>Explain numerical integration</h4>
Numeric integration is a set of techniques for solving definite integrals
in cases where we cannot find an analytic solution for the indefinite
integral. I.e. it lets us do integrals of the form:

$$\int_a^bf(x)dx$$

When doing Bayesian analysis on any non-trivial problem, some form of
numeric integration will usually be required. Indeed, while most of the
theory and mathematics of Bayesian inference was initially worked out
during the years 1900-1970, it was only with the widespread availability
of fast computers able to quickly perform numeric integration that it
became popuar