<H3> Bayesian Statistics </h3>

Bayesian theory on probability is a core part of machine learning as well a fundamentally alternate viewpoint on statistical theory itself. The frequentist view of the world is that one can only make make statements based on observed data (i.e. data sampling). Bayesians allow for any prior beliefs about the data, prior to doing any sampling, allowing it to alter the posterior belief based on data.

This is helpful in situations where there is not much data. For example in earthquake
modelling there maybe only be 4 or 5 earthquakes to have ever
occurred on some particular fault.

Bayesian probability statements are also easier to interpret which is
important when communicating with non-statisticians. A frequentist 95 % confidence interval for a parameter θ does not mean that we are 95 % sure that θ lies in the interval. However, Bayesian credible intervals do have this interpretation. 

--------------

#### Give a high level overview of Bayesian decision making

We start with prior beliefs p (θ) about the state of the world. After observing the
data y , we update these to give p (θ| y ) . Based on this, we then choose an
action a i from the set of k actions.

--------------------

<H4>List the fundamental rules of probability</H4>

<b>1) UNION (OR) rule:</b> $$p(a \ or \ b) = p(a) + p(b) - p(both)$$  i.e. a union b = p(a) + p(b) - p(a intersect b)

<b>2) Product/Joint probability rule:</b>
$$p(a,b)=p(a|b)*p(b) = p(b|a)*p(a) $$

i.e. Specific combination of a & b = prob. of one given the other * prob of the other

<b>3) SUM rule / Marginal distribution:</b>
$$p(a)=\sum_{b} p(a,b) =\sum_{b} p(a|b)*p(b) $$
i.e. p(a) across all the possible values of b

<b>4) Bayes Rule</b>
$$ P(a \mid b) = \frac{P(b \mid a) \, P(a)}{P(b)}=\frac{P(a,b) \,}{P(b)} $$

The denominator here can be extended using the Marginal distribution rule to become:

$$ P(a \mid b) =\frac{P(b \mid a) \, P(a)}{\sum_{b} p(b|a)*p(a)} $$

-----------------

<h4>Explain prior and posterior probabilities in relation to estimating parameters</h4>

Ans: In a typical inference problem we have an unknown parameter θ which
we wish to estimate. For example, θ may be the mean of a Normal
distribution, or the probability of a particular coin landing heads when
tossed. We also have data Y , such as the outcome of tossing the coin
multiple times. We wish to use the data Y to learn about θ .

The prior distribution p (θ) represents our beliefs about θ before
incorporating the information from the data.
The posterior distribution p (θ| Y ) represents our beliefs about θ
after incorporating the information from the data.
Bayes theorem tells us how to move from p (θ) to p (θ| Y ) . I.e. given we
have some beliefs about θ before seeing the data, it tells us the beliefs
p (θ| Y ) we should have about θ after seeing the data.


---------------------

<h4> Describe the Beta distribution</h4>
Excellent explanation: http://stats.stackexchange.com/questions/47771/what-is-the-intuition-behind-beta-distribution

beta(1,2) = 0.5

beta(2,1) = 0.5

beta(1,1) = 1

-------------

<h4>
A new medical screening test is developed to assess whether a patient
has a particular disease. The test is advertised to have the following
degrees of accuracy: ”if the patient truly has the disease, then the test
will correctly detect this and return a positive result with probability 0.95.
If the patient truly does not have the disease, the test will correctly detect
this and return a negative result with probability 0.98”
Given that 1 in 1000 people in the population have the disease, what is
the chance that a person testing positive on the test really has the
disease?</h4>

Ans:

* p(d) = 1/1000 = 0.0001
* p(pos|d) = 0.95
* p(not pos | not d) = 0.98

* Therefore: p(d|pos) =  P(pos | d)P(d) / P(pos) = 

$(0.95 * 0.001) / (0.95 * 0.001 + 0.02 * 0.999) = 0.045$

--------------

<h4>Your company is launching a new product and is trying to determine whether a customer will buy the product after viewing it. Based on gut feel Tom thinks the probability distribution for whether someone will buy it is centered around 0.5 (half the people will buy it) and is quite confident about this. His colleague Jill also thinks it's 0.5 but thinks there's a wider amount of uncertainty around it. From an early test you carried out you found that 48% of people (out of 100 people who viewed it) purchased the product. How would you go about defining the probability distribution on what the true probability of buying the product is and the variance around it?</h4>

1. Buying or not buy can be represented with a binomial distribution (a series of successes and failures).
2. The best way to represent the prior expectations of Tom and Jill is with the Beta distribution. The domain of the Beta distribution is (0, 1), just like a probability distribution, so we already know we're on the right track- but the appropriateness of the Beta for this task goes far beyond that. 
3. We can ask for some estimates of the mean and variance and then develop the beta distribtution that matches these views. 

------------------

<h4>Q) 
Suppose we are given a coin and told that it could be biased, so the
probability of landing heads is not necessarily 0 . 5. Let θ denote the
probability of it landing heads. We wish to learn about θ .
We toss the coin N times and obtain Y heads. In frequentist statistics,
the point estimate of θ would be Y / N, and a confidence interval can be
constructed around this.
Is this reasonable? </h4>

Ans:
Well, say we performed 100 tosses and got 48
heads. The point estimate would be θ = 0 . 48. However in this situation
it may be more reasonable to conclude that the coin isn’t biased as the vast majority of coins in the world are not biased, and observing 48 heads in 100 tosses is a normal outcome from tossing an unbiased coin.

In other words, rather than concluding that θ = 0 . 48, we may wish to
include prior information to make a more informed judgement.

In a Bayesian analysis, we first need to represent our prior beliefs about
θ . Specifically, we construct a probability distribution p (θ) which
encapsulates our beliefs.

There is no one way to do this! p (θ) represents the beliefs of one
particular person based on their assessment of the prior evidence – it
will not be the same for different people if they have different knowledge
about what proportion of coins are biased. In some cases, p (θ) may be
based on subjective judgement, while in others it may be based on
objective evidence. This is the essence of Bayesian statistics –
probabilities express degrees of beliefs.
However since θ here represents the probability of the coin landing
heads, it must lie between 0 and 1. So the function we use to represent
our beliefs should only have mass in the interval [ 0 , 1 ] .

The Beta distribution only has mass in [ 0 , 1 ] so it is a sensible choice for the probability mass funcrion of our prior belief.

----------------------

<h4>Q) Explain the chain rule</h4>

P(A,B,C) = P(A| B,C) P(B,C) = P(A|B,C) P(B|C) P(C)

P(A, B, ..., Z) = P(A| B, ..., Z) P(B| C, ..., Z) P(Y|Z) P(Z)

------------------

#### Q: What is the difference between generative and discriminative models?

* Discriminative models learn the (hard or soft) boundary between classes.
* Generative models model the distribution of individual classes. I.e. they build a model of the external world.

<h4>Q: Explain conditional independence</h4>

Ans: If a and b are independent then p(a and b) = P(a) + p(b)
Unfortuntely total independence is rare. Instead two variables may be independent under certain scenarios. 
Hence we can say, a and b are independent, given c.

If a <i>independent</i> b | c, then p(a,b|c) = p(a|c)*p(c)

i.e. If a and b are independent given c, then the probability of one of those, let's say a, is gonna be the probability of a given c, times the probability of c happening


--------------

<h4>Q: What is Expectation? </h4>

* Another way of saying average
* The average value of some function f(x) under a probability distribution p(x)

$ \mathbf{E}(f) = \displaystyle \sum p(x)f(x)  $ -- discreet case

$ \mathbf{E}(f) = \displaystyle \int p(x)f(x)dx  $ -- continuous case

* Expectation can be estimated from a N samples drawn from a probabilty distribution function:

$ \mathbf{E}(f) \simeq \frac1 N \sum f(x_n)$

Note: Multiplying by \frac1 N is the same as dividing by N. You'll see this a lot in machine learning formulas.

---------------

<h4>Q) Question</h4>

Ans:

#### Q: What is variance and covariance?

* <b>Variance</b> is the amount of variablility around the Expectation:

$$ var[f] = \mathbf{E}[(f(x) - \mathbf{E}[f(x)])^2]= \mathbf{E}[f(x)^2] - \mathbf{E}[f(x)]^2  $$


This translates to: Variation = Mean of what you predicted$^2$ minus what you expected$^2$

<br>
* <b>Covariance</b>: Measures the joint-variability between two variables. 


------------

#### Q: What does this show?

<img src="img/covariance.png" height="200" width="400">

Ans: 

------------------

<h4>Q: What is maximum liklihood estimation</h4>

A method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters.

In formulas:<br>
$\theta$ represents the parameter <br>
$\theta$ can be one more variables (e.g. mean and std. dev)<br>

Max $p(data |  \theta)$ across all possible values of $\theta$<br>
=Max $p(X_i =x_i|\theta)$ across all possible values of $\theta$

Pro's:
- Easy to compute & Interpret<br>
- Asymptotically Consistent (converges towards to true solution as side of data, N, increases)
- Lowest asymptotic variance (lowest possible error)
- Invariant: Any transformation on the real $\theta$ can also be applied to the MLE $\theta$

Cons:<br>
- Point estimate - no indication of how much uncertainty there is
- $\theta$ may not be unique - could have more than 1 solution

-----------------

## Decision Theory

Decision Theory is concerned with making a decision based on probabilities and particularly Bayes Theorem.
How this is applied to machine learning and classification is examined here.

<h4>Q: What is a decision region, decision boundary?</h4>

Ans: 
* Decision region - a subset of your solution space that has been labelled as one classification.
* Decision boundary - the boundary between decision regions.

<h4>Q: Describe minimizing the risk of misclassification</h4>
    
For classification we can either minimize the probability of misclassification or maximize the probability of correct classification. We can model the decision in terms of Bayes theorem and then pick the classicification based on the whichever option has the lower (minimization of risk) or higher (maximization of being correct) probability.

<h4> Q: What is a loss function?</h4>

Ans: Also known as a cost function. 
    
$$ \mathbf{E}[L] = \sum_k \sum_j \int_{R} L_{kj}p(x,C_k)dx $$

Translates to: Expected Loss = Loss matrix (L_kj) * probability of it being x and each possible Class C.   This is expressed as a continuous solution space rather than discreet (so we are interested in the probability of a region, not a point) hence the integral. And we sum up this loss for all k and j (so yeah the sum of the loss in the loss matrix)

In the loss matrix, k and j represent the class labels (e.g. isCancer, isNormal). 

<img src="img/lossMatrix.png" height="100" width="150">

Down the side it's what you say the label is, along the top it's what it actually is, and the value in the cross-section is some loss function you devise (in this case loss may be heavier if you say isNormal when its actually isCancer)

----------------

<h4>What is Utility?</h4>
Ans: The opposite of loss, U = -L

---------------------

#### Q: Explain Liklihood, Maximum Liklihood, and Log-likihood

<b>Liklihood:</b>

Liklihood is the opposite of knowing a probability distribution and asking questions about the probability of seeing a value based on that distributions. Likilhood asks what the is the probability of seeing that distribution given the observed values.

$$ L(\theta) = p(X|\theta) = \prod^N_{n=1} p(X_n|\theta) $$

The liklihood of a model with parameter(s) $\theta$ = the probability of seeing the data sample X given $\theta$ = the probability of all x given $\theta$ multiplied together (assuming each x is independent)


<b>Log-liklihood</b>

For computational reasons its better to work with the log-liklihood.

$$ ln \, L(\theta) = \sum^N_{n=1} ln\,p(X_n|\theta) $$

Note: By using log you can move from a 'product of' equation to a 'sum of' equation as <a href="https://people.richland.edu/james/lecture/m116/logs/properties.html">"the log of a product is the sum of the logs"</a>

<b> Maximum Liklihood Estimation</b>

Think back to generative models where you are trying to build an internal view of an unknown external world  based on your observations. This is what Maximum liklihood estimation (MLE) seeks to do. It is a way of determining the parameter(s) $\theta$ of whatever model you assume the external world to be, so as to maximize the chances of the values X being observed.

$$ \hat{\theta} _{MLE} = \underset{\theta}{\arg\max} \sum\limits_{i=1}^n \log f(x_i|\theta)  $$

Maximizing the log-liklihood is done by actually minimizing the negative log-liklihood  $- \sum^N_{n=1} ln\,p(X_n|\theta) $ 


To find the minimum of a function we need to find the point at which the gradient is 0.

<b> Maximum Liklihood Estimation for a Gaussian distribution</b>

If we suspect the external model to be a Gaussian distribution then the process of determining the parameters gets simplified to:

mean = $ \frac1 N * \sum^N_{n=1} x_n$ - i.e. the mean of your sample

variance = $ \frac1 N * \sum^N_{n=1} (x_n - \hat\mu)^2$ - i.e. the variance of your sample




Good video: https://www.youtube.com/watch?v=TaotW-u6eys


--------------

#### Q: Imagine you observe two samples of data that come from two Gaussian distributions. You then recieve a new data point. How would you use Maximum Liklihood Estimation to determine which of the two underyling classess the data point belongs to?

Ans:

Assign based on whether $ P(x | \mu_1^* \sum_1^*) > P(x | \mu_0^* \sum_0^*)$

i.e. If the probabiloty of x happening given class 1 is greater than the probability of x happening given class 0

<h4>Q: What are the two main philosophies to predicting the class of something given x, (p|x)?</h4>

Ans: Empirical Risk Empirical distribution and Bayesian Decision.

<h4>Q: Explain zero-one loss utility</h4>
Ans: Measuring the prediction performance based on the count of correct predictions. In the case of 1 class, Sum of (If correct = 1 else 0). For 2 classes, Sum of (If correct then 1 or 2 depending on which class it was, else 0).

For j classes:

$$ U(c* = j) = \sum_i{U}ij p(c^{true} = i|x^*) $$

<h4>Q: You are planning a vacation in Italy. Before packing, you hear that there might
be an earthquake the day you arrive.
After consulting Google, you learn that in recent years there have been (on
average) five earthquakes a year in the part of the country you are visiting
(ignore leap years). Moreover, you learn that when there is an earthquake, the
earthquake forecast service has correctly predicted it 90% of the time. The other
10% of the times an earthquake is predicted, the forecast its wrong and there is
no earthquake.
What is the probability that there will be an earthquake on the day you arrive?</h4>


Ans: 
* e = earthquake, e' = not earthquake,
* f = an earthake being forecasted,   f' = an earthquake not being forecasted
* p(e) = 5/365 = 0.0137,  p(e') = 1-p(e) = 0.9863
* p(f|e) = 0.9,   p(f|e') = 0.1
* p(f) = p(f|e)*p(e) + p(f|e')*p(e')       (prob of forecast given earthquake + prob. of forecast given no earthquake)

Therefore: p(e|f) = p(f|e)p(e) / p(f) = 0.9*0.0137 / 0.9*0.0137 + 01*0.9863 = 0.11 ~ 11%

--------------------

<h4>Q: Consider a woman who has a brother with haemophilia, but whose father does
not have haemophilia. This implies that her mother must be a carrier of the
haemophilia gene on one of her X chromosomes and that her father is not a
carrier. The woman herself thus has a fifty-fifty chance of having the gene.
The situation involving uncertainty is whether or not the woman carries the
haemophilia gene. The parameter of interest θ can take two states:
• Carries the gene (θ = 1)
• Does not carry the gene (θ = 0).
<br><br>
1. Write down the prior distribution for θ using the above information.
<br>
2. The data Y is the number of the woman’s sons who are infected. Suppose
she has two sons, neither of whom is affected. Assuming the status of the two
sons is independent, write down the likelihood function p(Y |θ) (if the woman is
not a carrier then her sons cannot be affected, but if she is a carrier they each
have a 50% chance of being effected).
<br>
3. Find the corresponding posterior distribution for θ</h4><br>

1) $p(\theta$ = 0) = 0.5, p($\theta$ = 1) = 0.5

2) 

p(Y|$\theta$ = 0) = 1 (we know that neither son can get infected if the mother is not a carrier)

p(Y|$\theta$ = 1) = 0.5 * 0.5 = 0.25

3) 
p($\theta$ = 0 | Y) = $\frac{p(Y|\theta = 0)p(\theta = 0)}{p(Y)}$ 
p(Y) = 1 * 0.5 + 0.25 * 0.5 = 0.625

--------------------

#### Q: What is the formula for calculating risk?

$$Risk(action_i) = \int p(\theta|y)L(\theta, action_i)p(\theta) = \sum_{\theta} p(\theta|y)L(\theta,action_i) $$

In English:

Risk of doing an action = the sum of: 
    - the prob. of an adverse outcome given data y, 
    - times the size of loss(if that outcome happened and you did that action) 
    - times the prob. the prior prob out that outcome happening)

--------------

<h4>Q: The police believe that a criminal is guilty of theft. A criminal prosecutor has
to decide whether the case is worth taking to court. He will take it to court if
he believes the person will be convicted of the crime, and not take it to court
otherwise (he is more concerned with his own career advancement than with
justice!)

Based on previous experience, the prosecutor knows that 70% of people who the
police believe are guilty of theft get convicted in court. His loss function (based
on career considerations) is as follows:

<img src="img/dr1.png" height="100" width="400">


Based purely on the prosecutor’s prior beliefs, should he take the case to
court? (i.e. calculate the risks of both taking to court, and not taking to court)</h4>


Ans:

* p(guilty) = 0.7,  p(not guilty) = 0.3

* Risk if taken to court = p(not guilty)* loss(if not found guity & taken to court) = $0.3 * 5 = 1.5$ 

* Risk if not taken to court = p(found guilty) * loss(if found guilty) = $0.7 * 1 = 0.7$

Risk is lower is not taken to course therefore do not take to court.

-----------------

<h4>Q: A witness comes forward and claims to have information that will prove the
person is guilty. The prosecutor knows that not all witnesses are reliable. Based
on previous experience he knows that the probability of getting a conviction with
a favourable witness is 0.9. Now that the prosecutor has this witness testimony, compute the risks of both
taking to court and not taking to court, and find his optimal decision
(hint: treat ”having a witness” as being the y = 1 case).</h4>


Ans:

* p(found guilty | witness) = 0.9
* Risk if taken to court = 0.1*


-----------------------------

<h4>Q: A company knows from previous experience that only 0.3% of batches are bad but also knows that the percentage of defective items in each bad batch varies. They know based on previous experience that it is equal to 0.05 on average, with a standard
deviation of 0.01. A new batch is tested that contains 4 defective widgets out of 100. </h4>

<h4>Derive the
risk associated with both decisions (keeping and scrapping the batch)
(in other words, rather than assuming that the defective rate is equal to 0.05,
put a Beta distribution prior on the defective rate with α and β chosen to give a
mean of 0.05 and standard deviation 0.01, and then the risk of both decisions under
this posterior. Recall that the dbeta() function in R will evaluate p(y|θ)).</h4>

----------------------

<h4>Q: Items are produced on an assembly line and the probability that any item is
defective is given by θ. A uniform prior on θ is assumed (i.e. Beta(1,1)). An
item is selected from the line and tested. What is the posterior distribution for
θ if the item is
(a) defective?
(b) non defective?</h4>


*  As the prior here is a Beta distribution, we can look up the conjugate-prior of Wikipedia as being:

$$\alpha +\sum _{i=1}^{n}x_{i},\,\beta +n-\sum _{i=1}^{n}x_{i}$$

* For this question where we have Beta(1,1) to start with, this translates into: 
$$p(\theta |Y=0) = 1 +\sum _{i=1}^{n}x_{i},\,1 +n-\sum _{i=1}^{n}x_{i}$$ where n is the number of trials (in this case 1) and x is the number of successes (either 1 or 0 depending on whether its defective)

Therefore:
* p(0|Y=0) = Beta(1,2)
* p(1|Y=1) = Beta(2,1)
--------------------

<h4>Q: Question template</h4>

Answer..

--------------------