<H3> Bayesian Statistics </h3>

* A core part of machine learning as well a fundamentally alternate viewpoint on statistical theory itself. 
* The frequentist view of the world (which is how most people learn statistics) is that one can only make make statements based on observed data (i.e. data sampling). 
* Bayesians allow for any prior beliefs about the data, prior to doing any sampling, allowing it to alter the posterior belief based on data.
* Helpful in situations where there is not much data. For example in earthquake modelling there maybe only be 4 or 5 earthquakes to have ever occurred on some particular fault.
* Bayesian probability statements are also easier to interpret 

--------------

<h4>Give a high level overview of Bayesian decision making</h4>

We start with prior beliefs p (θ) about the state of the world. After observing the
data y , we update these to give p (θ| y ) . Based on this, we then choose an
action a i from the set of k actions.

--------------------

<H4>List the fundamental rules of probability</H4>

<b>1) UNION (OR) rule:</b> $$p(a \ or \ b) = p(a) + p(b) - p(both)$$  i.e. a union b = p(a) + p(b) - p(a intersect b)

<b>2) Product/Joint probability rule:</b>
$$p(a,b)=p(a|b)*p(b) = p(b|a)*p(a) $$

i.e. Specific combination of a & b = prob. of one given the other * prob of the other

<b>3) SUM rule / Marginal distribution:</b>
$$p(a)=\sum_{b} p(a,b) =\sum_{b} p(a|b)*p(b) $$
i.e. p(a) across all the possible values of b

<b>4) Bayes Rule</b>
$$ P(a \mid b) = \frac{P(b \mid a) \, P(a)}{P(b)}=\frac{P(a,b) \,}{P(b)} $$

The denominator here can be extended using the Marginal distribution rule to become:

$$ P(a \mid b) =\frac{P(b \mid a) \, P(a)}{\sum_{b} p(b|a)*p(a)} $$


Important to know: 
* P(a) is referred to as the prior
* P(b $\mid$ a) is referred to as the liklihood
* P(a $\mid$ b) is referred to as the posterior
-----------------

<h4>Q) Explain the chain rule</h4>

P(A,B,C) = P(A| B,C) P(B,C) = P(A|B,C) P(B|C) P(C)

P(A, B, ..., Z) = P(A| B, ..., Z) P(B| C, ..., Z) P(Y|Z) P(Z)

------------------

<h4>a) Write down Bayes theorem which relates p(T |E) to p(E|T ) </h4>
p(T|E) = p(E|T)p(T)/p(E)

<h4>b) Based on past experience, you know that when a large attack occurs, the expert
correctly predicts it with probability 0.8. However there is also a 0.4 probability that
the expert incorrectly predicts that an attack will occur when in fact no attack occurs.
Additionally, your prior p(T) is 0.5
Calculate p(T|E)</h4>
<br>
p(T |E) = 0.8 ∗ 0.5/0.6 = 0.67

<h4>c)Write down the Theorem of Total Probability which expresses p(E) in terms of
p(E|T ) and p(E|T')   (T' denotes that event T does not occur).</h4>

p(E) = p(E|T)p(T) + p(E|T')p(T')
p(E) = 0.8 ∗ 0.5 + 0.4 ∗ 0.5 = 0.6

--------------------------------------------------------

<h4>Explain prior and posterior probabilities in relation to estimating parameters</h4>

Ans: In a typical inference problem we have an unknown parameter θ which
we wish to estimate. For example, θ may be the mean of a Normal
distribution, or the probability of a particular coin landing heads when
tossed. We also have data Y , such as the outcome of tossing the coin
multiple times. We wish to use the data Y to learn about θ .

The prior distribution p (θ) represents our beliefs about θ before
incorporating the information from the data.
The posterior distribution p (θ| Y ) represents our beliefs about θ
after incorporating the information from the data.
Bayes theorem tells us how to move from p (θ) to p (θ| Y ) . I.e. given we
have some beliefs about θ before seeing the data, it tells us the beliefs
p (θ| Y ) we should have about θ after seeing the data.


---------------------

<h4>Explain what is a conjugate prior</h4>

If the posterior distributions p(θ|x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. E.g. The beta distribution is a conjugate prior to the binomial liklihood and the resulting posterior is also a beta distribution

There are two reasons why it's useful if your problem has a conjugate prior. 

a) Calculating what θ is, is made a lot easier. Think about Bayes theorem and what the denominator is for a simple problem in which θ is one of 2 possible values. Think about what the denominator calculation would look like. <br>Not imagine θ is one of any possible values in a distriution -- how much harder is to calculate the denominator now?

b) A conjugate prior gives you a way to control how much influence the likilihood has in determining the posterior. 
http://lesswrong.com/lw/5sn/the_joys_of_conjugate_priors/

<h4> Describe the Beta distribution</h4>
The Beta distribution only has mass in [ 0 , 1 ] and so it makes a good distribution to use for representing probabilities.

* beta(1,2) = 0.5
* beta(2,1) = 0.5
* beta(1,1) = 1

-------------

<h4>
A new medical screening test is developed to assess whether a patient
has a particular disease. The test is advertised to have the following
degrees of accuracy: ”if the patient truly has the disease, then the test
will correctly detect this and return a positive result with probability 0.95.
If the patient truly does not have the disease, the test will correctly detect
this and return a negative result with probability 0.98”
Given that 1 in 1000 people in the population have the disease, what is
the chance that a person testing positive on the test really has the
disease?</h4>

Ans:

* p(d) = 1/1000 = 0.001
* p(pos|d) = 0.95
* p(not pos | not d) = 0.98

* Therefore: p(d|pos) =  P(pos | d)P(d) / P(pos) = 

$\frac{(0.95 * 0.001)}{(0.95 * 0.001 + 0.02 * 0.999)} = 0.047$

--------------

<h4> Q: What is a loss function?</h4>

Also known as a cost function. The loss function L (θ, θ̂) defines the loss incurred if we estimate the true
value of θ by $\hatθ̂$

Expected Loss = Loss matrix ($L_{kj}$) * probability of it being x and each possible Class C.   This is expressed as a continuous solution space rather than discreet (so we are interested in the probability of a region, not a point) hence the integral. And we sum up this loss for all k and j (so yeah the sum of the loss in the loss matrix)

In the loss matrix, k and j represent the class labels (e.g. isCancer, isNormal). 

<img src="img/lossMatrix.png" height="100" width="150">

Down the side it's what you say the label is, along the top it's what it actually is, and the value in the cross-section is some loss function you devise (in this case loss may be heavier if you say isNormal when its actually isCancer)

----------------

<h4>Q) 
Suppose we are given a coin and told that it could be biased, so the
probability of landing heads is not necessarily 0.5. Let θ denote the
probability of it landing heads. We wish to learn about θ .
We toss the coin N times and obtain Y heads. In frequentist statistics,
the point estimate of θ would be Y / N, and a confidence interval can be
constructed around this.
Is this reasonable? </h4>

Ans:
Well, say we performed 100 tosses and got 48
heads. The point estimate would be θ = 0.48. However in this situation
it may be more reasonable to conclude that the coin isn’t biased as the vast majority of coins in the world are not biased, and observing 48 heads in 100 tosses is a normal outcome from tossing an unbiased coin.

In other words, rather than concluding that θ = 0.48, we may wish to
include prior information to make a more informed judgement.

In a Bayesian analysis, we first need to represent our prior beliefs about
θ . Specifically, we construct a probability distribution p (θ) which
encapsulates our beliefs.

There is no one way to do this! p (θ) represents the beliefs of one
particular person based on their assessment of the prior evidence – it
will not be the same for different people if they have different knowledge
about what proportion of coins are biased. In some cases, p (θ) may be
based on subjective judgement, while in others it may be based on
objective evidence. This is the essence of Bayesian statistics –
probabilities express degrees of beliefs.
However since θ here represents the probability of the coin landing
heads, it must lie between 0 and 1. So the function we use to represent
our beliefs should only have mass in the interval [ 0 , 1 ] .

The Beta distribution only has mass in [ 0 , 1 ] so it is a sensible choice for the probability mass funcrion of our prior belief.

----------------------

<h4>Q: Explain conditional independence</h4>

Ans: If a and b are independent then p(a and b) = P(a) + p(b)
Unfortuntely total independence is rare. Instead two variables may be independent under certain scenarios. 
Hence we can say, a and b are independent, given c.

If a <i>independent</i> b | c, then p(a,b|c) = p(a|c)*p(c)


--------------

<h4>Q: What is maximum liklihood estimation</h4>

A method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters.

In formulas:<br>
$\theta$ represents the parameter <br>
$\theta$ can be one more variables (e.g. mean and std. dev)<br>

Max $p(data |  \theta)$ across all possible values of $\theta$<br>
=Max $p(X_i =x_i|\theta)$ across all possible values of $\theta$

Pro's:
- Easy to compute & Interpret<br>
- Asymptotically Consistent (converges towards to true solution as side of data, N, increases)
- Lowest asymptotic variance (lowest possible error)
- Invariant: Any transformation on the real $\theta$ can also be applied to the MLE $\theta$

Cons:<br>
- Point estimate - no indication of how much uncertainty there is
- $\theta$ may not be unique - could have more than 1 solution

-----------------

## Decision Theory

Decision Theory is concerned with making a decision based on probabilities and particularly Bayes Theorem.
How this is applied to machine learning and classification is examined here.

<h4>Q: What is a decision region, decision boundary?</h4>

Ans: 
* Decision region - a subset of your solution space that has been labelled as one classification.
* Decision boundary - the boundary between decision regions.

<h4>Q: Describe minimizing the risk of misclassification</h4>
    
For classification we can either minimize the probability of misclassification or maximize the probability of correct classification. We can model the decision in terms of Bayes theorem and then pick the classicification based on the whichever option has the lower (minimization of risk) or higher (maximization of being correct) probability.

<h4>What is Utility?</h4>
Ans: The opposite of loss, U = -L

---------------------

#### Q: Explain Liklihood, Maximum Liklihood, and Log-likihood

<b>Liklihood:</b>

Liklihood is the opposite of knowing a probability distribution and asking questions about the probability of seeing a value based on that distributions. Likilhood asks what the is the probability of seeing that distribution given the observed values.

$$ L(\theta) = p(X|\theta) = \prod^N_{n=1} p(X_n|\theta) $$

The liklihood of a model with parameter(s) $\theta$ = the probability of seeing the data sample X given $\theta$ = the probability of all x given $\theta$ multiplied together (assuming each x is independent)


<b>Log-liklihood</b>

For computational reasons its better to work with the log-liklihood.

$$ ln \, L(\theta) = \sum^N_{n=1} ln\,p(X_n|\theta) $$

Note: By using log you can move from a 'product of' equation to a 'sum of' equation as <a href="https://people.richland.edu/james/lecture/m116/logs/properties.html">"the log of a product is the sum of the logs"</a>

<b> Maximum Liklihood Estimation</b>

Think back to generative models where you are trying to build an internal view of an unknown external world  based on your observations. This is what Maximum liklihood estimation (MLE) seeks to do. It is a way of determining the parameter(s) $\theta$ of whatever model you assume the external world to be, so as to maximize the chances of the values X being observed.

$$ \hat{\theta} _{MLE} = \underset{\theta}{\arg\max} \sum\limits_{i=1}^n \log f(x_i|\theta)  $$

Maximizing the log-liklihood is done by actually minimizing the negative log-liklihood  $- \sum^N_{n=1} ln\,p(X_n|\theta) $ 


To find the minimum of a function we need to find the point at which the gradient is 0.

<b> Maximum Liklihood Estimation for a Gaussian distribution</b>

If we suspect the external model to be a Gaussian distribution then the process of determining the parameters gets simplified to:

mean = $ \frac1 N * \sum^N_{n=1} x_n$ - i.e. the mean of your sample

variance = $ \frac1 N * \sum^N_{n=1} (x_n - \hat\mu)^2$ - i.e. the variance of your sample




Good video: https://www.youtube.com/watch?v=TaotW-u6eys


--------------

#### Q: Imagine you observe two samples of data that come from two Gaussian distributions. You then recieve a new data point. How would you use Maximum Liklihood Estimation to determine which of the two underyling classess the data point belongs to?

Ans:

Assign based on whether $ P(x | \mu_1^* \sum_1^*) > P(x | \mu_0^* \sum_0^*)$

i.e. If the probabiloty of x happening given class 1 is greater than the probability of x happening given class 0

<h4>Q: What are the two main philosophies to predicting the class of something given x, (p|x)?</h4>

Ans: Empirical Risk Empirical distribution and Bayesian Decision.

<h4>Q: Explain zero-one loss utility</h4>
Ans: Measuring the prediction performance based on the count of correct predictions. In the case of 1 class, Sum of (If correct = 1 else 0). For 2 classes, Sum of (If correct then 1 or 2 depending on which class it was, else 0).

For j classes:

$$ U(c* = j) = \sum_i{U}ij p(c^{true} = i|x^*) $$

<h4>Q: You are planning a vacation in Italy. Before packing, you hear that there might
be an earthquake the day you arrive.
After consulting Google, you learn that in recent years there have been (on
average) five earthquakes a year in the part of the country you are visiting
(ignore leap years). Moreover, you learn that when there is an earthquake, the
earthquake forecast service has correctly predicted it 90% of the time. The other
10% of the times an earthquake is predicted, the forecast its wrong and there is
no earthquake.
What is the probability that there will be an earthquake on the day you arrive?</h4>


Ans: 
* e = earthquake, e' = not earthquake,
* f = an earthake being forecasted,   f' = an earthquake not being forecasted
* p(e) = 5/365 = 0.0137,  p(e') = 1-p(e) = 0.9863
* p(f|e) = 0.9,   p(f|e') = 0.1
* p(f) = p(f|e)*p(e) + p(f|e')*p(e')       (prob of forecast given earthquake + prob. of forecast given no earthquake)

Therefore: p(e|f) = p(f|e)p(e) / p(f) = 0.9*0.0137 / 0.9*0.0137 + 01*0.9863 = 0.11 ~ 11%

--------------------

<h4>Q: Consider a woman who has a brother with haemophilia, but whose father does
not have haemophilia. This implies that her mother must be a carrier of the
haemophilia gene on one of her X chromosomes and that her father is not a
carrier. The woman herself thus has a fifty-fifty chance of having the gene.
The situation involving uncertainty is whether or not the woman carries the
haemophilia gene. The parameter of interest θ can take two states:
• Carries the gene (θ = 1)
• Does not carry the gene (θ = 0).
<br><br>
1. Write down the prior distribution for θ using the above information.
<br>
2. The data Y is the number of the woman’s sons who are infected. Suppose
she has two sons, neither of whom is affected. Assuming the status of the two
sons is independent, write down the likelihood function p(Y |θ) 

Note: if the woman is
not a carrier then her sons cannot be affected, but if she is a carrier they each
have a 50% chance of being effected).
<br>
3. Find the corresponding posterior distribution for θ</h4><br>

1) $p(\theta$ = 0) = 0.5, p($\theta$ = 1) = 0.5

2) The data is that neither son is affected. Let ’D’ denote this data.

p(D|$\theta$ = 0) = 1 (we know that neither son can get infected if the mother is not a carrier)

p(D|$\theta$ = 1) = 0.5 * 0.5 = 0.25

3) 
p(D) = p(D|θ=0)p(θ = 0) + p(D|θ = 1)p(θ = 1) = 1 ∗ 0.5 + 0.25 ∗ 0.5 = 0.625
<img src="img/0006.png" height="100" width="400">

--------------------

#### Q: What is the formula for calculating risk?

$$Risk(action_i) = \int p(\theta|y)L(\theta, action_i)p(\theta) = \sum_{\theta} p(\theta|y)L(\theta,action_i) $$

In English:

Risk of doing an action = the sum of: 
    - the prob. of an adverse outcome given data y, 
    - times the size of loss(if that outcome happened and you did that action) 
    - times the prob. the prior prob out that outcome happening)

--------------

<h4>Q: The police believe that a criminal is guilty of theft. A criminal prosecutor has
to decide whether the case is worth taking to court. He will take it to court if
he believes the person will be convicted of the crime, and not take it to court
otherwise (he is more concerned with his own career advancement than with
justice!)

Based on previous experience, the prosecutor knows that 70% of people who the
police believe are guilty of theft get convicted in court. His loss function (based
on career considerations) is as follows:

<img src="img/dr1.png" height="100" width="400">
<br>

Based purely on the prosecutor’s prior beliefs, should he take the case to
court? (i.e. calculate the risks of both taking to court, and not taking to court)</h4>


Ans:

* p(guilty) = 0.7,  p(not guilty) = 0.3

* Risk if taken to court = p(not guilty)* loss(if not found guity & taken to court) = $0.3 * 5 = 1.5$ 

* Risk if not taken to court = p(found guilty) * loss(if found guilty) = $0.7 * 1 = 0.7$

Risk is lower is not taken to course therefore do not take to court.

-----------------

<h4>Q: A witness comes forward and claims to have information that will prove the
person is guilty. The prosecutor knows that not all witnesses are reliable. Based
on previous experience he knows that the probability of getting a conviction with
a favourable witness is 0.9. Now that the prosecutor has this witness testimony, compute the risks of both
taking to court and not taking to court, and find his optimal decision
(hint: treat ”having a witness” as being the y = 1 case).</h4>


Ans:
After seeing the data, the risk becomes:

$R(a_0 |y) = p(θ = 0|y)L(θ = 0, a_0 )+p(θ = 1|y)L(θ = 1, a_0 ) = 0.9∗0+0.1∗5 = 0.5$

$R($a_1$ |y) = p(θ = 0|y)L(θ = 0, a_1 )+p(θ = 1|y)L(θ = 1, a_1 ) = 0.9∗1+0.1∗0 = 0.9$

The risk is now minimised by taking to court

Note: the calculations in this case are easier than the ones from the lecture
because you are given p(θ|y) here, while in the lecture you were only given
p(y|θ) and had to use Bayes theorem to find p(θ|y). Make sure you understand
the difference! The question stated:
”Based on previous experience he knows that the probability of getting a con-
viction with a favourable witness is 0.9”
i.e. this is telling you p(θ = 0|y) = 0.9 and p(θ = 1|y) = 0.1 directly.without
the need for further calculation.


-----------------------------

<h4>Q:
A manufacturing company knows from previous experience that only 0.3% of batches are bad but also knows that the percentage of defective items in each bad batch varies. They know based on previous experience that it is equal to 0.05 on average, with a standard
deviation of 0.01. A new batch is tested that contains 4 defective widgets out of 100. </h4>

<h4>Derive the risk associated with both decisions (keeping and scrapping the batch), using the loss function below, where $\theta = 1$ means batch is bad and $a_0$ is to keep the batch.
<br>
In other words, rather than assuming that the defective rate is equal to 0.05, put a Beta distribution prior on the defective rate with α and β chosen to give a mean of 0.05 and standard deviation 0.01, and then the risk of both decisions under this posterior. Recall that the dbeta() function in R will evaluate p(y|θ)).</h4>

<img src="img/0007.png" width="150">


The risk associated with action $a_1$ is:

$$R(a_1|y) = \sum_{i=0}^1 p(θ = i|y)L(θ=i,a_1) = p(θ=0|y)$$

<img src="img/0008.png" width="500">

so the risk is minimised by taking action $a_0$ , i.e. keeping the batch

----------------------

<H4>Q: Suppose that we believe that the time-between-earthquakes on a particular fault follows an Exponential(0.1) distribution (higher numbers towards beginning then long tail). If an earthquake occurs today, what is the probability of the next earthquake occurring within 10 years?”</H4>

* The R function pexp(x, lambda) returns the probability of a random variable with the Exponential(λ) distribution having a value <b>less than or equal to</b> x
* pexp(10,0.1) = 0.6321206
* The probability of the next earthquake happening within 10 years is hence 0.63, or 63%


<h4>Q: The average number of fatalities in a terrorist attack is 4.17 and follows an exponential distribution with $\lambda$ = 0.24, what is the probability of 10 or more people dying? </h4>

* The probability of more than 10 peple dying would be: 1 minus the probability of less than 10 dying
* =1-pexp(10,0.24) = 0.09071795. Therefore 9% chance.

<h4>Q: What would the probability of 30 or more people dying be if the number of attacks increased to 2000 a year?</h4>

= 1 - pexp(30,0.24)$^{2000}$ = 0.78

So there is a 78 % chance of at least one attack killing 30 or more people

--------------------

<h4>Suppose that a parameter θ has only two possible values, 1 or 2. A random
variable Y is observed with the following distribution:
<br><br>
$$p(Y|θ = 1) = Gamma(2, 4)$$
$$p(Y |θ = 2) = Gamma(6, 8)$$
<br>

Your task is to estimate θ. Let the actions $a_1$ and $a_2$ correspond to claiming that
θ = 1 and θ = 2 respectively. The loss function is:</h4>

<img src="img/0001.png" height="100" width="150">

<h4>
a) Before observing the observation Y, your prior on θ is p(θ = 1) = 0.75. Compute
the risk associated with both actions based only on this prior knowledge, and hence
decide which action to take. </h4>

For action $a_1$ , the risk is: R($a_1$) = L(θ = 1, $a_1$)p(θ = 1) + L(θ = 2, $a_1$)p(θ = 2)
= 0 + 9 ∗ 0.25 = 2.25


and for action $a_2$ the risk is:
R($a_2$) = L(θ = 1, $a_2$)p(θ = 1) + L(θ=2, $a_1$)p(θ = 2 = 2 ∗ 0.75 + 0 = 1.5

So we would take action $a_2$

-----------------------

<h4>

b) You now observe the value Y = 1. Compute the risk of both actions given this
observation, and hence decide which action to take. You may want to use the formula
sheet. Note that when n is a positive integer, the Gamma function Γ(n) used in the
Gamma distribution is equal to Γ(n) = (n − 1)! = 1 × 2 × ... × (n − 1). Also, recall
that 0! = 1.</h4>

<img src="img/0003.png" height="100" width="500">

Notes:
* We have to compute posterior probability given that we have observational data
* We write out Bayes theorem as shown
* 0.29 is calculated as: $\frac{4^2}{1!}1^{2-1}e^{-6}$
* 0.22P(Y) should read 0.22 / p(Y)
* $\frac{0.22}{p(Y)} + \frac{0.18}{p(Y)} = \frac{0.4}{p(Y)}=1;  p(Y) = 0.4$

------------------------------

<h4>An investment company wants to predict the probability of a stock having a negative
log return on a particular week. The company has access to the previous 10 weekly
log returns of the stock. 

Let $Y_i$ denote the log return of the stock on week i for
i = 1, 2, . . . , 10. 
<br>
$Z_i$ = 0 if $Y_i$ ≥ 0 and $Z_i$ = 1 if $Y_i$ < 0. 
<br>
Let:
    $$R = \sum^{10}_{i=1}{Z_i}$$

be the number of weeks with negative log returns in the historical data.</h4>
<h4>
Assuming that the $Y_i$ variables are independent and identically distributed, then R has a Binomial(10, \theta). distribution. 
<h4>
a) explain why the Beta(α, β) distribution is the conjugate prior for \theta. You do not
need to explicitly calculate the integral in the denominator of Bayes theorem.
</h4>
The prior distribution $p(\theta)$ is said to be conjugate to the liklihood $p(Y|\theta)$, if multiplying these two distributions together (as you do in the top row of Bayes Theorem) and normalizing (as you do using the denominator in Bayes theorem), gives another distribution of the same form as the posterior $p(\theta|Y)$
<br>
<br>In this case the likelihood (up to proportionality, i.e. ignoring the normalizing part of the binomial formula) is $$\theta^R (1 − \theta)^{n−R}$$ 
The beta distribution matches this form and is hence conjugate
<h4>
b) Show that the posterior distribution p(\theta|Y 1 , . . . , Y 10 ) is Beta(α + R, β + 10 − R).
You do not need to explicitly calculate the integral in the denominator.
</h4>
$$p(\theta|Y)  = p(\theta|R) ∝ p(R|\theta)p(\theta) ∝ \theta^R (1 − \theta)^{10−R}\theta^{α−1}(1 − \theta)^{β−1}$$
<br>
This has the form of a Beta(α + R, β + 10 − R) distribution
<br>

<h4>
For the remainder of this question you should use the fact that the Beta function
B(α, β) used to define the Beta distribution, is defined as:</h4>

<img src="img/0002.png" height="100" width="160">

<b>and that when n is a positive integer, Γ(n) = (n − 1)! <br>(this is known as the gamma function)</b>

<h4>c) Suppose that 6 of the 10 observed log returns $Y_i$ are negative (i.e. R=6), and
that that a uniform Beta(1,1) prior is chosen for the unknown Binomial parameter \theta.
Using the Trapezoid rule, compute the probability that \theta is between 0.5 and 0.6</h4>

* P(\theta|Y) follows a Beta(7, 5) distribution. 
* To calculate the probabilty of Y being less than or equal to any 1 value in this distribution, you can use the Beta function. 
* For this question we neeed to calculate the probability it is between two points on the distribution

* Therefore we need:

$$\frac{1}{\frac{Γ(\alpha)Γ(\beta)}{Γ(\alpha+\beta)}} = 2310$$
Note that: Γ(n) = (n − 1)! 
<br>
$$\int_{0.5}^{0.6}2310*\theta^6(1-\theta)^4$$

We can approximate above integral using trapezoid rule $\int_a^b f(x)dx = (b−a)(f(a) + f(b))/2$

In this case we have f(0.5) = 2.26, and f(0.6) = 2.76.
So probability is approximately 0.25

<h4>d) Again suppose that R = 6 and that a Beta(1,1) prior has been chosen for \theta. Show
that the predictive distribution:<br>
$$p(\tilde{Z}|R) =\int  p(\tilde{Z}|\theta)p(\theta|R)d\theta$$

for predicting whether the log return will be negative on a particular week in the
future is equal to:<br>
$$p(\tilde{Z}|R) = \frac{1}{12} \frac{Γ(7 + \tilde{Z})Γ(6 − \tilde{Z})}{Γ(7)Γ(5)}$$

where $\tilde{Z}$ i is the value of $Z_i$ on the week in question</h4>


$$NOT YET ANSWERED$$

<h4>
e) Based on the above equation for p($\tilde{Z}$|R) given in part d), compute the probability
of the log return being negative on a particular week in the future.

Note that when n is a positive integer, the Gamma function Γ(n) is equal to Γ(n) =
(n − 1)! = 1 × 2 × ... × (n − 1). 

$$NOT YET ANSWERED$$

--------------------------------------------------------------------

<h4>An insurance company offers hurricane insurance. To price the insurance product,
the company must estimate the number of hurricanes that occur in a given year. Let
$Y_1 , . . . , Y_n$ denote the number of hurricanes that have occurred in each of the previous n years. The company models these as independent and identically distributed Poisson(λ) random variables, i.e. $Y_i ∼ Poisson(λ)$.
<br><br>
a) Show that the Gamma(\alpha, \beta) distribution is the conjugate prior for λ (you do not need to evaluate the integral in the denominator of Bayes theorem).</h4>

* Starting with writing out Bayes theorem (minus the denominator) and substituting both the poisson and gamma functions in:

$$p(λ|Y) \propto p(Y|λ)p(λ) = \frac{λ^Ye^{-λ}}{Y!} \frac{\beta^{\alpha}}{Γ(\alpha)}λ^{\alpha-1}{e^{-\betaλ}}$$

* Then take out the parts that are not dependent on λ as they are just a normalizing factor:

$$\propto  λ^{\alpha-1}{e^{-\betaλ}λ^Ye^{-λ}}$$

* Then simplify:
$$=λ^{\alpha−1+Y}e^{−λ(\beta+1)}$$

This is a Gamma(\alpha + Y, \beta + 1) distribution.


Notes: This is demonstrating the conjugate prior definition (see above) in action. The posterior is proportional to the liklihood * prior. The denominators are dropped as they only serve a normalization role and don't impact the solution.

<h4>
b) Write down the posterior distribution of λ given an uninformative Gamma(0,0)
prior, and the observations $Y_1 = 3, Y_2 = 0, Y_3 = 1, Y_4 = 0, Y_5 = 4$.</h4>

Based on the above argument, given $Y_1 to Y_n$ the posterior is Gamma($\alpha+\sum Y_i , \beta + n$).

So in this case the posterior is Gamma(8,5).

<h4>
c) Derive the predictive distribution:
$$p(\tilde{Y} |Y_1 , . . . , Y_5 ) = \int p(\tilde{Y}|\theta)p(\theta|Y_1 , . . . , Y_5 )d\theta$$

for predicting the number of hurricanes occurring in a future year, given the prior
and data from part b). </h4>

<img src="img/0004.png" height="100" width="400">

Note: Prior and liklihood positions have been switched around here


-----------------------------------

<H4>Explain numerical integration</h4>
Numeric integration is a set of techniques for solving definite integrals
in cases where we cannot find an analytic solution for the indefinite
integral. I.e. it lets us do integrals of the form:

$$\int_a^bf(x)dx$$

When doing Bayesian analysis on any non-trivial problem, some form of
numeric integration will usually be required. Indeed, while most of the
theory and mathematics of Bayesian inference was initially worked out
during the years 1900-1970, it was only with the widespread availability
of fast computers able to quickly perform numeric integration that it
became popuar

<H4>Explain the Trapezoid rule</h4>

The Trapezoid rule is one method for calculating the integral (space under a curve) in cases where we cannot use an analytic solution for the indefinite integral.


<img src="img/0005.png" height="100" width="200">

It works by calculating the average value and multiplying it by the width:

$$\int_a^bf(x)dx \approx \frac{f(a)+f(b)}{2}(b-a)$$

You can divide the problem space into several widths that you compute separately and add up in you wish to gain more accuracy.
