## Probability and Statistics: Assignment 2


### Problem 1 - Bayesian Interpretation and Probability

Let the events be defined as follows:

H: *Event that Alex is expert, i.e. the hypothesis*

E: *Event of hitting 3 bull's eye out of 5, i.e. the evidence*

**Prior Belief** : It is the probability of an event before we get any new evidence. It is largely based on general data, existing knowledge, or just our intuition. It is used as a starting point to factor-in new evidence and update our belief.

In our case, we are assuming that Alex is only 1% as good as he claims. 
Hence,
$$
P(Expert)=P(H)=0.01 ~\text{or} ~\frac{1}{100}
$$

**Likelihood** : The event of him hitting bull's eyes in a given number of trials follows simple Binomial distributiion $X\sim \mathcal{Binom(5, 0.7)}$ if he is an expert and is as good as he claims. Otherwise, the event is given by $X\sim \mathcal{Binom(5, 0.1)}$ when he is not an expert and his accuracy matches the general population. If Alex is an expert, the likelihood of him hitting 3 bull's eye out of 5 is,
$$
P(\text{3 bull's eye out of 5 | Expert})=P(E | H)=\binom{5}{3}*(0.7)^3*(0.3)^2 \approx 0.3087 \text{or} 30.87%
$$

If he is not an expert, the same likelihood is given as,
$$
P(\text{3 bull's eye out of 5 | Not an Expert})=P(E | H')=\binom{5}{3}*(0.1)^3*(0.9)^2 \approx 0.008 \text{or} 0.8%
$$

**Bayesian Update** : Bayes' theorem is given by
$$
P(H | E) = \frac {P(E | H)P(H)}{P(E | H)P(H) + P(E | H')P(H')} 
$$ 
In our case,

$$
P(H | E)= \frac {\binom{5}{3}*(0.7)^3*(0.3)^2*\frac{1}{100}}{\binom{5}{3}*(0.7)^3*(0.3)^2*\frac{1}{100}+\binom{5}{3}*(0.1)^3*(0.9)^2\frac{99}{100}} \approx 0.278
$$

**Interpretation**

- (a) The posterior, that is our updated belief given the new evidence is approximately **0.278**. 

- (b) Initially, our belief was that Alex is not as good at he claims and based on our intuition, we decided that the probability of him being an expert is only 1%. Based on the evidence presented to us, that Alex was able to hit 3 bull's eye out of 5, increased the odds of him being an expert upto 27.8%.

If we look at this intuitively, we can say that getting 3 hits out of 5 while being an expert makes more sense than being able to do it otherwise. We can even see this numerically as well as the probability of him doing this as an expert is 30.87% compared to the very less 0.8% when we consider he is not.

- (c) If we change our prior belief to 20%, then $ P(H) = 0.2$
$$
\therefore P(H | E) = \frac {\binom{5}{3}*(0.7)^3*(0.3)^2*\frac{1}{5}}{\binom{5}{3}*(0.7)^3*(0.3)^2*\frac{1}{5}+\binom{5}{3}*(0.1)^3*(0.9)^2*\frac{4}{5}} \approx 0.905 ~\text{or}~ 90.5%
$$

This drastically increased the probability that Alex is an expert given the evidence. Although our assumption was only 1% initially, it was sufficient to increase the belief upto 27.8% as our evidence was strong enough to favour the condition that Alex is an expert. Now, when we started off with a considerable assumption already, that increased the belief upto 90.5%. In this case, the prior strongly influences the result and the posteriori is highly sensitive to any changes made in it. 

### Problem 2: Estimating an Exponential Rate from Truncated data

**Part A**

Since $T_i$ follows exponential distribution with rate $\lambda$, its pdf is given as:
$$
f_X(x) = 
\begin{cases}
\lambda e^{-\lambda x}, & x \geq 0 \\
0, & x < 0
\end{cases}
$$
Since we are only considering $T_j$ for $T \geq 10$, the pdf is given as:
$$
\begin{align*}
f_{T | T \geq 10}(t) &= \frac{f_{T}(t)}{P(T \geq 10)}~(\text{upon normalizing pdf over the new domain, so that integration of the pdf over the domain remains 1}) \\
                     &= \frac{\lambda e^{-\lambda t}}{\int_{10}^\infty \lambda e^{-\lambda t } dt} = \lambda e^{-\lambda (t-10)}, ~t \geq 10
                     
\end{align*}
$$

From this, we get the likelihood function as
$$
L(\lambda)= \prod_{i=1}^n \lambda e^{-\lambda (t_i-10)} 
$$

The log-likelihood is given as:

$$
\begin{align*}
l(\lambda) &= \log(L(\lambda)) \\
           &= \log(\lambda^n \prod_{i=1}^n e^{-\lambda (t_i-10)}) \\
           &= n\log(\lambda) -\lambda \sum_{i=1}^n (t_i-10)
           
\end{align*}
$$

For MLE,

$$
\begin{align*}
\frac{dl(\lambda)}{d\lambda} &= 0 \Rightarrow~ \frac{n}{\lambda}-\sum_{i=1}^n (t_i-10)=0 \Rightarrow~ \hat{\lambda}=\frac{n}{\sum_{i=1}^n (t_i-10)}
\end{align*}
$$

- Not using truncation while finding estimate means we are taking into account the part as well which has no data points. As a result, the estimate for $\lambda$ would have been lower than expected, thus giving a wrong account of the data we have.

- For our MLE, we have taken the assumption that the waiting times which are greater than 10 minutes are always taken into consideration. In this case, missing data points for t>10 will lead to wrong results. This could probably be sorted by conditioning our pdf or rather modifying it to take into account the randomness of the device as well. We could do the same for the likelihood function too maybe, whichever makes the process smoother and efficient.

**Part B**

Since we have assumed Gamma prior on $\lambda$, then $\lambda \sim \text{Gamma}(\alpha, \beta)$.

$$
\therefore p(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}, ~ x > 0
$$

Thus the posterior (unnormalized) is given as:

$$
\begin{align*}
p(\lambda | \text{data}) &= p(\text{data} | \lambda).p(\lambda) \\
                         &= \lambda^n \prod_{i=1}^n e^{-\lambda (t_i-10)}.\lambda^{\alpha - 1} e^{-\beta \lambda} \frac{\beta^\alpha}{\Gamma(\alpha)}
                        
\end{align*}
$$

Log-posterior is given as:

$$
\log(p(\lambda | \text{data})) = n\log(\lambda) -\lambda \sum_{i=1}^n (t_i-10) + (\alpha -1)\log(\lambda) -\beta\lambda + k
$$

To maximise this, we need to equate its derivative to 0.

$$
\begin{align*}
\frac{d\log(p(\lambda | \text{data}))}{d\lambda} &= 0 \Rightarrow \frac{n}{\lambda}-\sum_{i=1}^n (t_i-10) + \frac{(\alpha -1)}{\lambda} - \beta = 0 \Rightarrow \hat{\lambda}=\frac{n+\alpha-1}{\sum_{i=1}^n (t_i-10) + \beta}
\end{align*}
$$

- When we got the result using MAP, factors such as $\alpha$ and $\beta$ are part of the result too. These are influenced by the data we already have prior to when we started observing whereas for MLE, the result was purely based on the observations we made. 
  
MLE and MAP can significantly differ in cases when the sample size is small. We saw this in the earlier question as well. Due to small sample size, the result is strongly prior driven and consequently, highly sensitive to it as well. 

**Interpretation**
- $\lambda$ calculated using pure MLE is based only on the observed data. Thus a single outlier can give drastically different result, which is often not an accurate representation of how things actually are. This can be understood through the difference between climate and weather.Suppose you visit a new city planning to buy a property there, but it is raining on that day so you decide against it, assuming its always rainy. This is similar to MLE where the decision is made upon immediate observations.
On the other hand, climate gives the trend of the weather in that place over the years, thus helping in your decision-making by giving a more reliable data. This is similar to MAP, which takes prior data into account as well which might lead to more holistic results.  
Whether to use MAP and MLE is mostly case-dependent and both have their uses.  
In our case, adding the prior gives a more holistic view of the bus arrivals, thus the $\lambda$ obtained this way can be applied to the overall trends as well.

- Say that on a particular day, a huge rally is passing through the main roads. This can suddenly increase the bus arrival times for that day due to traffic and delays, leading to skewed estimate found using MLE. In this case, MAP gives a better estimate of $\lambda$ over MLE as we are taking into account what we already know about the usual arrival times and a single outlier does not affect the estimate drastically



### Problem 4: Hypothesis Testing with Known Variance

The filling amount is defined by a Normal distribution with $\mu = 500$ grams and $\sigma = 10$ grams. To check if our machine is correctly calibrated, we formulated our hypotheses.

**Null Hypothesis, $H_0$** : $\mu = 500$, i.e. the machine is correctly calibrated and $\mu$ matches the initial value
**Alternative Hypothesis, $H_1$** : $\mu \neq 500$

Upon sampling on n=16 bags, we get $\bar{X} = 504$ grams.

For known variance, the test statistic $Z$ for $\mu$ is given as:

$$
Z = \frac{\bar{X}-\mu_0}{\sigma / \sqrt{n}} = \frac{504-500}{10 / 4}= 1.6
$$

For significance $\alpha = 0.05$,  

$$
z_{\frac{\alpha}{2}}=z_{0.025}=1.96
$$

The acceptance region for such cases with two-sided testing and known variance is given by $|Z| \leq z_{\frac{\alpha}{2}} $

Since our test statistic safely fits this condition, we can say that we failed to reject the Null Hypothesis, as the data given to us is not strong enough to reject it and prove that the machine is miscalibrated. 

**Interpretation**
- Rejecting $H_0$ implies that the data we obtained from sampling was sufficient to conclude the same. In hypothesis testing, this means that our Test Statistic lies in the rejection region of the distribution defined by $\alpha$, i.e. the level of significance. $\alpha$ represents the probability of rejecting the Null Hypothesis when it is actually true. If our test statistic lies beyond the acceptable region defined by this $\alpha$, it means we have enough evidence to reject the Null Hypothesis, and the result supports the alternative. 


​In case we fail to reject $H_0$, it means the data we collected from samples was not enough to reject it. The point of Hypothesis Testing is to see if the evidence we have is sufficient to doubt the Null Hypothesis. In the first case, we said that we can reject $H_0$ given the evidence since it strongly goes against it, but in this case our conclusion is that the Null Hypothesis cannot be rejected as the evidence is not strong enough to outright deny it.

- In a smaller sample, some random factors get added into our data which influence the result, leading to errors. Larger number of samples help reducing the effect of these factors and provide a better basis to reject $H_0$ when it is actually false, thus improving the power of the test.






## Probability and Statistics: Assignment 2


### Problem 1 - Bayesian Interpretation and Probability

Let the events be defined as follows:

H: *Event that Alex is expert, i.e. the hypothesis*

E: *Event of hitting 3 bull's eye out of 5, i.e. the evidence*

**Prior Belief** : It is the probability of an event before we get any new evidence. It is largely based on general data, existing knowledge, or just our intuition. It is used as a starting point to factor-in new evidence and update our belief.

In our case, we are assuming that Alex is only 1% as good as he claims. 
Hence,
$$
P(Expert)=P(H)=0.01 ~\text{or} ~\frac{1}{100}
$$

**Likelihood** : The event of him hitting bull's eyes in a given number of trials follows simple Binomial distributiion $X\sim \mathcal{Binom(5, 0.7)}$ if he is an expert and is as good as he claims. Otherwise, the event is given by $X\sim \mathcal{Binom(5, 0.1)}$ when he is not an expert and his accuracy matches the general population. If Alex is an expert, the likelihood of him hitting 3 bull's eye out of 5 is,
$$
P(\text{3 bull's eye out of 5 | Expert})=P(E | H)=\binom{5}{3}*(0.7)^3*(0.3)^2 \approx 0.3087 \text{or} 30.87%
$$

If he is not an expert, the same likelihood is given as,
$$
P(\text{3 bull's eye out of 5 | Not an Expert})=P(E | H')=\binom{5}{3}*(0.1)^3*(0.9)^2 \approx 0.008 \text{or} 0.8%
$$

**Bayesian Update** : Bayes' theorem is given by
$$
P(H | E) = \frac {P(E | H)P(H)}{P(E | H)P(H) + P(E | H')P(H')} 
$$ 
In our case,

$$
P(H | E)= \frac {\binom{5}{3}*(0.7)^3*(0.3)^2*\frac{1}{100}}{\binom{5}{3}*(0.7)^3*(0.3)^2*\frac{1}{100}+\binom{5}{3}*(0.1)^3*(0.9)^2\frac{99}{100}} \approx 0.278
$$

**Interpretation**

- (a) The posterior, that is our updated belief given the new evidence is approximately **0.278**. 

- (b) Initially, our belief was that Alex is not as good at he claims and based on our intuition, we decided that the probability of him being an expert is only 1%. Based on the evidence presented to us, that Alex was able to hit 3 bull's eye out of 5, increased the odds of him being an expert upto 27.8%.

If we look at this intuitively, we can say that getting 3 hits out of 5 while being an expert makes more sense than being able to do it otherwise. We can even see this numerically as well as the probability of him doing this as an expert is 30.87% compared to the very less 0.8% when we consider he is not.

- (c) If we change our prior belief to 20%, then $ P(H) = 0.2$
$$
\therefore P(H | E) = \frac {\binom{5}{3}*(0.7)^3*(0.3)^2*\frac{1}{5}}{\binom{5}{3}*(0.7)^3*(0.3)^2*\frac{1}{5}+\binom{5}{3}*(0.1)^3*(0.9)^2*\frac{4}{5}} \approx 0.905 ~\text{or}~ 90.5%
$$

This drastically increased the probability that Alex is an expert given the evidence. Although our assumption was only 1% initially, it was sufficient to increase the belief upto 27.8% as our evidence was strong enough to favour the condition that Alex is an expert. Now, when we started off with a considerable assumption already, that increased the belief upto 90.5%. In this case, the prior strongly influences the result and the posteriori is highly sensitive to any changes made in it. 

### Problem 2: Estimating an Exponential Rate from Truncated data

**Part A**

Since $T_i$ follows exponential distribution with rate $\lambda$, its pdf is given as:
$$
f_X(x) = 
\begin{cases}
\lambda e^{-\lambda x}, & x \geq 0 \\
0, & x < 0
\end{cases}
$$
Since we are only considering $T_j$ for $T \geq 10$, the pdf is given as:
$$
\begin{align*}
f_{T | T \geq 10}(t) &= \frac{f_{T}(t)}{P(T \geq 10)}~(\text{upon normalizing pdf over the new domain, so that integration of the pdf over the domain remains 1}) \\
                     &= \frac{\lambda e^{-\lambda t}}{\int_{10}^\infty \lambda e^{-\lambda t } dt} = \lambda e^{-\lambda (t-10)}, ~t \geq 10
                     
\end{align*}
$$

From this, we get the likelihood function as
$$
L(\lambda)= \prod_{i=1}^n \lambda e^{-\lambda (t_i-10)} 
$$

The log-likelihood is given as:

$$
\begin{align*}
l(\lambda) &= \log(L(\lambda)) \\
           &= \log(\lambda^n \prod_{i=1}^n e^{-\lambda (t_i-10)}) \\
           &= n\log(\lambda) -\lambda \sum_{i=1}^n (t_i-10)
           
\end{align*}
$$

For MLE,

$$
\begin{align*}
\frac{dl(\lambda)}{d\lambda} &= 0 \Rightarrow~ \frac{n}{\lambda}-\sum_{i=1}^n (t_i-10)=0 \Rightarrow~ \hat{\lambda}=\frac{n}{\sum_{i=1}^n (t_i-10)}
\end{align*}
$$

- Not using truncation while finding estimate means we are taking into account the part as well which has no data points. As a result, the estimate for $\lambda$ would have been lower than expected, thus giving a wrong account of the data we have.

- For our MLE, we have taken the assumption that the waiting times which are greater than 10 minutes are always taken into consideration. In this case, missing data points for t>10 will lead to wrong results. This could probably be sorted by conditioning our pdf or rather modifying it to take into account the randomness of the device as well. We could do the same for the likelihood function too maybe, whichever makes the process smoother and efficient.

**Part B**

Since we have assumed Gamma prior on $\lambda$, then $\lambda \sim \text{Gamma}(\alpha, \beta)$.

$$
\therefore p(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}, ~ x > 0
$$

Thus the posterior (unnormalized) is given as:

$$
\begin{align*}
p(\lambda | \text{data}) &= p(\text{data} | \lambda).p(\lambda) \\
                         &= \lambda^n \prod_{i=1}^n e^{-\lambda (t_i-10)}.\lambda^{\alpha - 1} e^{-\beta \lambda} \frac{\beta^\alpha}{\Gamma(\alpha)}
                        
\end{align*}
$$

Log-posterior is given as:

$$

$$










### Problem 4: Hypothesis Testing with Known Variance

The filling amount is defined by a Normal distribution with $\mu = 500$ grams and $\sigma = 10$ grams. To check if our machine is correctly calibrated, we formulated our hypotheses.

**Null Hypothesis, $H_0$** : $\mu = 500$, i.e. the machine is correctly calibrated and $\mu$ matches the initial value
**Alternative Hypothesis, $H_1$** : $\mu \neq 500$

Upon sampling on n=16 bags, we get $\bar{X} = 504$ grams.

For known variance, the test statistic $Z$ for $\mu$ is given as:

$$
Z = \frac{\bar{X}-\mu_0}{\sigma / \sqrt{n}} = \frac{504-500}{10 / 4}= 1.6
$$

For significance $\alpha = 0.05$,  

$$
z_{\frac{\alpha}{2}}=z_{0.025}=1.96
$$

The acceptance region for such cases with two-sided testing and known variance is given by $|Z| \leq z_{\frac{\alpha}{2}} $

Since our test statistic safely fits this condition, we can say that we failed to reject the Null Hypothesis, as the data given to us is not strong enough to reject it and prove that the machine is miscalibrated. 

**Interpretation**
- Rejecting $H_0$ implies that the data we obtained from sampling was sufficient to conclude the same. In hypothesis testing, this means that our Test Statistic lies in the rejection region of the distribution defined by $\alpha$, i.e. the level of significance. $\alpha$ represents the probability of rejecting the Null Hypothesis when it is actually true. If our test statistic lies beyond the acceptable region defined by this $\alpha$, it means we have enough evidence to reject the Null Hypothesis, and the result supports the alternative. 


​In case we fail to reject $H_0$, it means the data we collected from samples was not enough to reject it. The point of Hypothesis Testing is to see if the evidence we have is sufficient to doubt the Null Hypothesis. In the first case, we said that we can reject $H_0$ given the evidence since it strongly goes against it, but in this case our conclusion is that the Null Hypothesis cannot be rejected as the evidence is not strong enough to outright deny it.

- In a smaller sample, some random factors get added into our data which influence the result, leading to errors. Larger number of samples help reducing the effect of these factors and provide a better basis to reject $H_0$ when it is actually false, thus improving the power of the test.




