# Exercise set 2

> This exercise aims to investigate Bayes’ theorem and to use ANOVA to determine if a certain factor is indeed affecting the results of an experiment.

## Exercise 2.1 Bayes' theorem

A particular disease occurs randomly in the general population with a probability of $\tfrac{1}{10\,000}$.
A test has been developed for this disease which is $99\%$ correct, meaning here
that if a person has the disease, then the test will correctly be positive in $99\%$ of the cases.
If a person does not have the disease, the test will correctly be negative in $99\%$ of the cases.

### 2.1(a)
Assume that we are testing for this disease in a population of $1\,000\,000$ people. How many
are expected to have the disease? How many are expected to not
have the disease? Further:

- (i)  How many people with the disease will have a positive test?

- (ii)  How many people with the disease will have a negative test?

- (iii)  How many people without the disease will have a positive test? 

- (iv)  How many people without the disease will have a negative test? 


Summarize your answers to the four points above in a table of the following
form:

|               |          |     |Has the disease?|
|:--------------|:---------|:---:|:--------------:|
|               |          |Yes  |No              |
|**Test result**| Positive |     |                |
|               | Negative |     |                |



In [None]:
# Your code here

#### Your answer to question 2.1(a):
*Double click here*

### 2.1(b)
Later in the course, when dealing with classification problems,
we will refer to this table as the "confusion matrix". This table
summarizes the types of errors we are making: the number of
false positives ("FP") and false negatives ("FN").
In addition, it shows us the number
of true positives
("TP") and true negatives ("TN").
Identify the location of
these $4$ labels in the table above. 
Using these labels, we can define different metrics that tell us
something about the performance of the test. One example of such a
measure is the [precision](https://en.wikipedia.org/wiki/Precision_and_recall) which is
defined as the ratio between the number of true positives and
the total number of positives:

\begin{equation}
\text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
\tag{1}\end{equation}

We can interpret the precision as the probability of having the
disease, given that the test was positive.
Calculate this probability, using the formula given above.
Do you have any comments about the size of
this probability?




In [None]:
# Your code here

#### Your answer to question 2.1(b):
*Double click here*

### 2.1(c)
Use Bayes theorem to calculate the probability of
having the disease, given that a test was positive. Compare
this to the probability you found in the previous answer.

In [None]:
# Your code here

#### Your answer to question 2.1(c):
*Double click here*

## Exercise 2.2

The fertilizer magnesium ammonium phosphate MgNH$_4$PO$_4$ is an effective
supplier of nutrients necessary for plant growth. In an experiment, you have
tested the effect of this fertilizer on the growth of Chrysanthemums by
measuring the height of the plants after growing them for four weeks.
You have considered $4$ different concentrations of the fertilizer (measured in g/bu)
and you have measured the mean height by growing $10$ plants per concentration.
The measured data is given in Table 1, and is
also given in the file [`Data/fertilizer.txt`](Data/fertilizer.txt).


|**50 g/bu** | **100 g/bu** | **200 g/bu** | **400 g/bu** |
|:---:|:---:|:---:|:---:|
|$13.2$ | $16.0$ | $ 7.8$ | $21.0$ |
|$12.4$ | $12.6$ | $14.4$ | $14.8$ |
|$12.8$ | $14.8$ | $20.0$ | $19.1$ |
|$17.2$ | $13.0$ | $15.8$ | $15.8$ |
|$13.0$ | $14.0$ | $17.0$ | $18.0$ |
|$14.0$ | $23.6$ | $27.0$ | $26.0$ |
|$14.2$ | $14.0$ | $19.6$ | $21.1$ |
|$21.6$ | $17.0$ | $18.0$ | $22.0$ |
|$15.0$ | $22.2$ | $20.2$ | $25.0$ |
|$20.0$ | $24.4$ | $23.2$ | $18.2$ |


**Table 1:** *Measured plant heights (in cm) as a function of the fertilizer concentration (in g/bu).*

### 2.2(a)
Here, we will test the hypothesis that the mean height
of the plants is not affected by the amount of fertilizer. We
are going to test this with a specified significance level, $\alpha$.
What is the meaning of $\alpha$ in connection with a hypothesis test?

In [None]:
# Your code here

#### Your answer to question 2.2(a):
*Double click here*

### 2.2(b)
To test the hypothesis, we will perform ANOVA. Before we do that,
it is a good idea to visualize the raw data. Create a suitable plot
of the raw data. Does it look like our hypothesis is correct?

In [None]:
# Here is some code to get you started:
import pandas as pd  # For reading the data
from matplotlib import pyplot as plt  # For plotting
import seaborn as sns  # More plotting

sns.set_context("notebook")  # Style plots for a Jupyter notebook

data = pd.read_csv("Data/fertilizer.txt")  # Read the data

In [None]:
data  # Just show the data table

In [None]:
# Your code here.
# Here is an example plot:
fig, ax = plt.subplots(constrained_layout=True)
sns.stripplot(data=data, jitter=False, ax=ax, s=8)
ax.set(xlabel="Fertilizer concentration", ylabel="Height (cm)")
sns.despine(fig=fig)

#### Your answer to question 2.2(b):
*Double click here*

### 2.2(c)
In connection with performing ANOVA, we need to calculate some
terms that measure the variance within groups and the variance
between groups.
Calculate these terms ($SST$, $SSA$, and $SSE$).

In [None]:
# Your code here. Hint: See the example notebook for lecture 2.

#### Your answer to question 2.2(c):
*Double click here*

### 2.2(d)
Using the terms you calculated in the previous point, obtain
the two estimates for the variance:

* (i)  $s_1^2 = \frac{SSA}{k-1}$,

* (ii)  $s^2 = \frac{SSE}{k(n-1)}$,


and calculate the $f$-statistic: $f=s_1^2/s^2$.

In [None]:
# Your code here

#### Your answer to question 2.2(d):
*Double click here*

### 2.2(e)
To perform the actual hypothesis test, we need the critical value
from the $F$-distribution. For a specified significance level, $\alpha$,
and with $3$ and $36$ degrees of freedom, we
label the critical value as $f_{\alpha}(3, 36)$. Check that I am using
the correct degrees of freedom here, consistent with the
raw data given in table 1.

In [None]:
# Your code here

#### Your answer to question 2.2(e):
*Double click here*

### 2.2(f)
Critical $f$ values can be calculated from the distribution
function(With Python, this can be done with
[`scipy.stats.f.ppf`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html)
from the
[SciPy library](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f.html).)
or it can be found in statistical tables.(One such table can be found
[here on-line](https://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm).)
For the significance levels $\alpha=0.05$ and $\alpha=0.10$, I have found the
following two values:

* (i)  $f_{\alpha = 0.05}(3, 36) = 2.866$

* (ii)  $f_{\alpha=0.10}(3, 36) = 2.243$


Check that these two values are correct.

In [None]:
# Your code here, here is an example of how to use scipy.stats.f.ppf:
import scipy
alpha = 0.1
dof1 = 3
dof2 = 36
f_critical = scipy.stats.f.ppf (1 - alpha , dof1 , dof2)
print(f_critical)

#### Your answer to question 2.2(f):
*Double click here*

### 2.2(g)
Based on the calculations you have done,
can you conclude on the $0.05$ level of significance that
different concentrations of the fertilizer affect the mean attained height
of the plants? What concentration, if any, appears to give the tallest plants?

In [None]:
# Your code here

#### Your answer to question 2.2(g):
*Double click here*

### 2.2(h)
Would your conclusion change with a significance level of $0.10$?

In [None]:
# Your code here

#### Your answer to question 2.2(h):
*Double click here*

### 2.2(i)
(Optional) Re-do this exercise, but use the [anova_lm](https://www.statsmodels.org/stable/generated/statsmodels.stats.anova.anova\_lm.html) method
from the Python package [statsmodels](https://www.statsmodels.org), or the
[scipy.stats.f_oneway](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f\_oneway.html) method from the Python package [SciPy](https://www.scipy.org/), to run ANOVA.

In [None]:
# Your code here

#### Your answer to question 2.2(i):
*Double click here*