**Exercise set 6**
==============


>In connection with experimental design, we have seen two approaches
>for checking if determined effects are important or not. These two
>approaches are based on creating a probability plot and performing ANOVA.
>The goal of this exercise is to learn how we use these two approaches in
>practice.


**Exercise 6.1**

In this part of the exercise, we will deal with the technical aspects of
creating a normal probability plot. Our final aim is here
to check if the data for some measured quantities (given in
the data files:
[data1.txt](Data/data1.txt) (located at `Data/data1.txt`),
[data2.txt](Data/data2.txt) (located at `Data/data2.txt`),
[data3.txt](Data/data3.txt) (located at `Data/data3.txt`), and
[data4.txt](Data/data4.txt) (located at `Data/data4.txt`))
comes
from a normal distribution.


**(a)**  Before we begin creating normal probability plots, we should
inspect the raw data. Plot histograms for the raw data. Based
on this, would you say that any of the data files contain numbers
that might come from a normal distribution?
You can also, for each data set, try to compare directly with
a normal distribution. For each data set, you can obtain
the mean and standard deviation, and you can plot
a normal distribution with these values in the same figure
where you have the histograms.

In [None]:
# Your code here

**Your answer to question 6.1(a):** *Double click here*

**(b)**  For creating the normal probability plot, we need to be able
to obtain certain parameters for the standard normal distribution.
Here, we will investigate some of the functions that can give us
such parameters.

The standard normal probability density function ($\operatorname{PDF}$) is given by,

$\operatorname{PDF}(x) = \frac{1}{\sqrt{2 \pi}} \operatorname{e}^{-\tfrac{x^2}{2}},$

and the cumulative distribution function (CDF) is,

$
\operatorname{CDF}(x) = \int_{-\infty}^{x} \operatorname{PDF} (t)\,\operatorname{d}t = 
\frac{1}{2} \left[ 1 + \operatorname{erf} \left( \frac{x}{\sqrt{2}} \right) \right],
$

where $\operatorname{erf}(\ldots)$ is the error function.
The cumulative distribution function gives the probability of observing a
value less than or equal to $x$: $P(X \leq x) = \operatorname{CDF}(x)$.

We can also turn this equation around: Given a probability $P$ what is
the value of $x$ that gives this probability? To answer this question,
we need the quantile function (also known as the percent-point function),
which is the inverse of the cumulative distribution function.
For the standard normal distribution, the percent-point
function ($\operatorname{PPF}$),
is given by,

$
\operatorname{PPF}(P) = \sqrt{2} \operatorname{erf}^{-1}(2P - 1).
$

If we make use of the `scipy` package, all these functions
are available to us:
```python
from scipy.stats import norm
import numpy as np

x = np.linspace(-2, 2, 100)
pdf = norm.pdf(x)  # Probability density function.
cdf = norm.cdf(x)  # Cumulative probability density function.
ppf = norm.ppf(x)  # Quantile function/percent-point function.
```


Use these methods to answer the following questions for the standard normal distribution:

* (i)  What is the probability of observing a $x \leq 1$?

* (ii)  What is the probability of observing a $x \leq 0$?

* (iii)  What is the probability of observing a $x \leq -2$?

* (iv)  Given that the probability of observing $x \leq \alpha$ is
$10$\%, what is $\alpha$?

* (v)  Given that the probability of observing $x \leq \alpha$ is
$90$\%, what is $\alpha$?

* (vi)  Given that the probability of observing $x \leq \alpha$ is
$99$\%, what is $\alpha$?






In [None]:
# Your code here

**Your answer to question 6.1(b):** *Double click here*

**(c)**  To construct the normal probability plot, we will make use of
the $\operatorname{PPF}$.
If the data we are to investigate contain $n$ points, then
we need to figure out how these $n$ points would be placed
in the distribution we are going to compare with (here: the standard normal distribution).
In the following, we will call the $n$ points we have measured
for $y_1$, $y_2$, $\ldots$, $y_n$, and we assume that we have
sorted them so that $y_1 \leq y_2 \leq \ldots \leq y_n$.

We now need to check
how $n$ points drawn from a normal distribution would
be distributed, and compare this with how our measured data is distributed.
One way of doing that is to find the most probable location ($x_1$) of the
smallest value, the most probable location ($x_2$) of the second smallest
value, and so on, up to the most probable location ($x_n$) for the
largest value. There is no simple formula for finding $x_i$ and we have
to rely on a result from statistics:  These locations, the so-called
order statistic medians, for the normal distribution are exactly
related to order statistic medians from a *uniform distribution*, $m_i$, by,

\begin{equation}
x_i = \operatorname{PPF}(m_i) .
\label{eq:orderstat}
\tag{1}
\end{equation}

Thus we can potentially find $x_i$ by first obtaining the corresponding
$m_i$. Unfortunately, no analytical expression for $m_i$ exist and
we have to rely on approximate estimates. One such approximation
was
suggested by [Filliben](https://doi.org/10.1080/00401706.1975.10489279),

\begin{equation}
m_i = 
\begin{cases}
1 - 0.5^{1/n} & \text{if } i = 1, \\
\frac{i - 0.3175}{n + 0.365} & \text{if } i = 2, 3, \ldots, n-1, \\
0.5^{1/n} & \text{if } i=n,
\end{cases}
\label{eq:uniformorderstat}
\tag{2}
\end{equation}

Thus, in summary, to create the normal probability plot we do the following:

* (i)  We sort our original data ($y_1$, $y_2$, $\ldots$, $y_n$).

* (ii)  For each sorted data point, we calculate its
uniform order statistic median, $m_i$,
using Eq. \eqref{eq:uniformorderstat}.

* (iii)  For each sorted data point, we calculate its most
probable location, $x_i$, in a normal distribution using
Eq. \eqref{eq:orderstat} and
the $m_i$ value we found in the previous step. 

* (iv)  We plot the sorted data against the most probable locations
found in the previous step. That is, we plot the pairs ($x_i$, $y_i$),
and if the data is from a normal distribution, we expect that these
points fall on a straight line.



Create the normal probability plots for the four data sets given
in [data1.txt](Data/data1.txt) (located at `Data/data1.txt`),
[data2.txt](Data/data2.txt) (located at `Data/data2.txt`),
[data3.txt](Data/data3.txt) (located at `Data/data3.txt`), and
[data4.txt](Data/data4.txt) (located at `Data/data4.txt`). Which
of these would you say are numbers that could originate from a
normal distribution?

In [None]:
# Your code here

**Your answer to question 6.1(c):** *Double click here*

**(d)**  The method we have described above works for any distribution,
not just the normal distribution. We can create similar plots
for other distributions by changing the $\operatorname{PPF}$ function
in Eq. [(1)](#mjx-eqn-eq:orderstat)
to the corresponding function for the distribution we wish to check for.
Repeat the previous step, but use the `Gumbel distribution`
(in `scipy` this is available by
`from scipy.stats import gumbel_r`) in place of the
normal distribution. Based on the plots you now create, would you say
that any of the data sets may contain numbers from a Gumbel distribution?

In [None]:
# Your code here

**Your answer to question 6.1(d):** *Double click here*

**Exercise 6.2**

After running a set of experiments, you determine the effects
given in Table 1 for $4$ factors: A, B, C, and D.
Use a normal probability plot to identify the important effects among
the ones listed in this table. (Note: These numbers were also used in lecture $6$.)

|**Factor** | **Effect** |
|:---------:|:----------:|
|A          |  -8.00     |
|B          |  24.00     |
|C          |  -2.25     |
|D          |  -5.50     |
|AB         |   1.00     |
|AC         |   0.75     |
|AD         |   0.00     |
|BC         |  -1.25     |
|BD         |   4.50     |
|CD         |  -0.25     |
|ABC        |  -0.75     |
|ABD        |   0.50     |
|ACD        |  -0.25     |
|BCD        |  -0.75     |
|ABCD       |  -0.25     |

| |
|---|
|**Table 1:** *Effects determined in a set of experiments.*|

In [None]:
# Your code here

**Your answer to question 6.2:** *Double click here*

**Exercise 6.3**

From a $2^2$ factorial experiment replicated three times you have obtained
the data given in Table 3. We use here a short-hand notation
for the $4$ possible combinations of the variables: $(1)$, $a$, $b$, and $ab$.
In this notation $(1)$ is the experiment where all factors were at their low levels. For the
other cases, the absence of a letter means that the corresponding factor was at a low level, and
the presence of a letter means that the corresponding factor was at a high level (e.g. "$a$" is the
same as saying that factor A was at the high level and B at the low level). 


|**Experiment** | **Replicate 1** | **Replicate 2** | **Replicate 2** |
|:---:|:---:|:---:|:---:|
|$(1)$ | $9$  | $10$ | $11$ |
|$a$   | $30$ | $31$ | $29$ |
|$b$   | $19$ | $20$ | $21$ |
|$ab$  | $5$  | $6$  | $4$  |

| |
|---|
|**Table 2:** *Results from a $2^2$ factorial experiment, repeated $3$ times.*|

**(a)**  Calculate the effects (A, B, and AB).

In [None]:
# Your code here

**Your answer to question 6.3(a):** *Double click here*

**(b)**  Use ANOVA to investigate which effects are important in this case.
Use a significance level of $\alpha = 0.01$. For
a significance level of $\alpha = 0.01$, the relevant critical
$f$-value is $f_{\alpha=0.01}(1, 8) = 11.259$ with $1$ and $8$ degrees
of freedom. (Note: These numbers in Table 2
are the same as for the example on
page $96$ in the textbook.)

In [None]:
# Your code here

**Your answer to question 6.3(b):** *Double click here*