
# <span style="color:brown">Hypothesis testing on two populations</span>


## <span style=color:brown;>Contents</span>

This Notebook covers the contents from Lesson 3, *"Hypothesis testing on two populations"*, in the Statistics II subject. These contents cover the motivation and description of hypothesis testing procedures when our goal is to compare values of parameters from two populations.

They are structured as follows:

- Hypothesis testing for the difference between two population means
   - Samples composed of matched pairs
   - Independent samples
      - Two normal populations with equal (unknown) variances
      - Two populations with unknown variances and large sample sizes
      - Two Bernoulli populations (comparing proportions)
- Hypothesis testing for the ratio of two population variances
   - Independent samples from normal populations


## <span style="color:brown;">Introduction</span>

---

### <span style=color:brown;>Goals (i)</span>

In Lesson 2 we introduced the concept of hypothesis testing, with the aim of comparing the values of population parameters of interest to some prespecified reference values. In many practical situations we would prefer to compare the values of population parameters to the corresponding ones from other populations, based on information from two samples of both populations, instead of comparing them to given reference values.

We will introduce the main considerations we should take into account when extending the hypothesis testing procedure of Lesson 2 to the comparison of two populations. Overall, the procedure will remain unchanged except for some details. These details will be mostly related to  two main aspects we will take into account for the adaptation of the procedure: the relationship between the samples, and the selection of test statistics to conduct the corresponding tests.



### <span style=color:brown;>Goals (ii)</span>

Our learning goals for this lesson are:

- Know how to perform a test of hypothesis for the difference between two population means and for the ratio of two population variances.
   - Use this knowledge to construct confidence intervals for the difference of means/ratio of variances.
- Identify situations where a test based on matched pairs is more suitable than a test based on independent samples, or viceversa.
- Be able to select an appropriate test statistic for tests based on independent samples from two populations.
- Know how to calculate the power of a test for the difference of means/ratio of variances.


## <span style="color:brown;">Hypothesis tests for two populations</span>

---

We wish to consider the case when we have obtained two samples from (supposedly) different populations, and we are interested in obtaining information on the relationship between some of the parameters for these two populations, based on the information in the samples. We will conduct, as in the preceding lesson, comparisons for the values of the means, proportions or variances from both populations. Our basic tool will be the application of the hypothesis testing procedures we introduced in Lesson 2 to formal hypotheses of interest for the comparison of these parameters.

For example, we may compare the means of two populations corresponding to random variables $X$ and $Y$ by conducting a test of the form

$$
\begin{array}{rl}
  H_0 : & \mu_X \geq \mu_Y \\
  H_1 : & \mu_X < \mu_Y
\end{array} \quad \mbox{ or alternatively }\quad 
\begin{array}{rl}
  H_0 : & \mu_X - \mu_Y \geq 0 \\
  H_1 : & \mu_X - \mu_Y < 0
\end{array}
$$

or for a comparison of variances,

$$
\begin{array}{rl}
  H_0 : & \sigma^2_X \geq \sigma^2_Y \\
  H_1 : & \sigma^2_X < \sigma^2_Y
\end{array} \quad \mbox{ or alternatively }\quad 
\begin{array}{rl}
  H_0 : & \displaystyle \frac{\sigma^2_X}{\sigma^2_Y} \geq 1 \\
  H_1 : & \displaystyle \frac{\sigma^2_X}{\sigma^2_Y} < 1
\end{array}
$$

The basic structure of the procedures is very similar to the scheme introduced in Lesson 2:

1. Formulate the hypotheses to test for the parameters of interest. Optionally, specify a significance level $\alpha$.
2. Identify a test statistic with known distribution, which would relate the parameter and the data available for this test given our assumptions on the data.
3. Compute the value of the test statistic for the available samples, and either a critical value or a p-value for the test.
4. Reach a conclusion for the null hypothesis based on the preceding data and the computed values.

The main differences with respect to the procedure introduced in Lesson 2 are:

- The hypotheses of the test now involve a comparison of the values of two parameters based on data from two samples, instead of comparing one parameter with a reference value.
- We will continue to assume that both samples are simple random samples, but we will need to make assumptions on the dependence or independence properties between the two samples.


### <span style="color:brown">Dependence between samples</span>

When considering the two samples from both populations, these samples might be independent of each other or they might present some form of dependence. Which of these two cases holds in a given situation will affect our choice of a statistic for the test. For example, in the case of tests on the mean or the proportion, the statistic will be defined as a difference between the values of estimates from the two samples, scaled by a measure of variability. This joint measure of variability will need to combine the variabilities from both samples, and this combination will depend on the dependence properties of the samples.

Note that while the property of independence between two samples is a property that has a clear definition, any dependence between these samples can take many different forms.

In this lesson we will consider only two cases:

1. Paired samples, having any dependence structure between them, and
2. Independent samples


#### <span style="color:brown">Definition of paired samples</span>

In this case, we are given a sample defined in terms of pairs of values from both populations (random variables), $\{ (x_i,y_i) \}_{i=1}^n$. That is, we have samples from both populations were each value from one sample is associated exactly with one value from the other sample. This association is usually based on each pair of observations sharing a common characteristic: for example, both observations may correspond to the same individual, or to the same date, etc.

Having a paired sample does not need to imply anything about the dependence between the samples for the two variables, and in particular both samples can be independent. But this paired structure can be exploited in some cases to simplify the tests of interest so that the selected test statistic can be applied to any dependence structure present in the data.

The only cases for <span style="color:brown;">paired samples</span> considered in this lesson are tests for <span style="color:brown;">comparisons of means</span> of two populations. In particular, we will not study tests for a comparison of proportions or variances between paired samples.


## <span style="color:brown;">Test statistics and dependence</span>

In this cell we introduce the specific cases we will consider in this lesson, and the test statistics to be used in each of the cases. As in Lesson 2, we will discuss each of the cases, including the assumptions required in each one, but we will provide limited theoretical justification for our results.

### <span style="color:brown">Paired samples</span>

In this case we can reduce the tests of interest on the <span style="color:brown;">means</span> of two populations to tests on one population by introducing a <span style="color:brown;">new random variable</span> defined as <span style="color:brown;">$D = X - Y$.</span>

In this case our first step will be to generate a new sample $\{ d_i \}_{i=1}^n$, where $d_i \equiv x_i - y_i$. Then, we redefine the test in terms of the mean of $D$, $\mu_D$.

##### <span style="color:brown">Test definitions</span>

For a comparison of means, we can write the test as

$$
\begin{array}{rl}
  H_0 : & \mu_X \geq \mu_Y + \delta_0 \\
  H_1 : & \mu_X < \mu_Y + \delta_0
\end{array} \quad \mbox{ equivalent to }\quad 
\begin{array}{rl}
  H_0 : & \mu_D \geq \delta_0 \\
  H_1 : & \mu_D < \delta_0
\end{array}
$$

where $D = X - Y$.

Usually we compare the values of the means directly, that is, we study the case when $\delta_0 = 0$.

#### <span style="color:brown">Test statistics</span>

These tests can be carried out by using the statistics introduced in Lesson 2, as we have reduced our test to one based on a single sample. For the definitions indicated above we will have:

$$
\small
\begin{array}{lllccc}
\text{Parameter} & \text{Estimator} & \text{Assumptions} & \text{Statistic} & \text{Distribution} & \text{Statistic under } H_0 \\
\hline
\text{Mean } \mu_D & \text{Sample mean } \bar D & \text{Large sample size} & \frac{\bar D - \mu_D}{\frac{S_D}{\sqrt{n}}} & N(0,1) & t_0 = \frac{\bar d - \delta_0}{\frac{s_D}{\sqrt{n}}} \\
& & \text{Normal data} & \frac{\bar D - \mu_D}{\frac{S_D}{\sqrt{n}}} & t_{n-1} & t_0 = \frac{\bar d - \delta_0}{\frac{s_D}{\sqrt{n}}}
\end{array}
$$


##### <span style="color:green;">Questions</span>

<span style="color:green">Answer the following questions:</span>
- <span style="color:green">Consider two paired samples with 20 observations for each variable. Could you apply the CLT in this case?</span>
- <span style="color:green">Assume you are given two paired samples to test the equality of their variances. What would you need to do to apply the preceding procedure of reducing both samples to a single sample in order to test this property?</span>
- <span style="color:green">Would you obtain the same results for a hypothesis test on the difference of the means of two paired samples if you define the test in terms of $D = X - Y$, or in terms of $D' = Y - X$?</span>



#### <span style="color:red">Exercise</span>

*You are conducting a study on the seasonal variations in the sales of shellfish in a city. You have collected sales data from 20 fish markets in the city, corresponding to two days in two different periods of interest: December 20th (Christmastime), and April 17th (Spring); both days are Wednesdays. The following table presents a summary of the shellfish sales income in each one of the days, as well as the value for the difference in sales income between both periods:*

|   | December | April | December$-$April |
| --- | ---: | ---: | ---: |
*Average sales* | 300 euros | 180 euros | 120 euros |
*Quasi standard deviation* | 44 euros | 29 euros | 44 euros |

*For a significance level of 5%, conduct a hypothesis test to determine if the average daily sales in December are at least 100 euros greater than the sales in April.*



##### <span style="color:red">Exercise. Solution</span>

We define the variables of interest as $X =$ "shellfish sales on December 20", $Y =$ "shellfish sales on April 17". We have two paired samples of size $n = 20$, corresponding to sales in the same markets on these two different dates.

As we only have information for the 20 fish markets in the city, we cannot assume that we have a large sample; we will need to assume that the two populations follow a normal distribution. We will also assume that the observations corresponding to $X$ and $Y$ for the 20 markets are simple random samples.

1. The null and alternative hypotheses for the test are:

$$
\begin{array}{rcl}
H_0 & : & \mu_X - \mu_Y \leq 100 = \delta_0 \\
H_1 & : & \mu_X - \mu_Y > 100
\end{array}
$$

As we have paired samples, we define a new variable $D = X - Y$, and we redefine our test in terms of this variable

$$
\begin{array}{rcl}
H_0 & : & \mu_D \leq 100 = \delta_0 \\
H_1 & : & \mu_D > 100
\end{array}
$$

2. The statistic for this test will be given by

$$
T = \frac{\bar D - \delta_0}{S_D/\sqrt{n}} \sim t_{n-1}
$$

as we have normal data and we wish to test the value of the population mean for $D$. From the last column of the table in the description of the problem we have that $\bar d = 120$ and $s_d = 44$. The value of this statistic under the null hypothesis will be

$$
   t_0 = \frac{\bar d - d_0}{s_D/\sqrt{n}} = \frac{120-100}{44/\sqrt{20}} = 2.0328
$$

3. The rejection region for this test will be given by

$$
\text{RR}_{\alpha} = \{ T > t_{n-1;\alpha} \} = \{ T > t_{19;0.05} \} = \{ T > 1.7291 \}
$$

And the p-value for the test is
   
$$
\text{p-value} = \Pr(T_{19} > t_0 ) = \Pr(T_{19} > 2.0328 ) = 0.02814
$$

4. As the value of the test statistic under the null hypothesis satisfies $t_0 = 2.0328 > 1.7291 = t_{19;0.05}$, this value lies in the rejection region. Also, it holds that $\text{p-value} = 0.02814 < 0.05 = \alpha$ and we also conclude that we should reject $H_0$ for a significance level of $0.05$. That is, we accept that the average increase of sales income between December and April in this city is larger than 100 euros.


In [None]:
# Numerical calculations for the exercise

sig.lvl = 0.05
cat(sprintf("Significance level               : %10.4f\n",sig.lvl))

t.0 = (120 - 100)/(44/sqrt(20))
cat(sprintf("Value of the statistic under H_0 : %10.4f\n",t.0))

crit.v = qt(sig.lvl,19,lower.tail=FALSE)
cat(sprintf("Critical value for the test      : %10.4f\n",crit.v))

p.val = pt(t.0,19,lower.tail=FALSE)
cat(sprintf("P-value for the test             c: %10.4f\n",p.val))


### <span style="color:brown">Independent samples</span>

In the case of independent samples, we have more alternatives to select statistics we could use to conduct the tests. For example, we can conduct tests for a comparison of the proportions or variances from both populations.

In this case both samples may have different sizes. In general, the available samples will be denoted by $\{ x_i \}_{i=1}^{n_x}$ and $\{ y_i \}_{i=1}^{n_y}$.

#### <span style="color:brown">Test definitions</span>

In a similar manner to the preceding lesson, the cases we will consider are a comparison of means,

$$
\begin{array}{rl}
  H_0 : & \mu_X \geq \mu_Y + \delta_0 \\
  H_1 : & \mu_X < \mu_Y + \delta_0
\end{array} \quad \mbox{ equivalent to }\quad 
\begin{array}{rl}
  H_0 : & \mu_X - \mu_Y \geq \delta_0 \\
  H_1 : & \mu_X - \mu_Y < \delta_0
\end{array}
$$

a comparison of proportions,

$$
\begin{array}{rl}
  H_0 : & p_X \geq p_Y + \delta_0 \\
  H_1 : & p_X < p_Y + \delta_0
\end{array} \quad \mbox{ equivalent to }\quad 
\begin{array}{rl}
  H_0 : & p_X - p_Y \geq \delta_0 \\
  H_1 : & p_X - p_Y < \delta_0
\end{array}
$$

or a comparison of variances,

$$
\begin{array}{rl}
  H_0 : & \sigma^2_X \geq \rho_0 \sigma^2_Y \\
  H_1 : & \sigma^2_X < \rho_0 \sigma^2_Y
\end{array} \quad \mbox{ or alternatively }\quad 
\begin{array}{rl}
  H_0 : & \displaystyle \frac{\sigma^2_X}{\sigma^2_Y} \geq \rho_0 \\
  H_1 : & \displaystyle \frac{\sigma^2_X}{\sigma^2_Y} < \rho_0
\end{array}
$$

The most usual case is one in which we conduct a direct comparison of the values of the population means, proportions or variances. In these cases, $\delta_0 = 0$ or $\rho_0 = 1$.

#### <span style="color:brown">Test statistics</span>

These tests can be carried out by adapting the statistics introduced in Lesson 2 by introducing a combination of values from the two samples. For the tests indicated above these statistics would have the form:

$$
\scriptsize
\begin{array}{llccc}
\text{Parameter} & \text{Estimator} & \text{Statistic} & \text{Distribution} & \text{Statistic under } H_0 \\
\hline
\text{Means } \mu_X, \mu_Y & \text{Difference of sample means} & \displaystyle \frac{\bar X - \bar Y - (\mu_X - \mu_Y)}{\sqrt{S_X^2/n_x + S_Y^2/n_y}} & N(0,1) & \displaystyle t_0 = \frac{\bar x - \bar y - \delta_0}{\sqrt{s_x^2/n_x + s_y^2/n_y}} \\
& & \displaystyle \frac{\bar X - \bar Y - (\mu_X - \mu_Y)}{S_R\sqrt{1/n_x + 1/n_y}} & t_{n_x+n_y-2} & \displaystyle t_0 = \frac{\bar x - \bar y - \delta_0}{s_R\sqrt{1/n_x + 1/n_y}} \\
\hline
\text{Proportions } p_X, p_Y & \text{Difference of sample proportions} & \displaystyle \frac{\hat P_X - \hat P_Y - (p_X - p_Y)}{\sqrt{\hat P_0(1 - \hat P_0)/n}} & N(0,1) & \displaystyle t_0 = \frac{\hat p_X - \hat p_Y - \delta_0}{\sqrt{\hat p_0(1 - \hat p_0)/n}} \\
\hline
\text{Variances } \sigma_X^2, \sigma_Y^2 & \text{Ratio of quasivariances} & \displaystyle \frac{S_X^2/\sigma_X^2}{S_Y^2/\sigma_Y^2} & F_{n_x-1,n_y-1} & t_0 = \displaystyle \frac{s_X^2}{\rho_0 s_Y^2} \\
\hline
\end{array}
$$

where the cases with normal distributions indicated above correspond to approximations to the true distributions when $n$ is large.

In the preceding formulas we have made use of the following values:

$$
s_R^2 = \frac{(n_x-1)s_X^2 + (n_y - 1)s_Y^2}{n_x + n_y - 2} , \qquad \hat p_0 = \frac{n_x \hat p_X + n_y \hat p_Y}{n_x + n_y}
$$

The assumptions we require for each of the preceding cases are:

| Parameter | Assumptions |
| --- | --- |
| Means | Large sample size for both samples |
|   | Normal data, same population variance |
| Proportions | Large sample size for both samples |
| Variances | Normal data |

When we conduct tests on the mean or the proportion, the distributions for the preceding test statistics are the same ones we used in the case of one population: the normal distribution or the Student t distribution. But in the case of tests for the comparison of variances we need to introduce a new distribution, the Fisher F, denoted as $F_{n_1,n_2}$. It depends on two parameters (two numbers of degrees of freedom), and it is described in some detail in [Appendix 1](#App3_1), at the end of this lesson.


##### <span style="color:green;">Questions</span>

<span style="color:green">Answer the following questions:</span>
- <span style="color:green">If you have two independent samples with sizes 5 and 60, can you apply the CLT approximation in this case? Why?</span>
- <span style="color:green">Can you use the preceding formulas for two paired independent samples?</span>
- <span style="color:green">Why the case of normal data with different population variances is not included in the preceding list of statistics?</span>
- <span style="color:green">For the test on the ratio of variances, how would you write the null hypothesis if you only have a table for the right tail of the distribution?</span>



#### <span style="color:red">Exercise</span>

*An insurance company TV commercial was shown in both a normal (30-second) version and a time-compressed (24-second) version. For a random sample of 60 people who had seen the normal version, 15 could recall the brand 2 days later. For an independent sample of 75 people who had seen the compressed version, 30 could recall the brand 2 days later.*

*At 5% significance level, test the hypothesis that the population recall proportion for viewers of the normal version, $p_n$, is lower that the population recall proportion for viewers of the compressed version, $p_c$.*



##### <span style="color:red">Exercise. Solution</span>

Let $P_n$ and $P_c$ denote the proportions of the viewers that would be able to recall the brand 2 days after watching the commercial, for its normal and compressed versions respectively.

We have collected two samples that we assume are independent and have different sizes. From the sample data we have been given, we obtain the corresponding sample proportions as: $\hat p_n = 15/60 = 0.25$ and $\hat p_c = 30/75 = 0.4$.

1. The null and alternative hypotheses for this test will be:

$$
\begin{array}{rl}
H_0 : & p_n - p_c \geq 0 = \delta_0 \\
H_1 : & p_n - p_c < 0
\end{array}
$$

We have selected the alternative hypothesis corresponding to the sample data we have been given. This option would correspond to a case when we wish to ensure that we have enough evidence that the recall from the compressed version is better than for the normal version.

2. Our test statistic, and its approximate distribution, are:

$$
\frac{\hat P_n - \hat P_c - (p_n - p_c)}{\sqrt{\hat P_0(1-\hat P_0)}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \sim_{\scriptsize \text{approx.}} N(0,1) 
$$

The value for the statistic corresponding to our data, under the null hypothesis, is
   
$$
\begin{array}{rcl}
\hat p_0 & = & \displaystyle \frac{n_n \hat p_n + n_c \hat p_c}{n_n + n_c} = \frac{15 + 30}{60 + 75} = 0.3333 , \\
t_0 & = & \displaystyle \frac{\hat p_n - \hat p_c}{\sqrt{\hat p_0(1-\hat p_0)}\sqrt{\frac{1}{n_n}+\frac{1}{n_c}}} = \frac{0.25 - 0.4}{\sqrt{0.3333\times 0.6667}\sqrt{\frac{1}{60}+\frac{1}{75}}} = -1.8372
\end{array}
$$

3. The critical region for this test is given by:

$$
\text{RR}_{\alpha} = \{ Z < -z_\alpha \} = \{ Z < -z_{0.05} \} = \{ Z < -1.645 \}
$$

The p-value can be obtained from the value of the test statistic under the null hypothesis, as:

$$
\text{p-value} = \Pr (Z < t_0 ) = \Pr(Z < -1.8372 ) = 0.03309
$$

4. Our conclusion for the test would be that the evidence in our samples leads us to reject $H_0$ for a significance level of 5%, as the value of our test statistic $t_0 = -1.8372 < -1.645$ lies in the rejection region. Also, our p-value $0.03309 < 0.05 = \alpha$ is smaller than the significance level for our test. As a consequence, we have enough evidence to conclude (at this significance level) that the collected evidence supports our conclusion that the reduced commercial has a better retention rate than the normal commercial.


In [None]:
## Numerical calculations for the exercise

sig.lvl = 0.05
cat(sprintf("Significance level                : %10.4f\n",sig.lvl))

p.0 = (15+30)/(60+75)
cat(sprintf("Value of the joint proportion p_0 : %10.4f\n",p.0))

t.0 = (0.25-0.4)/sqrt(p.0*(1-p.0)*(1/60+1/75))
cat(sprintf("Value of the statistic under H_0  : %10.4f\n",t.0))

crit.v = qnorm(sig.lvl)
cat(sprintf("Critical value for the test       : %10.4f\n",crit.v))

p.val = pnorm(t.0)
cat(sprintf("P-value for the test              : %10.4f\n",p.val))


<h4 style="color:brown">Summary of test conclusions</h4>

The summary below presents the conditions under which you should reject the null hypothesis, for a certain significance level $\alpha$ and each of the preceding cases corresponding to independent samples. For the paired samples case, you should refer to the table presented in Lesson 2.

$$
\small
\begin{array}{cccc}
H_1 & \text{Critical value} & \text{Rejection region} & \text{P-value def} \\
\hline
\mu_X - \mu_Y > \delta_0 & z_{\alpha},t_{n_x+n_y-2;\alpha} & \{ t > \text{cv} \} & \Pr(T > t) \\
\mu_X - \mu_Y < \delta_0 & -z_{\alpha},-t_{n_x+n_y-2;\alpha} & \{ t < \text{cv} \} & \Pr(T < t) \\
\mu_X - \mu_Y \not= \delta_0 & z_{\alpha/2},t_{n_x+n_y-2;\alpha/2} & \{ t < -\text{cv} \}\cup \{ t > \text{cv} \} & \min(1,2\Pr(T > t)) \\
\sigma_X^2/\sigma_Y^2 > \rho_0 & F_{n_x-1,n_y-1;\alpha} & \{ t > \mbox{cv} \} & \Pr(T > t) \\
\sigma_X^2/\sigma_Y^2 < \rho_0 & F_{n_x-1,n_y-1;1-\alpha} & \{ t < \mbox{cv} \} & \Pr(T < t) \\
\sigma_X^2/\sigma_Y^2 \not= \rho_0 & F_{n_x-1,n_y-1;\alpha/2},F_{n_x-1,n_y-1;1-\alpha/2} & \{ t < F_{n_x-1,n_y-1;1-\alpha} \}\cup \{ t > F_{n_x-1,n_y-1;\alpha} \} & \min(1,2\Pr(T > t))
\end{array}
$$

As in Lesson 2, we have that:
   - The symbol $t$ represents the value of the test statistic for the available sample, under the null hypothesis.
   - The symbol $\mbox{cv}$ represents the critical value as specified in the corresponding column of the table.
   - When two critical values from different distributions are indicated, you should use the one corresponding to the assumptions reasonable for your test.
   - For the p-value you always reject when $\mbox{p-value} < \alpha$.
   - All quantiles are right-quantiles, for example, $\Pr(Z > z_\alpha ) = \alpha$.
   - For tests on the proportion, just replace $\mu$ with $p$ and use the quantiles from a Normal distribution.



#### <span style="color:red">Exercise</span>

*We have collected a simple random sample of 42 observations of the daily electricity prices from the Spanish market, $X$. This sample can be split into two subsamples, one corresponding to 30 working days ($D$) and one to 12 weekend days ($W$). For these two subsamples we have $\bar d = 59.863$, $\bar w = 53.678$, $s_D = 3.107$ and $s_W = 4.442$.*

*Before we proceed with a test for the equality of the means for these populations (daily weekday vs. weekend prices), we wish to test if the variances of the two populations are the same. Conduct a hypothesis test with a significance level of 5%, to determine if it would be reasonable to assume that the variances of the daily market prices on weekdays and weekends are equal.*

*You may assume that these daily prices follow a normal distribution, and that the two subsamples are independent.*



##### <span style="color:red">Exercise. Solution</span>

We will conduct a two-sided hypothesis test on the equality of the variances for the two populations: average daily prices for weekdays $D$ and weekends $W$, under the assumptions of normality and independence between the samples.

1. The null and alternative hypotheses for this test will be:

$$
\begin{array}{rl}
H_0 : & \sigma^2_D = \sigma^2_W \\
H_1 : & \sigma^2_D \not= \sigma^2_W
\end{array}
$$

We have selected to conduct a two-sided test, although it would also be possible to define our test as a one-sided one, with alternative hypothesis $H_1 : \sigma^2_D < \sigma^2_W$.

2. Our test statistic, and its distribution, are:

$$
T = \frac{S_D^2/\sigma_D^2}{S_W^2/\sigma_W^2} \sim F_{n_D-1,n_W-1} 
$$

The value for the statistic corresponding to our data, under the null hypothesis $\sigma_D^2 = \sigma_W^2$, is
   
$$
t_0 = \frac{s_D^2}{s_W^2} = \frac{3.107^2}{4.442^2} = 0.4892
$$

3. For $\alpha = 0.05$, the critical region for this test is given by:

$$
\begin{array}{rcl}
\text{RR}_{0.05} & = & \{ F_{29,11} < F_{29,11;0.975} \} \cup \{ F_{29,11} > F_{29,11;0.025} \} \\
& = & \{ F_{29,11} < 0.4040 \} \cup \{ F_{29,11} > 3.1253 \}
\end{array}
$$

The p-value can be obtained from the value of the test statistic under the null hypothesis, as:
   
$$
\text{p-value} = 2 \Pr (F_{29,11} < t_0 ) = 2\Pr(F_{29,11} < 0.4892 ) = 0.1214
$$

4. Our conclusion for the test would be that the evidence in our samples is not sufficient to reject $H_0$ for a significance level of 5%, as the value of our test statistic $t_0 = 0.4892 > 0.4040,$ lies outside of the rejection region. Also, our p-value $0.1214 > 0.05 = \alpha$ is larger than the significance level for our test. As a consequence, we do not have enough evidence to conclude (at this significance level) that the variance of the two samples is different.


In [None]:
## Numerical calculations for the exercise

sig.lvl = 0.05
cat(sprintf("Significance level                : %10.4f\n",sig.lvl))

p.0 = (15+30)/(60+75)
cat(sprintf("Value of the joint proportion p_0 : %10.4f\n",p.0))

t.0 = 3.107^2/4.442^2
cat(sprintf("Value of the statistic under H_0  : %10.4f\n",t.0))

crit.v = qf(sig.lvl/2,29,11)
cat(sprintf("Critical value for the test       : %10.4f\n",crit.v))

p.val = 2*pf(t.0,29,11)
cat(sprintf("P-value for the test              : %10.4f\n",p.val))



#### <span style="color:red">Exercise</span>

*Consider again the preceding exercise, and the two samples we have collected where $n_d = 30$, $n_w = 12$, $\bar d = 59.863$, $\bar w = 53.678$, $s_D = 3.107$ and $s_W = 4.442$.*

*Conduct a hypothesis test with a significance level of 5%, to determine if the mean of the daily electricity prices would be different for weekdays and weekdays.*

*You may assume that these daily prices follow a normal distribution, and that the two subsamples are independent.*



##### <span style="color:red">Exercise. Solution</span>

In this case we will choose to conduct a one-sided hypothesis test on the values of the means for the two populations: average daily prices for weekdays $D$ and weekends $W$, under the assumptions of normality and independence between the samples.

1. The null and alternative hypotheses for this test will be:

$$
\begin{array}{rl}
H_0 : & \mu_D \leq \mu_W \\
H_1 : & \mu_D > \mu_W
\end{array}
$$

We have selected to conduct a one-sided test, as the general evidence for the market indicates that the prices on weekends are consistently lower than those on weekdays, and we wish to test if that evidence is strong enough to support a statistically significant difference between the two prices.

2. As we have not rejected the equality of the variances in the preceding exercise, we will assume that the population variances are equal, and that the data follow normal distributions. Our test statistic, and its distribution, are:

$$
T = \frac{\bar D - \bar W - (\mu_D - \mu_W)}{S_R^2\displaystyle \sqrt{\frac{1}{n_d} + \frac{1}{n_w}}} \sim t_{n_d + n_w - 2} 
$$

The value for the statistic corresponding to our data, under the null hypothesis $\mu_D = \mu_W$, is
   
$$
\begin{array}{rcl}
s_R^2 & = & \displaystyle \frac{(n_d-1)s_D^2 + (n_w - 1)s_W^2}{n_d + n_w - 2} = \frac{29\times 3.107^2 + 11\times 4.442^2}{30 + 12 - 2} = 12.4249 \\
t_0 & = & \displaystyle \frac{\bar d - \bar w}{s_R^2\displaystyle \sqrt{\frac{1}{n_d} + \frac{1}{n_w}}} = \frac{59.863 - 53.678}{\displaystyle\sqrt{12.4249\left(\frac{1}{30} + \frac{1}{12}\right)}} = 5.1371
\end{array}
$$

3. For $\alpha = 0.05$, the critical region for this test is given by:

$$
\text{RR}_{0.05} = \{ T_{40} > t_{40;0.05} \} = \{ T_{40} > 1.6839 \}
$$

The p-value can be obtained from the value of the test statistic under the null hypothesis, as:

$$
\text{p-value} = \Pr (T_{n_d + n_w - 2} > t_0 ) = \Pr(T_{40} > 5.1371 ) = 3.8208\, 10^{-6}
$$

4. Our conclusion for the test would be that the evidence in our samples is clearly sufficient to reject $H_0$ for a significance level of 5%, as the value of our test statistic $t_0 = 5.1371 \gg 1.6839,$ lies far outside of the rejection region. Also, our p-value $3.8208\, 10^{-6} \ll 0.05 = \alpha$ is much smaller than the significance level for our test. As a consequence, we do have enough evidence to conclude (at nearly any significance level) that the mean price on weekdays is higher than the mean price on weekends.



#### <span style="color:red">Exercise</span>

*A midterm exam for a given subject has been completed in different sections of the subject. The exams for each one of the sections were different. Some students felt that their exam was more difficult than that of other sections, and that this situation put them at an unfair disadvantage with respect to other students in the class.*

*The following information about the exam results has been collected both for the section that complained to the coordinators, which we will denote as section $A$, and for other section with similar student backgrounds, section $B$:*

$$
n_A = 41 , \ \bar x_A = 6.46 ,\ s_A = 1.79 , \qquad n_B = 36 ,\ \bar x_B = 6.88 ,\ s_B = 1.66
$$

*How would you proceed to study if there was a significant difference between the results of the evaluations conducted in these sections?*



##### <span style="color:red">Exercise. Solution</span>

We present the main steps to be completed to conduct this study.

1. We need to formulate the null and alternative hypotheses for this problem. These hypotheses will be based on the comparison of the mean grades for the two groups. Some relevant questions on the formulation of these hypotheses would be:

   - Should the null hypothesis be written as $H_0 : \mu_A \leq \mu_B$? 
   
     If we proceed with this null hypothesis, we will only be able to answer questions such as "Is the mean of A smaller than the mean of B?", which has a trivial answer, or "Is the mean of A significantly larger than the mean of B?", which is clearly not true given the data.
     
   - Should we use a two-sided test?
   
     This is a more informative option, but it would have the consequence that, for the values of interest on the left tail of the distribution, the rejection region will be smaller than that for a one-sided test.
     
<span style="color:brown;">Discuss why the values on the left tail are more relevant, why the critical value for the left tail will be smaller, compared to that for a two-sided test, what consequences this could have on your conclusion for the test, and wether rejecting less might be a desirable outcome.</span>

As a consequence of these comments, the null and alternative hypotheses for this test will be defined as:

$$
\begin{array}{rl}
H_0 : & \mu_A \geq \mu_B \\
H_1 : & \mu_A < \mu_B
\end{array}
$$

2. We need to assume that the samples are independent. Also, we must decide if we would consider our samples to be sufficiently large or if the data follow a normal distribution.

We will assume that we have large samples, that is, that the CLT provides a reasonable approximation to the distribution of the test statistic. The alternative choice of assuming normality is in general far more restrictive, providing worse approximations, unless we are certain that our data presents a very good fit for this distribution.
   
Our test statistic will be

$$
T = \frac{\bar x_A - \bar x_B - (\mu_A - \mu_B)}{\displaystyle \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}} \sim_{\scriptsize \text{approx.}} N(0,1)
$$

The value for this statistic corresponding to our data, under the null hypothesis $\mu_A = \mu_B$, is

$$
\begin{array}{rcl}
t_0 & = & \displaystyle \frac{6.46 - 6.88}{\displaystyle \sqrt{\frac{1.79^2}{42} + \frac{1.66^2}{35}}} = -1.0679
\end{array}
$$

3. To conduct the test we may select a significance value. But as our goal is to understand if the evidence would suggest that there might be a problem with the homogeneity of the tests in different groups, it is much more informative to compute the p-value for this test.

The p-value is obtained from the value of the test statistic under the null hypothesis, as:

$$
\text{p-value} = \Pr (Z < t_0 ) = \Pr(Z < -1.0679 ) = 0.143
$$

4. Our conclusion for this test would be that the evidence in our samples is not sufficient to reject $H_0$ for any significance level of 14% or less. Based on this information we must decide if this significance level is reasonable for us to conclude that there is a lack of homogeneity in the tests, or if the evidence is not sufficient to reach this conclusion. Note that selecting a significance level also has implications on the probability of a type I error.

<span style="color:green;">Some additional questions for this exercise:</span>

- <span style="color:green;">What would be the p-value of this test if the test were two-sided? How would that value influence your conclusion?</span>

- <span style="color:green;">Obtain the value of the mean grade for the first group that would make the difference between the grades significant for a 1% significance level. Assume that all other values remain the same.</span>


In [None]:
## Numerical calculations for the exercise

cat(sprintf("\n- Testing results:\n"))
t.0 = (6.46 - 6.88)/sqrt(1.79^2/41 + 1.66^2/36)
cat(sprintf("Value of the statistic under H_0  : %10.4f\n",t.0))

p.value = pnorm(t.0)
cat(sprintf("P-value for the test              : %10.4f\n",p.value))

cat(sprintf("\n- Additional questions\n"))
cat(sprintf("P-value for a two-sided test      : %10.4f\n",2*p.value))
sig.lvl = 0.01
mean.a.sig = 6.88 + qnorm(sig.lvl)*sqrt(1.79^2/41 + 1.66^2/36)
cat(sprintf("Mean grade A - significant result : %10.4f\n",mean.a.sig))


## <span style="color:blue;">Examples based on selected data sets</span>

---

To illustrate the concepts we have introduced, and to motivate possible choices of good estimators, we will consider specific examples, mostly based on real data, which we will process using <span style="color:blue;font-family:monospace;font-size:90%;">R</span>.

### <span style="color:blue;">Preparing R and the data</span>

We start by preparing <span style="color:blue;font-family:monospace;font-size:90%;">R</span> to read and manipulate the data mentioned above. In the following <span style="color:blue;font-family:monospace;font-size:90%;">R</span> <span style="color:brown">code cell</span> we:

1. Load the <span style="color:blue;font-family:monospace;font-size:90%;">R</span> libraries we are going to need for our examples.
2. Define a function, <span style="color:blue;font-family:monospace;font-size:90%;">table_prnt</span>, specifying the format for the tables that will present the numerical results in this lesson.
3. Introduce information to work with the available data sets.

The <span style="color:brown;">available data sets</span> and their identifying codes are:

1. Hourly prices for the Iberian electricity market
2. Grades for a Statistics subject in UC3M
3. Share prices for a company (Iberdrola) from the IBEX index
4. Simulated data from a N(80,30) distribution (var 1), an Exp(lambda=1/30) distribution (var 2) and a Binom(20,0.4) distribution (var 3)
5. Data from the Sustainable Develpment Report 2021, with the scores by country for goals 1 and 2

In order to add another data set to this collection, you should include information for each of the following variables: the <span style="color:blue;font-family:monospace;font-size:90%;">.csv</span> file containing the data and a text with a short description for the data.

It is also important to ensure that the <span style="color:brown;">working directory</span> has been <span style="color:brown;">selected correctly,</span> as the directory that includes all the data sets that could be used in this lesson.

To execute the commands in the cell, select the cell by clicking on it, and then <span style="color:blue;">press the **RUN** button</span> in the menu bar, or press <span style="color:blue;">Shift-Enter.</span>


In [None]:
#options(jupyter.plot_mimetypes = c("text/plain","image/png"))

# Load libraries with R functions

suppressMessages(library(tidyverse))
suppressMessages(library(huxtable))
library(knitr)
suppressMessages(library(kableExtra))
library(IRdisplay)
suppressMessages(library(sjPlot))
suppressMessages(library(gridExtra))
library(grid)
suppressMessages(library(qqplotr))

# Define a function to format and print the results of interest

table_prnt <- function(p.df,p.capt) {
    # A function to control the presentation of tables with numerical results
    p.df %>% kable("html",caption=paste0('<em>',p.capt,'</em>'),align='r') %>%
    kable_styling(full_width = F, position = "left") %>% as.character() %>% display_html()
    }

# Define the dataset of interest

## Datasets that are available for this lesson

v.pref = data.frame(file = c("Dat_PreciosOMIE.csv",     # Name of the .csv data file
                            "Dat_Calificaciones.csv",
                            "Dat_PreciosIBE_MC.csv",
                            "Dat_SimulatedData.csv",
                            "Dat_SDR21.csv"))
v.pref$title = c("Electricity prices",         # Short title for the data
                "Grades",
                "Share returns",
                "Simulated data",
                "SDG 2021 Scores")



#### <span style="color:blue;">Selecting and displaying the data set and the variables of interest</span>

We select one of these data sets and the variable of interest in the following cell.

1. We assign the corresponding number to the variable <span style="color:blue;font-family:monospace;font-size:90%;">sel.data</span>, at the start of the following code cell.
2. We read the file and include the data in a <span style="color:brown;">data frame</span> with the name <span style="color:blue;font-family:monospace;font-size:90%;">Data.fr</span>.
3. We assign the number corresponding to the order of the variable of interest in the data set, to the variable <span style="color:blue;font-family:monospace;font-size:90%;">sel.col</span>.
4. We assign the values of the variable to an <span style="color:blue;font-family:monospace;font-size:90%;">R</span> data frame with the name <span style="color:blue;font-family:monospace;font-size:90%;">data.sel</span>.

If the two samples are composed of paired values, we extract them as two variables from one of the data sets. In the case of independent samples, we will extract the values from one variable in a data set, and we will split these values into two groups. In practice, these values will not be independent but we will assume that their dependence is not too strong.

Finally, we print the names of the selected data set and variables, to check that these values are the correct ones. Then we display a part of the values from the <span style="color:blue;font-family:monospace;font-size:90%;">.csv</span> file, keeping the same structure of the file.


In [None]:
# Define the data set of interest

## Indicate the data set and variable to select
## These values can be modified

sel.data = 1
data.type = 1          # data.type = 1 for paired data, = 2 for independent data
if (data.type == 1) {
    sel.col = c(6,8)
} else {
    sel.col = c(6)
    p.dat.x = 0.4    # Indicate how to partition the data (for the independent case)
}

## Read the data

s.pref = v.pref[sel.data,]
Data.fr = read.csv2(s.pref$file)

if (data.type == 1) {
### Extract variables of interest (paired samples)
    n.dat.x = n.dat.y = nrow(Data.fr)
    data.sel.x = Data.fr[,sel.col[1]]
    data.sel.y = Data.fr[,sel.col[2]]
    c.names = colnames(Data.fr)[c(sel.col[1],sel.col[2])]
} else {
### Extract variables of interest (independent samples)
    n.dat.all = nrow(Data.fr)
    n.dat.x = floor(p.dat.x*n.dat.all)
    n.dat.y = n.dat.all - n.dat.x
    data.sel.x = Data.fr[1:n.dat.x,sel.col[1]]
    data.sel.y = Data.fr[(n.dat.x+1):n.dat.all,sel.col[1]]
    c.names = colnames(Data.fr)[c(sel.col[1],sel.col[1])]
}
vr.1 = c(rep("VarX",n.dat.x),rep("VarY",n.dat.y))
vr.2 = c(data.sel.x,data.sel.y)
val.melt = data.frame(variable = vr.1, value = vr.2)

## Summary of the selected data

descr.df = as.data.frame(c(s.pref$title,c.names))
colnames(descr.df) <- c("Selection")
rownames(descr.df) <- c("Data set","Variable X","Variable Y")

Data.hux.0 <-
  hux(descr.df) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)
table_prnt(Data.hux.0[-1,],"")

# Print a part of the data we have selected

max.row.show = 8       # Max number of individual values to show
max.col.show = 8       # Max number of variables to show

n.row.show = min(nrow(Data.fr),max.row.show)
n.col.show = min(ncol(Data.fr),max.col.show)

Data.hux.1 <-
  hux(Data.fr[1:n.row.show,1:n.col.show]) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)
rownames(Data.hux.1) <- c(0:n.row.show)
table_prnt(Data.hux.1[-1,],s.pref$title)



### <span style="color:blue;">A numerical example for a two-population test</span>

In order to illustrate how a two-population test would be conducted, we apply the general procedure introduced in Lesson 2, including the modifications we have described previously, to the preceding data. In this way, we would compare the parameter values corresponding to two subsamples from the data collected in the first variable. A possible justification for this test would be to allow us to identify some possible lack of homogeneity in the data, assuming that our assumptions on the data are correct.

In the preceding cell we have already selected a variable of interest from the data set. The number for this variable is assigned to the R variable <span style="color:blue;font-family:monospace;font-size:90%;">sel.col</span>.

The procedure we will follow will be based on splitting the data into two subsamples of similar size and testing for the equality of the means for these subsamples. The size of the subsamples is defined by the value of the variable <span style="color:blue;font-family:monospace;font-size:90%;">smp.sz.frac</span>.

In the following cell we extract the subsamples we wish to study and we present some summaries for these samples.


In [None]:
# Two-population test

smp.sz.frac = 0.45

## Extract the data by splitting the values in the first variable

smp0.sz = nrow(Data.fr)
smp.sz.x = floor(smp.sz.frac*smp0.sz)
smp.sz.y = smp0.sz - smp.sz.x

v.smp.x = Data.fr[1:smp.sz.x,sel.col[1]]
v.smp.y = Data.fr[(smp.sz.x+1):smp0.sz,sel.col[1]]

## Create data frame to plot the variables and extract some numerical summaries

vr.1 = c(rep("VarX",smp.sz.x),rep("VarY",smp.sz.y))
vr.2 = c(v.smp.x,v.smp.y)
val.melt = data.frame(variable = vr.1, value = vr.2)

mn.smp.x = mean(v.smp.x)
var.smp.x = var(v.smp.x)
mn.smp.y = mean(v.smp.y)
var.smp.y = var(v.smp.y)

## Print the collected information

descr.df = as.data.frame(rbind(sprintf("%6.0f",smp.sz.x),sprintf("%6.0f",smp.sz.y),
                                sprintf("%9.3f",mn.smp.x),sprintf("%9.3f",mn.smp.y),
                                sprintf("%9.3f",mn.smp.x - mn.smp.y),
                                sprintf("%9.3f",sqrt(var.smp.x)),sprintf("%9.3f",sqrt(var.smp.y))))

colnames(descr.df) <- c("Values")
rownames(descr.df) <- c("Sample size X","Sample size Y","Mean X","Mean Y","Mean X - Y","Quasi-sd X","Quasi-sd Y")

Data.hux.0 <-
  hux(descr.df) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)
table_prnt(Data.hux.0[-1,],"")

## Draw boxplots for the two subsamples

box.2 = val.melt %>%
  ggplot(aes(x=variable,y=value)) + geom_boxplot(outlier.colour="black",notch=FALSE) + theme_bw() +
    ggtitle(sprintf("%s subsample boxplots",s.pref$title)) +
    xlab("Variables") + ylab("Values") +
    theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))
plot(box.2)

## Draw normal QQplots for the two subsamples

v.smp.x.df = data.frame(Val = v.smp.x)

qqplot.1 = v.smp.x.df %>% ggplot(aes(sample = Val)) +
  stat_qq_band() + stat_qq_line(color="red") + stat_qq_point() +
  ggtitle("Normal QQ-plot X") +
  xlab("Theoretical quantiles") + ylab("Sample quantiles") +
  theme(plot.title = element_text(color="blue", size=13, face="bold.italic", hjust=0.5))

v.smp.y.df = data.frame(Val = v.smp.y)
qqplot.2 = v.smp.y.df %>% ggplot(aes(sample = Val)) +
  stat_qq_band() + stat_qq_line(color="red") + stat_qq_point() +
  ggtitle("Normal QQ-plot Y") +
  xlab("Theoretical quantiles") + ylab("Sample quantiles") +
  theme(plot.title = element_text(color="blue", size=13, face="bold.italic", hjust=0.5))

suppressWarnings(grid.arrange(qqplot.1,qqplot.2,nrow = 1,
                              top=textGrob("QQplots for both subsamples",gp=gpar(fontsize=15,col="blue"))))


#### <span style="color:blue;">Conducting the test</span>

For the preceding data, we are going to carry out the test defined by

$$
\begin{array}{rl}
  H_0 : & \mu_X - \mu_Y \geq 0 \\
  H_1 : & \mu_X - \mu_Y < 0
\end{array}
$$

where $X$ denotes the observations selected from the first part of the sample, and $Y$ denotes the observations from the second part of the sample. Assuming we have a reasonably large sample, we will apply the procedure based on using the CLT to define our statistic.

In the next cell we will implement the following steps:

1. We define the significance level for the test, $\alpha$, in the variable <span style="color:blue;font-family:monospace;font-size:90%;">sig.lvl</span>.

2. We identify the test statistic $T$ corresponding to the parameters we wish to compare and the properties of the samples we will use. We replace the sample values in the formula for the test statistic $T$ to obtain the value of $t_0$. This test statistic is given by

$$
T = \frac{\bar X - \bar Y - (\mu_X - \mu_Y)}{\sqrt{S_X^2/n_x + S_Y^2/n_y}} \sim_{\scriptsize\text{approx.}} N(0,1)
$$

3. We compute the critical value for this test, $z_{\alpha}$, and/or the p-value of the test, $\Pr(Z < t)$.

4. We print all these values, as well as the conclusion for the test obtained from comparing the relevant values; for example, we compare the p-value of the test and the significance level.


In [None]:
# Conducting a two-population test

## Value of the significance level of the test

sig.lvl = 0.05

## Value of the test statistic

var.smp.all = var.smp.x/smp.sz.x + var.smp.y/smp.sz.y
ht.val.1 = (mn.smp.x - mn.smp.y)/sqrt(var.smp.all)

## Critical value and p-value for the test

ht.val.2 = qnorm(sig.lvl, lower.tail = TRUE)
ht.val.3 = pnorm(ht.val.1, lower.tail = TRUE)

## Conclusion for the test

if (ht.val.3 < sig.lvl) ht.val.4 = "Reject H0" else ht.val.4 = "Fail to reject H0"
    
## Print the results

df.c.1 = c("Value of the test statistic","Critical value","P-value for the test","Test result")
df.c.2 = c(sprintf("%7.3f",ht.val.1),sprintf("%7.3f",ht.val.2), sprintf("%5.3f",ht.val.3),ht.val.4)
mn.val = as.data.frame(cbind(df.c.1,df.c.2))

Data.hux <-
  hux(mn.val) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)

Data.hux = Data.hux[-1,]
rownames(Data.hux) = NULL
colnames(Data.hux) = NULL

Data.hux %>% kable("html",caption=paste0('<em>',"Test results",'</em>'),align='lr') %>%
    kable_styling(full_width = F, position = "left") %>% as.character() %>% display_html()


#### <span style="color:blue;">An extension to the preceding test</span>

In the following cell we are going to extend the preceding test to different partitions of the data. Our goal would be to check if by looking at shorter time periods it is still possible to identify heterogeneity in the data, or equivalently if there are some subperiods in the data where heterogeneity might be more clearly visible.

In the cell we include the instructions to plot both the value of the test statistic and the p-value of the hypothesis test corresponing to these different pairs of subsamples.


In [None]:
# Testing homogeneity for different time periods

## Values of general parameters

smp.sz.z = 168

sig.lvl = 0.05

## Generate all the samples that we wish to study and their p-values

smp0.sz = nrow(Data.fr)
ht.val.2 = qnorm(sig.lvl,lower.tail = TRUE)

ht.val.01 = ht.val.03 = NULL
ix.sta = 0
ix.end = ix.sta + 2*smp.sz.z
while (ix.end < smp0.sz) {
    v.smp.x = Data.fr[(ix.sta+1):(ix.sta+smp.sz.z),sel.col[1]]
    v.smp.y = Data.fr[(ix.sta+smp.sz.z+1):ix.end,sel.col[1]]
    
    smp.sz.x = length(v.smp.x)
    smp.sz.y = length(v.smp.y)
    mn.smp.x = mean(v.smp.x)
    var.smp.x = var(v.smp.x)
    mn.smp.y = mean(v.smp.y)
    var.smp.y = var(v.smp.y)
    
    var.smp.all = var.smp.x/smp.sz.x + var.smp.y/smp.sz.y
    ht.val.1 = (mn.smp.x - mn.smp.y)/sqrt(var.smp.all)
    ht.val.3 = pnorm(ht.val.1, lower.tail = TRUE)
    ht.val.01 = c(ht.val.01,ht.val.1)
    ht.val.03 = c(ht.val.03,ht.val.3)
    
    ix.sta = ix.sta + 1
    ix.end = ix.end + 1
}

ht.val.df = data.frame(x = smp.sz.z + seq(1,length(ht.val.01)),y1 = ht.val.01,y3 = ht.val.03)

## Plot the resulting p-values

plt.1 = ht.val.df %>%
   ggplot(aes(x=x,y=y1)) + geom_line() +
   geom_hline(yintercept=ht.val.2,color="brown") +
   ggtitle("Critical values for mean comp") +
   xlab("Dividing observation") + ylab("Test statistic") +
   theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

plt.3 = ht.val.df %>%
   ggplot(aes(x=x,y=y3)) + geom_line() +
   geom_hline(yintercept=sig.lvl,color="brown") +
   ggtitle("P-values for mean comp") +
   xlab("Dividing observation") + ylab("P-value") +
   theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5))

suppressWarnings(grid.arrange(plt.1,plt.3,nrow = 1,
                              top=textGrob("Plots of hypothesis tests comparing means of subsamples",
                                           gp=gpar(fontsize=15,col="blue"))))


### <span style="color:blue;">Examples: numerical application of hypothesis testing</span>

In the following cells we define a group of R commands that will allow us to conduct different hypothesis tests for predefined data sets or for randomly selected data.

These cells are structured as a sequence of tasks to be conducted on the data, to select it, preprocess it, apply the hypothesis testing procedure and show the results. This sequence corresponds to the general steps we have introduced to conduct a hypothesis testing procedure.

#### <span style="color:blue;">1. Define the data to be used for the example</span>

In the following cell we need to select both the dataset and the variables that you would like to study.

We start by specifying the data set to be used. Then, we indicate if we are going to work with paired samples or with independent samples. We will assign the value 1 to the variable <span style="color:blue;font-family:monospace;font-size:90%;">data.type</span> in the case of paired data.

We will consider two options to work with this data:

1. Conduct a test using all the available data, or
2. Conduct a test on a random subsample obtained from the data.

This second case is introduced to allow for the generation of random examples to be used for hypothesis tests. The two cases are controlled by specifying the value of two variables <span style="color:blue;font-family:monospace;font-size:90%;">smp.sz.frac.1</span> and <span style="color:blue;font-family:monospace;font-size:90%;">smp.sz.frac.2</span>. If both values are set equal to 1, all the selected data are used; otherwise, random subsamples with the specified sizes are used.


In [None]:
# Define the data set of interest

f.pref = 5

## Read the data set

s.pref = v.pref[f.pref,]

Data.fr.1 <- read.csv2(s.pref$file)

# Define the variables of interest

sel.col = c(6,8)

## Indicate if you are going to work with paired data (=1) or independent data (=2)

data.type = 2

## If the samples are randomly selected, specify their sizes (as a proportion of the total size)
## For paired samples, only smp.sz.frac.1 is used

smp.sz.frac.1 = 0.3
smp.sz.frac.2 = 0.4

## Extract values from the data and assign them to variables with the desired structure

smp0.sz = nrow(Data.fr.1)
if (data.type == 1) {
### Extract variables of interest (paired samples)
    if (smp.sz.frac.1 == 1) {
        smp.sz.lbl = "No"
        smp.sz.x = smp.sz.y = smp0.sz
        v.smp.x = Data.fr.1[,sel.col[1]]
        v.smp.y = Data.fr.1[,sel.col[2]]
    } else {
        smp.sz.lbl = "Yes"
        smp.sz.x = smp.sz.y = floor(smp.sz.frac.1*smp0.sz)        
        v.smp.ix = sample(seq(1:smp0.sz), smp.sz.x, replace = FALSE)
        v.smp.x = Data.fr.1[v.smp.ix,sel.col[1]]
        v.smp.y = Data.fr.1[v.smp.ix,sel.col[2]]
    }
    c.names = colnames(Data.fr)[c(sel.col[1],sel.col[2])]
    data.type.lbl = "Yes"
} else {
### Extract variables of interest (independent samples)
    smp.sz.frac = smp.sz.frac.1 + smp.sz.frac.2
    smp.sz.x = floor(smp.sz.frac.1*smp0.sz)
    ref.frac = smp.sz.frac.1/smp.sz.frac
    smp.sz.ref.x = floor(ref.frac*smp0.sz)
    if (smp.sz.frac == 1) {
        smp.sz.lbl = "No"
        smp.sz.y = smp0.sz - smp.sz.x
        v.smp.x = Data.fr.1[1:smp.sz.x,sel.col[1]]
        v.smp.y = Data.fr.1[(smp.sz.x+1):smp0.sz,sel.col[1]]
    } else {
        smp.sz.lbl = "Yes"
        smp.sz.y = floor(smp.sz.frac.2*smp0.sz)
        v.smp.ix = sample(seq(1,smp.sz.ref.x), smp.sz.x, replace = FALSE)
        v.smp.iy = sample(seq(smp.sz.ref.x+1,smp0.sz), smp.sz.y, replace = FALSE)
        v.smp.x = Data.fr.1[v.smp.ix,sel.col[1]]
        v.smp.y = Data.fr.1[v.smp.iy,sel.col[1]]
    }
    c.names = colnames(Data.fr.1)[c(sel.col[1],sel.col[1])]
    data.type.lbl = "No"
}
vr.1 = c(rep("VarX",smp.sz.x),rep("VarY",smp.sz.y))
vr.2 = c(v.smp.x,v.smp.y)
val.melt = data.frame(variable = vr.1, value = vr.2)

## Summary of values from the full sample and the subsample

mn.smp.x = mean(v.smp.x)
var.smp.x = var(v.smp.x)
mn.smp.y = mean(v.smp.y)
var.smp.y = var(v.smp.y)

# Print information on the selected data

descr.df = as.data.frame(c(s.pref$title,c.names,data.type.lbl,smp.sz.lbl))
colnames(descr.df) <- c("Selection")
rownames(descr.df) <- c("Data set","Variable X","Variable Y","Paired samples?","Random subsample?")

Data.hux.0 <-
  hux(descr.df) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)
table_prnt(Data.hux.0[-1,],"")


#### <span style="color:blue;">2. Define the type of test that you would like to conduct</span>

We select the significance level of the test as the value of the variable <span style="color:blue;font-family:monospace;font-size:90%;">sig.lvl</span>, at the beginning of the following cell.

We then specify the test we would like to carry out, by indicating:

- The parameter on which we wish to conduct the test and the assumptions we are willing to make about the test, in variable <span style="color:blue;font-family:monospace;font-size:90%;">ht.par</span>.
- The form of the null hypothesis of the test, in variable <span style="color:blue;font-family:monospace;font-size:90%;">ht.type</span>.
- The reference value we will use for the test ($\delta_0$, $\rho_0$).

The reference value is selected as 0 for tests on the differences of means and proportions, or 1 for a test on the equality of variances. It can be introduced directly by giving a value to the variable <span style="color:blue;font-family:monospace;font-size:90%;">ref.val</span> in the following cell.

At the end of the cell we print a summary with all the preceding values plus some basic sample statistics.


In [None]:
# Significance level for the test

sig.lvl = 0.05

# Choice of type of hypothesis test to conduct
## If data.type == 1, the only acceptable values are ht.par = 1, 2

ht.par = 4         # = 1, test for the difference of means (CLT)
                    # = 2, test for the difference of means (normal data)
                    # = 3, test for the difference of proportions
                    # = 4, test for the ratio of variances
ht.type = 3        # = 1, one-sided, left-hand test, H_1 : mu_X - mu_Y > mu_0 (for example)
                   # = 2, one-sided, right-hand test, H_1 : mu_X - mu_Y < mu_0 (for example)
                   # = 3, two-sided, H_1 : mu_X - mu_Y <> mu_0 (for example) 

# Reference value for the test
## You may specify the reference value directly by giving a value to ref.val

if (ht.par <= 2) ref.val = 0 else
    if (ht.par == 3) {
        smp0.mn = mean(Data.fr.1[,sel.col[1]])
        smp0.var = var(Data.fr.1[,sel.col[1]])
        frac.shft = 1 + runif(1)
        ref.cutoff = smp0.mn - frac.shft*sqrt(smp0.var/smp0.sz)
        ref.val = 0
    } else ref.val = 1

# Output of sample data summaries and reference values

df.c.1 = c("Test"," "," "," ","Sample")
df.c.2 = c("Population Parameter","Alternative hypothesis","Reference value","Significance level",
            "Sample size")

if (ht.par <= 2) {
    df.c.3.1 = "Mean"
    add.vals = as.data.frame(rbind(c(" ","Mean X",sprintf("%9.3f",mn.smp.x)),
                                   c(" ","Mean Y",sprintf("%9.3f",mn.smp.y)),
                                   c(" ","Mean X - Y",sprintf("%9.3f",mn.smp.x - mn.smp.y)),
                                   c(" ","Quasivariance X",sprintf("%9.3f",var.smp.x)),
                                   c(" ","Quasivariance Y",sprintf("%9.3f",var.smp.y))))
    if (data.type == 1) {
        if (ht.type == 1) df.c.3.2 = "$$H_1 : \\mu_D > \\mu_0$$" else
            if (ht.type == 2) df.c.3.2 = "$$H_1 : \\mu_D < \\mu_0$$" else
                df.c.3.2 = "$$H_1 : \\mu_D \\not= \\mu_0$$"
    } else {
        if (ht.type == 1) df.c.3.2 = "$$H_1 : \\mu_X - \\mu_Y > \\mu_0$$" else
            if (ht.type == 2) df.c.3.2 = "$$H_1 : \\mu_X - \\mu_Y < \\mu_0$$" else
                df.c.3.2 = "$$H_1 : \\mu_X - \\mu_Y \\not= \\mu_0$$"
    }
} else if (ht.par == 3) {
    df.c.3.1 = "Proportion"
    pr.smp.x = mean(v.smp.x > ref.cutoff)
    pr.smp.y = mean(v.smp.y > ref.cutoff)
    add.vals = as.data.frame(rbind(c(" ","Cutoff value",sprintf("%5.3f",ref.cutoff)),
                                   c(" ","Proportion X",sprintf("%5.3f",pr.smp.x)),
                                   c(" ","Proportion Y",sprintf("%5.3f",pr.smp.y)),
                                   c(" ","Proportion X - Y",sprintf("%9.3f",pr.smp.x - pr.smp.y))))
    if (ht.type == 1) df.c.3.2 = "$$H_1 : p_X - p_Y > p_0$$" else
        if (ht.type == 2) df.c.3.2 = "$$H_1 : p_X - p_Y < p_0$$" else
            df.c.3.2 = "$$H_1 : p_X - p_Y \\not= p_0$$"
} else {
    df.c.3.1 = "Variance"
    add.vals = as.data.frame(rbind(c(" ","Quasivariance X",sprintf("%9.3f",var.smp.x)),
                                   c(" ","Quasivariance Y",sprintf("%9.3f",var.smp.y))))
    if (ht.type == 1) df.c.3.2 = "$$H_1 : \\sigma_X^2 > \\sigma^2_0 \\sigma_Y^2$$" else
        if (ht.type == 2) df.c.3.2 = "$$H_1 : \\sigma_X^2 < \\sigma^2_0 \\sigma_Y^2$$" else
            df.c.3.2 = "$$H_1 : \\sigma_X^2 \\not= \\rho^2_0 \\sigma_Y^2$$"
}

df.c.3 = c(df.c.3.1,df.c.3.2,sprintf("%9.3f",ref.val),sprintf("%5.2f",sig.lvl),sprintf("%5.0f",smp.sz.x))

mn.val = as.data.frame(cbind(df.c.1,df.c.2,df.c.3))
mn.val = data.frame(Map(c,mn.val,add.vals))

Data.hux <-
  hux(mn.val) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)

Data.hux = Data.hux[-1,]
rownames(Data.hux) = NULL
colnames(Data.hux) = NULL

Data.hux %>% kable("html",caption=paste0('<em>',"Sample and test values",'</em>'),align='llr') %>%
    kable_styling(full_width = F, position = "left") %>% as.character() %>% display_html()


#### <span style="color:blue;">3. Perform the required calculations for the test</span>

The following cell conducts the calculations required for the test, that is, it computes the value of the statistic, the critical value, and the p-value of the test. The conclusion for this test is also printed at the end of the cell.


In [None]:
# Hypothesis testing: Conducting the test and printing the results

## Computing the values required for the testing procedure

if (data.type == 1) {
## Paired samples, only for ht.par = 1,2 
    var.smp.d = var(v.smp.x - v.smp.y)
    ht.val.1 = (mn.smp.x - mn.smp.y - ref.val)/sqrt(var.smp.d/smp.sz.x)
    if (ht.par == 1) {
        if (ht.type == 1) {
            ht.val.2 = qnorm(sig.lvl,lower.tail = FALSE)
            ht.val.3 = pnorm(ht.val.1, lower.tail = FALSE)
        } else if (ht.type == 2) {
            ht.val.2 = qnorm(sig.lvl,lower.tail = TRUE)
            ht.val.3 = pnorm(ht.val.1, lower.tail = TRUE)
        } else {
            ht.val.2a = qnorm(sig.lvl,lower.tail = FALSE)
            ht.val.2b = -ht.val.2a
            ht.val.3 = min(1,2*pnorm(abs(ht.val.1), lower.tail = FALSE))
        }
    } else {
        if (ht.type == 1) {
            ht.val.2 = qt(sig.lvl, smp.sz.x-1, lower.tail = FALSE)
            ht.val.3 = pt(ht.val.1, smp.sz.x-1, lower.tail = FALSE)
        } else if (ht.type == 2) {
            ht.val.2 = qt(sig.lvl, smp.sz.x-1, lower.tail = TRUE)
            ht.val.3 = pt(ht.val.1, smp.sz.x-1, lower.tail = TRUE)
        } else {
            ht.val.2a = qt(sig.lvl/2, smp.sz.x-1, lower.tail = FALSE)
            ht.val.2b = -ht.val.2a
            ht.val.3 = min(1,2*pt(abs(ht.val.1), smp.sz.x-1, lower.tail = FALSE))
        }
    }
} else {
## Independent samples
    if (ht.par == 1) {
        var.smp.all = var.smp.x/smp.sz.x + var.smp.y/smp.sz.y
        ht.val.1 = (mn.smp.x - mn.smp.y - ref.val)/sqrt(var.smp.all)
        if (ht.type == 1) {
            ht.val.2 = qnorm(sig.lvl,lower.tail = FALSE)
            ht.val.3 = pnorm(ht.val.1, lower.tail = FALSE)
        } else if (ht.type == 2) {
            ht.val.2 = qnorm(sig.lvl,lower.tail = TRUE)
            ht.val.3 = pnorm(ht.val.1, lower.tail = TRUE)
        } else {
            ht.val.2a = qnorm(sig.lvl,lower.tail = FALSE)
            ht.val.2b = -ht.val.2a
            ht.val.3 = min(1,2*pnorm(abs(ht.val.1), lower.tail = FALSE))
        }
    } else if (ht.par == 2) {
        var.smp.r = ((smp.sz.x - 1)*var.smp.x + (smp.sz.y - 1)*var.smp.y)/(smp.sz.x + smp.sz.y - 2) 
        var.smp.all = var.smp.r*(1/smp.sz.x + 1/smp.sz.y)
        ht.val.1 = (mn.smp.x - mn.smp.y - ref.val)/sqrt(var.smp.all)
        if (ht.type == 1) {
            ht.val.2 = qt(sig.lvl, smp.sz.x + smp.sz.y - 2, lower.tail = FALSE)
            ht.val.3 = pt(ht.val.1, smp.sz.x + smp.sz.y - 2, lower.tail = FALSE)
        } else if (ht.type == 2) {
            ht.val.2 = qt(sig.lvl, smp.sz.x + smp.sz.y - 2, lower.tail = TRUE)
            ht.val.3 = pt(ht.val.1, smp.sz.x + smp.sz.y - 2, lower.tail = TRUE)
        } else {
            ht.val.2a = qt(sig.lvl/2, smp.sz.x + smp.sz.y - 2, lower.tail = FALSE)
            ht.val.2b = -ht.val.2a
            ht.val.3 = min(1,2*pt(abs(ht.val.1), smp.sz.x + smp.sz.y - 2, lower.tail = FALSE))
        }
    } else if (ht.par == 3) {
        pr.smp.0 = (smp.sz.x*pr.smp.x + smp.sz.y*pr.smp.y)/(smp.sz.x + smp.sz.y)
        ht.val.1 = (pr.smp.x - pr.smp.y - ref.val)/sqrt(pr.smp.0*(1-pr.smp.0)*(1/smp.sz.x + 1/smp.sz.y))
        if (ht.type == 1) {
            ht.val.2 = qnorm(sig.lvl,lower.tail = FALSE)
            ht.val.3 = pnorm(ht.val.1, lower.tail = FALSE)
        } else if (ht.type == 2) {
            ht.val.2 = qnorm(sig.lvl,lower.tail = TRUE)
            ht.val.3 = pnorm(ht.val.1, lower.tail = TRUE)
        } else {
            ht.val.2a = qnorm(sig.lvl/2,lower.tail = FALSE)
            ht.val.2b = -ht.val.2a
            ht.val.3 = min(1,2*pnorm(abs(ht.val.1), lower.tail = FALSE))
        }
    } else {
        ht.val.1 = var.smp.x/(ref.val*var.smp.y)
        if (ht.type == 1) {
            ht.val.2 = qf(sig.lvl, smp.sz.x-1, smp.sz.y-1, lower.tail = FALSE)
            ht.val.3 = pf(ht.val.1, smp.sz.x-1, smp.sz.y-1, lower.tail = FALSE)
        } else if (ht.type == 2) {
            ht.val.2 = qf(sig.lvl, smp.sz.x-1, smp.sz.y-1, lower.tail = TRUE)
            ht.val.3 = pf(ht.val.1, smp.sz.x-1, smp.sz.y-1, lower.tail = TRUE)
        } else {
            ht.val.2a = qf(sig.lvl/2, smp.sz.x-1, smp.sz.y-1, lower.tail = FALSE)
            ht.val.2b = qf(sig.lvl/2, smp.sz.x-1, smp.sz.y-1, lower.tail = TRUE)
            ht.val.3a = pf(ht.val.1, smp.sz.x-1, smp.sz.y-1, lower.tail = FALSE)
            ht.val.3b = pf(ht.val.1, smp.sz.x-1, smp.sz.y-1, lower.tail = TRUE)
            ht.val.3 = min(1,2*min(ht.val.3a,ht.val.3b))
        }
    }
}

## Printing a summary of the results

if (ht.val.3 < sig.lvl) ht.val.4 = "Reject H0" else ht.val.4 = "Fail to reject H0"

if (ht.type <= 2) {
    df.c.1 = c("Value of the test statistic","Critical value","P-value for the test","Test result")
    df.c.2 = c(sprintf("%7.3f",ht.val.1),sprintf("%7.3f",ht.val.2), sprintf("%5.3f",ht.val.3),ht.val.4)
} else {
    df.c.1 = c("Value of the test statistic","Critical value 1","Critical value 2","P-value for the test","Test result")
    df.c.2 = c(sprintf("%7.3f",ht.val.1),sprintf("%7.3f",ht.val.2b),sprintf("%7.3f",ht.val.2a),sprintf("%5.3f",ht.val.3),ht.val.4)
}
mn.val = as.data.frame(cbind(df.c.1,df.c.2))

Data.hux <-
  hux(mn.val) %>%
  set_bold(row = 1, value = T) %>%
  set_all_borders(TRUE)

Data.hux = Data.hux[-1,]
rownames(Data.hux) = NULL
colnames(Data.hux) = NULL

Data.hux %>% kable("html",caption=paste0('<em>',"Test results",'</em>'),align='lr') %>%
    kable_styling(full_width = F, position = "left") %>% as.character() %>% display_html()



## <span style="color:brown;">The power function for a two-population test</span>

---

We introduced the power function in Lesson 2 as a way to evaluate the behavior of the test with regard to Type II errors. The power function is equally useful in the case of tests on two populations, and it is defined in a very similar manner.

In this case, the power of a test is a function of the true values of the parameters, defined as

$$
\mbox{power} (\theta_1,\theta_2) = \Pr( \mbox{reject } H_0 | \theta = (\theta_1,\theta_2) ) .
$$

For example, assume that the null hypothesis of the test is given by $H_0 : \mu_X - \mu_Y \leq 0$ and we have normal data, independent samples and the population variances are assumed to be the same. The definition of the power as a function of the true difference of population means $\delta \equiv m_X - m_Y$ would be:

$$
\begin{array}{rcl}
\mbox{power} (\delta) & = & \displaystyle \Pr\left( \frac{\bar X - \bar Y}{S_R\sqrt{1/n_x + 1/n_y}} > t_{n_x+n_y-2;\alpha} \; \Bigl| \; \delta = m_X - m_Y \right) , \\
& & \displaystyle \text{where } \quad \frac{\bar X - \bar Y - \delta}{S_R\sqrt{1/n_x + 1/n_y}} \sim t_{n_x+n_y - 2}
\end{array}
$$



### <span style="color:brown;">Computing the power of a test for two populations</span>

The procedure to compute the power of a test for two populations is identical to the one presented in Lesson 2 for the case when we have one population. The only modification that needs to be introduced corresponds to the statistics to use and their distributions.

To illustrate the procedure in our case, consider an example when we wish to compare the power ot two tests corresponding a comparison of means for two paired samples with independent observations; we will additionally assume that the population variances are the same. We will study the test with a null hypothesis given by $H_0 : \mu_X - \mu_Y \leq 0$, and we assume that the true value of the difference of population means is $\delta = \mu_X - \mu_Y > 0$. Note that for these observations it holds that $n_x = n_y = n$.

Based on the information we have introduced above, we can conduct two different tests: i) a test for paired samples of normal observations for the difference of the population means, and ii) a test for independent samples with normal data and the same population variances.

The powers of these two tests can be computed as follows:

1. Paired samples test.

The definition of the power for this case where $D = X - Y$ is given by
$$
\mbox{power}_p (\delta) = \Pr\left( \frac{\bar X - \bar Y}{S_D\sqrt{1/n}} > t_{n-1;\alpha} \; \Bigl| \; \delta \right) , \qquad \frac{\bar D - \delta}{S_D\sqrt{1/n}} \sim t_{n - 1}
$$

where $S_D^2 = \text{Var} (X - Y) = 2 S_X^2$, as we are assuming independence and equal population variances; also $(n - 1)S_X^2/\sigma^2 \sim \chi^2_{n-1}$. We have that

$$
\mbox{power}_p (\delta;\sigma^2) = \Pr\left( \frac{\bar D - \delta}{S_D\sqrt{1/n}} + \frac{\delta}{S_X\sqrt{2/n}} > t_{n-1;\alpha} \; \Bigl| \; \delta \right) = \Pr\left( T_{n-1} + \frac{\delta}{\displaystyle \sqrt{\frac{2\sigma^2 \chi^2_{n-1}}{n(n-1)}}} > t_{n-1;\alpha} \right)
$$

2. Independent samples test.

In this case we have

$$
\mbox{power}_i (\delta) = \Pr\left( \frac{\bar X - \bar Y}{S_R\sqrt{1/n_x + 1/n_y}} > t_{n_x+n_y-1;\alpha} \; \Bigl| \; \delta \right) , \qquad \frac{\bar X - \bar Y - \delta}{S_R\sqrt{1/n_x + 1/n_y}} \sim t_{n_x + n_y - 1}
$$

where $S_R^2 = ((n_x-1)S_X^2 + (n_y-1)S_Y^2)/(n_x+n_y-2) = S_X^2$, as we are assuming equal population variances; and as before $(n - 1)S_X^2/\sigma^2 \sim \chi^2_{n-1}$. It holds that

$$
\mbox{power}_i (\delta;\sigma^2) = \Pr\left( \frac{\bar X - \bar Y - \delta}{S_R\sqrt{1/n_x + 1/n_y}} + \frac{\delta}{S_X\sqrt{2/n}} > t_{2n-2;\alpha} \; \Bigl| \; \delta \right) = \Pr\left( T_{2n-2} + \frac{\delta}{\displaystyle \sqrt{\frac{2\sigma^2 \chi^2_{n-1}}{n(n-1)}}} > t_{2n-2;\alpha} \right)
$$

These two power functions are very similar for $n$ large.


<h5 style="color:green;">Questions</h5>

<span style="color:green">Answer the following questions:</span>
- <span style="color:green">Is the power of a test with alternative hypothesis $H_1 : \mu_X - \mu_Y < \delta_0$ an increasing function of $\delta = \mu_X - \mu_Y$?</span>
- <span style="color:green">If you increase the significance level $\alpha$ of a test $H_0 : \mu_X - \mu_Y \leq \delta_0$, does the probability of a type II error $\beta$ for $\delta_1 < \delta_0$ increase?</span>


#### <span style="color:blue;">Representation of power functions</span>

The following cell represents the shape of the power functions for the two preceding tests, as a function of the values of $\delta$.

These functions have been obtained from the preceding formulas, by conditioning on the possible values of the sample quasivariance. In particular, the following formulas have been applied (and computed) to obtain the representations in <span style="color:blue;font-family:monospace;font-size:90%;">R</span>,

$$
\begin{array}{rcl}
\mbox{power}_p (\delta;\sigma^2) & = & \displaystyle \int_0^\infty \Pr\left( T_{n-1} + \frac{\delta}{\displaystyle \sqrt{\frac{2\sigma^2 x}{n(n-1)}}} > t_{n-1;\alpha} \right) f_{\chi^2_{n-1}} (x) dx \\
\mbox{power}_i (\delta;\sigma^2) & = & \displaystyle \int_0^\infty \Pr \left( T_{2n-2} + \frac{\delta}{\displaystyle \sqrt{\frac{2\sigma^2 x}{n(n-1)}}} > t_{2n-2;\alpha} \right) f_{\chi^2_{n-1}} (x) dx
\end{array}
$$

where $f_{\chi^2_{n-1}} (x)$ denotes the density of a chi squared distribution with $n-1$ degrees of freedom computed at $x$.


In [None]:
# Hypothesis testing: Power functions

## These parameters can be modified

smp.sz = 10
sig.lvl = 0.05
sig2.0 = 2

low.lim = 0
up.lim = 4

## Computing the values of the power functions for the mean

crit.val.1 = qt(sig.lvl,2*smp.sz-2,lower.tail = FALSE)
crit.val.2 = qt(sig.lvl,smp.sz-1,lower.tail = FALSE)

x.seq = seq(low.lim,up.lim,length.out=100)

int.fn1 = function(x,delta) {pt(crit.val.1 - delta/sqrt(2*sig2.0*x/(smp.sz*(smp.sz-1))),
                                2*smp.sz-2,lower.tail=FALSE)*dchisq(x,smp.sz-1)}
int.fn2 = function(x,delta) {pt(crit.val.2 - delta/sqrt(2*sig2.0*x/(smp.sz*(smp.sz-1))),
                                smp.sz-1,lower.tail=FALSE)*dchisq(x,smp.sz-1)}

pwr.v1 = pwr.v2 = NULL
for (ix in x.seq) {
    pwr.v1 = c(pwr.v1,integrate(int.fn1, lower = 0, upper = Inf, delta = ix)$value)
    pwr.v2 = c(pwr.v2,integrate(int.fn2, lower = 0, upper = Inf, delta = ix)$value)
}

pwr.df = data.frame(x = x.seq, y1 = pwr.v1, y2 = pwr.v2)

## Plotting the power functions

plt.pwr = pwr.df %>% ggplot(aes(x=x)) + 
    geom_line(aes(y=y1, colour="red")) +
    geom_line(aes(y=y2, colour="blue")) +
    ggtitle("Power functions for means comparison") +
    xlab("True difference") + ylab("Power") +
    theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5)) +
    scale_colour_identity("Power F", guide = "legend", 
                          labels = c("Paired smp", "Indep smp"), 
                          breaks = c("red", "blue"))
plot(plt.pwr)



### <span style="color:brown;">Impact of the sample size on the power function</span>

The power function for a given test can also be used to find a sample size that would ensure that the errors associated to this test are reduced to acceptable levels, as we commented in Lesson 2. One difference is that, for independent samples, we have two sample sizes, $n_x$ and $n_y$, and their combined impact on the power of the test can be complex to analyze.

In the following cells we present the values corresponding to the smallest sample sizes that would guarantee a given smallest value of the power function, for the particular case of a test on the difference of the proportions in two samples, assuming sufficiently large sample sizes.

The test we will consider for this illustration is the one defined as

$$
\begin{array}{rl}
  H_0 : & p_X - p_Y \leq 0 \\
  H_1 : & p_X - p_Y > 0
\end{array}
$$

whose test statistic $T$ and (approximate) distribution are given by

$$
T = \frac{\hat P_X - \hat P_Y - (p_X - p_Y)}{\sqrt{\hat P_0 (1 - \hat P_0) (1/n_x + 1/n_y)}} \sim_{\scriptsize\text{approx.}} N(0,1)
$$

The estimator $\hat P_0$ corresponds to the common estimate for the population proportion under the null assumption, and has the following definition and (approximate) distribution:

$$
\hat P_0 = \frac{n_x \hat P_X + n_y \hat P_Y}{n_x + n_y} \sim_{\scriptsize\text{approx.}} N \left( \mu_p , \sigma^2_p \right) , \quad \mu_p = \frac{n_x p_X + n_y p_Y}{n_x + n_y} , \quad \sigma^2_p = \frac{n_x^2 p_X (1-p_X) + n_y^2 p_Y (1-p_Y)}{(n_x + n_y)^2}
$$

The power function for $\delta = p_X - p_Y$ is given (approximately) in this case by

$$
\begin{array}{rcl}
\mbox{power} (p_X,p_Y) & = & \displaystyle \Pr\left( \frac{\hat P_X - \hat P_Y}{\sqrt{\hat P_0 (1 - \hat P_0)(1/n_x + 1/n_y)}} > z_{\alpha} \; \Bigl| \; (p_X,p_Y) \right) \\
& = & \displaystyle \Pr\left( \frac{\hat P_X - \hat P_Y - (p_X - p_Y)}{\sqrt{\hat P_0 (1 - \hat P_0)(1/n_x + 1/n_y)}} + \frac{p_X - p_Y}{\sqrt{\hat P_0 (1 - \hat P_0)(1/n_x + 1/n_y)}} > z_{\alpha} \; \Bigl| \; (p_X,p_Y) \right) \\
& = & \displaystyle \Pr\left( Z + \frac{p_X - p_Y}{\sqrt{\hat P_0 (1 - \hat P_0)(1/n_x + 1/n_y)}} > z_{\alpha} \right)
\end{array}
$$

The relationship between the power and the sample sizes is complicated, as the values $n_x$ and $n_y$ appear in different places both in the preceding expression and in the distribution of $P_0$, as part of nonlinear terms. Nevertheless, the general pattern of the dependence is still monotonic in the values of $n$, that is, larger sample sizes imply higher values for the power.

To compute this power function in the following code cell we have again used conditioning on the value of $\hat P_0$, taking into account that this variable follows (approximately) a normal distribution,

$$
\mbox{power} (p_X,p_Y) \approx \int_0^1 \Pr\left( Z + \frac{p_X - p_Y}{\sqrt{w (1 - w)(1/n_x + 1/n_y)}} > z_{\alpha} \right) f_{N(\mu_p,\sigma^2_p)} (w) dw
$$

where $f_{N(\mu_p,\sigma^2_p)}$ represents the density function of a normal random variable with mean $\mu_p$ and variance $\sigma^2_p$.



#### <span style="color:blue;">Sample size and the power function of a test</span>

The following cell computes a plot to illustrate the dependence between the power and the sample sizes for the case of the test on the difference of proportions we have described above.


In [None]:
# Hypothesis testing: Power functions and sample sizes

## These parameters can be modified

sig.lvl = 0.05

p.x = 0.3
p.y = 0.2

## Values of the power for different sample sizes

sz.x.low = 100
sz.x.high = 500
sz.x.inc = 10
sz.y.low = 100
sz.y.high = 500
sz.y.inc = 100

sz.x.val = seq(sz.x.low,sz.x.high,sz.x.inc)
sz.y.val = seq(sz.y.low,sz.y.high,sz.y.inc)

crit.val = qnorm(sig.lvl,lower.tail = FALSE)
int.fn = function(x,delta,cv,n1,n2,b.mn,b.sd) {pnorm(cv - delta/sqrt(x*(1-x)*(1/n1 + 1/n2)),lower.tail=FALSE)*
                                          dnorm(x,mean = b.mn,sd = b.sd)}

pwr.val = NULL
for (n.x in sz.x.val) {
    for (n.y in sz.y.val) {
        ber.mn = (n.x * p.x + n.y * p.y)/(n.x + n.y)
        ber.sd = sqrt(n.x^2 * p.x * (1-p.x) + n.y^2 * p.y*(1-p.y))/(n.x + n.y)
        pwr.v = integrate(int.fn, lower = 0, upper = 1, delta = p.x - p.y, cv = crit.val, n1 = n.x, n2 = n.y, b.mn = ber.mn, b.sd = ber.sd)$value
        pwr.val = rbind(pwr.val,c(n.x,n.y,pwr.v))
    }
}

pwr.df2 = as.data.frame(pwr.val)
colnames(pwr.df2) = c("nx","ny","pwr")
pwr.df2$ny = as.factor(pwr.df2$ny)

plt.pwr2 = pwr.df2 %>% ggplot(aes(x=nx, y=pwr, color=ny)) + geom_line() +
    ggtitle("Power functions for different sample sizes") +
    xlab("Sample size x") + ylab("Power") +
    theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5)) +
    scale_colour_identity("Sample size y", guide = "legend")
plot(plt.pwr2)


##### <span style="color:green;">Questions</span>

<span style="color:green">Answer the following questions, assuming that both $p_X,p_Y \in [0.05;0.95]$:</span>
- <span style="color:green">Should the power function for this test go to 1 for finite values of the sample sizes? Why? Can you say something about the limit value?</span>
- <span style="color:green">Is it enough that the sample size of one of the samples goes to infinity for the power to increase to one? Why?</span>
- <span style="color:green">Given two values of the sample size, $n_x \gg n_y$, which of the two sample sizes is most relevant to define the value of the power of this test?</span>


---
---

<a id='App3_1'></a>

## <span style="color:orange;">Appendix 1: Distributions associated to the normal (ii)</span>

---

In this lesson we have introduced a new distribution, the Fisher F distribution, associated to the test comparing the variances of two normal populations when the samples are independent. This distribution is closely related to the ones introduced in Lesson 1, and in particular to the chi squared distribution.

<h3 style="color:orange;">The Fisher F distribution</h3>

As we have mentioned, the most usual statistic corresponding to the test for the equality of two population variances when the data are normal and the samples are independent is defined as

$$
T = \frac{S_X^2/\sigma_X^2}{S_Y^2/\sigma_Y^2}
$$

The Fisher F distribution is defined as the ratio of two chi squared distributions, divided by their respective numbers of degrees of freedom. For example,

$$
F = \frac{W_1/n_1}{W_2/n_2}
$$

would follow a Fisher F distribution if $W_1$ and $W_2$ follow chi squared distributions with $n_1$ and $n_2$ degrees of freedom respectively.

In Lesson 1 we indicated that $(n-1)S^2/\sigma^2$ follows a chi squared distribution when $S^2$ represents a random variable corresponding to the quasivariance of a sample of size $n$ and $\sigma^2$ is the variance of the population. As a consequence,

$$
\frac{((n_x-1)S_X^2/\sigma_X^2)/(n_x-1)}{((n_y-1)S_Y^2/\sigma_Y^2)/(n_y-1)} = \frac{S_X^2/\sigma_X^2}{S_Y^2/\sigma_Y^2} = T
$$

follows a Fisher F distribution with $n_x-1$ and $n_y-1$ degrees of freedom.



#### <span style="color:orange;">Summary of statistics for the Fisher F distribution</span>

$$
\small
\begin{array}{cccc}
\text{Deg of freedom} & \text{Domain} & \text{Mean} & \text{Variance} \\
n_1,n_2 & [0 , \infty) & \displaystyle \frac{n_2}{n_2-2}, \text{ for } n_2 > 2 & \displaystyle \frac{2n_2^2(n_1+n_2-2)}{n_1(n_2-2)^2(n_2-4)}, \text{ for } n_2 > 4 \\
\hline
& & \text{Median (approx.)} & \text{Asymmetry coefficient} \\
& & \displaystyle \frac{3n_1-2}{n_1}\frac{n_2}{3n_2-2}, \text{ for } n_1,n_2 > 2 & \displaystyle \frac{(2n_1+n_2-2)\sqrt{8(n_2-4)}}{(n_2-6)\sqrt{n_1(n_1 + n_2 - 2)}}, \text{ for } n_2 > 6 \\
\hline
\end{array}
$$


##### <span style="color:green;">Questions</span>

<span style="color:green">Answer the following questions:</span>
- <span style="color:green">Where is the approximate center of a Fisher F distribution? Does it depend much on the values of $n_1$ and $n_2$</span>
- <span style="color:green">Symmetry properties of a Fisher F distribution. If you know the value of $F_{n_1,n_2;\alpha}$ for any $\alpha$, how could you compute the value of $F_{n_2,n_1;\alpha}$?</span>
- <span style="color:green">Does the variance of a Fisher F distribution increase/decrease with the values of $n_1$ and $n_2$? What is its limit when $n_1 \rightarrow \infty$, or when $n_2 \rightarrow \infty$?</span>


#### <span style="color:orange;">Representation of the density function of a Fisher F distribution</span>

The following cell shows some plots with the densities for this distribution, for different values of the degrees of freedom of the distribution. The values of these degrees of freedom can be selected by introducing the desired values for the variables <span style="color:blue;font-family:monospace;font-size:90%;">deg.fr.1</span> and <span style="color:blue;font-family:monospace;font-size:90%;">deg.fr.2</span> (corresponding to the values of $n_1$ and $n_2$).


In [None]:
# Graphical representations of the Fisher F density function

## These values can be modified
## Numbers of degrees of freedom for the densities to be plotted (include between 1 and 6 pairs of values)

deg.fr.1 = c(5,5,15,15)
deg.fr.2 = c(5,15,5,15)

## Plot the densities

mx.df = 6
l.df = min(mx.df,length(deg.fr.1),length(deg.fr.1))
deg.fr.1 = deg.fr.1[1:l.df]
deg.fr.2 = deg.fr.2[1:l.df]

col.v.gen = c("#619CFF", "#00BFC4", "#00BF7D", "#33CCCC", "#33FFCC", "#99CCFF")
col.v = col.v.gen[1:l.df]

xlm.l = 0
xlm.u = 3

plot.f1 = data.frame(x = c(xlm.l, xlm.u), c.fn = factor(1))

plt.c = ggplot(NULL, aes(x = x, color = c.fn)) +
    stat_function(data = plot.f1, fun = df, args=list(df1=deg.fr.1[1],df2=deg.fr.2[1])) +
    stat_function(data = plot.f1, fun = df, args=list(df1=10,df2=10), color="red") +
    annotate(geom="text",label=c("Fisher F 10,10 df"),color="red",
             x=c(xlm.l+0.65),y=c(0.01),vjust = -1) +
    ggtitle("Density of a Fisher F distribution") +
    xlab("Value") + ylab("Density") +
    scale_colour_manual("df", values = col.v, labels = sprintf("n1 = %4.0f, n2 = %4.0f",deg.fr.1,deg.fr.2), ) +
    theme(plot.title = element_text(color="blue", size=14, face="bold.italic", hjust=0.5)) +
    labs(fill = "df")

if (l.df > 1) {
    for (i in 2:l.df) {
        plot.fi <- data.frame(x = c(xlm.l, xlm.u), c.fn = factor(i))
        plt.c = plt.c + stat_function(data = plot.fi, fun = df, args=list(df1=deg.fr.1[i],df2=deg.fr.2[i]))
    }
}

suppressWarnings(grid.arrange(plt.c,nrow = 1,
                              top=textGrob("Densities of a Fisher F distribution",gp=gpar(fontsize=15,col="blue"))))
