In [1]:
from IPython.core.display import HTML
HTML("""
<style>
div.text_cell_render { /* Customize text cells */
font-family: 'Times New Roman';
font-size:1.3em;
line-height:1.4em;
padding-left:1.5em;
padding-right:1.5em;
}
</style>
""")

<h1><center>Hypothesis Testing</center></h1>

<b>Hypothesis Testing</b> is a method by which we can test an assumption made for a population parameter. For example, the statement $\mu > 10$ is an assumption or <b>hypothesis</b> about the population mean $\mu$. To check the validity of this hypothesis, we must conduct a <b>hypothesis test</b>. Hypothesis tests are closely related to confidence intervals.

Prior to performing a hypothesis test, we need to formulate the <b>null</b> and <b>alternate hypothesis</b>. <b>Null hypothesis</b> states that the effect indicated by the sample is only due to chance or random variation. The <b>alternate hypothesis</b> indicates that the effect indicated by the sample is real and it accurately represents the whole population. For example, if we want to test that whether the population mean $\mu > 10$ by analysing the data from a sample, the null and alternate hypothesis will be:

$$H_0 : \mu \leq 10$$

$$H_A : \mu > 10$$

While performing a hypothesis test, we put the null hypothesis on trial. We make an assumption that the null hypothesis is true. We then obtain the plausibility of the null hypothesis being true from the random samples (which can be termed as the evidence). This plausibility is denoted by a number between 0 and 1 and is called as <b>p-value</b>. <b>p-value can also be seen as the proability of observing the value of statistic which is under test (here population mean) whose disagreement with $H_0$ is at least as great as the observed value of statistic (here $\overline{X}$) under the assumption that null hypothesis is true</b>. Hence, <b>smaller the p-value, stronger the evidence against null hypothesis $H_0$</b>. If p-value is sufficiently small, we can <b>reject the null hypothesis</b> and can say that the alternate hypothesis is true.

#### Tests for a Population Mean (Large-Sample) :

Below are few examples which show how to conduct a hypothesis test for a population mean for large samples.

<b>Example: </b> A certain type of automobile engine emits a mean of 100 mg of oxides of nitrogen per second at 100 horsepower. A modification to the engine design has been proposed that may reduce the emissions. The new design will be put into production if it can be demonstrated that its mean emission rate is less than 100 mg/s. A sample of 50 modified engines are built and tested. The sample mean emission is 92 mg/s, and the sample standard deviation is 21 mg/s. Conduct a test to validate the claim (new mean emission is less than 100mg/s).

<b>Sol:</b> First of all, our intereset is in the validation of the fact that whether the mean emission rate is less than 100mg/s. So this can be termed as an alternate hypothesis. Generally, <b>what is claimed is termed as alternate hypothesis</b>. Hence,

$$H_0: \mu \geq 100$$

$$H_A: \mu < 100$$

Now, we make the assumption that the null hypothesis is true and will compute the distribution of sample mean $\overline{X}$. As $\overline{X}$ is the mean of a large sample ($n = 50$), from central limit theorem, it comes from a normal distribution whose mean is $\mu$ and variance is $\sigma^2/50$, where $\mu$ is population mean and $\sigma^2$ is population variance. Now, we need to find the values of $\mu$ and $\sigma^2$.

We are assuming that the null hypothesis is true, i.e. $\mu \geq 100$. This does not provide a specific value of $\mu$. So, we take the value of $\mu$ as close as possible to alternate hypothesis and hence $\mu = 100$. As the sample is large, we can estimate the population variance as the sample variance $s = 21 mg/s$ and hence the distribution under the assumption that null hypothesis is true (also called as <b>null distribution</b>) is $\overline{X} \sim N(100, (\frac{21}{\sqrt{50}})^2) = N(100, 2.97^2)$

Now, we need to calculate the <b>p-value</b> for the test. <b>p-value can be seen as the proability of observing the value of statistic which is under test (here population mean) whose disagreement with $H_0$ is at least as great as the observed value of statistic (here $\overline{X}$) under the assumption that null hypothesis is true</b>. Here, $\overline{X} = 92$, and we need to find the probability that a number drawn from the null distribution (which is $N(100, 2.97^2)$) is <b>less than or equal to</b> 92. This can be computed by determining the <b>z-socre</b>:

$$z = \frac{92 - 100}{2.97} = -2.69$$

The corresponding p-value is <b>0.0036</b>. The p-value can be interpreted in two ways: <b>Either $H_0$ is false</b> or <b>$H_0$ is true, which implies that out of all the samples that can be drawn from the null distribution, only 0.36% can have the sample mean as small as the observed value</b>. In practice, the second conclusion is not feasible and hence we can assume that the null hypothesis is false and can <b>reject the null hyposthesis</b>. Here, the calculation of p-value is done on the basis of z-score and hence it is called as <b>test statistic</b>. p-value is also called as the <b>observed significance level</b>.

<b>Example:</b> A scale is to be calibrated by weighing a 1000 g test weight 60 times. The 60 scale readings have mean 1000.6 g and standard deviation 2 g. Find the p-value for testing $H_0 :\mu = 1000$ versus $H_A : \mu \neq 1000$.

<b>Sol:</b> Here, the null distribution has mean 1000 and standard deviation $\frac{2}{\sqrt{60}} = 0.258$. Hence the z-score is:

$$z = \frac{1000.6 - 1000}{0.258} = 2.32$$

The corresponding probability is <b>0.0102</b>. Now, as $H_0$ is specified by $\mu = 1000$, regions in both the tails of the normal curve is in disagreement with $H_0$. Hence, the p-value is the sum of both the probabilities which is <b>0.0204</b>. Hence, we have a strong evidence against $H_0$ and it can be rejected. This is called as <b>two-sided</b> or <b>two-tailed</b> test. The test conducted in the previous example was <b>one-sided</b> or <b>one-tailed</b> test.

There often arises some sort of misunderstanding about the conclusions that can be drawn from a hypothesis test. Only two coclusions that can be drawn from the test are: either <b>$H_0$ is false</b> or <b>$H_0$ is plausible</b>. We can never conclude that $H_0$ is true. To reject $H_0$, we decide a <b>significance level</b> of the test and reject the null hypothesis is p-value is <b>less than</b> the significance level of the test. In general, smaller the p-value, less plausible is the null hypothesis. It is often in the best practice to report the p-value rather than just reporting the result of the hypothesis test (whether the null hypothesis is rejected or not). It should be noted that <b>p-value is not the probability of $H_0$ being true</b>.

One common misconception about hypothesis testing is the fact that if a result is statistically significant, it is practically significant as well. Let us look at this with the help of an example.

<b>Example:</b> Assume that a process used to manufacture synthetic fibers is known to produce fibers with a mean breaking strength of 50 N. A new process, which would require considerable retooling to implement, has been developed. In a sample of 1000 fibers produced by this new method, the average breaking strength was 50.1 N, and the standard deviation was 1 N. Can we conclude that the new process produces fibers with greater mean breaking strength?

<b>Sol:</b> The null and alternate hypothesis can be given as:

$$H_0: \mu \leq 50$$

$$H_A: \mu > 50$$

The calculated z-score is:

$$z = \frac{50.1 - 50}{1/\sqrt{1000}} = \frac{0.1}{0.0316} = 3.16$$

Hence, the p-value is <b>0.0008</b>, the null hypothesis can be rejeted and can be said that the result is statistically significant. But as the increase in the breaking strength is almost negligible, the result obtained is not practically significant. The main reason for this is the <b>lower value of standard deviation</b>.

#### Tests for a Population Mean (Small-Sample) :

When the sample size is small, we can not estimate the population standard deviation from sample standard deviation, and hence the large-sample method can not be applied. However, when the population is approximately normal, the <b>Student's t distribution</b> can be used.

<b>Example:</b> Spacer collars for a transmission countershaft have a thickness specification of 38.98–39.02 mm. The process that manufactures the collars is supposed to be calibrated so that the mean thickness is 39.00 mm, which is in the center of the specification window. A sample of six collars is drawn and measured for thickness. The six thicknesses are 39.030, 38.997, 39.012, 39.008, 39.019, and 39.002. Assume that the population of thicknesses of the collars is approximately normal. Can we conclude that the process need recalibration?

<b>Sol:</b> Here, $\overline{X} = 39.01133$ and $s = 0.011928$. We need to find that whether the process needs recalibration or not. The process will need the recalibration when the population mean will be different from 39.00 mm. Hence, the null and alternate hypothesis can be given ad:

$$H_0: \mu = 39.00$$

$$H_A: \mu \neq 39.00$$

The t-statistic is:

$$t = \frac{\overline{X} - \mu}{s/\sqrt{n}} = 2.327$$

This is a <b>two-tailed</b> test and hence the p-value is the sum of area under the curves $t < 2.327$ and $t < -2.327$, i.e. $0.05 < p-value < 0.1$. Hence, we can not reject the null hypothesis and can not conclusively state that the process is out of calibration.

When a small sample is taken from a normal population whose standard deviation $\sigma$ is <b>known</b>, we can use <b>z-statistic</b> instead of t-statistic as we are not approximating $\sigma$ with $s$.

#### Tests for a Population Proportion :

The hypothesis testing for population proportion can be conducted in a simialr manner. Here are some examples to depict it.

<b>Example:</b> A supplier of semiconductorwafers claims that of all the wafers he supplies, no more than 10% are defective. A sample of 400 wafers is tested, and 50 of them, or 12.5%, are defective. Can we conclude that the claim is false?

<b>Sol:</b> First of all, the supplier claims that the percentage of defective wafer is less than 10%. Let us denote the proportion of defective wafer by $p$, then the suppliers claim is: $p \leq 0.1$. As the sample size is large, we can apply the central limit theorem and can say that the sample proportion comes from a normal distribution denoted as:

$$\widehat{p} \sim N\bigg(p, \frac{p(1-p)}{n}\bigg)$$

where $p$ is the population proportion and $n = 400$ is the sample size. Our goal is to find that whether the supplier's claim is false or not. Hence, the null and alternate hypothesis can be defined as:

$$H_0: p \leq 0.1$$

$$H_A: p > 0.1$$

To perform hypothesis test, we assume that null hypothesis is true and hence $p=0.1$. The <b>null distribution</b> comes out to be:

$$\widehat{p} \sim N(0.1, 2.25 \times 10^{-4})$$

The standard deviation and the observed value of $\widehat{p}$ is $\sqrt{2.25 \times 10^{-4}} = 0.015$ and $50/400 = 0.125$. Hence, the <b>z-score</b> is:

$$z = \frac{0.125 - 0.1}{0.015} = 1.67$$

The corresponding <b>p-value</b> is <b>0.0475</b>. For a <b>significance level</b> of 5%, we can <b>reject the null hypothesis</b> and say that the supplier's claim is false.

The only necessary condition to conduct the mentioned test is the fact that sample size must be large, or there should be <b>more than 10</b> outcomes for both the classes, i.e. $np_0 > 10$ and $n(1-p_0) > 10$.


#### Tests for the Difference Between Two Means (Large-Sample) :

The basic idea to conduct the hypothesis test for difference between two means is to find the distribution for the difference of two means and test whether it is equal to 0 or not. Here is an example.

<b>Example:</b> Suppose that a production manager for a manufacturer of industrial machinery is concerned that <b>ball bearings produced in environments with low ambient temperatures may have smaller diameters than those produced under higher temperatures</b>. To investigate this concern, she samples 120 ball bearings that were manufactured early in the morning, before the shop was fully heated, and finds their mean diameter to be 5.068 mm and their standard deviation to be 0.011 mm. She independently samples 65 ball bearings manufactured during the afternoon and finds their mean diameter to be 5.072 mm and their standard deviation to be 0.007 mm. Can she conclude that ball bearings manufactured in the morning have smaller diameters, on average, than ball bearings manufactured in the afternoon?

<b>Sol:</b> Let $X$ and $Y$ denote the samples from the morning and afternoon respectively. Then, $\mu_X = 5.068$, $n_X = 120$, $s_X = 0.11$, $\mu_Y = 5.072$, $n_Y = 65$, $s_X = 0.007$. We can formulate the null and alternate hypothesis as:

$$H_0: \mu_X - \mu_Y \geq 0$$

$$H_A: \mu_X - \mu_Y < 0$$

Since both the sample size is large enough, $\overline{X}$ and $\overline{Y}$ are normally distributed and hence the distribution of $\overline{X} - \overline{Y}$ can be given as:

$$\overline{X} - \overline{Y} \sim N(\mu_X - \mu_Y, \sigma_{\overline{X}}^2 + \sigma_{\overline{Y}}^2) = N\bigg(\mu_X - \mu_Y, \frac{\sigma_{{X}}^2}{n_X} + \frac{\sigma_{{Y}}^2}{n_Y} \bigg)$$

As the sample size is large, we can estimate the population standard deviation from sample standard deviation, i.e. $\sigma_X = s_X = 0.011$ and $\sigma_Y = s_Y = 0.007$. Hence, the standard deviation of the <b>null disstribution</b> is $\sqrt{0.011^2/120 + 0.007^2/65} = 0.001327$. The null distribution is given as:

$$\overline{X} - \overline{Y} = N(0, 0.001327^2)$$

The observed value is $\overline{X} - \overline{Y} = -0.004$, which gives the <b>z-score</b> as:

$$z-score = \frac{-0.004 - 0}{0.001327} = -3.01$$

and hence the <b>p-value</b> is <b>0.0013</b>, which makes the null hypothesis to be rejected. The alternate hypothesis is true, whcih means that the manager's suspicion is correct.

We can also conduct a test for the statement that whether the two population means differ by some value or not. The structure of the test will be (rest of the things remain the same):

$$H_0: \mu_X - \mu_Y \leq \Delta_0$$

$$H_A: \mu_X - \mu_Y > \Delta_0$$

$$test \ statistic (z) = \frac{(\overline{X} - \overline{Y}) - \Delta_0}{\sqrt{\sigma_X^2/n_X + \sigma_Y^2/n_Y}}$$

It should be noted that it is more appropriate to use <b>t-statistic</b> than the <b>z-statistic</b> whenever sample standard deviation is used to estimate the population standard deviation. For a large sample size, <b>z-distribution</b> approximates the <b>t-distribution</b> and hence can be used instead.

#### Tests for the Difference Between Two Means (Small-Sample) :

The hypothesis test for the difference between two means for a small sample size can be conducted in a similar manner. As the sample size is small, the central limit theorem does not apply and if both the populations are approximately normal, <b>t-distribution</b> can be used instead. Here is an example:

<b>Example:</b> Assume that in an experiment, six identically prepared DNA specimens are prepared without the enzyme, and the numbers of a specific molecules counted are 33, 30, 26, 22, 37, and 34. Assume that in four identically prepared specimens with the enzyme present, the counts were 22, 29, 25, and 23. Can we conclude that the counts are lower when the enzyme is present?

<b>Sol:</b> As the sample size is small, we can not apply the central limit theorem. Let the samples without the enzyme is denoted by $X$ and the one with the enzyme is denoted by $Y$. Then, $\overline{X} = 30.333$, $n_X = 6$, $s_X = 5.538$, $\overline{Y} = 24.750$, $n_Y = 4$, $s_Y = 3.096$. The null and alternate hypothesis can be formulated as:

$$H_0: \mu_X - \mu_Y \leq 0$$

$$H_A: \mu_X - \mu_Y > 0$$

As the sample size is small, the following quantity approximately follows the <b>Student's t distribution</b>

$$\frac{(\overline{X} - \overline{Y}) - (\mu_X - \mu_Y)}{\sqrt{s_X^2/n_X + s_Y^2/n_Y}}$$

with degree of freedom $v$ given as

$$v = \frac{(s_X^2/n_X + s_Y^2/n_Y)^2}{\bigg(\frac{s_X^2/n_X}{n_X-1} + \frac{s_Y^2/n_Y}{n_Y-1}\bigg)}$$

Under the assumption that the null hypothesis is true, the test statistic is given as:

$$t = \frac{(\overline{X} - \overline{Y}) - 0}{\sqrt{s_X^2/n_X + s_Y^2/n_Y}} = \frac{5.583 - 0}{2.740} = 2.038$$

From the t table, we can find that <b>0.025 < p-value < 0.05 </b>, and hence the null hypothesis can be rejected stating that the mean count without the enzyme is lower.

We can also conduct a test for the statement that whether the two population means differ by some value or not. The structure of the test will be (rest of the things remain the same):

$$H_0: \mu_X - \mu_Y \leq \Delta_0$$

$$H_A: \mu_X - \mu_Y > \Delta_0$$

$$test \ statistic (t) = \frac{(\overline{X} - \overline{Y}) - \Delta_0}{\sqrt{s_X^2/n_X + s_Y^2/n_Y}}$$

When the populations have <b>equal variances</b>, <b>pooled variance</b> can be used, which is given as:

$$s_p^2 = \frac{(n_X-1)s_X^2 + (n_Y-1)s_Y^2}{n_X + n_Y - 2}$$

The <b>test statistic</b> for the testing of equality of mean can be given as:

$$t = \frac{\overline{X} - \overline{Y}}{s_p \sqrt{1/n_X + 1/n_Y}}$$

Under the assumption that the null hypothesis is true, the test statistic follows Student's t distribution with <b>degree of freedom</b> $n_X + n_Y - 2$.

<b>Example:</b> Two methods have been developed to determine the nickel content of steel. In a sample of five replications of the first method on a certain kind of steel, the average measurement (in percent) was $\overline{X} = 3.16$ and the standard deviation was $s_X = 0.042$. The average of seven replications of the second method was $\overline{Y} = 3.24$ and the standard deviation was $s_Y = 0.048$. Assume that it is known that the population variances are nearly equal. Can we conclude that there is a difference in the mean measurements between the two methods?

<b>Sol:</b> First of all, as we have to find that whether the means are equal or not, the null and alternate hypothesis will be:

$$H_0: \mu_X = \mu_Y$$

$$H_A: \mu_X \neq \mu_Y$$

The pooled standard deviation can be calculated as:

$$s_p = \sqrt{\frac{(n_X-1)s_X^2 + (n_Y-1)s_Y^2}{n_X + n_Y - 2}} = 0.0457$$

Hence, the <b>t-statistic</b> is:

$$t = \frac{3.16-3.24}{0.0457 \sqrt{1/5 + 1/7}} = -2.990$$

As this is a <b>two-tailed</b> test, we need to sum the area between the left and the right tails, which gives the p-value as: <b>0.01 < p-value < 0.02</b>. Hence, we can reject the null hypothesis, stating that the two means are different.

We should refrain ourselves from making the assumption that the population variances are equal based on the equality of sample variances. The test for the difference of two means by a specific quantity can be conducted in a similar way as described above.

#### Tests with Paired Data :

The test with paired data can be conducted in a similar way as the calculation of the confidence interval. We can treat the collection of differnces as the single random sample from the population of differences. Rest of the proceduer remains the same.

<b>Example:</b> The emission of the automobiles for two conditions is shown below.

<img src="images/paired_data.PNG"  width="800px">

Can we conclude that the mean level of emissions is less for highway driving than for stop-and-go driving?

<b>Sol:</b> The mean and standard deviation for the differences are : $\overline{D} = 190.5$, $s_D = 284.1$. The null and alternate hypothesis can be designed as:

$$H_0: \mu_D \leq 0$$

$$H_A: \mu_D > 0$$

The test statistic is given as (the test is for small samples):

$$t = \frac{\overline{D} - 0}{s_D/sqrt{n}} = \frac{190.5 - 0}{284.1/\sqrt{8}} = 1.897$$

which gives the p-value barely less than 0.05.

#### Distribution-Free Tests :

The one assumption for the Student's t tests performed above was the fact that the samples should come from the normal distribution. In <b>distribution-free tests</b>, this restriction is relaxed, i.e. the samnples are not required to come from any specific distribution. Distribution-free tests are sometimes called as <b>nonparametric tests</b>. Mainly, there are two types of distribution-free tests: <b>Wilcoxon signed-rank test</b>(test for population mean) and <b>Wilcoxon rank-sum test / Mann-Whitney test</b> (analogous to the two-sample t test).

##### The Wilcoxon Signed-Rank Test :

Let us see an example to understand the Wilcoxon signed-rank test. Below table shows the measurement of a quantity in a particular experiment. A test for $H_0: \mu \geq 12$ against $H_A: \mu < 12$ needs to be conducted. From the data set, it can be seen that it contains two outliers as 0.9 and 21.7 and hence the samples do not come from normal population. This makes the Student's t test inappropriate.


<img src="images/Wilcoxon_1.PNG"  width="300px">

The Wilcoxon Signed-Rank test does not require the population to be normal. Instead, it requires that the <b>population has a continuous and symmetric probability density function</b>. As the outliers are present on both sides, we can assume that the population is continuous and symmetric. As the population is symmetric, the median will be same as the mean. The next step is to make the data points <b>median-centered</b>. As, $\mu = 12$, the median will also be 12, we subtract 12 from each observation. Next, we find the rank (ignoring the signs) of median-centered observation, giving the rank 1 to the point which is closest to the median. Then the ranks are given the same sign as the one for the meadian-centered observations. Now, we calculate the sum of the <b>positive</b> and <b>negative</b> ranks as: $S_+ = 2+5 = 7$ and $S_- = 1 + 3 + 4 + 6 = 14$. Using simple mathematics, it can be shon that $S_+ + S_- = \frac{n(n+1)}{2}$, where $n$ is the sample size. Hence, for a larger sample size, instead of finding both the sums, we can find one of the sums and can compute the another one as $S_+ = n(n+1)/2 - S_-$. 

When, $\mu > 12$, the positive ranks are more probable than the negative ranks and tend to be larger in the magnitude as well, i.e. $S_+$ is likely to be large. If $\mu < 12$, the situation is reversed and hence the positive ranks are likely to be fewer in number and smaller in magnitude, so $S_+$ is likely to be small. Hence, in a nut shell, <b>large value of $S_+$ will provide evidence against a null hypothesis of the form $H_0: \mu \leq \mu_0$ while smaller values of $S_+$ will provide evidence against a null hypothesis of the form $H_0: \mu \geq \mu_0$</b>. Here $S_+$ is smaller and hence it gives the evidence against the null hypothesis which is $H_0: \mu \geq 12$. The p-value for the test can be calculated from the table as well as through various scientific packages.

One major hurdle while assigning the ranks is the case when a tie occurs. In this case, an average of the ranks is assigned to both the observations. For example, if there is a tie for the third and fourth rank, a rank of 3.5 is assigned to both the observations. Another cocern is the case when one of the observation is exactly equal to median (or mean). In this case, the difference will be 0 and the observation can not be assigned a rank. One way to deal with this is to drop this observation.

When the sample size is <b>large</b>, the test statistic $S_+$ is approximately normally distributed with mean $n(n+1)/4$ and standard deviation $n(n+1)(2n+1)/24$. In this case, the Wilcoxon signed-rank test can be performed by computing the <b>z-score</b> of $S_+$ and then the normal table can be used to find the <b>p-value</b>. The z-score is:

$$z = \frac{S_+ - n(n+1)/4}{\sqrt{n(n+1)(2n+1)/24}}$$

##### The Wilcoxon Rank-Sum Test :

The <b>Wilcoxon rank-sum test</b> can be used to test the difference in population mean in the case when populations are not normal. It makes two assumptions: <b>the populations must be continuous</b> and <b>their probability density functions must be identical in shape and size (location does not matter)</b>.

Let the random samples from two different populations be $X_1, X_2, ..., X_m$ and $Y_1, Y_2, ..., Y_n$ and the population means be $\mu_X$ and $\mu_Y$. The Wilcoxon rank-sum test is performed by combining the two samples and assigning a rank of $1, 2, ..., m+n$ to them. The test statistic, $W$, is the sum of the ranks of the samples from $X$ (where $m \leq n$). As the populations are identical apart from the location, if $\mu_X < \mu_Y$, the sum of the rank of the observations from population $X$, which is the test statistic, will be smaller. If $\mu_X > \mu_Y$, $W$ will tend to be larger.

Let us look at an example. Following code performs a <b>two-tailed</b> Wilcoxon rank-sum on the samples from two populations $X$ and $Y$. For a <b>one-tailed</b> test, such as: $H_0: \mu_X \geq \mu_Y$ against $H_A: \mu_X < \mu_Y$, the p-value will be the half of the two-tailed test, which is <b>0.0223</b>.

In [3]:
from scipy.stats import ranksums

X = [36, 28, 29, 20, 38]
Y = [34, 41, 35, 47, 49, 46]

print(ranksums(X, Y))

RanksumsResult(statistic=-2.008316044185609, pvalue=0.044609718024939606)


For a larger sample size ($m, n > 8$), it can be shown that the null disribution of test statistic $W$ is approximately normal with mean $\mu = m(m+n+1/2)$ and variance $\sigma^2 = mn(m+n+1)/12$. In this case, we can compute the <b>z-score</b> of $W$ and a <b>z-test</b> can be performed. The test statistic is given as:

$$z = \frac{W - m(m+n+1/2)}{\sqrt{mn(m+n+1)/12}}$$

There is a widespread misconception that distribution-free test methods are restriction free. <b>Befor applying the distribution-free methods, one should always remember the restrictions of symmetry for the signed-rank test and of identical shape and size for rank-sum test</b>. 

#### Tests with Categorical Data :

A generalization of the Bernoulli trial is the <b>multinomial trial</b>, which is an experiment that can rsult in any one of the $k$ outcomes where $k \geq 2$. Let the probabilities of the $k$ outcomes be denoted by $p_1, p_2, p_3, ..., p_k$ and the prespecified values be $p_{01}, p_{02}, ..., p_{0k}$. We need to conduct a test to chek whether the probabilities are equal to the prespecified values or not. The null hypothesis will be

$$H_0: p_1 = p_{01}, p_2 = p_{02}, ..., p_k = p_{0k}$$

Suppose we want to test that whether a die is fair or not (i.e. we need to check that whether the probability of getting each outcome is $\frac{1}{6}$ or not). We conducted an ecperiment and rolled a die $600$ times. The outcomes are shown below.

<img src="images/die.PNG"  width="400px">

The results consist of <b>observed</b> and <b>expected</b> values. Observed values is the result of the experiment and expected values are calculated form the desired probabilities for a fair die. We need to test that whether the observed and expected values are close to each other or not i.e. we need to conduct a test for variance. Hence, the statistic used is called as <b>chi-square statistic</b> and is given as:

$$\chi^2 = \sum_{i=1}^{k}\frac{(O_i - E_i)^2}{E_i}$$

The larger the value of the test statistic, the farther the observed values from the expected values and hence the stronger the evidence against the null hypothesis $H_0$. To determine the <b>p-value</b> for the test, we need to find the distribution for the test statistic. For <b>sufficiently large expected values</b>, the test statistic follows the <b>chi-square distribution</b> with $k-1$ degrees of freedom. The test statistic for the given experiment is $\chi^2 = 6.12$ and it's degree of freedom is $5$. From chi-square distribution, the upper 10% point is 9.236 and hence we can conclude that $p-value > 0.10$. We can not reject the null hypothesis then saying that there is no evidence that the die is not fair. The test described above determines that how well a given multinomial distribution fits the data and hence is also called as <b>goodness-off-fit test</b>.

#### The Chi-Square Test for Homogeneity :

Let us look at an example to understand the meaning of the test of homogeneity. Four machines manufacture cylindrical steel pins. The pins are subjected to a diameter specification. A pin may meet the specification, or it may be too thin or too thick. The results of the experiment is shown below.

<img src="images/hom.PNG"  width="400px">

The above table is called as a <b>contigency table</b>. For the mentioned experiment, the <b>null hypothesis</b> is that the proportions of pins that are too thin, OK, or too thick is same for all the machines. Let $I$ denotes the number of rows in the table and $J$ denotes the number of columns. Let $p_{ij}$ denotes the probability of the outcome in the cell $(i, j)$. The the null hypothesis can be given as:

$$H_0: For \ ecah \ column \ j,\ p_{1j} = p_{2j} = ... = p_{Ij}$$

Let $O_{ij}$ denotes the observed value in the cell $(i, j)$, $O_{i.}$ the sum of the observed values in the row $i$, $O_{.j}$ the sum of the observed values in the column $j$ and $O_{..}$ the sum of observed values in all the cells. To calculate the test statistic, we need to find the expected values for the number of observations in each of the cells. This can be calculated as:

$$E_{ij} = \frac{O_{i.}O_{.j}}{O_{..}}$$

The <b>test statistic</b> is given as:

$$\chi^2 = \sum_{i=1}^{I} \sum_{j=1}^{J} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

Under null hypothesis, the test statistic has a chi-square distribution with $(I-1)(J-1)$ degrees of freedom. The test results for the above mentioned experiment is shown below.

In [14]:
import numpy as np
from scipy.stats import chi2_contingency

obs = np.array([[10, 102, 8], [34, 161, 5], [12, 79, 9], [10, 60, 10]])
chi2, p, dof, ex = chi2_contingency(obs, correction=False)
print("The expected values are: \n" + str(ex))
print("The degree of freedom is: " + str(dof))
print("The test statistic is: " + str(chi2))
print("The p-value is: " + str(p))

The expected values are: 
[[ 15.84  96.48   7.68]
 [ 26.4  160.8   12.8 ]
 [ 13.2   80.4    6.4 ]
 [ 10.56  64.32   5.12]]
The degree of freedom is: 6
The test statistic is: 15.584353328056686
The p-value is: 0.01616760116149423


#### The Chi-Square Test for Independence :

In the above experiment, the row totals (the number of pins manufactured by each machine) were fixed. There may be the case when both the row and column totals are random. In this case, we need to conduct a test of independece and the test is same as described above.

#### Tests for Variances of Normal Populations :

Let $X_1, X_2, ..., X_n$ be a simple random sample from a normal population given as $N(\mu, \sigma^2)$. The sample variance $s^2$ is given as:

$$s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (X_i - \overline{X})^2$$

Then the test statistic $\frac{(n-1)s^2}{\sigma_0^2}$ follows a <b>chi-square distribution</b> with $n-1$ degrees of freedom. The null hypothesis can take any of the form:

$$H_0: \sigma^2 \leq \sigma_0^2;\ \sigma^2 = \sigma_0^2;\ \sigma^2 \geq \sigma_0^2$$

<b>Example:</b> To check the reliability of a scale in a butcher shop, a test weight known to weigh 400 grams was weighed 16 times. For the scale to be considered reliable, the variance of repeated measurements must be less than 1. The sample variance of the 16 measured weights was $s^2$ = 0.81. Assume that the measured weights are independent and
follow a normal distribution. Can we conclude that the population variance of the measurements is less than 1?

<b>Sol:</b> The null and alternate hypothesis are given as:

$$H_0: \sigma \geq 1$$

$$H_A: \sigma < 1$$

The test statistic is $\frac{(n-1)s^2}{\sigma_0^2} = \frac{15 \times 0.81}{1^2} = 12.15$

From the table of chi-square distribution for degrees of freedom $n-1 = 15$, we find that the lower 10% point is 8.547, and hence we can conclude that the $p-value > 0.10$ and can not reject the null hypothesis stating that: <b>We can not conclude that the scale is reliable</b>.

#### The F Test for Equality of Variance :

Let $X_1, X_2, ..., X_m$ and $Y_1, Y_2, ..., Y_n$ be simple random samples from normal populations $N(\mu_1, \sigma_1^2)$ and $N(\mu_2, \sigma_2^2)$. Let $s_1^2$ and $s_2^2$ be the sample variances which are given as:

$$s_1^2 = \frac{1}{m-1} \sum_{i=1}^{m} (X_i - \overline{X})^2$$

$$s_1^2 = \frac{1}{n-1} \sum_{i=1}^{n} (Y_i - \overline{Y})^2$$

The null and alternate hypothesis for the equality of variance can be formulated as:

$$H_0: \frac{\sigma_1^2}{\sigma_2^2} \leq 1; \ \frac{\sigma_1^2}{\sigma_2^2} \geq 1; \ \frac{\sigma_1^2}{\sigma_2^2} = 1$$

$$H_A: \frac{\sigma_1^2}{\sigma_2^2} > 1; \ \frac{\sigma_1^2}{\sigma_2^2} < 1; \ \frac{\sigma_1^2}{\sigma_2^2} \neq 1$$

The test statistic is given as:

$$F = \frac{s_1^2}{s_2^2}$$

The distribution of $F$ under null hypotheis follows <b>F distribution</b>. The F distribution has two degrees of freedom: one associated with the numerator and another with the denominator. Let us look at an example of the test for equality of variance.

<b>Example:</b> In a series of experiments to determine the absorption rate of certain pesticides into skin, measured amounts of two pesticides were applied to several skin specimens. After a time, the amounts absorbed (in μg) were measured. For pesticide A, the variance of the amounts absorbed in 6 specimens was 2.3, while for pesticide B, the variance of the amounts absorbed in 10 specimens was 0.6. Assume that for each pesticide, the amounts absorbed are a simple random sample from a normal population. Can we conclude that the variance in the amount absorbed is greater for
pesticide A than for pesticide B?

<b>Sol:</b> First of all we need to test that whether variance in the amount absorbed is greater for pesticide A than for pesticide B. Hence, the null and alternate hypothesis is formulated as:

$$H_A: \sigma_A^2 > \sigma_B^2$$

$$H_0: \sigma_A^2 \leq \sigma_B^2 \equiv \frac{\sigma_A^2}{\sigma_B^2} \leq 1$$

The test statistic is given as:

$$F = \frac{s_A^2}{s_B^2} = \frac{2.3}{0.6} = 3.83$$

Under the null hypothesis, the test statistic follows F distribution with degrees of freedoms 5 and 9 $(F_{5, 9})$. If $H_0$ is true, $s_A^2$ will on average be smaller than $s_B^2$. Hence, larger the value of $F$, stronger the evidence against the null hypothesis. From the F distribution table, we find that the upper 5% point is 3.48 and the upper 1% point is 6.06. Hence, we can conclude that $0.01 < p-value < 0.05$, i.e. we can reject the null hypothesis stating that there is an evidence that the amount absorbed is greater for pesticide A than for pesticide B.

As the F-table contains only the large values (greater than 1) for the F statistic, it is not feasible obtain a p-value for the tests, where F statistic is less then 1. In this case we can flip the test statistic (use $\frac{s_2^2}{s_1^2}$ instead of $\frac{s_1^2}{s_2^2}$). The new test statistic follows the F distribution with degrees of freedoms $n-1, m-1$ instead.

For a two-tailed test, both a small and large value of the test statistic provide evidence against null hypothesis. We can use either of $\frac{s_1^2}{s_2^2}$ or $\frac{s_2^2}{s_1^2}$ (whichever is greater than 1) as the test statistic. The <b>p-value</b> for the test is twice of the p-value for the one-tailed test. The F test is quite sensitive to the departure from the normality.

#### Power :

<b>Type-2 error</b> is the error when the null hypothesis $H_0$ is not rejected when it is false. The <b>power</b> of a test is the probability of rejecting $H_0$ when it is false. Hence, it can be given as:

$$Power = 1 - P(Type \ 2 \ Error)$$

<b>Type-1 error</b> is the error when null hypothesis is rejected even if it is true. We can minimize the type-1 error by choosing a small value for the significance level $\alpha$. For a test to be efficient, the type-1 and 2 error should be minimized, i.e. the power of the test should be larger. To compute the power, first of all we need to compute the rejection region. Then we compute the probability that the test statistic falls in the rejection region given that the alternate hypothesis is true. This probability gives the power of the test. Let us look at an example.

<b>Example:</b> Find the power of the 5% level test of $H_0 :μ ≤ 80$ versus $H_A :μ > 80$ for the mean yield of the new process under the alternative $μ = 82$, assuming n = 50 and σ = 5.

<b>Sol:</b> First of all we beed to find the rejection region. Here, $\sigma = 5$, hence $\sigma_{\overline{X}} = \frac{\sigma}{\sqrt{50}} = 0.707$. The rejection region for 5% level test is given as $z \geq 1.645$ and hence the rejection region is $\overline{X} \geq 80+1.645(0.707) = 81.16$. Now, under the assumption that the alternate hypothesis is true, the z-statistic is:

$$z = \frac{81.16 - 82}{0.707} = -1.19$$

The area to the right of this is <b>0.8830</b>, which is the power of the test.

As the alternate mean is far away from the null mean, the power of the test will be almost equal to 1. For the value of the alternate mean closer to the null mean, the power of the test will be almost equal to the significance level of the test. Power of test is increased by increasing the sample size as well.