# Frequently Applied Statistical Tests
___

## Equality of Means - One Sample  

__Null Hypothesis :__  
$H_0$: Population mean($\mu$) = Hypothesized mean ($\mu_0$)  
  
__Alternate Hypothesis :__  
$H_1$: $\mu$ > $\mu_0$  (One tailed test)  
$H_2$: $\mu$ $\neq$ $\mu_0$  (Two tailed test)  

__Assumptions :__  
  * The data is a continuous variable.
  * Data comes from a normal distribution.
  * Observations are independent.  
  
If population variance known : Use one sample Z-test  

[<img src="https://miro.medium.com/max/287/1*b7iZyQyP8SJ-W51x_L5ekg.png" width="250"/>](https://miro.medium.com/max/287/1*b7iZyQyP8SJ-W51x_L5ekg.png)

If population variance unknown : Use one sample t-test  

[<img src="https://miro.medium.com/max/1400/1*a63Vr-UYiXs0g8PHV9zqEA.jpeg" width="400"/>](https://miro.medium.com/max/1400/1*a63Vr-UYiXs0g8PHV9zqEA.jpeg)

__Example :__  

Quality control for a machine manufacturing screws. A random sample of 50 screws is taken.     
Given, Expected average diameter, $\mu$ = 1.5 cm  and Sample mean, $\overline{X}$ = 1.51 cm

Assuming Two-tailed test,  
$H_0$: $\mu$ = 1.5 cm    
$H_2$: $\mu$ $\neq$ 1.5 cm  

__Case 1 :__(If population variance known)  

Given, Population Standard deviation($\sigma$) = 0.03 cm and level of significance($\alpha$) = 0.01    
So, Z = $\frac{(1.51 - 1.5)}{\frac{0.03}{\sqrt{50}}}$ = 2.357  
P-value = 0.018 (Calculated using Statistics table)  

As, P-value > <span>&#593;</span>  
__Conclusion :__ Failed reject Null Hypothesis  

__Case 2 :__(If population variance unknown)  

Given, Sample Standard deviation(S) = 0.025 cm and level of significance($\alpha$) = 0.01  

So, t = $\frac{(1.51 - 1.5)}{\frac{0.025}{\sqrt{50}}}$ = 2.828  
degree of freedom(df) = n-1 = 49  
P-value = 0.007 (Calculated using Statistic table)  

As, P-value < $\alpha$  
__Conclusion :__ Reject Null Hypothesis  

## Equality of Means - Two Independent Sample  

__Null Hypothesis :__  
$H_0$: $\mu_1$ = $\mu_2$  

__Alternate Hypothesis :__  
$H_1$: $\mu_1$ $\neq$ $\mu_2$  

__<u> 3 Scenarios </u> :__  

  * Population variances known : Use two sample Z-test  
  
  [<img src="http://www.stat.yale.edu/Courses/1997-98/101/zstat2.gif" width="250"/>](http://www.stat.yale.edu/Courses/1997-98/101/zstat2.gif)
  * Population variances unknown but equal - Use two sample pooled t-test  
  
  [<img src="http://www.stat.yale.edu/Courses/1997-98/101/tstatp.gif" width="250"/>](http://www.stat.yale.edu/Courses/1997-98/101/tstatp.gif)  
  
  [<img src="http://www.stat.yale.edu/Courses/1997-98/101/tpool.gif" width="250"/>](http://www.stat.yale.edu/Courses/1997-98/101/tpool.gif)  
  * Population variances unknown and not equal - Use two sample t-test  
  
  [<img src="http://www.stat.yale.edu/Courses/1997-98/101/tstat2.gif" width="250"/>](http://www.stat.yale.edu/Courses/1997-98/101/tstat2.gif)

__Example :__  
Performance of employees (Computer Engineers(sample 1) vs. Other Engineers(sample 2))      

__Null Hypothesis :__  
$H_0$: $\mu_1$ = $\mu_2$  

__Alternate Hypothesis :__  
$H_1$: $\mu_1$ $\neq$ $\mu_2$  

__Case 1 :__(If population variance known)  
Given, Population variance for sample 1 ($\sigma_1^2$) = 3  
Population variance for sample 2 ($\sigma_2^2$) = 2.2  
Sample mean for sample 1 ($\bar{X_1}$) = 6.94  
Sample mean for sample 2 ($\bar{X_2}$) = 5.69  
Sample 1 size ($n_1$) = 15    
Sample 2 size ($n_2$) = 20  
Level of significance($\alpha$) = 0.05

So, Z = $\frac{(6.94 - 5.69)}{\sqrt{\frac{3}{15}+\frac{2.2}{20}}}$ = 2.245  
P-value = 0.025 (Calculated using statistic table)  

As, P-value < $\alpha$  
__Conclusion :__ Reject null hypothesis  

__Case 2 :__(If population variance unknown but equal)  
Given, Sample variance for sample 1 ($S_1^2$) = 2.95  
Sample variance for sample 2 ($S_2^2$) = 2.16  
Sample mean for sample 1 ($\bar{X_1}$) = 6.94  
Sample mean for sample 2 ($\bar{X_2}$) = 5.69  
Sample 1 size ($n_1$) = 15    
Sample 2 size ($n_2$) = 20  
Level of significance($\alpha$) = 0.05  

Pooled variance($S_p^2$) = $\frac{(14 \times 2.95) + (19 \times 2.16)}{33}$ = 2.656  

Pooled standard deviation($S_p$) = $\sqrt{2.656}$ = 1.63  

t = $\frac{(6.94-5.69)}{1.63 \times \sqrt{(\frac{1}{15}+\frac{1}{20})}}$ = 2.25    

P-value = 0.015 (Calculated using statistic table)  

As, P-value < $\alpha$  
__Conclusion :__ Reject null hypothesis  

__Case 3 :__(If population variance unknown and not equal)  
Given, Sample variance for sample 1 ($S_1^2$) = 2.95  
Sample variance for sample 2 ($S_2^2$) = 2.16  
Sample mean for sample 1 ($\bar{X_1}$) = 6.94  
Sample mean for sample 2 ($\bar{X_2}$) = 5.69  
Sample 1 size ($n_1$) = 15    
Sample 2 size ($n_2$) = 20  
Level of significance($\alpha$) = 0.05  

So, t = $\frac{(6.94-5.69)}{\sqrt{\frac{2.95}{15} + \frac{2.16}{20}}}$ = 2.264    
P-value = 0.015 (Calculated using statistic table)  

As, P-value < $\alpha$  
__Conclusion :__ Reject null hypothesis  

## Equality of Means - Two Dependent Samples  

For two dependent samples, we use Paired t-test.  

__Null Hypothesis :__  
$H_0$: $\mu_1$ = $\mu_2$  

__Alternate Hypothesis :__  
$H_1$: $\mu_1$ < $\mu_2$ (one-tailed test)    
$H_2$: $\mu_1$ $\neq$ $\mu_2$ (two-tailed test)  

[<img src="https://infostatconsulting.files.wordpress.com/2017/04/img_20170428_175956-1.jpg" width="250"/>](https://infostatconsulting.files.wordpress.com/2017/04/img_20170428_175956-1.jpg)

Where d = $y_i$ - $x_i$ (Difference between two dependent sample at $i^{th}$ position)    
$\overline{d}$ = $\sum d_i$  
$S_d$ = $\sqrt{\sum{(d_i - \overline{d})^2}}$

__Example :__  

Market Share of an FMCG company tries to compare between pre campaign vs. post campaign.  
Given, Level of significance($\alpha$) = 0.01

| City    | Pre campaign | Post campaign  |
|:-------:|:------------:|:--------------:|
| City A  | 19.02        | 13.53          |
| City B  | 16.56        | 26.85          |
| City C  | 27.61        | 23.86          |
| City D  | 45.24        | 45.69          |
| City E  | 25.87        | 26.92          |
| City F  | 29.77        | 30.86          |
| City G  | 27.70        | 29.76          |
| City H  | 24.82        | 23.85          |
| City I  | 22.32        | 27.20          |
| City J  | 25.51        | 35.99          |
| City K  | 26.59        | 36.37          |
| City L  | 37.45        | 37.21          |
| City M  | 19.74        | 25.23          |
| City N  | 37.06        | 40.12          |
| City O  | 23.89        | 25.72          |

__Solution :__  

__Null Hypothesis :__  
$H_0$: $\mu_1$ = $\mu_2$    

__Alternate Hypothesis :__  
$H_1$: $\mu_1$ < $\mu_2$ (one-tailed test)      

Here, Same individuals with two different measures. So we use paired t-test.  

| City    | Pre campaign($x_i$) | Post campaign($y_i$)  | $d_i$ = $y_i$ - $x_i$ |
|:-------:|:---------------------------:|:-----------------------------:|:-------------------:|
| City A  | 19.02        | 13.53          | -5.49 |
| City B  | 16.56        | 26.85          | 10.29 |
| City C  | 27.61        | 23.86          | -3.75 |
| City D  | 45.24        | 45.69          | 0.45  |
| City E  | 25.87        | 26.92          | 1.05  |
| City F  | 29.77        | 30.86          | 1.09  |
| City G  | 27.70        | 29.76          | 2.06  |
| City H  | 24.82        | 23.85          | -0.97 |
| City I  | 22.32        | 27.20          | 4.88  |
| City J  | 25.51        | 35.99          | 10.48 |
| City K  | 26.59        | 36.37          | 9.78  |
| City L  | 37.45        | 37.21          | -0.24 |
| City M  | 19.74        | 25.23          | 5.49  |
| City N  | 37.06        | 40.12          | 3.06  |
| City O  | 23.89        | 25.72          | 1.83  |

$\overline{d}$ = $\sum d_i$ = 2.67     

$S_d$ = $\sqrt{\sum{(d_i - \overline{d})^2}}$ = 4.81  

Sample size(n) = 15  

So, t = $\frac{2.67}{\frac{4.81}{\sqrt{15}}}$ = 2.15    

degree of freedom(df) = n-1 = 15 - 1 = 14  
P-value = 0.025 (Calculated using Statistic table)  

As, P-value > $\alpha$  

__Conclusion :__ Failed to reject null hypothesis  

## Equality of variance - Two Independent Samples  

For comparing variance of two independent sample, we use F-test.  

[<img src="https://images.slideplayer.com/23/6766509/slides/slide_8.jpg" width="350"/>](https://images.slideplayer.com/23/6766509/slides/slide_8.jpg)

__Null Hypothesis :__  
$H_0$: $\sigma_1^2$ = $\sigma_2^2$    

__Alternate Hypothesis :__  
$H_1$: $\sigma_1^2$ $\neq$ $\sigma_2^2$ (two-tailed test)   

__Assumption :__  
  * The data is a continuous variable
  * Data comes from a normal distribution  
  * Observations are independent  
  
__Example :__  

Performance of employees (Computer Engineers(sample 1) vs. Other Engineers(Sample 2))    
Given, Sample variance for sample 1($S_1^2$) = 2.95 (greater)   
Sample variance for sample 2($S_2^2$) = 2.16  
Sample size foe sample 1($n_1$) = 15  
Sample size foe sample 2($n_2$) = 20  
Level of significance($\alpha$) = 0.05  

__Null Hypothesis :__  
$H_0$: $\sigma_1^2$ = $\sigma_2^2$    

__Alternate Hypothesis :__  
$H_1$: $\sigma_1^2$ $\neq$ $\sigma_2^2$ (two-tailed test)   

So, F = $\frac{2.95}{2.16}$ = 1.37  
degree of freedom(df) = $n_1-1$ (14) and $n_2-1$ (19)  
P-value = 0.26 (Calculated using statistic table)  

As, P-value > $\alpha$  

__Conclusion :__ Failed to reject null hypothesis  

## Chi-Square Test of Independence 

__Chi-Square__ :  
  * Non-Parametric test of statistical test of statistical significance  
  * Different forms of test available
  * Chi-square test of independence


__Example__ :  

Survey of a recently released movie(Gender vs. Rating) shown in the below table :  

| Rating       | Males | Females | Total |
|:------------:|:-----:|:-------:|:-----:|
| Excellent    | 34    | 32      | 66    |
| Very Good    | 23    | 25      | 48    |
| Good         | 21    | 23      | 44    |
| Not so good  | 17    | 17      | 34    |
| Bad/Very Bad | 10    | 8       | 18    |
| Total        | 105   | 105     | 210   |

So, we want to know that, is their a gender association with rating?    
Level of significance($\alpha$) = 0.05  

__Solution :__  

__Null Hypothesis :__  
$H_0$: There is no association between gender and rating  
$H_1$: There is an association between gender and rating  

__Step 1:__  
Observed frequency(O)  

| Rating       | Males | Females | Total |
|:------------:|:-----:|:-------:|:-----:|
| Excellent    | 34    | 32      | 66    |
| Very Good    | 23    | 25      | 48    |
| Good         | 21    | 23      | 44    |
| Not so good  | 17    | 17      | 34    |
| Bad/Very Bad | 10    | 8       | 18    |
| Total        | 105   | 105     | 210   |

__Step 2:__  
Expected frequency(E) : $\frac{(Row Total \times Column Total)}{Grand Total}$  

| Rating       | Males | Females | Total |
|:------------:|:-----------:|:-------:|:-----:|
| Excellent    | $\frac{(66 \times 105)}{210}$ = 33 | 33      | 66    |
| Very Good    | 24    | 24      | 48    |
| Good         | 22    | 22      | 44    |
| Not so good  | 17    | 17      | 34    |
| Bad/Very Bad | 9     | 9       | 18    |
| Total        | 105   | 105     | 210   |

__Step 3:__  
Test-Statistic = $\sum \sum \frac{(O-E)^2}{E}$    

where, O = observed frequency  
E = Expected frequency  

degrees of freedom(df) = (rows - 1) $\times$ (column - 1)  

So, test-statistic = 0.457 (Calculated using above formula)  
df = $(5-1) \times (2-1)$ = 4  
P-value = 0.02  
$\alpha$ = 0.05  

As, P < $\alpha$  
__Conclusion :__ Reject null hypothesis  

## Parametric test vs. Non-Parametric test  

|     | Parametric test | Non-Parametric test  |
|:-------:|:------------:|:--------------:|
| Data type  | Interval/Ratio scale data     | Nominal or ordinal data    |
| Distributional assumptions  | Required        | Not required          |
| Sample size  | Specific guidelines to be satisfied        | can work for small data size          |
| Power of discrimination  | High        | Comparatively less          |
| Outliers  | Significantly affected        | Not affected          |
| Example  | Z-test, t-test        | Chi-squared test         |
