<table align="left" width=100%>
    <tr>
        <td width="20%">
            <img src="GL-2.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b> Faculty Notebook <br> ( Day 1) </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## Table of Content

1. **[Import Libraries](#lib)**
2. **[Large Sample Test](#z)**
    - 2.1 - **[Two Sample Z Test](#2z)**
3. **[Small Sample Test](#t)**
    - 3.1 - **[Two Sample t Test (Unpaired)](#2t)**
    - 3.2 - **[Paired t Test](#paired)**

<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [1]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import statsmodels
import statsmodels.api as sm

# import 'stats' package from scipy library
from scipy import stats

# import statistics to perform statistical computations
import statistics

# to test the normality 
from scipy.stats import shapiro

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

In [2]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

<a id="z"></a>
# 2. Large Sample Test

If the sample size is sufficiently large (usually, n > 30) then we use the `Z-test`. If population standard deviation ($\sigma$) is unknown, then the sample standard deviation (s) is used to calculate the test statistic.

<a id="2z"></a>
## 2.1 Two Sample Z Test

Let us perform a two sample Z test for the population mean. We compare the means of the two independent populations. The samples are assumed to be taken from populations such that they follow a normal distribution. Also, the sample must have equal variance.

The `Shapiro-Wilk Test` is used to check the normality of the data. The assumption of equal variances of the populations is tested using the `Levene's Test`. 
The hypothesis of the Levene's test is given as:
<p style='text-indent:25em'> <strong> H<sub>0</sub>:  The variances are equal</strong> </p>
<p style='text-indent:25em'> <strong> H<sub>1</sub>:  The variances are not equal </strong> </p>

The `levene()` from scipy library performs a Levene's test. 

The null and alternative hypothesis of two sample Z-test is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu_{1} - \mu_{2} = \mu_{0}$ or $\mu_{1} - \mu_{2} \geq \mu_{0}$ or $\mu_{1} -\mu_{2} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{1} - \mu_{2} \neq \mu_{0} $ or $\mu_{1} - \mu_{2} < \mu_{0}$ or $\mu_{1} -\mu_{2} > \mu_{0}$</strong></p>

Consider two normally distributed independent populations. Let us take a sample of size ($n_{1}$) from the first population with standard deviation ($\sigma_1$) and sample of size ($n_{2}$) from a second population with standard deviation ($\sigma_2$) such that $n_{1}, n_{2} > 30$. 

The test statistic for two sample Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{(\overline{X_{1}} - \overline{X_{2}})  - \mu_{0}} {\sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}}$</strong></p>

Where, <br>
$\overline{X_{1}}$, $\overline{X_{2}}$ : Mean of both the samples<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$\sigma_{1}, \sigma_{2}$: Standard deviation of both the populations<br>
$n_{1}, n_{2}$: Size of samples from both the populations

Under $H_{0}$ the test statistic follows a standard normal distribution.

If standard deviations of populations are unknown, use the standard deviations of samples ($s_{1}, s_{2}$) instead of $\sigma_{1}$ and $\sigma_{2}$ to calculate the test statistic.
i.e. <p style='text-indent:25em'> <strong> $Z = \frac{(\overline{X_{1}} - \overline{X_{2}})  - \mu_{0}} {\sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}}$</strong></p>


### Example:

#### 1. The training institute <i>Nature Learning</i> claims that the students trained in their institute have overall better performance than the students trained in their competitor institute <i>Speak Global Learning</i>. We have a sample data of 500 students from each institute along with their total score collected from independent normal populations. Frame a hypothesis and test the Nature Learning's claim with 99% confidence.

Consider the total score for students given in the CSV file `StudentsPerformance.csv`. 

In [3]:
# read the students performance data 
df_student = pd.read_csv('StudentsPerformance.csv')

# display the first two observations
df_student.head(1)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning


In [4]:
# Sample-1: Total Score of Nature Learning
nl = df_student.loc[df_student["training institute"]=="Nature Learning",
                   "total score"]
# Sample-2: Total Score of Speak Global Learning
sgl = df_student.loc[df_student["training institute"]=="Speak Global Learning",
                   "total score"]

Let us check the normality of total score for the students trained from both the institutes.

* **Ho: Data is Normal**
* **Ha: Data is Not Normal**

In [5]:
# shapiro test 
stats.shapiro(nl),stats.shapiro(sgl)

# conclusion data is normal

(ShapiroResult(statistic=0.997671365737915, pvalue=0.7214037179946899),
 ShapiroResult(statistic=0.9957306981086731, pvalue=0.19211649894714355))

Let us check the equality of variances.

* **Ho: That the Data has Equal Variance**
* **Ha: Ho is False**

In [6]:
# levene test
stats.levene(nl,sgl) # we can pass here multiple parametres to check the variance

# equality of variance is also met

LeveneResult(statistic=0.6422721347822817, pvalue=0.42307998325221574)

In [7]:
stats.norm.isf(0.01)

2.3263478740408408

The null and alternative hypothesis is:

H<sub>0</sub>: $\mu_{1} - \mu_{2} \leq 0$ <br>
H<sub>1</sub>: $\mu_{1} - \mu_{2} > 0$ 

Here ⍺ = 0.01, for a one-tailed test calculate the critical z-value.

In [8]:
stats.norm.isf(0.01)

2.3263478740408408

i.e. if z is greater than 2.33 then we reject the null hypothesis.

In [9]:
# calculate test statistics
xbar1=np.mean(nl)
xbar2=np.mean(sgl)

sigma1=np.std(nl)
sigma2=np.std(sgl)

n1=len(nl)
n2=len(sgl)

In [10]:
# Test Stats
num=(xbar1-xbar2)
deno=(sigma1**2/n1) + (sigma2**2/n2)
test_stats=num/np.sqrt(deno)
test_stats

0.15140659491350816

In [11]:
# p value
1-stats.norm.cdf(test_stats)

# Fail to reject the Ho meaning both the institutes have same performance

0.4398274937243183

In [12]:
# lets build the confidence
stats.norm.interval(0.99,loc=num,scale=np.sqrt(deno))

(-3.5548110808810587, 3.9988110808810187)

The above plot shows that the test statistic value is in the acceptance region, which implies we fail to reject (i.e. accept) $H_{0}$.

#### 2. A study was carried out to understand amount of haemoglobin in blood for males and females. A random sample of 160 males and 180 females have means of 13 g/dl and 15 g/dl. The two samples have standard deviation of 4.1 g/dl for male donors and 3.5 g/dl for female donor . Can it be said the population means of concentrations of the elements are the same for men and women? Use  α = 0.01.

The null and alternative hypothesis is:

H<sub>0</sub>: $\mu_{1} - \mu_{2} = 0$<br>
H<sub>1</sub>: $\mu_{1} - \mu_{2} \neq 0$

Here ⍺ = 0.01, for a two-tailed test calculate the critical z-value.

In [13]:
stats.norm.cdf(0.99)

0.8389129404891691

i.e. if z is less than -2.58 or z is greater than 2.58 then we reject the null hypothesis.

In [14]:
xbar1=13
xbar2=15
sigma1=4.1
sigma2=3.5
n1=160
n2=180

In [15]:
num=(xbar1-xbar2)
deno=(sigma1**2/n1) + (sigma2**2/n2)
test_stats=num/np.sqrt(deno)
test_stats

-4.806830552525058

In [16]:
# pValue
(1-stats.norm.cdf(test_stats))*2

1.9999984665814883

In [21]:
stats.norm.interval(0.99,loc=num,scale=np.sqrt(deno))

(-3.0717370938718864, -0.9282629061281136)

<a id="t"></a>
# 3. Small Sample Test

If the sample size is small (usually, n < 30) then we use the `t-test`. These tests are also known as `exact tests`.

<a id="2t"></a>
## 3.1 Two Sample t Test (Unpaired)

The two sample t-test is used to compare the means of two independent populations. This test assumes that the populations are normally distributed from which the samples are taken.

The null and alternative hypothesis is given as:
<p style='text-indent:25em'> <strong> $H_{0}: \mu_{1} - \mu_{2} = \mu_{0}$ or $\mu_{1} - \mu_{2} \geq \mu_{0}$ or $\mu_{1} -\mu_{2} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{1} - \mu_{2} \neq \mu_{0} $ or $\mu_{1} - \mu_{2} < \mu_{0}$ or $\mu_{1} -\mu_{2} > \mu_{0}$</strong></p>

Let us take a sample of size ($n_{1}$) from the first population and sample of size ($n_{2}$) from a second independent population. If both $n_{1}$ and $n_{2}$ are less than 30 and standard deviation of populations are unknown. We use two-sample t-test.

Consider the equal variance for both the populations. The test statistic for two sample t-test is given as:
<p style='text-indent:25em'> <strong> $t = \frac{(\overline{X_{1}} - \overline{X_{2}}) - \mu_{0}} {s \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}}$</strong></p>

Where, <br>
$\overline{X_{1}}$, $\overline{X_{2}}$: Mean of both the samples<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$s$: Pooled standard deviation<br>
$n_{1}, n_{2}$: Size of samples from both the populations

The pooled standard deviation is defined as:
$s = \sqrt{\frac{(n_{1} - 1)s_{1}^{2} + (n_{2} - 1)s_{2}^{2}}{n_{1} + n_{2} - 2}}$ $\hspace{2cm}$  Where, $s_{1}, s_{2}$: Standard deviation of both the samples

Under $H_{0}$, the test statistic follows a t-distribution with $(n_{1}+n_{2}-2)$ degrees of freedom.

If the population variances are equal and also the sample size is the same for both the samples then the test statistic is given as:
<p style='text-indent:25em'> <strong> $t = \frac{(\overline{X_{1}} - \overline{X_{2}}) - \mu_{0}} {s \sqrt{\frac{2}{n}}}$</strong></p>

Where the pooled standard deviation $s = \sqrt{\frac{s_{1}^{2} + s_{2}^{2}}{2}}$

Under $H_{0}$, the test statistic follows a t-distribution with $(2n-2)$ degrees of freedom.

If both the population variances and the sample sizes are not equal then the Welch's test is used.

### Example: 

#### 1. The teachers' association claims that the total score of the students who completed the test preparation course is different than the total score of the students who have not completed the course. The sample data consists of 15 students who completed the course and 18 students who have not completed the course. Test the association's claim with ⍺ = 0.05.

Consider the total score of the students who have/ have not completed the preparation course are given in the CSV file `totalmarks_2ttest.csv`.

In [22]:
# read the students performance data 
test = pd.read_csv('totalmarks_2ttest.csv')

# display the first two observations
test.head(2)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,male,group E,standard,completed,84,83,78,245,Speak Global Learning
1,male,group C,free/reduced,completed,79,77,75,231,Speak Global Learning


In [24]:
# exxtract the samples

completed=test.loc[test['test preparation course']=='completed','total score']
none=test.loc[test['test preparation course']=='none','total score']

Let us check the normality of the total marks of students who have/ have not completed the test preparation course.

In [25]:
# shapiro test

stats.shapiro(completed),stats.shapiro(none)

(ShapiroResult(statistic=0.9055536389350891, pvalue=0.11574102193117142),
 ShapiroResult(statistic=0.9481862187385559, pvalue=0.3972780704498291))

Let us check the equality of variances.

In [27]:
# leneven test

# Ho: equality of variance exist
# Ha: Ho is false


stats.levene(completed,none)

LeveneResult(statistic=0.045113770764648356, pvalue=0.8331854285659768)

In [None]:
# Ho: equality of variance exist
# Ha: Ho is false



The null and alternative hypothesis is:

H<sub>0</sub>: $\mu_{1} - \mu_{2} = 0$<br>
H<sub>1</sub>: $\mu_{1} - \mu_{2} \neq 0$

* **Ho: That the Test Prep Course has no effect on the Total Score**

* **Ha: That the Test Prep Course haseffect on the Total Score**

Here ⍺ = 0.05 and degrees of freedom = 31, for a two-tailed test let us calculate the critical t-value.

In [33]:
# T critical value
stats.t.isf(0.975,len(none)+len(completed)-2)

-2.0395134463964077

In [35]:
# use 'ttest_ind()' to calculate the test statistic and corresponding p-value for 2 sample test
# Welch Test: stats.ttest_ind(completed, none, equal_var=False)

stats.ttest_ind(completed,none)

# FTR: that means there is no difference

Ttest_indResult(statistic=1.4385323319823262, pvalue=0.16030339806989594)

In [20]:
# calculate pooled standard deviation
s = np.sqrt((((n_1-1)*samp_std_1**2) + ((n_2-1)*samp_std_2**2)) / (n_1 + n_2 - 2))

# calculate the 95% confidence interval for the population mean


NameError: name 'n_1' is not defined

#### 2. An experiment was conducted to compare the pain relieving hours of two new medicines. Two groups of 14 and 15 patients were selected and were given comparable doses. Group 1 was given medicine 1 and group 2 was given medicine 2. The following data is obtained from the two samples. Test whether the two populations give the same mean hours of relief. Assume the data comes from normal distribution has an equal variance. Use α = 0.01

<img src='unpaired_2t.PNG'>

In [None]:
# size of first sample
n1=14

# mean hours for medicine 1  

# standard deviation of hours for medicine 1

# size of second sample

# mean hours for medicine 2 

# standard deviation of hours for medicine 2

# degrees of freedom for 2 sample t-test


The null and alternative hypothesis is:

H<sub>0</sub>: $\mu_{1} - \mu_{2} = 0$<br>
H<sub>1</sub>: $\mu_{1} - \mu_{2} \neq 0$

Here ⍺ = 0.01 and degrees of freedom = 27, for a two-tailed test let us calculate the critical t-value.

In [None]:
# calculate the t-value for 99% of confidence level



In [None]:
# calculate pooled standard deviation

# calculate the test statistic 

# print the test statistic value 


In [None]:
# calculate the corresponding p-value for the test statistic
# use 'cdf()' to calculate P(t <= t_stat)
# pass the degrees of freedom to the parameter, 'df'

# for a two-tailed test multiply the p-value by 2


In [None]:
# calculate the 99% confidence interval for the population mean


We can see that the test statistic value is greater than -2.77, the p-value is greater than 0.01, also the confidence interval contains the value in the null hypothesis (i.e. 0). Thus, we fail to reject (i.e. accept) the null hypothesis and conclude that the two medicines have the same hours of relief.

<a id="paired"></a>
## 3.2 Paired t Test

A paired t-test is used to compare the mean of the population for two dependent samples. The dependent samples can be the scores before and after a specific treatment. 

Let $X_{i}$ be the sample before the treatment and $Y_{i}$ be the sample after the treatment. Let $\mu_{X}$, $\mu_{Y}$ be the mean of the data X and Y respectively. The mean difference $\mu_{d} = \mu_{Y} - \mu_{X}$.

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu_{d} = \mu_{0}$ or $\mu_{d} \geq \mu_{0}$ or $\mu_{d} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{d} \neq \mu_{0}$ or $\mu_{d} < \mu_{0}$ or $\mu_{d} > \mu_{0}$</strong></p>

The test statistic for paired t-test is given as:
<p style='text-indent:25em'> <strong> $t = \frac{\overline{X_{D}} - \mu_{0}} {\frac{s_{D}}{\sqrt{n}}}$</strong></p>

Where, <br>
$\overline{X_{D}}$: Mean difference between the pairs<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$s_{D}$: Standard deviation of differences between the pairs<br>
$n$: Sample size

Under $H_{0}$, the test statistic follows a t-distribution with (n-1) degrees of freedom.

### Example:

#### 1. A training institute wants to check if their writing training program was effective or not. 17 students are selected to check the hypothesis. Consider 0.05 as the level of significance.

The writing scores before and after training are provided in the CSV file `WritingScores.csv`. 

In [37]:
# read the file containing writing scores  
df_score = pd.read_csv('WritingScores.csv')

# display the first two observations
df_score.head(2)

Unnamed: 0,score_before,score_after
0,59,50
1,62,67


Let us check the normality of the scores before the training.

In [38]:
# shapiro test
stats.shapiro(df_score['score_before']),stats.shapiro(df_score['score_after'])

(ShapiroResult(statistic=0.947382390499115, pvalue=0.41645893454551697),
 ShapiroResult(statistic=0.9686525464057922, pvalue=0.7944169044494629))

Let us check the Equality of Variance score after the training.

In [39]:
# levene test
stats.levene(df_score['score_before'],df_score['score_after'])

LeveneResult(statistic=0.4612497504491918, pvalue=0.5019236019309768)

From the above result, we can see that the p-value is greater than 0.05, thus we can say that the scores after training are normally distributed.

The null and alternative hypothesis is:

H<sub>0</sub>: The training was not effective ($\mu_{d} = 0$)<br>
H<sub>1</sub>: The training was effective ($\mu_{d} \neq 0$)

Here ⍺ = 0.05 and degrees of freedom = 16, for a two-tailed test let us calculate the critical t-value.

In [41]:
# calculate the t-value for 95% of confidence level


In [40]:
# use 'ttest_rel()' to calculate the t-statistic and corresponding p-value for paired samples

stats.ttest_rel(df_score['score_before'],df_score['score_after'])

TtestResult(statistic=-1.4394882729049499, pvalue=0.16929012896279846, df=16)

In [None]:
# calculate the 95% confidence interval for the population mean


#### 2. An energy drink distributor claims that a new advertisement poster, featuring a life-size picture of a well-known athlete, will increase the product sales in outlets by an average of 50 bottles in a week. For a random sample of 10 outlets, the following data was collected. Test that the null hypothesis that there the advertisement was effective in increasing sales. Test the hypothesis using critical region technique. Use α = 0.05.

Given data:

        sales_before = [33, 32, 38, 45, 37, 47, 48, 41, 45]
        sales_after = [42, 35, 31, 41, 37, 36, 49, 49, 48]
        
* Ho: That the Presence of Sportsplayer has no effect on Sales
* Ha: That the Presence of Sportsplayer has effect on Sales

In [42]:
sales_before = [33, 32, 38, 45, 37, 47, 48, 41, 45]
sales_after = [42, 35, 31, 41, 37, 36, 49, 49, 48]

Let us check the normality of the sales before the advertisement.

In [44]:
stats.shapiro(sales_before),stats.shapiro(sales_after)

(ShapiroResult(statistic=0.9187210202217102, pvalue=0.3817565143108368),
 ShapiroResult(statistic=0.9118874669075012, pvalue=0.3293103277683258))

In [45]:
stats.levene(sales_before,sales_after)

LeveneResult(statistic=0.09467455621301783, pvalue=0.7622856002535852)

The null and alternative hypothesis is:

H<sub>0</sub>: The advertisement was not effective in increasing sales ($\mu_{d} \leq 0$)<br>
H<sub>1</sub>: The advertisement was effective in increasing sales ($\mu_{d} > 0$)

Here ⍺ = 0.05 and degrees of freedom = 8, for a one-tailed test let us calculate the critical t-value.

In [None]:
# calculate the t-value for 95% of confidence level


In [46]:
stats.ttest_rel(sales_before,sales_after)

TtestResult(statistic=-0.10085458113185983, pvalue=0.9221477146925299, df=8)