<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats

# Test of normality

In [2]:
df = pd.read_csv('mathscore_1ttest.csv')
df.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group C,standard,none,60,72,74,206,Nature Learning
1,female,group C,standard,none,59,72,68,199,Nature Learning
2,female,group E,standard,none,100,100,100,300,Speak Global Learning
3,female,group D,standard,none,69,74,74,217,Speak Global Learning
4,female,group A,free/reduced,none,47,59,50,156,Speak Global Learning


In [3]:
# Ho : skew=0 (Data is normal)
# Ha : skew != 0 (Data is not normal)

In [4]:
stats.shapiro(df['total score'])

ShapiroResult(statistic=0.9745057225227356, pvalue=0.7773985862731934)

In [11]:
pvalue=0.7773985862731934  # two sided probability
sig_lvl = 0.05  # default : 0.05
if pvalue>=sig_lvl:
    print('Ho is selected')
else:
    print('Ha is selected')

Ho is selected


In [7]:
# Total score is normally distributed.

In [None]:
# Practice:

# Test whether math score is normally distributed or not.

In [8]:
# Ho : skew=0 (Data is normal)
# Ha : skew != 0 (Data is not normal)

In [10]:
stats.shapiro(df['math score'])

ShapiroResult(statistic=0.9368310570716858, pvalue=0.13859796524047852)

In [6]:
pvalue=0.7773985862731934
sig_lvl = 0.05  # default : 0.05
if pvalue>=sig_lvl:
    print('Ho is selected')
else:
    print('Ha is selected')

Ho is selected


In [7]:
# Total score is normally distributed.

<a id="t"></a>
# 3. t Test

<a id="1t"></a>
## 3.1 One Sample t Test

Let us perform a one sample t-test for the population mean. We compare the population mean with a specific value. 

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu = \mu_{0}$ or $\mu \geq \mu_{0}$ or $\mu \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu \neq \mu_{0}$ or $\mu < \mu_{0}$ or $\mu > \mu_{0}$</strong></p>

The test statistic is given as:
<p style='text-indent:25em'> <strong> $t = \frac{\overline{X} -  \mu_{0}}{\frac{s}{\sqrt(n)}}$</strong></p>

Where, <br>
$\overline{X}$: Sample mean<br>
$s$: Sample standard deviation<br>
$n$: Sample size
 
Under $H_{0}$ the test statistic follows a t-distribution with n-1 degrees of freedom.


#### 1. A survey claims that in a math test female students tend to score marks greater than 75. Consider a sample of 24 female students and perform a hypothesis test to check the claim with 90% confidence.

Use the dataset available in the CSV file `mathscore_1ttest.csv`.

In [12]:
df = pd.read_csv('mathscore_1ttest.csv')
df.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group C,standard,none,60,72,74,206,Nature Learning
1,female,group C,standard,none,59,72,68,199,Nature Learning
2,female,group E,standard,none,100,100,100,300,Speak Global Learning
3,female,group D,standard,none,69,74,74,217,Speak Global Learning
4,female,group A,free/reduced,none,47,59,50,156,Speak Global Learning


In [None]:
# Ho : mu(female_mark) <=75
# Ha : mu (female_mark) > 75

In [None]:
# Test of Normality

In [8]:
# Ho : skew=0 (Data is normal)
# Ha : skew != 0 (Data is not normal)

In [13]:
stats.shapiro(df['math score'])

ShapiroResult(statistic=0.9368310570716858, pvalue=0.13859796524047852)

In [14]:
pvalue=0.7773985862731934
sig_lvl = 0.05  # default : 0.05
if pvalue>=sig_lvl:
    print('Ho is selected')
else:
    print('Ha is selected')

Ho is selected


In [7]:
# Math score is normally distributed.

In [None]:
# Data is normal
# Pop std is not known 

# One sample t test (right tailed)

In [15]:
x_bar = np.mean(df['math score'])
s = np.std(df['math score'],ddof=1)
n = len(df['math score'])
mu = 75

t = (x_bar-mu)/(s/n**0.5)
print(t)

-3.606738075702319


In [21]:
pval = stats.t.sf(t,df=n-1)
print(pval)

0.9992573386042322


In [None]:
#  Inbuilt:

# stats.ttest_1samp(sample_data,popmean)

# pval = two sided pvalue(two tail test)
# Inbuilt function can be used only if the sample data is known.

In [19]:
stats.ttest_1samp(df['math score'],popmean=75)

Ttest_1sampResult(statistic=-3.6067380757023204, pvalue=0.0014853227915357337)

In [22]:
t = -3.6067380757023204
pval= stats.t.sf(t,df=n-1)
print(pval)

0.9992573386042322


In [17]:
sig_lvl = 0.1
if pval>=sig_lvl:
    print('Ho is selected')
else:
    print('Ha is selected')

Ho is selected


In [18]:
# Avg Female mark is lesser than or equal to 75.

#### 2. A researcher is studying the growth of bacteria in waters of Lake Beach. The mean bacteria count of 100 per unit volume of water is within the safety level. The researcher collected 10 water samples of unit volume and found the mean bacteria count to be 94.8 with a sample variance of 72.66. Does the data indicate that the bacteria count is within the safety level? Test at the α = .05 level. Assume that the measurements constitute a sample from a normal population.

In [None]:
# Ho : mu(bacteria growth) >=100
# Ha : mu(bacteria growth) < 100

In [None]:
# Data is normal
# Pop std is unknown

# One sample t test(left tailed)

In [25]:
x_bar = 94.8
s = np.sqrt(72.66)
n = 10
mu = 100

t = (x_bar-mu)/(s/n**0.5)
print(t)

-1.9291040236750068


In [26]:
pval = stats.t.cdf(t,df=n-1)
print(pval)

0.04289782134327503


In [27]:
sig_lvl = 0.05
if pval>=sig_lvl:
    print('Ho is selected')
else:
    print('Ha is selected')

Ha is selected


In [28]:
# Bacteria growth is in safety level.

<a id="2t"></a>
## 3.2 Two Sample t Test (Unpaired)

The two sample t-test is used to compare the means of two independent populations. This test assumes that the populations are normally distributed from which the samples are taken.

The null and alternative hypothesis is given as:
<p style='text-indent:25em'> <strong> $H_{0}: \mu_{1} - \mu_{2} = \mu_{0}$ or $\mu_{1} - \mu_{2} \geq \mu_{0}$ or $\mu_{1} -\mu_{2} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{1} - \mu_{2} \neq \mu_{0} $ or $\mu_{1} - \mu_{2} < \mu_{0}$ or $\mu_{1} -\mu_{2} > \mu_{0}$</strong></p>

Let us take a sample of size ($n_{1}$) from the first population and sample of size ($n_{2}$) from a second independent population. If both $n_{1}$ and $n_{2}$ are less than 30 and standard deviation of populations are unknown. We use two-sample t-test.

Consider the equal variance for both the populations. The test statistic for two sample t-test is given as:
<p style='text-indent:25em'> <strong> $t = \frac{(\overline{X_{1}} - \overline{X_{2}}) - \mu_{0}} {s \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}}$</strong></p>

Where, <br>
$\overline{X_{1}}$, $\overline{X_{2}}$: Mean of both the samples<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$s$: Pooled standard deviation<br>
$n_{1}, n_{2}$: Size of samples from both the populations

The pooled standard deviation is defined as:
$s = \sqrt{\frac{(n_{1} - 1)s_{1}^{2} + (n_{2} - 1)s_{2}^{2}}{n_{1} + n_{2} - 2}}$ $\hspace{2cm}$  Where, $s_{1}, s_{2}$: Standard deviation of both the samples

Under $H_{0}$, the test statistic follows a t-distribution with $(n_{1}+n_{2}-2)$ degrees of freedom.

### Example: 

#### 1. The teachers' association claims that the total score of the students who completed the test preparation course is different than the total score of the students who have not completed the course. The sample data consists of 15 students who completed the course and 18 students who have not completed the course. Test the association's claim with ⍺ = 0.05.

Consider the total score of the students who have/ have not completed the preparation course are given in the CSV file `totalmarks_2ttest.csv`.

In [54]:
df = pd.read_csv('totalmarks_2ttest.csv')
df.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,male,group E,standard,completed,84,83,78,245,Speak Global Learning
1,male,group C,free/reduced,completed,79,77,75,231,Speak Global Learning
2,male,group A,standard,none,91,96,92,279,Nature Learning
3,female,group B,free/reduced,completed,76,94,87,257,Speak Global Learning
4,male,group A,standard,completed,46,41,43,130,Nature Learning


In [None]:
# Ho: mu1(completed) = mu2(not completed)
# Ha : mu1(completed) != mu2(not completed)

# Ho : mu1-mu2=0
# Ha : mu1-mu2 !=0

In [32]:
comp_score = df[df['test preparation course']=='completed']['total score']
no_comp_score = df[df['test preparation course']=='none']['total score']

In [None]:
# Test of normality

# Ho : skew =0 (normal)
# Ha : skew != 0 (not normal)

In [31]:
stats.shapiro(comp_score)

ShapiroResult(statistic=0.9055534601211548, pvalue=0.11574020981788635)

In [33]:
stats.shapiro(no_comp_score)

ShapiroResult(statistic=0.948186457157135, pvalue=0.39728137850761414)

In [34]:
# pval>0.05
# Both scores are normal

In [35]:
# Data is normal
# Pop std is not known

# Two sample t test- unpaired - two tailed

In [None]:
# Manual Calculation

In [51]:
x1_bar = np.mean(comp_score)
x2_bar = np.mean(no_comp_score)
s1 = np.std(comp_score,ddof=1)
s2 = np.std(no_comp_score,ddof=1)
n1 = len(comp_score)
n2 = len(np_comp_score)
df1 = n1-1
df2 = n2-1
df_total =df1+df2

# pooled variance

sp_2 = ((df1*s1**2) + (df2*s2**2))/(df1+df2)


# t_Stat

num = (x1_bar-x2_bar)-0
den = np.sqrt(sp_2*(1/n1+1/n2))
t= num/den
print(t)

1.4385323319823262


In [52]:
pval = stats.t.sf(abs(t),df=df_total)*2
print(pval)

0.16030339806989594


In [37]:
# Inbuilt function

# stats.ttest_ind(sample_data1,sample_data2)

In [38]:
stats.ttest_ind(comp_score,no_comp_score)

Ttest_indResult(statistic=1.4385323319823262, pvalue=0.16030339806989594)

In [39]:
pval=0.16030339806989594 # (two tail proabability)

In [40]:
sig_lvl = 0.05
if pval>=sig_lvl:
    print('Ho is selected')
else:
    print('Ha is selected')

Ho is selected


In [28]:
# Avg Score of course completed students is same as not completed student

## Practice:

1. The teachers' association claims that the total score of Speak Global Learning is greater than the total score of Nature Learning.  Test the association's claim with ⍺ = 0.05.

In [53]:
# Ho: mu1(sgl) <= mu2(nl)
# Ha : mu1(sgl) > mu2(nl)

# Ho : mu1-mu2=0
# Ha : mu1-mu2 !=0

In [58]:
sgl = df[df['training institute']=='Speak Global Learning']['total score']
nl = df[df['training institute']=='Nature Learning']['total score']

In [59]:
# Test of normality

# Ho : skew =0 (normal)
# Ha : skew != 0 (not normal)

In [60]:
stats.shapiro(sgl)

ShapiroResult(statistic=0.940517246723175, pvalue=0.26940712332725525)

In [61]:
stats.shapiro(nl)

ShapiroResult(statistic=0.960299015045166, pvalue=0.7280198335647583)

In [34]:
# pval>0.05
# Both scores are normal

In [62]:
# Data is normal
# Pop std is not known

# Two sample t test- unpaired - right tailed

In [63]:
# Inbuilt function

# stats.ttest_ind(sample_data1,sample_data2)

In [64]:
stats.ttest_ind(sgl,nl)

Ttest_indResult(statistic=0.9984458677537893, pvalue=0.32579344760218754)

In [85]:
t = 0.9984458677537893
n1 =len(sgl)
n2 = len(nl)
df1 = n1-1
df2 = n2-1
df_total = df1+df2
pval= stats.t.sf(t,df=df_total)
print(pval)

0.16289672380109377


In [67]:
sig_lvl = 0.05
if pval>=sig_lvl:
    print('Ho is selected')
else:
    print('Ha is selected')

Ho is selected


In [28]:
# Avg Score of course sgl is lesser than or equal to nl

<a id="paired"></a>
## 3.3 Paired t Test

A paired t-test is used to compare the mean of the population for two dependent samples. The dependent samples can be the scores before and after a specific treatment. 

Let $X_{i}$ be the sample before the treatment and $Y_{i}$ be the sample after the treatment. Let $\mu_{X}$, $\mu_{Y}$ be the mean of the data X and Y respectively. The mean difference $\mu_{d} = \mu_{Y} - \mu_{X}$.

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: \mu_{d} = \mu_{0}$ or $\mu_{d} \geq \mu_{0}$ or $\mu_{d} \leq \mu_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: \mu_{d} \neq \mu_{0}$ or $\mu_{d} < \mu_{0}$ or $\mu_{d} > \mu_{0}$</strong></p>

The test statistic for paired t-test is given as:
<p style='text-indent:25em'> <strong> $t = \frac{\overline{X_{D}} - \mu_{0}} {\frac{s_{D}}{\sqrt{n}}}$</strong></p>

Where, <br>
$\overline{X_{D}}$: Mean difference between the pairs<br>
$\mu_{0}$: Mean difference given in the null hypothesis<br>
$s_{D}$: Standard deviation of differences between the pairs<br>
$n$: Sample size

Under $H_{0}$, the test statistic follows a t-distribution with (n-1) degrees of freedom.

### Example:

#### 1. A training institute wants to check if their writing training program was effective or not. 17 students are selected to check the hypothesis. Consider 0.05 as the level of significance.

The writing scores before and after training are provided in the CSV file `WritingScores.csv`. 

In [68]:
df = pd.read_csv('WritingScores.csv')
df 

Unnamed: 0,score_before,score_after
0,59,50
1,62,67
2,76,92
3,32,75
4,61,98
5,69,91
6,67,65
7,61,41
8,77,63
9,60,77


In [None]:
# Ho : mu1 (before) >= mu (after)
# Ha : mu1(before) < mu2(after)

In [69]:
bfr_score = df['score_before']
aftr_sfcore = df['score_after']

In [70]:
# Test of normality

# Ho : skew =0 (normal)
# Ha : skew != 0 (not normal)

In [71]:
stats.shapiro(bfr_score)

ShapiroResult(statistic=0.9473825097084045, pvalue=0.416460782289505)

In [72]:
stats.shapiro(aftr_sfcore)

ShapiroResult(statistic=0.9686523675918579, pvalue=0.7944130897521973)

In [73]:
# pval>0.05
# Both scores are normal

In [74]:
# Data is normal
# Pop std is not known

# Two sample t test- paired - left tailed

In [80]:
# Manual Calculation

difference = bfr_score-aftr_sfcore
stats.ttest_1samp(difference,popmean=0)

Ttest_1sampResult(statistic=-1.4394882729049499, pvalue=0.16929012896279846)

In [82]:
t = -1.4394882729049499
n1 =len(bfr_score)
n2 = len(aftr_sfcore)
df1 = n1-1
df2 = n2-1
df_total = df1+df2
pval= stats.t.cdf(t,df=df_total) 
print(pval)

0.07986181650455965


In [75]:
# Inbuilt function

# stats.ttest_rel(sample_data1,sample_data2)

In [76]:
stats.ttest_rel(bfr_score,aftr_sfcore)

Ttest_relResult(statistic=-1.4394882729049499, pvalue=0.16929012896279846)

In [83]:
t = -1.4394882729049499
n1 =len(bfr_score)
n2 = len(aftr_sfcore)
df1 = n1-1
df2 = n2-1
df_total = df1+df2
pval= stats.t.cdf(t,df=df_total) 
print(pval)

0.07986181650455965


In [84]:
sig_lvl = 0.05
if pval>=sig_lvl:
    print('Ho is selected')
else:
    print('Ha is selected')

Ho is selected


In [79]:
# Avg Score of before prgm is greater than or equal to after program. 
# No effect in the program.

#### 2. An energy drink distributor claims that a new advertisement poster, featuring a life-size picture of a well-known athlete, will increase the product sales in outlets by an average of 50 bottles in a week. For a random sample of 10 outlets, the following data was collected. Test that the null hypothesis that there the advertisement was effective in increasing sales. Test the hypothesis using critical region technique. Use α = 0.05.

Given data:

        sales_before = [33, 32, 38, 45, 37, 47, 48, 41, 45]
        sales_after = [42, 35, 31, 41, 37, 36, 49, 49, 48]

In [86]:
# Ho : mu1 (before) >= mu (after)
# Ha : mu1(before) < mu2(after)

In [87]:
sales_before = [33, 32, 38, 45, 37, 47, 48, 41, 45]
sales_after = [42, 35, 31, 41, 37, 36, 49, 49, 48]

In [88]:
# Test of normality

# Ho : skew =0 (normal)
# Ha : skew != 0 (not normal)

In [89]:
stats.shapiro(sales_before)

ShapiroResult(statistic=0.9187208414077759, pvalue=0.38175514340400696)

In [90]:
stats.shapiro(sales_after)

ShapiroResult(statistic=0.9118874073028564, pvalue=0.3293096721172333)

In [73]:
# pval>0.05
# Both scores are normal

In [91]:
# Data is normal
# Pop std is not known

# Two sample t test- paired - left tailed

In [92]:
stats.ttest_rel(sales_before,sales_after)

Ttest_relResult(statistic=-0.10085458113185983, pvalue=0.9221477146925299)

In [93]:
t = -0.10085458113185983
n1 =len(sales_before)
n2 = len(sales_after)
df1 = n1-1
df2 = n2-1
df_total = df1+df2
pval= stats.t.cdf(t,df=df_total) 
print(pval)

0.4604594514186438


In [94]:
sig_lvl = 0.05
if pval>=sig_lvl:
    print('Ho is selected')
else:
    print('Ha is selected')

Ho is selected


In [95]:
# Avg Score of before ad is greater than or equal to after ad. 
# No effect in the ad.