<table align="left" width=100%>
    <tr>
        <td width="20%">
            <img src="GL-2.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                  <b> Faculty Notebook <br> (Day1) </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## Table of Content
1. **[Small Sample Test](#t)**
    - 1.1 - **[One Sample t Test](#1t)**
2. **[Z Proportion Test](#prop)**
    - 2.1 - **[One Sample Test](#1_p)**

<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [1]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import 'random' to generate random sample
import random

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

In [2]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

### Example:


#### 1. A survey claims that in a math test female students tend to score fewer marks than the average marks of 75 out of 100. Consider a sample of 24 female students and perform a hypothesis test to check the claim with 90% confidence.

Use the dataset available in the CSV file `mathscore_1ttest.csv`.

In [3]:
# read the students performance data 
df_female_scores = pd.read_csv('totalmarks_2ttest.csv')

# display the first two observations
df_female_scores.head(2)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,male,group E,standard,completed,84,83,78,245,Speak Global Learning
1,male,group C,free/reduced,completed,79,77,75,231,Speak Global Learning


In [4]:
# consider a list of math scores of female students from the data
df_female_scores.shape

(33, 9)

The null and alternative hypothesis is:

H<sub>0</sub>: $\mu \geq 75$<br>
H<sub>1</sub>: $\mu < 75$

Here ⍺ = 0.1 and degrees of freedom = 23, for a one-tailed test let us calculate the critical t-value.

In [5]:
#critical value
alpha=0.10
stats.norm.ppf(alpha)

-1.2815515655446004

In [6]:
# calculate test statistic
# Ho>=75
# Ha<75
cl=0.9
alpha=1-cl
mew=75

df=df_female_scores.loc[df_female_scores['gender']=='female','math score']
print(df.shape)

xbar=df.mean()
sigma=df.std(ddof=0)
print(sigma)
# ddof -1 in numpy considers n-1 in denom.

samplesd=np.std(df,ddof=1)
print(samplesd)

(18,)
10.003085943600691
10.29309051831862


In [7]:
# Ztest
mew=75
n=df.shape[0]
num=xbar-mew
deno=sigma/np.sqrt(n)
teststats=num/deno
print('TestStats:',teststats)

TestStats: -6.691879119078237


In [8]:
# one sample T test
stats.ttest_1samp(df,mew)

TtestResult(statistic=-6.50333753824416, pvalue=5.409677011466925e-06, df=17)

In [9]:
stats.ttest_1samp(df,mew,alternative='less')

TtestResult(statistic=-6.50333753824416, pvalue=2.7048385057334625e-06, df=17)

In [10]:
# p value using cdf function
stats.t.cdf(-6.50333753824416,df=n-1)

2.7048385057334625e-06

<a id="prop"></a>
# 2. Z Proportion Test

<a id="1_p"></a>
## 2.1 One Sample Test

Perform one sample Z test for the population proportion. We compare the population proportion ($P$) with a specific value ($P_{0}$).

The null and alternative hypothesis is given as:

<p style='text-indent:25em'> <strong> $H_{0}: P = P_{0}$ or $P \geq P_{0}$ or $P \leq P_{0}$</strong></p>
<p style='text-indent:25em'> <strong> $H_{1}: P \neq P_{0}$ or $P < P_{0}$ or $P > P_{0}$</strong></p>

The test statistic for proportion Z-test is given as:
<p style='text-indent:25em'> <strong> $Z = \frac{p -  P_{0}}{\sqrt{\frac{P_{0}(1-P_{0})}{n}}}$</strong></p>

Where, <br>
$p$: Sample proportion<br>
$n$: Sample size

Under $H_{0}$, the test statistic follows a standard normal distribution.

### Example:

#### 1. In previous years, people believed that at most 80% of male students score more than 50 marks out of 100 in Mathematics. Perform a test to check whether this percentage is more than 80. Consider the level of significance as 0.05.

Consider the sample of math scores of male students available in the CSV file `StudentsPerformance.csv`.

In [12]:
# read the students performance data 
df_student = pd.read_csv('StudentsPerformance.csv')

# display the first two observations
df_student.head(2)

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning


The null and alternative hypothesis is:

H<sub>0</sub>: $P \leq 0.8$<br>
H<sub>1</sub>: $P > 0.8$ 

Here ⍺ = 0.05, for a one-tailed test calculate the critical z-value.

In [17]:
# select male students 
df_student['gender'].unique()
males=df_student.loc[df_student.gender=='male']
males.shape

(483, 9)

In [35]:
# now using the male dataset wer will apply the condition
# finding score >80
count_80=males.loc[males['math score']>50].shape[0]

samp_prop=count_80/males.shape[0]
samp_prop
n=483
hyp_prop=0.80

In [29]:
stats.norm.isf(0.05)

1.6448536269514729

In [36]:
# Test Statistic
test_stats=(samp_prop-hyp_prop)/np.sqrt((hyp_prop*(1-hyp_prop))/n)
test_stats

4.163394160018601

In [40]:
1-stats.norm.cdf(test_stats)

# reject the null hypothesis which means more> 80% students are able to achive more than 50% marks

1.5677570141203745e-05

In [41]:
# calculate the 95% confidence interval
stats.norm.interval(0.95,loc=samp_prop,scale=np.sqrt((hyp_prop*(1-hyp_prop))/n))

(0.8401038178124423, 0.9114489772186136)

#### 2. From a sample of 361 business owners had gone into bankruptcy due to recession. On taking a survey, it was found that 105 of them had not consulted any professional for managing their finance before opening the business. Test the null hypothesis that at most 25% of all businesses had not consulted before opening the business. Test the claim using p-value technique. Use α = 0.05.

The null and alternative hypothesis is:

H<sub>0</sub>: $P \leq 0.25$<br>
H<sub>1</sub>: $P > 0.25$ 

In [37]:
# p value
stats.norm.cdf(0.05)

0.5199388058383725

In [47]:
n=361
x=105
samp_prop=x/n
hyp_prop=0.25
stats.norm.isf(0.05)

1.6448536269514729

In [48]:
test_stats=(samp_prop-hyp_prop)/np.sqrt((hyp_prop*(1-hyp_prop))/n)
test_stats

1.7928245201151534

In [54]:
# p value
1-stats.norm.cdf(test_stats)

0.03650049373124953

In [51]:
#interval
stats.norm.interval(0.95,loc=samp_prop,scale=np.sqrt((hyp_prop*(1-hyp_prop))/n))

(0.24619086783771343, 0.33552658368583227)

### Basis confidence interval

* we observe that the confidence interval range contains 0.25 which means that the results are misleading.
* we also notice that the PValue is rejecting Ho at 5% however it is getting FTR at 1% which suggests that the test is   
*  running into error
* we need more data to generate accurate results