<a id="lib"></a>
# 1. Import Libraries

**Let us import the required libraries.**

In [5]:
# import 'pandas' 
import pandas as pd 

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import 'random' to generate random sample
import random

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

<a id="chisq"></a>
# 2. Chi-Square Test

It is a non-parametric test. `Non-parametric tests` do not require any assumptions on the parameter of the population from which the sample is taken. These tests can be applied to the ordinal/ nominal data. A non-parametric test can be performed on the data containing outliers.

The chi-square test statistic follows a Chi-square ($\chi^{2}$) distribution under the null hypothesis. It can be used to check the relationship between the categorical variables. 

Let us calculate the right-tailed $\chi^{2}$ values for different levels of significance ($\alpha$).

<a id="goodness"></a>
## 2.1 Chi-Square Test for Goodness of Fit

This test is used to compare the distribution of the categorical data with the expected distribution. 

<p style='text-indent:6em'> <strong> $H_{0}$: There is no significant difference between the observed and expected frequencies from the expected distribution</strong></p>
<p style='text-indent:6em'> <strong> $H_{1}$: There is a significant difference between the observed and expected frequencies from the expected distribution</strong></p>

### Example:



#### 1. At an emporium, the manager is interested in knowing the age group which visits the mall during the day. He defines categories as - children, teenagers, adults and senior citizens. He plans to have his inventory of goods accordingly. He claims that out of all the people who visited 5% are children, 38% are teenagers, 2% are senior citizens are remaining are adults. From a sample of 180 people, it was seen that 25 were children, 50 were teenagers, 90 were adults and  15 were senior citizens. Test the manager’s claim at a 95% confidence level.


In [6]:
# Ho : Observed = Expected
# Ha : Observed != Expected

In [7]:
exp_per = np.array([0.05,0.38,0.02,0.55])  # children,teenager,Sr.Ct,Adults
n = 180
obs_val = [25,50,15,90]
exp_val = exp_per*n

In [8]:
stats.chisquare(f_exp=exp_val,f_obs = obs_val)

Power_divergenceResult(statistic=70.31233386496545, pvalue=3.659118590746868e-15)

In [9]:
# Since chisquare is bidirectional, pvalue of the inbuilt function can be taken.

In [11]:
pval = 3.659118590746868e-15
sig_lvl = 0.05
if pval<sig_lvl:
    print('Ha is selected')
else:
    print('Ho is selected')

Ha is selected


In [None]:
# The actual proportion of people is not in the expected prportion

### Practice:

1) In a school, sports teacher is willing to see the proportion of
people participating in different sports. He expects that all the sports
are equal in proportion. After the observation, he found that

cricket - 35%
volley ball - 25%
foot ball - 20%
basket ball - 20%

Total number of student in the school - 200

Check the hypotheis with 95% Confidence level.

In [6]:
# Ho : Observed = Expected
# Ha : Observed != Expected

In [16]:
n = 200
exp_per = np.array([0.25,0.25,0.25,0.25])  # cricket,volley,foot,basket
obs_per = np.array([0.35,0.25,0.20,0.20])  # cricket,volley,foot,basket
obs_val = obs_per*n
exp_val = exp_per*n

In [19]:
stats.chisquare(f_exp=exp_val,f_obs = obs_val)

Power_divergenceResult(statistic=12.0, pvalue=0.007383160505359769)

In [9]:
# Since chisquare is bidirectional, pvalue of the inbuilt function can be taken.

In [20]:
pval = 0.007383160505359769
sig_lvl = 0.05
if pval<sig_lvl:
    print('Ha is selected')
else:
    print('Ho is selected')

Ha is selected


In [None]:
# The actual proportion of people is not in the expected prportion

<a id="ind"></a>
## 2.2 Chi-Square Test for Independence

This test is used to test whether the categorical variables are independent or not.

<p style='text-indent:20em'> <strong> $H_{0}$: The variables are independent</strong></p>
<p style='text-indent:20em'> <strong> $H_{1}$: The variables are not independent (i.e. variables are dependent)</strong></p>

Consider a categorical variable `A` with `r` levels and variable `B` with `c` levels. Let us test the independence of variables A and B.

The test statistic is given as:
<p style='text-indent:25em'> <strong> $\chi^{2} = \sum_{i= 1}^{r}\sum_{j = 1}^{c}\frac{O_{ij}^{2}}{E_{ij}} - N$</strong></p>

Where, <br>
$O_{ij}$: Observed frequency for category (i,j) <br>
$E_{ij}$: Expected frequency for category (i,j)<br>
$N$: Total number of observations

Under $H_{0}$, the test statistic follows a chi-square distribution with $(r-1)(c-1)$ degrees of freedom.

In [22]:
df = pd.read_csv('students_data.csv')
df.head()

Unnamed: 0,gender,ethnicity,education,lunch,test_prep_course,math_score,reading_score,writing_score,total_score,training_institute
0,female,group B,bachelor's degree,standard,none,89,55,56,200,Nature Learning
1,female,group C,college,standard,completed,55,63,72,190,Nature Learning
2,female,group B,master's degree,standard,none,64,71,56,191,Nature Learning
3,male,group A,associate's degree,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,college,standard,none,75,66,51,192,Nature Learning


**Check the relation of gender and test_prep_course with 95%CI.**

In [None]:
# Ho : No relation b/w Categorical features
# Ha : Significant Relation b/w Categorical features

In [23]:
obs_val = pd.crosstab(df['gender'],df['test_prep_course'])
obs_val

test_prep_course,completed,none
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,184,333
male,175,308


In [26]:
stats.chi2_contingency(obs_val) # Test of independance

(0.02117195064612695,
 0.8843115019893263,
 1,
 array([[185.603, 331.397],
        [173.397, 309.603]]))

In [25]:
test_stat,pval,dof,exp = stats.chi2_contingency(obs_val)
print('Test stat:',test_stat)
print('pval:',pval)
print('Degrees of freedom:',dof)
print('Expected:',exp)

Test stat: 0.02117195064612695
pval: 0.8843115019893263
Degrees of freedom: 1
Expected: [[185.603 331.397]
 [173.397 309.603]]


In [27]:
pval = 0.8843115019893263
sig_lvl = 0.05
if pval<sig_lvl:
    print('Ha is selected')
else:
    print('Ho is selected')

Ho is selected


In [None]:
# The gender has no relation with test_preparation_course

###  Practice:

**Check the relation of gender and lunch with 95%CI.**

In [28]:
obs_val = pd.crosstab(df['gender'],df['lunch'])
obs_val

lunch,free/reduced,standard
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,188,329
male,166,317


In [29]:
stats.chi2_contingency(obs_val) # Test of independance

(0.35177927582800844,
 0.5531076492252226,
 1,
 array([[183.018, 333.982],
        [170.982, 312.018]]))

In [30]:
test_stat,pval,dof,exp = stats.chi2_contingency(obs_val)
print('Test stat:',test_stat)
print('pval:',pval)
print('Degrees of freedom:',dof)
print('Expected:',exp)

Test stat: 0.35177927582800844
pval: 0.5531076492252226
Degrees of freedom: 1
Expected: [[183.018 333.982]
 [170.982 312.018]]


In [31]:
pval = 0.5531076492252226
sig_lvl = 0.05
if pval<sig_lvl:
    print('Ha is selected')
else:
    print('Ho is selected')

Ho is selected


In [32]:
# The gender has no relation with test_preparation_course

<a id="1way"></a>
# 3. One-way ANOVA

It is used to check the equality of population means for more than two independent samples. Each group is considered as a `treatment`. It assumes that the samples are taken from normally distributed populations. To check this assumption we can use the `Shapiro-Wilk Test.` Also, the population variances should be equal; this can be tested using the `Levene's Test`.

The null and alternative hypothesis is given as:
<p style='text-indent:20em'> <strong> $H_{0}$: The averages of all treatments are the same. </strong></p>
<p style='text-indent:20em'> <strong> $H_{1}$: At least one treatment has a different average. </strong></p>

Consider there are `t` treatments and `N` number of total observations. The test statistic is given as:
<p style='text-indent:28em'> <strong> $F = \frac{MTrSS}{MESS} $</strong></p>

Where,<br>
MTrSS = $\frac{TrSS}{df_{Tr}}$<br>

TrSS = $\sum_{i}^{t}\sum_{j}^{n_{i}}n_{i}(\bar{x_{i}}. - \bar{x}..)$<br> $n_{i}$ is the number of observations in $i^{th}$ treatment. <br>$\bar{x_{i}}.$ is the mean over $i^{th}$ treatment <br> $\bar{x}..$ is the grand mean (i.e. mean of all the observations). <br>

$df_{Tr}$ is the degrees of freedom for treatments (= $t-1$)

MESS = $\frac{ESS}{df_{e}}$<br>

ESS = $\sum_{i}^{t}\sum_{j}^{n_{i}}{(x_{ij} - \bar{x_{i}}.)}^{2}$

$df_{e}$ is the degrees of freedom for error (= $N-t$)

Under $H_{0}$, the test statistic follows F-distribution with ($t-1,  N-t$) degrees of freedom.

Let us calculate the F values for different levels of significance ($\alpha$).

### Example:

#### 1. Total marks in aptitude exam are recorded for students with different race/ethnicity. Test whether all the races/ethnicities have an equal average score with 0.05 level of significance. 

Use the performance dataset of students available in the CSV file `students_data.csv`.

In [33]:
df = pd.read_csv('students_data.csv')
df.head()

Unnamed: 0,gender,ethnicity,education,lunch,test_prep_course,math_score,reading_score,writing_score,total_score,training_institute
0,female,group B,bachelor's degree,standard,none,89,55,56,200,Nature Learning
1,female,group C,college,standard,completed,55,63,72,190,Nature Learning
2,female,group B,master's degree,standard,none,64,71,56,191,Nature Learning
3,male,group A,associate's degree,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,college,standard,none,75,66,51,192,Nature Learning


In [34]:
df['ethnicity'].value_counts()

group C    319
group D    261
group B    190
group E    140
group A     90
Name: ethnicity, dtype: int64

In [36]:
grp_a = df[df['ethnicity'] =='group A']['total_score']
grp_b = df[df['ethnicity'] =='group B']['total_score']
grp_c = df[df['ethnicity'] =='group C']['total_score']
grp_d = df[df['ethnicity'] =='group D']['total_score']
grp_e = df[df['ethnicity'] =='group E']['total_score']

In [None]:
# Ho : All means are equal
# Ha : Atleast one mean is not equal

In [37]:
# Test of Normality - Shapiro

# Ho : Data is normal
# Ha : Data is not normal

print(stats.shapiro(grp_a))
print(stats.shapiro(grp_b))
print(stats.shapiro(grp_c))
print(stats.shapiro(grp_d))
print(stats.shapiro(grp_e))

ShapiroResult(statistic=0.9894436001777649, pvalue=0.6901752352714539)
ShapiroResult(statistic=0.9947066307067871, pvalue=0.7402700185775757)
ShapiroResult(statistic=0.9973903298377991, pvalue=0.8950209617614746)
ShapiroResult(statistic=0.9948431253433228, pvalue=0.5269628167152405)
ShapiroResult(statistic=0.991719126701355, pvalue=0.5859840512275696)


In [38]:
# All pvalues > 0.05,All data are normal

In [40]:
# Test of equality of Variance - Levene's test

# Ho: All variance are equal
# Ha : Atleast one variance is not equal

print(stats.levene(grp_a,grp_b,grp_c,grp_d,grp_e))

LeveneResult(statistic=1.8006030590828939, pvalue=0.12649444001357793)


In [None]:
# pval> 0.05
# Ho is selected.
# Samples have equal variance

In [None]:
# 2+ samples
# pop std unknown
# data is normal
# data have equal variance

# ANOVA - Bidirectional test (So one tail is not possible)

In [41]:
stats.f_oneway(grp_a,grp_b,grp_c,grp_d,grp_e)

F_onewayResult(statistic=0.789109595922189, pvalue=0.5322937031083035)

In [42]:
pval = 0.5322937031083035
sig_lvl = 0.05
if pval<sig_lvl:
    print('Ha is selected')
else:
    print('Ho is selected')

Ho is selected


In [None]:
# All the average scores for different ethniciti is equal

### Practice:


#### 1. Ryan is a production manager at an industry manufacturing alloy seals. They have 4 machines - A, B, C and D. Ryan wants to study whether all the machines have equal efficiency. Ryan collects data of tensile strength from all the 4 machines as given. Perform the post-hoc test to find out which machine has a different average. Test at 5% level of significance.

<img src='1_ANOVA.png'>

In [43]:
x = np.array([68.7, 62.7, 55.9, 80.7, 75.4, 68.5, 56.1, 70.3, 70.9, 63.1, 57.3, 80.9, 79.1, 
62.2, 59.2, 85.4, 78.2, 60.3, 50.1, 82.3]).reshape(5,4).T
grp_a = x[0,:]
grp_b =x[1,:]
grp_c =x[2,:]
grp_d =x[3,:]

In [44]:
# Ho : All means are equal
# Ha : Atleast one mean is not equal

In [45]:
# Test of Normality - Shapiro

# Ho : Data is normal
# Ha : Data is not normal

print(stats.shapiro(grp_a))
print(stats.shapiro(grp_b))
print(stats.shapiro(grp_c))
print(stats.shapiro(grp_d))


ShapiroResult(statistic=0.9147661328315735, pvalue=0.4967544972896576)
ShapiroResult(statistic=0.8534730076789856, pvalue=0.2057477980852127)
ShapiroResult(statistic=0.8795409202575684, pvalue=0.3072359263896942)
ShapiroResult(statistic=0.8367964029312134, pvalue=0.15625961124897003)


In [38]:
# All pvalues > 0.05,All data are normal

In [46]:
# Test of equality of Variance - Levene's test

# Ho: All variance are equal
# Ha : Atleast one variance is not equal

print(stats.levene(grp_a,grp_b,grp_c,grp_d))

LeveneResult(statistic=0.3969333650936478, pvalue=0.7570021212992085)


In [None]:
# pval> 0.05
# Ho is selected.
# Samples have equal variance

In [None]:
# 2+ samples
# pop std unknown
# data is normal
# data have equal variance

# ANOVA - Bidirectional test (So one tail is not possible)

In [47]:
stats.f_oneway(grp_a,grp_b,grp_c,grp_d)

F_onewayResult(statistic=32.03072350199285, pvalue=5.375613532781072e-07)

In [48]:
pval = 5.375613532781072e-07
sig_lvl = 0.05
if pval<sig_lvl:
    print('Ha is selected')
else:
    print('Ho is selected')

Ha is selected


In [None]:
# Atleast one tensile strength is different