In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

### About P Value

* P Value is the probability of the H0 being correct given the test statistic.
* If the PValue > Alpha, We can say that the results are not statistically significant meaning that we Fail to Reject the Ho.

* If the PValue on the other hand, is less than Alpha, we say that the results are statistically significant and we Reject the Ho.

* In other words, a smaller PValue means that there is a stronger evidence in the favour of Ha.

* The lower the PValue, greater is the Statistical Significance and vice-versa.

The training institute Nature Learning claims that the students trained in their institute have overall better performance than the students trained in their competitor institute Speak Global Learning. We have a sample data of 500 students from each institute along with their **total score** collected from independent normal populations. Frame a hypothesis and test the Nature Learning's claim with 99% confidence.


* **Step 01: Null Hypothesis**
Ho: Total Score(NL) = Total Score(Speak GL) or Mew1 - Mew2 = 0 or Mew1 = Mew2

* **Step 02: Define the Alternate Hypothesis**
H1: Total Score(NL) > Total Score(Speak GL) # Right Tail Test...

* **Step 03: Define the Level of Significance and Alpha**
We will use 0.99 CL to draw our claims. i.e. the alpha will be 0.01

In [2]:
# Import the Datasets

student = pd.read_csv("~/Downloads/StudentsPerformance.csv")

student.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning
2,female,group B,standard,none,64,71,56,191,Nature Learning
3,male,group A,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,standard,none,75,66,51,192,Nature Learning


In [3]:
# Generate the Total Scores of Nature Learning & Speak Global...
nature = student.loc[student["training institute"]=="Nature Learning", 
                    "total score"]

speakgl = student.loc[student["training institute"]=="Speak Global Learning", 
                    "total score"]

### Points to Remember

* Data should be normal ~ Shapiro Wilk Test of Normality
Note: if the sample size is > 5000, then use Jarque Bera test

* There should be Equal Variance (Equality of Variance) ~ Levene's Test

Note: These tests should comply for the TTest to be applied. If these assumptions are not met, then we will apply a Non Parametric Test ~ **MannWhitneyU Test**

* Anova ~ Kruskall Wallis Test

In [5]:
# Shapiro
# Ho: Data is Normal
# Ha: Data is Not Normal
print(stats.shapiro(nature)) # teststatsitic, pvalue
print(stats.shapiro(speakgl))
# Inference: If the PValue<Alpha => Reject the Ho else 
# We Fail to Reject the Ho

ShapiroResult(statistic=0.997671365737915, pvalue=0.7214037179946899)
ShapiroResult(statistic=0.9957306981086731, pvalue=0.19211649894714355)


In [6]:
# Levene's Test
# Ho: That the Samples have Equal Variance
# Ha: That the Samples DONT have Equal Variance

stats.levene(nature, speakgl)

LeveneResult(statistic=0.6422721347822817, pvalue=0.42307998325221574)

In [7]:
# Two Sample TTest

xbar1 = np.mean(nature)
xbar2 = np.mean(speakgl)

std1 = np.std(nature, ddof = 1)
std2 = np.std(speakgl, ddof = 1)

n1 = len(nature)
n2 = len(speakgl)

# Ho: mew1 = mew2
# Ha: mew1>mew2 (Right Tail Test)

# Calculate the Test Statistic
num = (xbar1 - xbar2) - 0
deno = np.sqrt((std1**2/n1) + (std2**2/n2))
teststats = num/deno
print("Test Stats: ", teststats)

Test Stats:  0.1512551125394991


In [11]:
# PValue ~ cdf and sf
# Cdf will return the Probability to the left...
# where sf will return the probability to the right...
print("PValue_CDF:", 1 - stats.t.cdf(teststats, df = (n1+n2)-2))
print("PValue_SF:", stats.t.sf(teststats, df = (n1+n2)-2))+

PValue_CDF: 0.43990252216938375
PValue_SF: 0.43990252216938386


In [16]:
1 - stats.norm.cdf(0.01)

0.4960106436853684

In [13]:
# Direct Function...
stats.ttest_ind(nature, speakgl, alternative = "greater")

Ttest_indResult(statistic=0.15125511253949914, pvalue=0.43990252216938386)

A survey claims that in a math test female students tend to score fewer marks than the average marks of 75 out of 100. Consider a sample of 24 female students and perform a hypothesis test to check the claim with 90% confidence.


* **Step01: Define the Null Hypothesis**
* Ho: Avg Marks >= 75 | Avg Marks = 75 (Both the Exp are OK)


* **Step02: Define the Alternate Hypothesis**
* Ho: Avg Marks < 75

Note: Check the Claim at 0.90

In [17]:
maths = pd.read_csv("~/Downloads/mathscore_1ttest.csv")
maths.head()

maths.shape

(24, 9)

In [12]:
maths.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group C,standard,none,60,72,74,206,Nature Learning
1,female,group C,standard,none,59,72,68,199,Nature Learning
2,female,group E,standard,none,100,100,100,300,Speak Global Learning
3,female,group D,standard,none,69,74,74,217,Speak Global Learning
4,female,group A,free/reduced,none,47,59,50,156,Speak Global Learning


In [19]:
# Ho Female Students are scoring 75
# Ha: Females Students Score < 75

# One Sample TTest
stats.ttest_1samp(maths["math score"], 75, alternative="less")

# If the Pvalue < alpha, We reject the Ho
# Else We Fail to Reject the Ho 

Ttest_1sampResult(statistic=-3.6067380757023204, pvalue=0.0007426613957678669)

The teachers' association claims that the total score of the students who completed the test preparation course is different than the total score of the students who have not completed the course. The sample data consists of 15 students who completed the course and 18 students who have not completed the course. Test the association's claim with ⍺ = 0.05. **totalmarks_2ttest**


* **Step01: Define the Null Hypothesis**
* Ho: TotalScore(Completed Test Prep) = Total Score(Not Completed the Test Prep)


* **Step02: Define the Alternate Hypothesis**
* Ha: TotalScore(Completed Test Prep) != Total Score(Not Completed the Test Prep)

Note: This is going to be a two tail test. Because it is a Sample Dataset, therefore, we will apply t-Test.

Note: Check the Claim at 0.95

In [20]:
testprep = pd.read_csv("~/Downloads/totalmarks_2ttest.csv")
testprep.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,male,group E,standard,completed,84,83,78,245,Speak Global Learning
1,male,group C,free/reduced,completed,79,77,75,231,Speak Global Learning
2,male,group A,standard,none,91,96,92,279,Nature Learning
3,female,group B,free/reduced,completed,76,94,87,257,Speak Global Learning
4,male,group A,standard,completed,46,41,43,130,Nature Learning


In [23]:
# Split the Data in Completed and None...

complete = testprep.loc[testprep["test preparation course"]=="completed", 
                        "total score"]

none = testprep.loc[testprep["test preparation course"]=="none", 
                        "total score"]

In [26]:
# Normality Test and Levene Test
print("Shapiro: ", stats.shapiro(complete), stats.shapiro(none))
print("Levene: ", stats.levene(complete, none))

Shapiro:  ShapiroResult(statistic=0.9055536389350891, pvalue=0.11574102193117142) ShapiroResult(statistic=0.9481862187385559, pvalue=0.3972780704498291)
Levene:  LeveneResult(statistic=0.045113770764648356, pvalue=0.8331854285659768)


In [27]:
# Two Sample T Test
stats.ttest_ind(complete, none)

# Inf: That there is no difference in the Performance of Both Groups

Ttest_indResult(statistic=1.4385323319823262, pvalue=0.16030339806989594)

### Two Sample T Test

* Two Sample T-Test can be divided in two parts. One is called **Independent Sample TTTest**

* Independent Samples/Sample T-Tests are the T-Tests where the Samples are not Related. They are independent in Nature. For e.g. The Doctor while testing a drug, gave medicine to one group and placebo to other group. These groups are not related.

* VLCC - VLCC tried a particular diet plan for one group and the same diet plan is tried in another city. Therefore, the people involved are not related with each other. The results measured will be an example of Independent Sample T-Test.


* **T-Test Related aka Paired T-Test** is another type of a T Test where we check the Before and After Effect o the same sample. The Subject target for the test is same. It is to ensure that we see some cause and effect relationship...

A training institute wants to check if their writing training program was effective or not. 17 students are selected to check the hypothesis. Consider 0.05 as the level of significance.


* **Step01: Define the Null Hypothesis**
* Ho: Training Program has no Effect on the Writing Skills of the People. Mean Difference between the Scores = 0. Mean(scores_before) = Mean(score_after)


* **Step02: Define the Alternate Hypothesis**
* Ha: Mean(scores_after) > Mean(score_before). This means that the Scores have improved.

Note: This is going to be a Right Tail test. 

Note: Because it is a Sample Dataset of before and after effect, therefore, we will apply Paired t-Test.

Note: Check the Claim at 0.95

In [28]:
writing = pd.read_csv("~/Downloads/WritingScores.csv")
writing.head()

writing.columns

Index(['score_before', 'score_after'], dtype='object')

In [30]:
# Normality Test and Levene Test
stats.shapiro(writing.score_before), 
stats.shapiro(writing.score_after)
print(stats.levene(writing.score_before, writing.score_after))

LeveneResult(statistic=0.4612497504491918, pvalue=0.5019236019309768)


In [31]:
# T Test Related to check the effect of Training Program on Wrting..
stats.ttest_rel(writing.score_before, writing.score_after,
               alternative = "greater")

# Inf: Since the PValue > alpha, We Fail to reject the Ho
# It means that there is no effect on the scores...

Ttest_relResult(statistic=-1.4394882729049499, pvalue=0.9153549355186007)

An energy drink distributor claims that a new advertisement poster, featuring a life-size picture of a well-known athlete, will increase the product sales in outlets by an average of 50 bottles in a week. For a random sample of 10 outlets, the following data was collected. Test that the null hypothesis that there the advertisement was effective in increasing sales. Test the hypothesis using critical region technique. Use α = 0.05.

Given data:
    * sales_before = [33, 32, 38, 45, 37, 47, 48, 41, 45]
    * sales_after = [42, 35, 31, 41, 37, 36, 49, 49, 48]
    
* Ho: That there is no Difference in the Sales irrespective of the Celeb (mew1 = mew2)

* Ho: Sales increased due to Celeb (mew2 > mew1)

In [32]:
sales_before = [33, 32, 38, 45, 37, 47, 48, 41, 45]
sales_after = [42, 35, 31, 41, 37, 36, 49, 49, 48]

In [34]:
# Normality Test and Levene Test
stats.shapiro(sales_before), stats.shapiro(sales_after)
stats.levene(sales_before, sales_after)

LeveneResult(statistic=0.09467455621301783, pvalue=0.7622856002535852)

In [41]:
# Apply the Related T Test
stats.ttest_rel(sales_before, sales_after, alternative="greater")

Ttest_relResult(statistic=-0.10085458113185983, pvalue=0.5389261426537351)

### One Sample Z Proportion Tests

In previous years, people believed that at most 80% of male students score more than 50 marks out of 100 in Mathematics. Perform a test to check whether this percentage is more than 80. Consider the level of significance as 0.05 **Student Performance.csv**

In [43]:
student = pd.read_csv("~/Downloads/StudentsPerformance.csv")

student.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning
2,female,group B,standard,none,64,71,56,191,Nature Learning
3,male,group A,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,standard,none,75,66,51,192,Nature Learning


In [45]:
# Find the Sample Proportion...
hyp_prop = 0.80

# Sample Prop = Count of Male Students scoring >50/total male students
count_50 = student.loc[(student.gender=="male") & 
                       (student["math score"]>50), ].shape[0]

total_males = student.loc[student.gender=="male"].shape[0]

samp_prop = count_50/total_males
print(samp_prop)

0.8757763975155279


In [52]:
# Null Hypothesis: That the Proportion is Less than Equal to 80%
# Alternate Hypothesis: That the Proportion is Greater than 80% | Right Tail Test

# Critical Value
stats.norm.isf(0.05)

1.6448536269514729

In [47]:
# Apply the Test Statistic
num = samp_prop - hyp_prop
deno = np.sqrt((hyp_prop*(1-hyp_prop))/total_males)

teststats = num/deno
print("TestStats: ", teststats)

TestStats:  4.163394160018601


In [50]:
# Pvalue
print("CDF", stats.norm.sf(teststats))
print("SF", 1 - stats.norm.cdf(teststats))

CDF 1.5677570141208797e-05
SF 1.5677570141203745e-05


In [51]:
# Confidence Interval
stats.norm.interval(0.95, loc = samp_prop, scale = deno)

(0.8401038178124423, 0.9114489772186136)

From a sample of 361 business owners had gone into bankruptcy due to recession. On taking a survey, it was found that 105 of them had not consulted any professional for managing their finance before opening the business. Test the null hypothesis that at most 25% of all businesses had not consulted before opening the business. Test the claim using p-value technique. Use α = 0.05.


* Ho: That people did not consult any professional services for Finance mew<=0.25
* Ha: That people did consult any professional services for Finance mew>0.25

In [53]:
hyp_prop = 0.25
samp_prop = 105/361

In [54]:
num = samp_prop - hyp_prop
deno = np.sqrt((hyp_prop*(1-hyp_prop))/361)
teststats = num/deno
print("TestStats: ", teststats)

TestStats:  1.7928245201151534


In [55]:
# PValue
stats.norm.sf(teststats) # Reject the Ho

0.03650049373124949

In [56]:
# Confidence
stats.norm.interval(0.95, loc = samp_prop, scale = deno)
# Since the PValue < Alpha, We Reject the Ho
# However, we need more evidence to support the claim...

(0.24619086783771343, 0.33552658368583227)

A team of nutritionists believes that each institute provides 'standard' lunch to an equal proportion of students. A sample of students from institutes Nature Learning and Speak Global Learning is given. Consider the null hypothesis as equality of proportion with 0.1 level of significance.

* Ho: That both the institutes are providing Same Quality of Lunch
* Ha: That both the institutes aren't providing Same Quality of Lunch (Two Tail Test)

In [88]:
student.head()

Unnamed: 0,gender,race/ethnicity,lunch,test preparation course,math score,reading score,writing score,total score,training institute
0,female,group B,standard,none,89,55,56,200,Nature Learning
1,female,group C,standard,completed,55,63,72,190,Nature Learning
2,female,group B,standard,none,64,71,56,191,Nature Learning
3,male,group A,free/reduced,none,60,99,72,231,Nature Learning
4,male,group C,standard,none,75,66,51,192,Nature Learning


In [89]:
student["training institute"].unique()

array(['Nature Learning', 'Speak Global Learning'], dtype=object)

In [60]:
# Count of Standard Lunch where the Institute is Nature Learning
nature_lunch_count = student.loc[(student["training institute"]=="Nature Learning")&
                                 (student.lunch=="standard"), ].shape[0] 

# Count of Standard Lunch where the Institute is Speakgl

sgl_lunch_count = student.loc[(student["training institute"]==
                               "Speak Global Learning")&
                                 (student.lunch=="standard")].shape[0] 
print(nature_lunch_count, sgl_lunch_count)

329 317


In [61]:
# Total Students in NL and sgl
nl = student.loc[student["training institute"]=="Nature Learning"].shape[0]
sgl = student.loc[student["training institute"]=="Speak Global Learning"].shape[0]

In [63]:
import statsmodels.api as sma

teststats, pvalue = sma.stats.proportions_ztest(count = [nature_lunch_count,
                                                        sgl_lunch_count],
                                               nobs=np.array([nl, sgl]))

print(pvalue)

0.4274690915859791


Steve owns a kiosk where he sells two magazines - A and B in a month. He buys 100 copies of magazine A out of which 78 were sold and 70 copies of magazine B out of which 65 were sold. Is there enough evidence to say that magazine is B is more popular? Test the claim using p-value technique with α = 0.05.

* Ho: MagA = MagB. It means the Sales of Mag_A = Sales Mag_B
* Ha: Sales Mag_B > Sales Mag_A

In [64]:
# 100 copies of magazine A out of which 78 were sold
mag_a = 100
mag_b = 70
# 70 copies of magazine B out of which 65 were sold
sold_a = 78
sold_b = 65
teststats, pvalue = sma.stats.proportions_ztest(count = [sold_a, sold_b],
                                    nobs = [mag_a, mag_b])

print(pvalue)

# Since the Pvalue is less than 0.05, We Reject the Ho
# It means that the Mag B is Indeed Popular...

0.009099103201094605


# Summary of the Tests

* The Test that we perform have a very fundamental difference that if the Pop Metrics are given, then go for Z Test else try T Test.

* T Tests are also bifurcated in 03 ways - One Sample T Test, Two Sample Independent T Test and Paired T Test.

* These tests are performed on the Data(Samples).

* When it comes to the proportion tests, they are generally Z Proportion Test.

* And these proportion tests are ratio based where no exact data is needed. If you get the data, you convert the data into proportion...



### Application of T Test on the Dataset

In [2]:
sales = pd.read_csv("~/Downloads/sales.csv")

In [3]:
sales.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Profit
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8,OUT049,1999,Medium,Tier 2,Supermarket Type1,3735.138,11.5
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.3,OUT018,2009,Medium,Tier 2,Supermarket Type2,443.4228,14.3
2,FDN15,17.5,Low Fat,0.01676,Meat,141.6,OUT049,1999,Medium,Tier 2,Supermarket Type1,2097.27,14.5
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.1,OUT010,1998,,,Grocery Store,732.38,13.6
4,NCD19,8.93,Low Fat,0.0,Household,53.9,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,14.1


In [4]:
sales.loc[:, ["Item_MRP", "Item_Outlet_Sales"]].corr()

Unnamed: 0,Item_MRP,Item_Outlet_Sales
Item_MRP,1.0,0.567555
Item_Outlet_Sales,0.567555,1.0


In [5]:
# Statistical Test to Verify if Item MRP is a Significant Feature
# For the Target....

# Ho: Item MRP and Sales are independent in Nature. It means
# both the variables are not related with each other ~ No Effect

# Ha: Ho is False

stats.ttest_ind(sales.Item_MRP, sales.Item_Outlet_Sales)

# Reject the Ho. It means MRP is a strong predictor for Target

Ttest_indResult(statistic=-110.30447571388359, pvalue=0.0)

In [13]:
# Ho: Item Type has no Relation with the Sales
# Ha: Item Type has Relation with the Sales

import statsmodels.formula.api as sfa

model = sfa.ols("Item_Outlet_Sales~Outlet_Location_Type+Item_Type+Outlet_Identifier", 
                data = sales).fit()

In [14]:
from statsmodels.stats.anova import anova_lm
anova_lm(model)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Outlet_Location_Type,6.0,867323200.0,144553900.0,61.448587,2.178052e-74
Item_Type,15.0,105576500.0,7038432.0,2.991976,8.327391e-05
Outlet_Identifier,9.0,3413822000.0,379313500.0,161.242868,3.170758e-276
Residual,6429.0,15123810000.0,2352436.0,,


#### Summary of Tests

Num vs Num(tgt) - TTest
Cat vs Num(tgt) - Anova
Cat vs Cat(tgt) - Chi Square Test