In [1]:
import numpy as np
import pandas as pd

from scipy import stats

In [3]:
# How to Calculate a probability (p-value) based on z-score
# Syntax : stats.norm.cdf(z-score)

stats.norm.cdf(1.2)

0.8849303297782918

**Case study 1 :** The length of 25 samples of a fabric are taken at random. Historic Mean(i.e,population mean) and standard deviation are 150 cm and 5.5 cm respectively. Company manager wants to test if the current mean is equal to the Historic mean. Assume level of significance (alpha = 0.05%) 

**Step 1 : Formation of H0, H1**

- H0 : mean ==  150 cm (No action)
- H1 : mean != 150 cm  (action)

**Step 2 : Check the level of significance**

- alpha = 0.05

**Step 3 : Check for Normality**

In [4]:
df1 = pd.read_excel("fabric data.xlsx")
df1.head()

Unnamed: 0,Fabric_length
0,151.2
1,160.3
2,147.5
3,149.2
4,159.2


In [7]:
df1["Fabric_length"].skew()

0.29650632012590666

**Step 4 : Select the statistical test and check p-value**

- 1 sample Z-test with two tail Rejection (Here standard deviation is given so selected Z-test)

In [8]:
from statsmodels.stats.weightstats import ztest
ztest(df1["Fabric_length"],value=150, alternative = "two-sided", ddof=0)

(4.581553935379281, 4.615335331740764e-06)

**p-value = 4.615335331740764e-06**

**Step 5 : Based on p-value, Accept or Reject H0**

- p < alpha
- 4.615335331740764e-06 < 0.05
- p low -- Null go
- Reject H0 
- Accept H1 --> mean != 150 (Action required)

**Case Study 2 :** The mean diameter of the bolt manufactured should be 10mm to be able to fit into the nut. 20 samples are taken at random from production line by a quality inspector. Conduct a test to check with 90% confidence that the mean is not different form the specification value.

**Step 1 : Formation of H0, H1**

- H0 : mean ==  10 mm (No action)
- H1 : mean != 10 mm  (action)

**Step 2 : Check the level of significance**

- alpha = 0.1

**Step 3 : Check for Normality**

In [11]:
df2 = pd.read_excel("Bolt diameter.xlsx")
df2.head()

Unnamed: 0,Diameter
0,10.22
1,11.0
2,10.98
3,12.26
4,10.96


In [12]:
 df2["Diameter"].skew()

0.009738124305753312

**Step 4 : Select the statistical test and check p-value**

- 1 sample t-test with two tail Rejection (Here, standard deviation as not given so selected (t-test)

In [14]:
from scipy import stats

ttest = stats.ttest_1samp(df2["Diameter"],popmean=10,alternative="two-sided")
ttest

TtestResult(statistic=4.994230312195523, pvalue=8.05307069702133e-05, df=19)

**Step 5 : Based on p-value, Accept or Reject H0**

- p < alpha
- 8.05307069702133e-05 < 0.1
- p low -- Null go
- Reject H0 
- Accept H1 --> mean != 10mm (Action required)

# ANOVA (f-test)

**Case study 3 : Creditcard promotion** 

A financial analyst at a Financial institute wants to evaluate a recent credit card promotion. After this promotion, 500 cardholders were randomly selected. Half received an add promoting a full waiver of interest rate on purchases made over the next three months, and half received a standard christmas advertisement. Did the add promoting full interest rate waiver, increase purchases ?

**Step 1 : Formation of H0, H1**

- H0 : No difference (Avg of purchases made by FIW == Avg purchases made by SC)
- H1 : difference    (Avg of purchases made by FIW != Avg purchases made by SC)

**Step 2 : Check the level of significance**

- alpha = 0.05

**Step 3 : Check for Normality**

In [15]:
df3 = pd.read_excel("Promotion.xlsx")
df3.head()

Unnamed: 0,InterestRateWaiver,StandardPromotion
0,1989.1,1272.25
1,1808.38,1250.38
2,1153.75,1474.78
3,1745.64,2064.89
4,1008.24,2030.87


In [16]:
df3["InterestRateWaiver"].skew(), df3["StandardPromotion"].skew()

(-0.186848927819777, 0.1556560407787534)

**Step 4 : Select the statistical test and check p-value**

- ANOVA test --> Analysis of variance (f-test) (Here we have multiple columns with continuous variable , so we selected the ANOVA test)

In [17]:
anova_test = stats.f_oneway(df3["InterestRateWaiver"],df3["StandardPromotion"])
anova_test

F_onewayResult(statistic=5.109521918142664, pvalue=0.02422584468584416)

**Step 5 : Based on p-value, Accept or Reject H0**

- p < alpha
- 0.02422584468584416 < 0.05
- p low -- Null go
- Reject H0 
- Accept H1 --> there is a differnce

# 1-proportional test

**Case study 4 :** The people carry out a poll to find the acceptability of new football coach. It was decided that if the support rate for the coach for the entire population was truly less than 90%, the coach would be fired, 2000 people participated and 496 poeple supported the new coach. conduct a test to check if the new coach should be fired with 95% level of confidence

**Step 1 : Formation of H0, H1**

- H0 : Coach not to be fired (No action)
- H1 : Coach to be fired     (action)

**Step 2 : Level if significance**

- alpha = 0.05

**Step 3 : Check for Normality** (For discrete varible we will not check for skewness)

**Step 4 : Select the statistical test and check p-value**

- 1 proportional test (Here, we selected 1-proportional test because, we have single column, discrete varible )

In [22]:
binom_test = stats.binomtest(496,2000,0.9)
binom_test

BinomTestResult(k=496, n=2000, alternative='two-sided', statistic=0.248, pvalue=0.0)

**Step 5 : Based on p-value, Accept or Reject H0**

- p < alpha
- 0.0 < 0.05
- p low -- Null go
- Reject H0 
- Accept H1 --> Coach to be fired (take Action)

# Chi-square test

**Case study 5 :** Johmmie Talkers soft drinks division sales manager has been planning to launch a new sales incentives program for their sales executives. The sales executives felt that adults (>40 yrs) won't buy, children will & hence requested sales manager not to launch the program. Analyze the data & determine whether  there is evidence at 5% significance level to support the hypothesis

**Step 1 : Formation of H0, H1**

- H0 : Proportions of Adults == Proportions of children (No difference)
- H1 : Proportions of Adults != Proportions of children (difference)

**Step 2 : Level if significance**

- alpha = 0.05

**Step 3 : Check for Normality** (For discrete varible we will not check for skewness)

**Step 4 : Select the statistical test and check p-value**

- Chi-Square test (Here, we selected chi-square test because, we have multiple column, discrete varible )

In [24]:
df5 = pd.read_excel("JohnyTalkers.xlsx")
df5.head()

Unnamed: 0,Person,Drinks
0,Adults,Did Not Purchase
1,Adults,Did Not Purchase
2,Adults,Did Not Purchase
3,Adults,Did Not Purchase
4,Adults,Did Not Purchase


In [25]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1220 entries, 0 to 1219
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Person  1220 non-null   object
 1   Drinks  1220 non-null   object
dtypes: object(2)
memory usage: 19.2+ KB


In [26]:
df5["Person"].unique()

array(['Adults', 'Children'], dtype=object)

In [27]:
df5["Person"].value_counts()

Person
Children    740
Adults      480
Name: count, dtype: int64

In [28]:
df5["Drinks"].unique()

array(['Did Not Purchase', 'Purchased'], dtype=object)

In [30]:
df5["Drinks"].value_counts()

Drinks
Did Not Purchase    1010
Purchased            210
Name: count, dtype: int64

In [32]:
count=pd.crosstab(df5["Person"],df5["Drinks"],margins=True)
count

Drinks,Did Not Purchase,Purchased,All
Person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adults,422,58,480
Children,588,152,740
All,1010,210,1220


In [33]:
chisquare_test = stats.chi2_contingency(count)
chisquare_test

Chi2ContingencyResult(statistic=14.613224681541515, pvalue=0.005574456386286158, dof=4, expected_freq=array([[ 397.37704918,   82.62295082,  480.        ],
       [ 612.62295082,  127.37704918,  740.        ],
       [1010.        ,  210.        , 1220.        ]]))

**Step 5 : Based on p-value, Accept or Reject H0**

- p < alpha
- 0.005574456386286158 < 0.05
- p low -- Null go
- Reject H0 
- Accept H1 --> there is a difference