# Hypothesis Testing

In [1]:
import pandas as pd
import numpy as np

### Question 1:

A F&B manager wants to determine whether there is any significant difference in the diameter of the cutlet between two units. A randomly selected sample of cutlets was collected from both units and measured? Analyze the data and draw inferences at 5% significance level. Please state the assumptions and tests that you carried out to check validity of the assumptions.

In [2]:
cutlets=pd.read_csv('Cutlets.csv') #two sampled t=test

In [3]:
cutlets.head()

Unnamed: 0,Unit A,Unit B
0,6.809,6.7703
1,6.4376,7.5093
2,6.9157,6.73
3,7.3012,6.7878
4,7.4488,7.1522


In [4]:
cutlets.shape

(35, 2)

#stating null and alternative hypothesis

Null Hypothesis: 
    There is no significant difference between the diameters of 2 units between the cutlets
    
    H0: μ1 = μ2
Alternative Hypothesis:
     There is significant difference between the diameters of 2 units between the cutlets
     
     H1: μ1 != μ2

In [5]:
from scipy import stats

In [6]:
mean1 = np.mean(cutlets['Unit A'])
mean2 = np.mean(cutlets['Unit B'])

mean1,mean2

(7.0190914285714285, 6.964297142857142)

In [7]:
std1 = np.std(cutlets['Unit A'], ddof=1) # ddof=1 for sample standard deviation---ddof deg of freedom
std2 = np.std(cutlets['Unit B'], ddof=1)

std1,std2

(0.2884084841815496, 0.343400647063108)

In [8]:
# Degrees of freedom
n1 = len(cutlets['Unit A'])
n2 = len(cutlets['Unit B'])
df = n1 + n2 - 2
df

68

In [9]:
denominator = np.sqrt((std1**2 / n1) + (std2**2 / n2))
denominator

0.07580114174625611

In [10]:
t_statistic = (mean1 - mean2) / denominator
t_statistic

0.7228688704678063

In [11]:
# Calculate the two-tailed p-value
p_value = 2 * (1 - stats.t.cdf((t_statistic), df))
p_value

0.4722394724599501

In [12]:

alpha = 0.05


if p_value < alpha:
    print("Reject the null hypothesis:There is a significant difference btw Unit A and Unit B")
else:
    print("Accept the null hypothesis: There is no significant btw between Unit A and Unit B")


Accept the null hypothesis: There is no significant btw between Unit A and Unit B


or

In [13]:
stats.ttest_ind(cutlets['Unit A'],cutlets['Unit B'])

Ttest_indResult(statistic=0.7228688704678063, pvalue=0.4722394724599501)

Since p-value is greater than alpha value we accept Null Hypothesis

### Question 2:

A hospital wants to determine whether there is any difference in the average Turn Around Time (TAT) of reports of the laboratories on their preferred list. They collected a random sample and recorded TAT for reports of 4 laboratories. TAT is defined as sample collected to report dispatch.
   
  Analyze the data and determine whether there is any difference in average TAT among the different laboratories at 5% significance level.


In [14]:
TAT=pd.read_csv('LabTAT.csv') #Analysis of Variance

In [15]:
TAT

Unnamed: 0,Laboratory 1,Laboratory 2,Laboratory 3,Laboratory 4
0,185.35,165.53,176.70,166.13
1,170.49,185.91,198.45,160.79
2,192.77,194.92,201.23,185.18
3,177.33,183.00,199.61,176.42
4,193.41,169.57,204.63,152.60
...,...,...,...,...
115,178.49,170.66,193.80,172.68
116,176.08,183.98,215.25,177.64
117,202.48,174.54,203.99,170.27
118,182.40,197.18,194.52,150.87


#stating null and alternative hypothesis

Null Hypothesis: There is no significant difference in the average TAT among the four laboratories

H0: μ1 = μ2 = μ3 = μ4 
Alternative Hypothesis: There is significant difference in the average TAT among the four laboratories

 H1: μ1 != μ2 != μ3 != μ4 

In [16]:
stats.f_oneway(TAT['Laboratory 1'], TAT['Laboratory 2'], TAT['Laboratory 3'], TAT['Laboratory 4'])

F_onewayResult(statistic=118.70421654401437, pvalue=2.1156708949992414e-57)

In [17]:
alpha = 0.05

if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Fail to reject the null hypothesis


We conclude that there will be significant difference in the average TAT among four laboratories

### Question 3:

Sales of products in four different regions is tabulated for males and females. Find if male-female buyer rations are similar across regions.

In [18]:
data=pd.read_csv('BuyerRatio.csv')

In [19]:
data

Unnamed: 0,Observed Values,East,West,North,South
0,Males,50,142,131,70
1,Females,435,1523,1356,750


#stating null and alternative hypothesis

Null Hypothesis: The observed frequencies are equal to the expected frequencies/There is no association between gender and region; the male-female buyer ratios are similar across regions.

Alternative Hypothesis: The observed frequencies are not equal to the expected frequencies/There is an association between gender and region; the male-female buyer ratios are not similar across regions.


> Since both variables are categorical, they involve categories or groups rather than continuous numerical values. The Chi-Square Test is specifically designed for categorical data.

In [20]:
observed_data = [[50, 142, 131, 70], [435, 1523, 1356, 750]] 

In [21]:
#chi square test of independence
chi2, p, dof, expected = stats.chi2_contingency(observed_data)

In [22]:
print(f"Chi-Square Test Statistic: {chi2:.3f}")
print(f"P-value:{p:.3f}")
print(f"Degrees of Freedom: {dof}")# df=(R-1)*(C-1)=((2-1)*(4-1))
print("Expected Frequencies:")
print(expected)

Chi-Square Test Statistic: 1.596
P-value:0.660
Degrees of Freedom: 3
Expected Frequencies:
[[  42.76531299  146.81287862  131.11756787   72.30424052]
 [ 442.23468701 1518.18712138 1355.88243213  747.69575948]]


In [23]:
alpha = 0.05

if p < alpha:
    print("Reject the null hypothesis")
else:
    print("Accept the null hypothesis")

Accept the null hypothesis


We conclude that there is no association between gender and region; the male-female buyer ratios are not similar across regions.

### Question 4:

TeleCall uses 4 centers around the globe to process customer order forms. They audit a certain %  of the customer order forms. Any error in order form renders it defective and has to be reworked before processing.  The manager wants to check whether the defective %  varies by centre. Please analyze the data at 5% significance level and help the manager draw appropriate inferences


In [24]:
dt=pd.read_csv('Costomer+OrderForm.csv')

In [25]:
dt

Unnamed: 0,Phillippines,Indonesia,Malta,India
0,Error Free,Error Free,Defective,Error Free
1,Error Free,Error Free,Error Free,Defective
2,Error Free,Defective,Defective,Error Free
3,Error Free,Error Free,Error Free,Error Free
4,Error Free,Error Free,Defective,Error Free
...,...,...,...,...
295,Error Free,Error Free,Error Free,Error Free
296,Error Free,Error Free,Error Free,Error Free
297,Error Free,Error Free,Defective,Error Free
298,Error Free,Error Free,Error Free,Error Free


#stating null and alternative hypothesis

Null Hypothesis: The observed frequencies are equal to the expected frequencies/The defective percentage is same across all centers.

Alternative Hypothesis: The observed frequencies are not equal to the expected frequencies/The defective percentage is different across all centers.

In [26]:
dt.columns

Index(['Phillippines', 'Indonesia', 'Malta', 'India'], dtype='object')

In [27]:
for i in dt.columns:
    print(i)
    print(dt[i].value_counts())
    print()

Phillippines
Error Free    271
Defective      29
Name: Phillippines, dtype: int64

Indonesia
Error Free    267
Defective      33
Name: Indonesia, dtype: int64

Malta
Error Free    269
Defective      31
Name: Malta, dtype: int64

India
Error Free    280
Defective      20
Name: India, dtype: int64



In [28]:
obs_data = [[271,267,269,280],[29,33,31,20]]
obs_data

[[271, 267, 269, 280], [29, 33, 31, 20]]

In [29]:
chi2, p, dof, expected = stats.chi2_contingency(obs_data)

In [30]:
print(f"Chi-Square Test Statistic: {chi2:.3f}")
print(f"P-value:{p:.3f}")
print(f"Degrees of Freedom: {dof}")# df=(R-1)*(C-1)=((2-1)*(4-1))
print("Expected Frequencies:")
print(expected)

Chi-Square Test Statistic: 3.859
P-value:0.277
Degrees of Freedom: 3
Expected Frequencies:
[[271.75 271.75 271.75 271.75]
 [ 28.25  28.25  28.25  28.25]]


In [31]:
alpha = 0.05

if p < alpha:
    print("Reject the null hypothesis")
else:
    print("Accept the null hypothesis")

Accept the null hypothesis


We conclude that the observed frequencies are equal to the expected frequencies/The defective percentage is same across all centers.