# Assignment 3: Hypothesis Testing

## Question 1: A F&B manager wants to determine whether there is any significant difference in the diameter of the cutlet between two units. A randomly selected sample of cutlets was collected from both units and measured? Analyze the data and draw inferences at 5% significance level. Please state the assumptions and tests that you carried out to check validity of the assumptions.

### Dataset : Cutlets.csv

### Assumptions of Hypothesis
### Null Hypothesis Ho : μ1 = μ2 (There is no difference in diameters of cutlets between two units). 
### Alternate Hypothesis Ha : μ1 ≠ μ2 (There is significant difference in diameters of cutlets between two units)
### As its problem of 2 samples, 2 sample 2 test approach is applicable
### Also, as the number of samples collected from both the units are greater than 30 so, insted of using t-test we need to use z-test to calculate p-value. The reason we are using z-test is that, as no. of samples increases t-distribution approaches to standard normal distribution. 

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from scipy import stats
from statsmodels.stats import weightstats as stests
import statsmodels.api as sm

In [2]:
# Load Dataset
cutlets_data = pd.read_csv('Cutlets.csv')

#### EDA


In [6]:
cutlets_data.shape

(35, 2)

In [10]:
# description of dataset 
cutlets_data.describe()

Unnamed: 0,Unit A,Unit B
count,35.0,35.0
mean,7.019091,6.964297
std,0.288408,0.343401
min,6.4376,6.038
25%,6.8315,6.7536
50%,6.9438,6.9399
75%,7.28055,7.195
max,7.5169,7.5459


In [14]:
# Checking for any null values 
cutlets_data.isna().sum()

Unit A    0
Unit B    0
dtype: int64

### Preparing data to be used in problem solving

In [16]:
unit_a = cutlets_data['Unit A']
unit_b = cutlets_data['Unit B']

### Applying ztest from statsmodels.stats, weightstats

In [17]:
# This will return two values, 1) ztest vaue, 2)p_value
ztest, pvalue = stests.ztest(x1 = unit_a, x2=unit_b, value=0,alternative='two-sided')

In [23]:
# significance level given is α = 0.05
pvalue = float(pvalue)
α = 0.05

print("P-value: ", pvalue)

if pvalue <= α:
    print("Rejecting null hypothesis")
    print("Conclusion: There is significant difference in diameters of cutlets between two units")
else:
    print("Fail to reject null hypothesis")
    print("Conclusion: There is no significant difference in diameters of cutlets between two units")
    

P-value:  0.46976045023906077
Fail to reject null hypothesis
Conclusion: There is no significant difference in diameters of cutlets between two units


## Result: Fail to reject null hypothesis
## Inference: There is no significant difference in diameters of cutlets between two units

# 

# 

## Question 2: A hospital wants to determine whether there is any difference in the average Turn Around Time (TAT) of reports of the laboratories on their preferred list. They collected a random sample and recorded TAT for reports of 4 laboratories. TAT is defined as sample collected to report dispatch.
   
## Analyze the data and determine whether there is any difference in average TAT among the different laboratories at 5% significance level.


### Dataset : LabTAT.csv

### Hypothesis Test: Anova ftest statistics  
### Assumptions of Hypothesis
### Null Hypothesis Ho : μ1 = μ2 = μ3 = μ4 (All samples Turn Around Time (TAT) population means are same)
### Alternate Hypothesis: Atleast one sample Turn Around Time (TAT) population mean is different

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [2]:
# Load Dataset
libtat_data = pd.read_csv('LabTAT.csv')

#### EDA

In [4]:
libtat_data.shape

(120, 4)

In [5]:
# Description of Dataset
libtat_data.describe()

Unnamed: 0,Laboratory 1,Laboratory 2,Laboratory 3,Laboratory 4
count,120.0,120.0,120.0,120.0
mean,178.361583,178.902917,199.91325,163.68275
std,13.173594,14.957114,16.539033,15.08508
min,138.3,140.55,159.69,124.06
25%,170.335,168.025,188.2325,154.05
50%,178.53,178.87,199.805,164.425
75%,186.535,189.1125,211.3325,172.8825
max,216.39,217.86,238.7,205.18


In [6]:
# Check for null values
libtat_data.isna().sum()

Laboratory 1    0
Laboratory 2    0
Laboratory 3    0
Laboratory 4    0
dtype: int64

### Preparing data to be used in problem solving

In [7]:
lab1 = libtat_data['Laboratory 1']
lab2 = libtat_data['Laboratory 2']
lab3 = libtat_data['Laboratory 3']
lab4 = libtat_data['Laboratory 4']

### Applying ANOVA Ftest statistics

In [15]:
import scipy.stats as stats
α = 0.05

f_test, pvalue = stats.f_oneway(lab1, lab2, lab3, lab4)

print("P-value: ", pvalue)

if pvalue <= α:
    print("Rejecting null hypothesis")
    print("Conclusion:  Atleast one sample TAT population mean is different")
else:
    print("Fail to reject null hypothesis")
    print("Conclusion: All samples TAT population means are same")

P-value:  2.1156708949992414e-57
Rejecting null hypothesis
Conclusion:  Atleast one sample TAT population mean is different


## Result: Rejecting null hypothesis
## Inference:  Atleast one sample TAT population mean is different

# 

# 

# Question 3:

 ![alt text](Capture.png "Title")

### Dataset: BuyerRatio.csv

### Hypothesis Test: Chi2 contingency Test
### Assumptions of Hypothesis
### Null Hypothesis Ho : Male-Female buyer rations are similar across regions ,does not vary and not related
### Alternate Hypothesis Ha : Male-Female buyer rations are NOT similar across regions 

In [16]:
# Loading Libraries
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

In [17]:
# Load the dataset
buyer_data = pd.read_csv('BuyerRatio.csv')

In [33]:
# Making two dimensional array from available data  
obs=np.array([[50,142,131,70],
              [435,1523,1356,750]])
obs

array([[  50,  142,  131,   70],
       [ 435, 1523, 1356,  750]])

#### Applying chi2_contigency test

In [38]:
chi2, pvalue, dof, ex = chi2_contingency(obs)

In [58]:
α = 0.05

print(pvalue)

if pvalue <= α:
   print("Rejecting null hypothesis")
   print("Conclusion: Male-Female buyer rations are NOT similar across regions and related")
else:
    print("Fail to reject null hypothesis")
    print("Conclusion: Male-Female buyer rations are similar across regions and not related") 

0.2771020991233135
Fail to reject null hypothesis
Conclusion: Male-Female buyer rations are similar across regions and not related


## Result: Fail to reject null hypothesis
## Inference: Male-Female buyer rations are similar across regions and not related 

# 

# 

# Question 4: TeleCall uses 4 centers around the globe to process customer order forms. They audit a certain %  of the customer order forms. Any error in order form renders it defective and has to be reworked before processing.  The manager wants to check whether the defective %  varies by centre. Please analyze the data at 5% significance level and help the manager draw appropriate inferences

### Dataset: CustomerOrderForm.csv

### Hypothesis Test: Chi2 contingency test
### Null Hypothesis: Customer order forms defective %  does not varies by centre
### Alternate Hypothesis: Customer order forms defective %  varies by centre

In [42]:
# Loading Libraries
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

In [44]:
# Load the dataset
cust_data = pd.read_csv('Costomer+OrderForm.csv')

#### EDA

In [46]:
cust_data.shape

(300, 4)

In [47]:
# description of dataset
cust_data.describe()

Unnamed: 0,Phillippines,Indonesia,Malta,India
count,300,300,300,300
unique,2,2,2,2
top,Error Free,Error Free,Error Free,Error Free
freq,271,267,269,280


In [48]:
# checking for null values
cust_data.isna().sum()

Phillippines    0
Indonesia       0
Malta           0
India           0
dtype: int64

In [54]:
# Getting count of categorical data i.e., Error free and Defective from dataset
error_free_Phillippines, defective_Phillippines = cust_data.Phillippines.value_counts()
error_free_Indonesia, defective_Indonesia = cust_data.Indonesia.value_counts()
error_free_Malta, defective_Malta = cust_data.Malta.value_counts()
error_free_India, defective_India = cust_data.India.value_counts()

In [55]:
# Converting the count of data into 2D array
obs = np.array([[error_free_Phillippines, error_free_Indonesia, error_free_Malta, error_free_India],
               [defective_Phillippines, defective_Indonesia, defective_Malta, defective_India]])
obs

array([[271, 267, 269, 280],
       [ 29,  33,  31,  20]])

#### Applying chi2_contigency test

In [56]:
chi2, pvalue, dof, ex = chi2_contingency(obs)

In [57]:
α = 0.05


if pvalue <= α:
   print("Rejecting null hypothesis")
   print("Conclusion: Customer order forms defective %  varies by centre")
else:
    print("Fail to reject null hypothesis")
    print("Conclusion: Customer order forms defective %  does not varies by centre")

Fail to reject null hypothesis
Conclusion: Customer order forms defective %  does not varies by centre


## Result: Fail to reject null hypothesis
## Inference: Customer order forms defective %  does not varies by centre