## Module 3: Exploratory data analysis  

### Lesson 5: Comparing two groups: hypothesis testing
- A hypothesis is a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation.

**Examples of hypotheses:**
- "Increasing the amount of sunlight a plant receives will result in faster growth."
- "Higher levels of education are associated with increased income."
- "An increase in the minimum wage will lead to a decrease in unemployment."

Hypotheses must be tested using the data available and so should be testable and, ideally, falsifiable. The hypothesis is usually about the population parameters such as mean or proportion and we seek evidence from a sample for the support of the claim.

### Part 3.5.1  : Sampling our data
- Before getting deeper into hypothesis testing, we must understand some sampling techniques.
- A ***population*** is the group of all items of interest. *E.g.*: All the students on university campus
- A ***sample*** is a set of data drawn from the studied population. *E.g.*: 500 students from the campus
- Few sampling techniques are:
    - ***Random Sampling***: Probability of each record being selected into your sample will be equal. If there are $n$ records, probability of choosing any one record is $1/n$.
    - ***Stratified Sampling***: Involves dividing the entire population into homogeneous groups called strata. It can be better understood using the below given image:

<img src="Stratified_sampling.png" alt="Drawing" style="width: 400px;"/>

*Image Source: By Dan Kernler - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=36506021*

#### Draw a random sample from POS data

In [None]:
# Import the required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#read the data file
pos_data= pd.read_csv('POS_CleanData.csv')
pos_data.head()

In [None]:
#create a sample of 5 records
pos_sample = pos_data.sample(n = 5)
pos_sample

In [None]:
#create a sample of 1000 records
pos_sample = pos_data.sample(n = 200, random_state = 466)
pos_sample.head()

***NOTE:***
- When we draw the random sample from the dataset, there is no guarantee that we will get a sample which has a similar distribution of the data as that of the population.
- To illustrate this, let us check the percentage share of records belonging to each of sectors in the population and in the random sample:

In [None]:
# % of records belonging to each sector in POS data
round(pos_data.Sector.value_counts(normalize = True),3)*100

In [None]:
# % of records belonging to each sector in the random sample
round(pos_sample.Sector.value_counts(normalize=True),3)*100

**To have a sample which represents the true nature of the population, we can use stratified sampling**

In [None]:
# stratified sampling
from sklearn.model_selection import train_test_split
sample1, sample2 = train_test_split(pos_data, test_size = 0.3, \
                                    stratify = pos_data['Sector'], random_state = 1)

In [None]:
# print sample sizes
print("Size of sample1: ", sample1.shape)
print("Size of sample2: ", sample2.shape)


In [None]:
# Check value counts as proportions 
round(sample1.Sector.value_counts(normalize=True),3)*100

In [None]:
# Check value counts as proportions 
round(sample2.Sector.value_counts(normalize=True),3)*100

***We can observe that both the samples drawn have same ratio of sectors and it is same as that of the population***

### Part 3.5.2  : Introduction to hypothesis testing
- Hypothesis testing is a statistical method used to make decisions or draw conclusions about a population based on a sample of data. 
- It involves formulating a hypothesis (a statement or assumption about a population parameter) and then using sample data to determine whether there is enough evidence to support or reject the hypothesis.
- Hypothesis testing is widely used in various fields, including science, engineering, social sciences, and business, to make data-driven decisions, validate assumptions, and test hypotheses about populations. 
- It's important to note that failing to reject the null hypothesis doesn't prove that the null hypothesis is true; it simply means that there isn't enough evidence to conclude otherwise based on the available data.
- Hypothesis testing helps researchers make informed conclusions based on statistical evidence from collected data.

The process of hypothesis testing typically involves the following steps:

**1. Formulate hypotheses:**
- Null Hypothesis ($H_0$): This is the default hypothesis that there is no significant effect or difference. It often represents the *status quo*.
- Alternative Hypothesis ($H_1$): This is the hypothesis that contradicts the null hypothesis and suggests that there is a significant effect or difference. It reflects what the researcher is trying to find evidence for.
    
**2. Choose a significance level (alpha):** The significance level (often denoted as α) is the threshold that defines how strong the evidence needs to be in order to reject the null hypothesis. Commonly used values are 0.05 or 0.01.
    
**3. Collect and analyze data:** 
- Collect a sample of data relevant to the hypothesis.
- Perform statistical analysis on the sample to calculate test statistics and p-values.
    
**4. Calculate test statistic:** The test statistic is a numerical value that summarizes the data and is used to assess the compatibility of the sample data with the null hypothesis.
    
**5. Calculate the *p-value*:** The p-value is a probability that measures the likelihood of observing the sample data, or something more extreme, if the null hypothesis is true. It indicates how strong the evidence against the null hypothesis is.
    
**6. Make a decision:** 
- If the *p-value* is less than or equal to the significance level (α), then there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
- If the *p-value* is greater than the significance level, then there is not enough evidence to reject the null hypothesis.
    
**7. Draw a conclusion:** Based on the decision, you can draw a conclusion about the hypothesis and its implications for the population.


**Example** showing the formulation of hypothesis:
- Let the population mean = 350 and population standard deviation = 11.3 
- Sample size n = 120, Sample mean = 320
- $H_0$: Sample mean is same as population mean (the sample is representative of the population)
- $H_1$: Sample mean is less than the population mean (the sample is not representative of the population)

#### Parametric and Non-parametric tests
- A parametric test makes assumptions about a population’s parameters, and a non-parametric test does not assume anything about the underlying distribution.
- A parametric test makes assumptions about a population’s parameters:
    - Normality : data in each group should be normally distributed.
    - Independence : data in each group should be sampled randomly and independently.
    - No outliers : no extreme outliers in the data.
    - Equal variance : data in each group should have approximately equal variance.
- However, a non-parametric test (sometimes referred to as a distribution free test) does not assume anything about the underlying distribution (for ex. data follows normal distribution is an assumption)
- For ordinal and categorical data, we use non-parametric tests.
- Also, if we doubt the normality of our data, then we use non-parametric tests
- Examples of parametric tests: z-test, t-test
- Examples of non-parametric tests: $\chi^2$ test, Kruskal Walli's test etc.

##### One-tailed and two-tailed tests
- One-tailed and two-tailed hypothesis testing refer to the ways in which the critical region (the region of extreme values) is defined when performing a hypothesis test. 
- These terms are used to specify the directionality of the alternative hypothesis and influence how you interpret the test results.
- **One-Tailed Hypothesis Testing:**
    - In a one-tailed hypothesis test, the alternative hypothesis specifies a directional difference or effect. 
    - This means that you are interested in whether the parameter of interest is significantly greater than or less than the null hypothesis value. 
    - The critical region is only on one side of the distribution.
    - For a one-tailed test, the hypotheses can be expressed as:
        - Null Hypothesis ($H_0$): The parameter is equal to a specific value.
        - Alternative Hypothesis ($H_1$): 
            - One-tailed greater test (positive): The parameter is greater than a specific value.
            - One-tailed lesser test (negative): The parameter is less than a specific value.

- **Two-Tailed Hypothesis Testing:**
    - In a two-tailed hypothesis test, the alternative hypothesis does not specify a directional difference or effect. 
    - Instead, it is concerned with whether the parameter of interest is significantly different from the null hypothesis value in any direction. 
    - The critical region is divided between both tails of the distribution.
    - For a two-tailed test, the hypotheses can be expressed as:
        - Null Hypothesis ($H_0$): The parameter is equal to a specific value.
        - Alternative Hypothesis ($H_1$): The parameter is not equal to a specific value.

- **How to choose the right test among one-tailed and two-tailed tests?:**
    - The choice between one-tailed and two-tailed tests depends on the specific business question and the hypothesis you want to test. 
    - If you have a clear expectation about the direction of the effect (e.g., you expect a treatment to increase a measurement), you might use a one-tailed test. 
    - On the other hand, if you are simply looking for a significant difference in either direction without a specific expectation, a two-tailed test is appropriate.
    - When interpreting the results of a hypothesis test, consider the context of your study and whether the chosen type of test aligns with your research objectives.

<img src="One-Two-TailedTest.png" alt="Drawing" style="width: 600px;"/>

***NOTE:***
- If we choose the level of significance $\alpha$ as 0.05, then on one-tailed test, the critical region will be 5% on either positive or negative side of the distribution. 
- However, when $\alpha = 0.05$ for a two-tailed test, then the critical region will be 2.5% each on both the sides of the distribution.

#### Type I and Type II Errors
- When we do hypothesis testing, there is a possibility of two types of errors resulting from the test.
- These are dipicted in the image below:

<img src="Type1-2Errors.png" alt="Drawing" style="width: 600px;"/>

*Image Scource: https://www.scribbr.com/statistics/type-i-and-type-ii-errors/*

**NOTE:**
- In the upcoming lectures, we will perform several hypothesis testing like one-sample z-test, two-sample z-test, t-test, chi-square test etc.
- The significance of each of these tests will be explained as we progress. 

### Part 3.5.3  : Hypothesis testing using the Z-test
- A z-test is a statistical test used to determine whether a sample mean is significantly different from a known or hypothesized population mean when the population standard deviation is known. 
- It's commonly used when you have a large sample size and you know the population standard deviation.

### One-sample Z-test
- A one-sample z-test is a statistical test used to determine whether the mean of a sample is significantly different from a known population mean or a hypothesized population mean. 
- This test is appropriate when you have a single sample of data and you want to test whether it is consistent with a specific population parameter.
- We assume that the population mean and standard deviation are known and population data aproximately follows normal distribution.
- The test statistic is given by:

$$z=\frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt(n)}}$$


In [None]:
# check the mean of the revenue column
pos_data['Revenue($)'].mean()

#### Example 1: Two-tailed hypothesis test on the revenue of the brand *Ariel*
**Let us formulate the hypothesis:**
- Consider the POS dataset with 31057 records as a sample drawn from the larger sales data from few years.
- Let us check whether the revenue generated from the brand *Ariel* is 14350 or not at 5% level of significance.
- Let us define the hypothesis:
    - $H_0$: Average revenue for the *Ariel* brand is 14350 (sample mean is same as population mean) 
    - $H_1$: Average revenue for the *Ariel* brand is not 14350 (sample mean is not equal to 14350)
    - This can be written as -
        - $H_0 : \mu = 14350 $  
        - $H_1 : \mu \ne 14350 $
- **NOTE** that the value 14350 is used here for illustration purpose. And we are using ***two-tailed test.***

In [None]:
# one-sample two-tailed Z-test
from statsmodels.stats import weightstats as wst

print('Null Hypothesis (H0): Average revenue for the Ariel brand is 14350')
print('Alternative Hypothesis (H1): Average revenue for the Ariel brand is not 14350')

alpha = 0.05
# subset for brand Ariel
df = pos_data.loc[pos_data['Brand']=='Ariel']

z_stat, p_value = wst.ztest(df['Revenue($)'],value = 14350)

print(f'Test_statistic value is : {round(z_stat,4)}, p-value is : {round(p_value,4)}')

if p_value <= alpha:
    print(f'Final Conclusion: Reject the null h
          ypothesis')
else:
    print(f'Final Conclusion: Do not reject the null hypothesis')

***Inference:***
- As the computed p-value is greater than the level of significance ($\alpha$), we don't have enough evidence to reject the null hypothesis.


In [None]:
pos_data['Page_traffic'].mean()

#### Example 2: one-tailed hypothesis test on the page traffic of sales data
**Let us formulate the hypothesis:**
- Consider the POS dataset with 31057 records as a sample drawn from the larger sales data from few years.
- Let us check whether the page traffic for various products is more than 2000 or not at 5% level of significance.
- Let us define the hypothesis:
    - $H_0$: Average page traffic for the products is 2000 
    - $H_1$: Average page traffic for the products is greater than 2000 
    - This can be written as -
        - $H_0 : \mu = 2000 $  
        - $H_1 : \mu > 2000 $


In [None]:
# one-sample one-tailed Z-test
print('Null Hypothesis (H0): Average page traffic for the products is 2000')
print('Alternative Hypothesis (H1): Average page traffic for the products is greater than 2000')

alpha=0.05
# Right tailed test
z_stat, p_value = wst.ztest(pos_data['Page_traffic'],value=2000,alternative='larger')  

print(f'Test_statistic value is : {round(z_stat,4)}, p-value is : {round(p_value,4)}')

if p_value <= alpha:
    print(f'Final Conclusion: Reject the null hypothesis')
else:
    print(f'Final Conclusion: Do not reject the null hypothesis')

***Inference:***
- As the computed p-value is smaller than the level of significance ($\alpha$), we reject the null hypothesis.
- Remember, rejection of null hypothesis do not mean that we accept the alternative hypothesis.
- Because, we have not tested for the acceptance of the statement given in alternative hypothesis.
- That is, meaning of rejecting the null hypothesis implies that we don't have enough evidence to accept the null hypothesis.

### Two-sample Z-test
- A two-sample z-test is a statistical test used to determine whether the means of two independent samples are significantly different from each other. 
- This test is appropriate when you have two separate samples and you want to compare whether their means suggest a significant difference in the populations they come from.
- The test statistic is given by:

$$z=\frac{\bar{x_1}- \bar{x_2}}{\sqrt(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2})}$$

- Here, 
    - $\bar{x_1}$ and $\bar{x_2}$ represents the sample means
    - $n_1$ and $n_2$ and the sizes of two samples
    - $s_1^2$ and $s_2^2$ are the sample standard deviations

#### Example 1: Two-tailed hypothesis test on the revenue of the brand *Ariel* and *Tide*
**Let us formulate the hypothesis:**
- Consider the POS dataset with 31057 records as a sample drawn from the larger sales data from few years.
- Take a first sample of the data where the brand is *Ariel*, and the second sample as the records where the brand is *Tide*
- Let us check whether the revenue generated from the brand *Ariel* is same as that of the brand *Tide* at 5% level of significance.
- Let us define the hypothesis:
    - $H_0$: Average revenue for the *Ariel* brand is same as the average revenue for the *Tide* brand
    - $H_1$: Average revenue for the *Ariel* brand is not same as the average revenue for the *Tide* brand
    - This can be written as -
        - $H_0 : \mu_1 = \mu_2 $  
        - $H_1 : \mu_1\ne \mu_2 $


In [70]:
# two-sample two-tailed Z-test
print('Null Hypothesis (H0): Average revenue for the Ariel brand is same as the average revenue for the Tide brand')
print('Alternative Hypothesis (H1): Average revenue for the Ariel brand is not same as the average revenue for the Tide brand')

alpha = 0.05
#take a data where Brand is Ariel
df1 = pos_data.loc[pos_data['Brand']=='Ariel']
df2 = pos_data.loc[pos_data['Brand']=='Tide']

z_stat, p_value = wst.ztest(df1['Revenue($)'],df2['Revenue($)'])

print(f'Test_statistic value is : {round(z_stat,4)}, p-value is : {round(p_value,4)}')

if p_value <= alpha:
    print(f'Final Conclusion: Reject the null hypothesis')
else:
    print(f'Final Conclusion: Do not reject the null hypothesis')

Null Hypothesis (H0): Average revenue for the Ariel brand is same as the average revenue for the Tide brand
Alternative Hypothesis (H1): Average revenue for the Ariel brand is not same as the average revenue for the Tide brand
Test_statistic value is : 1.2295, p-value is : 0.2189
Final Conclusion: Do not reject the null hypothesis


***Inference:***
- As we see, we don't have sufficient evidence to reject the null hypothesis.
- So, based on the data available, we can conclude that the revenue generated by the brand *Ariel* and *Tide* are not different.

#### Example 2: One-tailed hypothesis test to check whether the online promotion helped to increase the sales.
- To perform this test, let us use the dataset obtained from merging POS data and Online data.

**Let us formulate the hypothesis:**
- Let the sample1 be the products where promotion was not done, and sample2 be the products which were promoted online. 
- Let us check whether the revenue generated from the promoted products is more than the products without the promotion.
- Let us define the hypothesis:
    - $H_0$: Average revenue for the promoted and non-promoted products are same
    - $H_1$: Average revenue for the non-promoted products are lesser than that of the promoted products
    - This can be written as -
        - $H_0 : \mu_1 = \mu_2 $  
        - $H_1 : \mu_1< \mu_2 $


In [71]:
pos_online_merge=pd.read_csv('POS_Online_MergeData.csv')
pos_online_merge.head()

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic,Online_Clicks,Online_Cost,Online_Impressions
0,SKU1009,2021-05-01,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,0,0.0,0.0,0.0,0.0
1,SKU1009,2021-05-08,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0.0,0.0,0.0,0.0
2,SKU1009,2022-01-08,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0.0,0.0,0.0,0.0
3,SKU1009,2022-11-05,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0.0,0.0,0.0,0.0
4,SKU1009,2022-12-10,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0.0,0.0,0.0,0.0


In [72]:
# two-sample one-tailed Z-test
print('Null Hypothesis (H0): Average revenue for the promoted and non-promoted products are same')
print('Alternative Hypothesis (H1): Average revenue for the non-promoted products are lesser than that of the promoted products')

alpha=0.05
#take a data where Brand is Ariel
df1 = pos_online_merge.loc[pos_online_merge['Online_Impressions']==0]
df2 = pos_online_merge.loc[pos_online_merge['Online_Impressions']!=0]

z_stat, p_value = wst.ztest(df1['Revenue($)'],df2['Revenue($)'], alternative='smaller')

print(f'Test_statistic value is : {round(z_stat,4)}, p-value is : {round(p_value,4)}')

if p_value <= alpha:
    print(f'Final Conclusion: Reject the null hypothesis')
else:
    print(f'Final Conclusion: Do not reject the null hypothesis')

Null Hypothesis (H0): Average revenue for the promoted and non-promoted products are same
Alternative Hypothesis (H1): Average revenue for the non-promoted products are lesser than that of the promoted products
Test_statistic value is : 0.6602, p-value is : 0.7454
Final Conclusion: Do not reject the null hypothesis


***Inference:***
- As we see, we don't have enough evidance to say that promotions helped improve sales
- So, we may want to look at the strategies with online promotions and ask why promotions are not helping to improve sales, and what we can do to better utilize the spend on online promotions.

### Part 3.5.4  : Hypothesis testing using the t-test
- A t-test is a statistical test used to determine whether the means of two groups are significantly different from each other. 
- It is commonly used when comparing the means of a continuous variable between two independent groups. 
- The t-test is especially useful when the sample size is small and the population standard deviation is unknown.
- The concepts of *Z-test* and the *t-test* are the same with respect to hypothesis testing.
- However there are few differences:
    1. The *Z-test* is used for large samples (when sample size is $\ge$ 30) and the *t-test* is used for small samples (when sample size is < 30)
    2. The *Z-test* is used when the population standard deviation (variance) is known. This is often an idealized assumption that might not hold in many real-world cases. The *t-test* is used when the population standard deviation is unknown, and the sample standard deviation is used as an estimate. This is more common in practice as population standard deviations are not often known.
- The *t-test* assumes that the population mean is known. 

#### Example 1: Two-tailed one-sample t-test on the revenues from *Sensitivity Toothpaste*
- From the POS data, consider the records belonging to the segment *Sensitivity Toothpaste*
- Compute the average revenue of these records, which we can consider as the population mean.
- Draw a small random sample from this data. 
- Check whether the sample has the mean as specific value. 

**Let us formulate the hypothesis:**
- $H_0$: Average revenue for the *Sensitivity Toothpaste* brand is 14500 
- $H_1$: Average revenue for the *Sensitivity Toothpaste* brand is not equal to 14500 
- This can be written as -
    - $H_0 : \mu = 14500 $  
    - $H_1 : \mu \ne 14500 $


In [73]:
# subset the data
df=pos_data.loc[pos_data['Segment']=='Sensitivity Toothpaste']
df.shape

(623, 10)

In [74]:
# find the mean
pop_mean = round(df['Revenue($)'].mean(),2)
print(pop_mean)

14719.37


In [75]:
#create a small sample of sensitivity toothpaste records
sample_data = df.sample(n = 12)
sample_data.shape

(12, 10)

In [76]:
# one-sample t-test

from scipy.stats import ttest_1samp

print('Null Hypothesis (H0): Average revenue for the Sensitivity Toothpaste brand is 14500')
print('Alternative Hypothesis (H1): Average revenue for the Sensitivity Toothpaste brand is not equal to 14500')

alpha=0.05

#perform one sample t-test
t_statistic,p_value = ttest_1samp(sample_data['Revenue($)'], popmean = pop_mean)

print(f'Test_statistic value is : {round(t_statistic,4)}, p-value is : {round(p_value,4)}')

if p_value < alpha:
    print(f'Final Conclusion: Reject null hypothesis')
else:
    print(f'Final Conclusion: Do not reject null hypothesis.')

Null Hypothesis (H0): Average revenue for the Sensitivity Toothpaste brand is 14500
Alternative Hypothesis (H1): Average revenue for the Sensitivity Toothpaste brand is not equal to 14500
Test_statistic value is : -1.2817, p-value is : 0.2263
Final Conclusion: Do not reject null hypothesis.


***Inference:***
- As per the result of the above test, we do not have enough evidence to reject the null hypothesis. 
- That is, we may accept that the average revenue of Sensitivity Toothpaste is 14500.

#### Example 2: Two-tailed two-sample test on the revenues from *Sensitivity Toothpaste* and *Whitening Toothpaste*
- Here, we are going to test whether the revenues from sensitivity toothpaste and the whitening toothpaste are same or not
- Draw a sample 1 of size 12 from the POS data where the segment is sensitivity toothpaste
- Draw a sample 2 of size 15 from the POS data where the segment is whitening toothpaste
- Check whether the revenues of these two samples are same

**Let us formulate the hypothesis:**
- $H_0$: Average revenues for *Sensitivity Toothpaste* and *Whitening Toothpaste* are the same
- $H_1$: Average revenues for *Sensitivity Toothpaste* and *Whitening Toothpaste* are not the same
- This can be written as -
    - $H_0 : \mu_1 = \mu_2$  
    - $H_1 : \mu_1\ne \mu_2 $

In [77]:
# create two subsets of the data
from scipy.stats import ttest_ind
sample1 = pos_data.loc[pos_data['Segment']=='Sensitivity Toothpaste'].sample(12)
sample2 = pos_data.loc[pos_data['Segment']=='Whitening Toothpaste'].sample(15)

In [78]:
# check their sizes
sample1.shape, sample2.shape

((12, 10), (15, 10))

In [79]:
# two-sample t-test

print('Null Hypothesis (H0): Average revenues of the Sensitivity Toothpaste and Whitening Toothpaste are same')
print('Alternative Hypothesis (H1): Average revenues of the Sensitivity Toothpaste and Whitening Toothpaste are not same')

alpha=0.05

t_statistic,p_value = ttest_ind(sample1['Revenue($)'],sample2['Revenue($)'])

print(f'Test_statistic value is : {round(t_statistic,4)}, p-value is : {round(p_value,4)}')

if p_value < alpha:
    print(f'Final Conclusion: Reject null hypothesis')
else:
    print(f'Final Conclusion: Do not reject null hypothesis.')

Null Hypothesis (H0): Average revenues of the Sensitivity Toothpaste and Whitening Toothpaste are same
Alternative Hypothesis (H1): Average revenues of the Sensitivity Toothpaste and Whitening Toothpaste are not same
Test_statistic value is : 0.968, p-value is : 0.3423
Final Conclusion: Do not reject null hypothesis.


***Inference:***
- As per the results, we may need to say that the revenue of sensitivity toothpaste and whitening toothpaste are same.

### Part 3.5.5  :  Chi-square ($\chi^2$) test
- A chi-squared ($\chi^2$) test is a statistical test used to determine whether there is a significant association or relationship between categorical variables. 
- It is also called the **chi-square test for independence** or the **contingency table test**
- It is often applied to data presented in a contingency table, where the rows represent one variable's categories and the columns represent another variable's categories.
- The null hypothesis states that the two variables are independent, meaning there is no association between them.
- To implement $\chi^2$ test for independence, we need to keep the data in the form of a ***contingency table (or cross-tab)*** that shows the frequencies in each category for the two variables.

**NOTE:**
- The POS data has 5 categorical variables (SKU ID, Sector, Category, Segment and Brand). 
- However, SKU ID is a product ID and all other variables have a hierarchical relationship. 
- Hence, this data is not suitable to test the association between the variables. 
- So let us illustrate this concept with a different data set.

Note that in another context, the $\chi^2$ test can also be used to to analyze frequency data and assess whether the observed frequencies differ from the expected frequencies based on a null hypothesis of independence.


#### Problem statement
- A campaign was conducted for credit card holders to increase their credit limit.
- The data consists of information like credit card ID, gender of the card holder, birth date, the date of registration of the credit card and whether they have positively responded to the campaign or not.
- Let us now test whether there is an association between the gender of the card holder and their response (True or False) for the campaign.
**Formulate the hypothesis:**
    - $H_0$: There is no relationship between Gender and Campaign Response
    - $H_1$: There is a relationship between Gender and Campaign Response


In [80]:
# read the data
data=pd.read_excel("CreditCardData.xlsx")
data.head()

Unnamed: 0,Card_ID,Campaign_Response,Registration_Date,Gender,Birth_Date
0,100005950,False,1998-11-18,M,1984-02-06
1,100022191,True,1999-09-15,F,1959-09-11
2,100025442,False,1998-05-12,M,1970-08-25
3,100026513,False,1999-02-12,M,1951-03-12
4,100039145,False,2000-08-12,M,1949-06-08


In [81]:
# value counts
data.Campaign_Response.value_counts()

False    248
True      49
Name: Campaign_Response, dtype: int64

In [82]:
# create a contingency table
obs=pd.crosstab(data['Gender'],data['Campaign_Response'])
obs

Campaign_Response,False,True
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,102,25
M,146,24


In [83]:
# test for association using a chi-square test

from scipy.stats import chi2_contingency,chisquare

print('Null Hypothesis (H0): There is no relationship between Gender and Campaign Responce. (independent)')
print('Alternative Hypothesis (H1): There is a relatinoship between Gender and Campaign Responce. (dependent)')


chi_sq_Stat, p_value, deg_freedom, exp_freq = chi2_contingency(obs)

print(f'Test_statistic value is : {round(chi_sq_Stat,4)}, p-value is : {round(p_value,4)}')

alpha = 0.05
if p_value < alpha:
    print(f'Final Conclusion: Reject null hypothesis')
else:
    print(f'Final Conclusion: Do not reject null hypothesis.')
    

Null Hypothesis (H0): There is no relationship between Gender and Campaign Responce. (independent)
Alternative Hypothesis (H1): There is a relatinoship between Gender and Campaign Responce. (dependent)
Test_statistic value is : 1.2564, p-value is : 0.2623
Final Conclusion: Do not reject null hypothesis.


***Inference:***
- We do not have enough evidance to say that gender has an influence on campaign response. 
- That means, gender and response are independent of each other.