# Week 4 Assignment: Hypothesis Testing

<img align="right" style="padding-left:50px;" src="figures_wk4/hypothesis_testing.png" width=350><br>
## Assignment Requirements

**Dataset Name::** shoe.csv (located in the assing_wk4 folder)<br>
This data contains global sales information for a shoe company.

**Assignment Requirements:** <br>
As demonstrated in the <u>Demo: Analyzing with Hypothesis Testing</u> section of the lecture notebook, you will be providing an analysis for the wages.csv dataset.

For you analysis you will need to:
1) Develop a scenario to provide an overall understanding of the organization represented by the dataset.
    * Define a problem statement to investigate with either t-test or z-test.
    * Describe your dataset, including the columns that will support our analysis.
2) Describe your test approach
    * Justify your selection of which test you will use. 
    * Define a H0/H1 couplet to support your problem statement. 
3) Conduct your analysis according to your defined parameters above.
    * Include a descriptive analysis as appropriate
    * Prep your data for analysis
    * Conduct your hypothesis testing
4) Based on the results from your analysis in step 3, redefined your H0/H1 couplet to dive deeper into the analysis and repeat steps 2 and 3.
5) Provide a recap of the insights you have gained throughout your analysis.


# Deliverables
Upload your Jupyter Notebook to the corresponding location in WorldClass. 

**Note::** Make sure you have clearly indicated each assignment requirement within your notebook. Also, I <i>highly encourage</i> you to use markdown text to create a notebook that integrates your analysis within your code. The narrative within your notebook will count for 50% of your total grade on this assignment.

Based on the initial overview of the dataset from a shoe company, here's a proposed scenario and problem statement for analysis:

### Scenario
The dataset represents sales data from an international shoe company. This company operates in multiple countries and offers a range of shoe products catering to different genders. Their business model includes both full-priced and discounted sales, which can be vital for understanding purchasing trends and consumer behavior.

### Problem Statement
A key question for the company could be to understand if there is a statistically significant difference in the average sales price between discounted and non-discounted shoes. Such insights could inform future pricing and discount strategies.

### Dataset Description
The dataset includes the following relevant columns for our analysis:

1. **Country**: Location of the sale, which could be interesting for geographical analysis.
2. **Gender**: The gender for which the shoe is designed.
3. **Size (US), Size (Europe), Size (UK)**: Different sizing standards, useful for regional analysis.
4. **UnitPrice**: The original price of the shoe.
5. **Discount**: The percentage discount applied to the shoe.
6. **SalePrice**: The final price of the shoe after applying the discount.

To proceed with the analysis, we will focus on the `Discount` and `SalePrice` columns. We can use a t-test or z-test to compare the mean SalePrice between discounted and non-discounted shoes. The choice between a t-test and a z-test would depend on the sample size and whether the population standard deviation is known.

Would you like to proceed with this analysis, or is there a different aspect of the data you're interested in exploring?

In [10]:
import pandas as pd

import seaborn as sns  
sns.set()

from statsmodels.stats.weightstats import ztest as ztest
from statsmodels.stats.weightstats import ttest_ind as ttest_ind

import warnings
warnings.filterwarnings("ignore")

In [11]:
shoes_df = pd.read_csv('/Users/vincentgunti/Documents/Data_Analytics_Project/shoes.csv')

In [12]:
shoes_df

Unnamed: 0,InvoiceNo,Date,Country,ProductID,Shop,Gender,Size (US),Size (Europe),Size (UK),UnitPrice,Discount,Year,Month,SalePrice
0,52389.0,1/1/2014,United Kingdom,2152.0,UK2,Male,11.0,44,10.5,$159.00,0%,2014,1,$159.00
1,52390.0,1/1/2014,United States,2230.0,US15,Male,11.5,44-45,11.0,$199.00,20%,2014,1,$159.20
2,52391.0,1/1/2014,Canada,2160.0,CAN7,Male,9.5,42-43,9.0,$149.00,20%,2014,1,$119.20
3,52392.0,1/1/2014,United States,2234.0,US6,Female,9.5,40,7.5,$159.00,0%,2014,1,$159.00
4,52393.0,1/1/2014,United Kingdom,2222.0,UK4,Female,9.0,39-40,7.0,$159.00,0%,2014,1,$159.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14972,,,Germany,2156.0,GER1,Female,6.5,37,4.5,$139.00,10%,1900,1,$125.10
14973,,,Germany,2156.0,GER1,Female,6.5,37,4.5,$139.00,10%,1900,1,$125.10
14974,,,Germany,2156.0,GER1,Female,6.5,37,4.5,$139.00,10%,1900,1,$125.10
14975,,,Germany,2156.0,GER1,Female,6.5,37,4.5,$139.00,10%,1900,1,$125.10


In [13]:
shoes_df.shape

(14977, 14)

In [14]:
shoes_df.sample(10)

Unnamed: 0,InvoiceNo,Date,Country,ProductID,Shop,Gender,Size (US),Size (Europe),Size (UK),UnitPrice,Discount,Year,Month,SalePrice
5300,57076.0,7/30/2015,United States,2227.0,US12,Male,10.5,43-44,10.0,$189.00,30%,2015,7,$132.30
14157,65030.0,11/8/2016,Canada,2177.0,CAN6,Male,11.0,44,10.5,$169.00,0%,2016,11,$169.00
2021,54112.0,9/25/2014,Germany,2211.0,GER3,Male,10.0,43,9.5,$159.00,0%,2014,9,$159.00
12054,63167.0,8/9/2016,United States,2194.0,US12,Female,8.0,38-39,6.0,$139.00,20%,2016,8,$111.20
14872,65686.0,12/24/2016,United States,2226.0,US14,Female,8.5,39,6.5,$139.00,10%,2016,12,$125.10
12237,63322.0,8/16/2016,Germany,2165.0,GER2,Female,8.5,39,6.5,$139.00,30%,2016,8,$97.30
8834,60307.0,3/10/2016,Canada,2183.0,CAN7,Male,9.5,42-43,9.0,$179.00,50%,2016,3,$89.50
6813,58477.0,11/9/2015,Germany,2201.0,GER1,Male,10.0,43,9.5,$149.00,0%,2015,11,$149.00
14691,65518.0,12/12/2016,United States,2205.0,US14,Female,7.5,38,5.5,$149.00,20%,2016,12,$119.20
6679,58346.0,10/31/2015,United States,2193.0,US14,Female,9.5,40,7.5,$139.00,0%,2015,10,$139.00


In [15]:
# Preparing data for analysis
# First, we need to clean and format the 'SalePrice' and 'Discount' columns for numerical analysis.

# Removing the '$' sign and converting the 'SalePrice' column to a numeric type
shoes_df['SalePrice'] = shoes_df['SalePrice'].replace('[\$,]', '', regex=True).astype(float)

# Converting 'Discount' to a numerical representation (e.g., 20% becomes 0.20)
shoes_df['Discount'] = shoes_df['Discount'].replace('[\%,]', '', regex=True).astype(float) / 100

# Splitting the dataset into two groups: Discounted and Non-Discounted
discounted = shoes_df[shoes_df['Discount'] > 0]['SalePrice']
non_discounted = shoes_df[shoes_df['Discount'] == 0]['SalePrice']

# Checking the sample sizes for both groups to decide between t-test and z-test
sample_size_discounted = discounted.shape[0]
sample_size_non_discounted = non_discounted.shape[0]

sample_size_discounted, sample_size_non_discounted


(6680, 8296)

In [16]:
from scipy import stats

# Performing an independent t-test between the two groups
t_stat, p_value = stats.ttest_ind(discounted, non_discounted, equal_var=False)

t_stat, p_value


(-106.31884280444672, 0.0)

### Test Approach Description

#### Test Selection Justification
I chose the **independent t-test** for the following reasons:

1. **Objective of the Analysis**: The goal is to compare the means of two independent groups (discounted vs. non-discounted shoes) to determine if there is a significant difference in their average sale prices.

2. **Sample Size**: Both groups have large sample sizes (discounted shoes: 6,680; non-discounted shoes: 8,296). While large samples might often suggest a z-test, the t-test is more versatile as it does not require the population standard deviation, which is unknown in this case.

3. **Data Characteristics**: The t-test is appropriate when comparing the means of two independent samples, which is exactly our scenario.

4. **Robustness**: The t-test is robust for large samples, making it a suitable choice even when the sample size is well above the typical threshold for 'large' samples.

#### Hypotheses Definition
To support the problem statement, the following hypothesis couplet (H0/H1) is defined:

- **Null Hypothesis (H0)**: There is no difference in the average sale price between discounted and non-discounted shoes. Mathematically, this can be stated as: \( H_0: \mu_{\text{discounted}} = \mu_{\text{non-discounted}} \).

- **Alternative Hypothesis (H1)**: There is a difference in the average sale price between discounted and non-discounted shoes. This can be represented as: \( H_1: \mu_{\text{discounted}} \neq \mu_{\text{non-discounted}} \).

In this setup, we are testing for any difference (higher or lower) in the average sale prices, making it a two-tailed test. The p-value obtained from the test will indicate the probability of observing the data assuming the null hypothesis is true. A very low p-value (as we obtained) leads us to reject the null hypothesis in favor of the alternative hypothesis, suggesting a significant difference in the average sale prices between the two groups.

### Descriptive Analysis

#### Discounted Shoes
- **Count**: 6,680
- **Mean Sale Price**: $117.85
- **Standard Deviation**: $29.99
- **Minimum Sale Price**: $64.50
- **25th Percentile**: $94.50
- **Median**: $118.30
- **75th Percentile**: $139.30
- **Maximum Sale Price**: $179.10

#### Non-Discounted Shoes
- **Count**: 8,296
- **Mean Sale Price**: $165.02
- **Standard Deviation**: $22.73
- **Minimum Sale Price**: $129.00
- **25th Percentile**: $149.00
- **Median**: $169.00
- **75th Percentile**: $189.00
- **Maximum Sale Price**: $199.00

### Hypothesis Testing

- **t-statistic**: -106.32
- **p-value**: 0.0

### Analysis and Conclusion
The descriptive statistics show a notable difference in the mean sale prices between discounted and non-discounted shoes. Discounted shoes have a lower average sale price and a wider range of prices compared to non-discounted shoes.

The hypothesis test results (t-statistic of -106.32 and a p-value of 0.0) lead us to reject the null hypothesis. This means that there is a statistically significant difference in the average sale prices of discounted and non-discounted shoes, with discounted shoes having a lower average sale price.

This analysis provides clear evidence that discounting significantly affects the sale price of shoes. The company could use this information to inform their pricing strategies, possibly considering more nuanced discounting approaches to optimize revenue and profitability.

In [17]:
# Descriptive Analysis
# Calculating basic statistics for each group to understand their distribution
desc_discounted = discounted.describe()
desc_non_discounted = non_discounted.describe()

# Preparing the data for hypothesis testing
# Data is already prepared in the earlier steps (conversion of SalePrice to numeric, splitting the dataset)

# Re-performing the t-test as part of this complete analysis
t_stat, p_value = stats.ttest_ind(discounted, non_discounted, equal_var=False)

desc_discounted, desc_non_discounted, t_stat, p_value


(count    6680.000000
 mean      117.846243
 std        29.989981
 min        64.500000
 25%        94.500000
 50%       118.300000
 75%       139.300000
 max       179.100000
 Name: SalePrice, dtype: float64,
 count    8296.000000
 mean      165.023385
 std        22.726488
 min       129.000000
 25%       149.000000
 50%       169.000000
 75%       189.000000
 max       199.000000
 Name: SalePrice, dtype: float64,
 -106.31884280444672,
 0.0)

Based on the previous analysis, we observed a significant difference in average sale prices between discounted and non-discounted shoes. To dive deeper, we can refine our hypothesis to investigate whether the average discount rate itself differs significantly based on the gender for which the shoe is designed. This will help us understand if there's a gender-based bias in discounting strategies.

### Refined Hypotheses

#### New Null Hypothesis (H0)
There is no difference in the average discount rate between men's and women's shoes. Mathematically, this can be stated as: 
\( H_0: \mu_{\text{discount, men}} = \mu_{\text{discount, women}} \).

#### New Alternative Hypothesis (H1)
There is a difference in the average discount rate between men's and women's shoes. This can be represented as: 
\( H_1: \mu_{\text{discount, men}} \neq \mu_{\text{discount, women}} \).

### Steps 2 and 3: Data Preparation and Hypothesis Testing

For this analysis, we'll focus on the `Discount` column, and categorize the data based on the `Gender` column.

Let's proceed with this refined approach.

### Descriptive Analysis

#### Men's Shoes
- **Count**: 8,919
- **Mean Discount Rate**: 12.42%
- **Standard Deviation**: 17.02%
- **Minimum Discount**: 0%
- **25th Percentile**: 0%
- **Median**: 0%
- **75th Percentile**: 20%
- **Maximum Discount**: 50%

#### Women's Shoes
- **Count**: 6,057
- **Mean Discount Rate**: 12.37%
- **Standard Deviation**: 16.99%
- **Minimum Discount**: 0%
- **25th Percentile**: 0%
- **Median**: 0%
- **75th Percentile**: 20%
- **Maximum Discount**: 50%

### Hypothesis Testing

- **t-statistic**: 0.168
- **p-value**: 0.866

### Analysis and Conclusion
The descriptive statistics reveal that both men's and women's shoes have similar average discount rates and distributions. 

The results of the hypothesis test (t-statistic of 0.168 and a p-value of 0.866) lead us to fail to reject the null hypothesis. This means that there is no statistically significant difference in the average discount rates between men's and women's shoes.

This analysis suggests that the company applies its discounting strategies uniformly across genders, without bias towards men's or women's shoes. It provides an assurance of fairness in pricing strategies across different gender categories.

Would you like to explore any other aspect of the dataset or conduct a different type of analysis?

In [18]:
# Preparing data for the refined analysis
# Filtering the dataset based on gender and focusing on the 'Discount' column
discount_men = shoes_df[shoes_df['Gender'] == 'Male']['Discount']
discount_women = shoes_df[shoes_df['Gender'] == 'Female']['Discount']

# Descriptive Analysis
desc_discount_men = discount_men.describe()
desc_discount_women = discount_women.describe()

# Conducting an independent t-test for the new hypothesis
t_stat_gender, p_value_gender = stats.ttest_ind(discount_men, discount_women, equal_var=False)

desc_discount_men, desc_discount_women, t_stat_gender, p_value_gender


(count    8919.000000
 mean        0.124184
 std         0.170179
 min         0.000000
 25%         0.000000
 50%         0.000000
 75%         0.200000
 max         0.500000
 Name: Discount, dtype: float64,
 count    6057.000000
 mean        0.123708
 std         0.169911
 min         0.000000
 25%         0.000000
 50%         0.000000
 75%         0.200000
 max         0.500000
 Name: Discount, dtype: float64,
 0.16822751311129247,
 0.8664069108647354)

### Descriptive Analysis

#### United Kingdom (UK)
- **Count**: 1,737
- **Mean Unit Price**: $165.61
- **Standard Deviation**: $23.61
- **Minimum Price**: $129.00
- **25th Percentile**: $149.00
- **Median**: $169.00
- **75th Percentile**: $189.00
- **Maximum Price**: $199.00

#### United States (US)
- **Count**: 5,886
- **Mean Unit Price**: $163.49
- **Standard Deviation**: $22.74
- **Minimum Price**: $129.00
- **25th Percentile**: $139.00
- **Median**: $159.00
- **75th Percentile**: $179.00
- **Maximum Price**: $199.00

### Hypothesis Testing Results

- **t-statistic**: 3.323
- **p-value**: 0.0009

### Analysis and Conclusion
The descriptive statistics show that the average unit price of shoes is slightly higher in the UK compared to the US. The t-test results, with a t-statistic of 3.323 and a p-value of approximately 0.0009, indicate that this difference is statistically significant.

Thus, we reject the null hypothesis and conclude that there is a statistically significant difference in the average unit prices of shoes between the UK and the US. This finding suggests that the company's pricing strategies differ between these two markets, potentially due to factors like market demand, economic conditions, or regional pricing policies.

This insight could be crucial for the company's regional pricing strategy and marketing efforts. Would you like to explore any other aspects of the data or conduct additional analyses?

In [19]:
# Filtering the dataset for sales in the United Kingdom and the United States
uk_prices = shoes_df[shoes_df['Country'] == 'United Kingdom']['UnitPrice'].replace('[\$,]', '', regex=True).astype(float)
us_prices = shoes_df[shoes_df['Country'] == 'United States']['UnitPrice'].replace('[\$,]', '', regex=True).astype(float)

# Conducting an independent t-test between the UK and US prices
t_stat_country, p_value_country = stats.ttest_ind(uk_prices, us_prices, equal_var=False)

# Descriptive statistics for each group for better understanding
desc_uk = uk_prices.describe()
desc_us = us_prices.describe()

desc_uk, desc_us, t_stat_country, p_value_country


(count    1737.000000
 mean      165.614853
 std        23.609917
 min       129.000000
 25%       149.000000
 50%       169.000000
 75%       189.000000
 max       199.000000
 Name: UnitPrice, dtype: float64,
 count    5886.000000
 mean      163.490316
 std        22.741654
 min       129.000000
 25%       139.000000
 50%       159.000000
 75%       179.000000
 max       199.000000
 Name: UnitPrice, dtype: float64,
 3.3229133993061826,
 0.0009024629493922783)