In [46]:
import pandas as pd
import scipy.stats as stats
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Task 2 of week's two assignment: 
## Hypothesis Testing and Statistical Analysis

In this notebook, we will identify several hypotheses based on exploratory data analysis (EDA) and perform statistical tests (e.g., t-test, ANOVA) to test these hypotheses.

We will:
1. Clean the dataset by handling missing values and non-numeric entries.
2. Perform a t-test to compare the discount prices of high-rated vs. low-rated products.
3. Perform an ANOVA to determine if the number of ratings varies significantly across different subcategories.


In [62]:
# Loading the data
data = pd.read_csv("../data/final_cleaned.csv")
data.head()

Unnamed: 0,name,main_category,sub_category,ratings,no_of_ratings,discount_price,actual_price
0,Small Personal Air Conditioners Mini Air Condi...,appliances,Air Conditioners,3.1,85.0,2499.0,2999.0
1,ALLWIN TAFTA High Quality Metal Special Coated...,appliances,Air Conditioners,4.6,7.0,647.0,1500.0
2,Hexzone Floor AC Stand Certified Special Coate...,appliances,Air Conditioners,4.1,9.0,679.0,1249.0
3,AC Stand/Heavy Duty Air Conditioner Outdoor Un...,appliances,Air Conditioners,5.0,2.0,599.0,999.0
4,ThreeH USB Fan Silent Mini Air Conditioner Tra...,appliances,Air Conditioners,4.0,1400.0,1219.0,1449.0


# Hypothesis 1: T-test for discount prices between high-rated and low-rated products

In [71]:
# Checking for missing values in actual_price and discount_price columns, cleaning the 'discount_price' column
data[['actual_price', 'discount_price']].isna().sum()

actual_price      0
discount_price    0
dtype: int64

In [72]:
# T-test for discount prices between high-rated and low-rated products

# Separate the high-rated and low-rated products based on a threshold rating of 4.0
threshold = 4.0
high_rated = data_clean[data_clean['ratings'] >= threshold]['discount_price']
low_rated = data_clean[data_clean['ratings'] < threshold]['discount_price']

# Perform Welch's t-test (two-sample t-test with unequal variance)
t_stat, p_value = stats.ttest_ind(high_rated, low_rated, equal_var=False)

print(f"T-test result: t-statistic = {t_stat}, p-value = {p_value}")

T-test result: t-statistic = 33.68404553514181, p-value = 3.12098162720897e-248


### Interpretation of T-test Results:

The t-test result shows a **t-statistic of 44.89** and a **p-value of 0.0**, indicating a statistically significant difference in discount prices between high-rated and low-rated products. We can confidently reject the null hypothesis and conclude that there is a significant difference between the two groups.

# Hypothesis 2 (ANOVA Test)

In [69]:
# Perform ANOVA test to check if the number of ratings differs significantly across subcategories
anova_stat, p_value_anova = stats.f_oneway(*[final_data[final_data['sub_category'] == cat]['no_of_ratings']
    for cat in final_data['sub_category'].unique()])

print("ANOVA statistic:", anova_stat)
print("P-value:", p_value_anova)

ANOVA statistic: 144.66464846384974
P-value: 0.0


### Interpretation of ANOVA Results:

The ANOVA test result shows a **statistic of 144.66** and a **p-value of 0.0**, meaning that the number of ratings varies significantly across the subcategories. The large ANOVA statistic suggests that the variance between the group means is much larger than the variance within the groups, making it unlikely that the observed differences are due to random variation.


### Conclusion for Hypothesis 1 (T-test):
The t-test shows a statistically significant difference in discount prices between high-rated and low-rated products. Based on the very low p-value, we can reject the null hypothesis.

### Conclusion for Hypothesis 2 (ANOVA):
The ANOVA test shows a significant difference in the number of ratings across different subcategories, suggesting that certain subcategories are more popular or attract more feedback from customers.