### Hypothesis Testing – World Happiness Analysis

#### Goal  
This notebook investigates whether specific factors significantly influence a country's **Happiness Score** using hypothesis testing techniques. The objective is to provide statistical evidence to support or reject the relationships between happiness and key variables such as **Family**, **Health**, **Generosity**, and **Continent**.

####  What’s Included
- Formulation of null and alternative hypotheses  
- Selection and application of appropriate statistical tests
- Step-by-step hypothesis testing process  
- Interpretation of p-values and conclusions  
- Data visualisation to support findings

#### Factors Tested
1. **Family support** – Is there a significant difference in Happiness Score based on family support levels?  
2. **Healthy life expectancy** – Does better health significantly impact happiness?  
3. **Continent** – Does happiness significantly differ across continents?  
4. **Generosity** – Despite low correlation, does generosity still show a significant effect on happiness?

---

Hypothesis 1 Goal:

Does higher family support (as measured in the dataset) lead to significantly higher happiness scores across countries?

We'll compare:

Group A: Countries with high family support (above median)

Group B: Countries with low family support (below or equal to median)



In [3]:
# Import required libraries
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, shapiro, levene, mannwhitneyu
import matplotlib.pyplot as plt
import seaborn as sns


In [6]:
# Load your dataset
df = pd.read_csv('../data/cleaned/winsorized_df_all.csv')

In [9]:
# Find the median value of the 'Family' column
family_median = df['Family'].median()

# Split the data into two groups based on Family support
high_family = df[df['Family'] > family_median]['Happiness Score']
low_family = df[df['Family'] <= family_median]['Happiness Score']

# Check for normal distribution in both groups using Shapiro-Wilk test
shapiro_high = shapiro(high_family)
shapiro_low = shapiro(low_family)

print("Shapiro Test - High Family Support:", shapiro_high)
print("Shapiro Test - Low Family Support:", shapiro_low)

Shapiro Test - High Family Support: ShapiroResult(statistic=0.9863836115386565, pvalue=8.70840576689903e-06)
Shapiro Test - Low Family Support: ShapiroResult(statistic=0.9880601031402853, pvalue=3.5036370130799594e-05)


We performed the Shapiro-Wilk test to check if the Happiness Score data for countries with high and low Family Support is normally distributed. Since both groups had p-values less than 0.05, we rejected the null hypothesis that the data is normally distributed. This means the data is **not normally distributed**, so we will use a **non-parametric test** (Mann-Whitney U Test) to compare happiness scores between the two groups.


In [11]:
from scipy.stats import mannwhitneyu

# Perform Mann-Whitney U Test (non-parametric)
stat, p_value = mannwhitneyu(high_family, low_family)

print(f"\nMann-Whitney U Test Statistic: {stat}")
print(f"P-value: {p_value}")

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("\nConclusion: There is a statistically significant difference in happiness scores between high and low family support groups.")
else:
    print("\nConclusion: There is no statistically significant difference in happiness scores between the groups.")



Mann-Whitney U Test Statistic: 343382.0
P-value: 4.86236910903912e-79

Conclusion: There is a statistically significant difference in happiness scores between high and low family support groups.


We performed the Mann-Whitney U Test to compare happiness scores between countries with high and low family support because the data was not normally distributed. The test resulted in a statistic of 343,382 and a p-value far below 0.05, indicating a statistically significant difference between the two groups. This means that countries with higher family support tend to have significantly different (likely higher) happiness scores compared to those with lower family support.

Hypothesis 2 Goal:

Does higher healthy life expectancy lead to significantly higher happiness scores across countries?

We'll compare:

Group A: Countries with high healthy life expectancy (above median)

Group B: Countries with low healthy life expectancy (below or equal to median

In [13]:
# Find the median value of the 'Healthy life expectancy' column
life_exp_median = df['Healthy life expectancy'].median()

# Split the data into two groups based on Healthy life expectancy
high_life_exp = df[df['Healthy life expectancy'] > life_exp_median]['Happiness Score']
low_life_exp = df[df['Healthy life expectancy'] <= life_exp_median]['Happiness Score']

# Check for normal distribution in both groups using Shapiro-Wilk test
shapiro_high = shapiro(high_life_exp)
shapiro_low = shapiro(low_life_exp)

print("Shapiro Test - High Healthy Life Expectancy:", shapiro_high)
print("Shapiro Test - Low Healthy Life Expectancy:", shapiro_low)

Shapiro Test - High Healthy Life Expectancy: ShapiroResult(statistic=0.9813847613930053, pvalue=2.1781921935745383e-07)
Shapiro Test - Low Healthy Life Expectancy: ShapiroResult(statistic=0.9929052081580103, pvalue=0.0033308354812517154)


The Shapiro-Wilk test was used to check if the Happiness Score data for countries with high and low Healthy Life Expectancy is normally distributed. Both groups returned p-values less than 0.05 (0.00000022 for high and 0.0033 for low), so we reject the null hypothesis of normality. This means the data is not normally distributed, and therefore, a non-parametric test like the Mann-Whitney U test will be used for further comparison.

In [14]:
# Since data is usually not normal, use Mann-Whitney U Test
stat, p_value = mannwhitneyu(high_life_exp, low_life_exp)

print(f"\nMann-Whitney U Test Statistic: {stat}")
print(f"P-value: {p_value}")

# Interpret the result
alpha = 0.05
if p_value < alpha:
    print("\nConclusion: There is a statistically significant difference in happiness scores between high and low healthy life expectancy groups.")
else:
    print("\nConclusion: There is no statistically significant difference in happiness scores between the two groups.")


Mann-Whitney U Test Statistic: 345632.5
P-value: 9.286679437728698e-82

Conclusion: There is a statistically significant difference in happiness scores between high and low healthy life expectancy groups.


The Mann-Whitney U test was performed to compare happiness scores between countries with high and low Healthy Life Expectancy. The test returned a statistic of 345,632.5 and a very small p-value (9.29e-82), which is far below the 0.05 significance level. This indicates a statistically significant difference in happiness scores between the two groups, suggesting that higher Healthy Life Expectancy is associated with higher happiness scores.