# Descriptive Analysis – World Happiness Health Analytics

## Notebook Overview
This descriptive analysis examines the ** merged World Happiness dataset from 2015–2024 ** with **1,320 rows** of data across **132 countries**. The analysis focuses on understanding patterns, distributions, and relationships in global happiness and health metrics.


**Dataset Columns**:  
- Country  
- Happiness Rank  
- Happiness Score  
- Economy  
- Family  
- Healthy Life Expectancy  
- Freedom to Make Life Choices  
- Perceptions of Corruption  
- Generosity  
- Continent  

---

## Analysis Framework: Four Key Descriptive Charts

### 1. Health vs Happiness Relationship Analysis
- **Type of Analysis**: Correlation Analysis (Bivariate-Statistical technique that examines the relationship between two continuous variables.)  
- **Chart Type**: Scatter plot with trend line and continent color-coding  
- **Purpose**: Explores the direct relationship between healthy life expectancy and happiness scores across countries  

**Variables**:
- X-axis: Healthy Life Expectancy  
- Y-axis: Happiness Score  
- Color: Continent  

**Expected Insights**:
- Strength of health-happiness correlation  
- Regional patterns in health-happiness relationship  
- Identification of outlier countries  
- Linear vs non-linear relationship patterns  

---

### 2. Happiness Score Trends Over Time (2015–2024)
- **Type of Analysis**: Time Series Analysis (Temporal Trend)  
- **Chart Type**: Multi-line chart or area chart  
- **Purpose**: Reveals temporal patterns and global happiness trajectory over the decade  

**Variables**:
- X-axis: Year (2015–2024)  
- Y-axis: Happiness Score  
- Multiple lines: Top countries or continental averages  

**Expected Insights**:
- Global happiness trend direction (improving/declining)  
- Year-over-year changes and volatility  
- Identification of significant events affecting happiness  
- Comparative country/regional performance over time  

---

### 3. Key Happiness Factors Correlation Matrix
- **Type of Analysis**: Correlation Matrix Analysis (Multivariate-Comprehensive analysis examining relationships between multiple variables simultaneously.)  
- **Chart Type**: Correlation heatmap  
- **Purpose**: Identifies which factors most strongly influence happiness and their inter-relationships  

**Variables**:
- Happiness Score  
- Economy  
- Family  
- Healthy Life Expectancy  
- Freedom to Make Life Choices  
- Perceptions of Corruption  
- Generosity  

**Expected Insights**:
- Strongest predictors of happiness  
- Factor interdependencies  
- Potential multicollinearity issues  
- Comprehensive relationship mapping  

---

### 4. Continental Health & Happiness Comparison
- **Type of Analysis**: Comparative Analysis (Cross-sectional)  
- **Chart Type**: Grouped bar chart or dual-axis chart  
- **Purpose**: Compares both health and happiness metrics across continents  

**Variables**:
- X-axis: Continent  
- Y-axis 1: Average Happiness Score  
- Y-axis 2: Average Healthy Life Expectancy  
- Representation: Grouped bars or dual axes  

**Expected Insights**:
- Regional performance rankings  
- Health-happiness gaps by continent  
- Identification of best/worst performing regions  
- Regional disparities and patterns  

---

### 📊 Chart Outputs
- **Scatter Plot**: Health vs Happiness with continental grouping  
- **Time Series Plot**: Happiness trends over 2015–2024  
- **Correlation Heatmap**: All factor relationships  
- **Comparative Bar Chart**: Continental performance comparison  


---

## Next Steps
This descriptive analysis will provide the foundation for:
- **Diagnostic Analysis**: Investigating why certain patterns exist  
- **Predictive Analysis**: Forecasting future happiness trends  
- **Prescriptive Analysis**: Recommending policy interventions  
- **Machine Learning Models**: Building predictive algorithms  

---

## Tools and Libraries
- **Python**: Primary programming language  
- **Pandas**: Data manipulation and analysis  
- **Matplotlib / Seaborn**: Statistical visualizations  
- **Plotly**: Interactive charts  
- **NumPy**: Numerical computations  
- **Scipy**: Statistical analysis functions  


In [13]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats  # For correlation Analysis

In [36]:
df =pd.read_csv('../data/cleaned/cleaned_df_all_2015_to_2024.csv')

In [38]:
# Initial data exploration
df.info()
df.describe()
df.isnull().sum()
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1318 entries, 0 to 1317
Data columns (total 11 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country                       1318 non-null   object 
 1   Happiness Rank                1318 non-null   int64  
 2   Happiness Score               1318 non-null   float64
 3   Economy                       1318 non-null   float64
 4   Family                        1318 non-null   float64
 5   Healthy life expectancy       1318 non-null   float64
 6   Freedom to make life choices  1318 non-null   float64
 7   Perceptions of corruption     1318 non-null   float64
 8   Generosity                    1318 non-null   float64
 9   Continent                     1318 non-null   object 
 10  Year                          1318 non-null   int64  
dtypes: float64(7), int64(2), object(2)
memory usage: 113.4+ KB


Unnamed: 0,Country,Happiness Rank,Happiness Score,Economy,Family,Healthy life expectancy,Freedom to make life choices,Perceptions of corruption,Generosity,Continent,Year
0,Afghanistan,153,3.575,0.31982,0.30285,0.30335,0.23414,0.09719,0.3651,Asia,2015
1,Afghanistan,154,3.36,0.38227,0.11037,0.17344,0.1643,0.07112,0.31268,Asia,2016
2,Afghanistan,141,3.794,0.401477,0.581543,0.180747,0.10618,0.061158,0.311871,Asia,2017
3,Afghanistan,145,3.632,0.332,0.537,0.255,0.085,0.036,0.191,Asia,2018
4,Afghanistan,154,3.203,0.35,0.517,0.361,0.0,0.025,0.158,Asia,2019


# Fixing Congo Variants (Row Mismatch 1318 → 1310)

## 🎯 Goal
- **Expected rows:** 1,310 ( 31 countries × 10 years, 2015‑2024 )  
- **Actual rows:** 1,318  
- **Cause:** *Congo* appears **18 times** instead of the expected 10.  
  - Two spellings exist:
    1. **Congo (Brazzaville)**  → mapped to `Congo`
    2. **Congo (Kinshasa)**  → mapped to `Congo`
- One of these variants is **missing in ≥ 2 years**, so its extra rows must be removed (8 rows) to restore the 1,310‑row target.

---

## 🗺️ Plan of Attack

1. **Locate both original variants** before mapping.  
2. **Count each variant** by year (2015‑2024).  
3. **Identify the variant that does *not* appear in all 10 years.**  
4. **Drop all rows** of that incomplete variant (≈ 8 rows).  
5. **Verify final row count** = 1,310.

In [39]:
# Total number of unique countries
unique_countries_count = df['Country'].nunique()
print(f"Total unique countries: {unique_countries_count}")

# Number of times each country appears
country_counts = df['Country'].value_counts()

print(country_counts.to_string())



Total unique countries: 131
Country
Congo                      18
Afghanistan                10
Netherlands                10
Poland                     10
Philippines                10
Peru                       10
Paraguay                   10
Panama                     10
Palestinian Territories    10
Pakistan                   10
Norway                     10
North Macedonia            10
Nigeria                    10
Niger                      10
Nicaragua                  10
New Zealand                10
Nepal                      10
Romania                    10
Myanmar                    10
Morocco                    10
Montenegro                 10
Mongolia                   10
Moldova                    10
Mexico                     10
Mauritius                  10
Mauritania                 10
Malta                      10
Mali                       10
Malaysia                   10
Malawi                     10
Madagascar                 10
Luxembourg                 10
Port

In [40]:
# Find duplicate rows based on Country and Year for Congo
duplicate_rows = df[df['Country'] == 'Congo'].duplicated(subset=['Country', 'Year'], keep=False)

# Show duplicated rows only
print(df[(df['Country'] == 'Congo') & (duplicate_rows)])


    Country  Happiness Rank  Happiness Score   Economy    Family  \
240   Congo             120            4.517  0.000000  1.001200   
241   Congo             139            3.989  0.678660  0.662900   
242   Congo             125            4.272  0.056610  0.806760   
243   Congo             127            4.236  0.771090  0.477990   
244   Congo             124            4.291  0.808964  0.832044   
245   Congo             126            4.280  0.092102  1.229023   
246   Congo             114            4.559  0.682000  0.811000   
247   Congo             132            4.245  0.069000  1.136000   
248   Congo             103            4.812  0.673000  0.799000   
249   Congo             127            4.418  0.094000  1.125000   
250   Congo              88            5.190  0.630000  0.760000   
251   Congo             131            4.310  0.060000  0.830000   
254   Congo              86            5.267  0.921000  0.665000   
255   Congo             133            3.207  0.

In [41]:
# Load each CSV file into a separate DataFrame
df_2015 = pd.read_csv('../data/raw/2015.csv')  # Load the 2015 World Happiness data
df_2016 = pd.read_csv('../data/raw/2016.csv')  # Load the 2016 World Happiness data
df_2017 = pd.read_csv('../data/raw/2017.csv')  # Load the 2017 World Happiness data
df_2018 = pd.read_csv('../data/raw/2018.csv')  # Load the 2018 World Happiness data
df_2019 = pd.read_csv('../data/raw/2019.csv')  # Load the 2019 World Happiness data
df_2020 = pd.read_csv('../data/raw/2020.csv')  # Load the 2020 World Happiness data
df_2021 = pd.read_csv('../data/raw/2021.csv')  # Load the 2021 World Happiness data
df_2022 = pd.read_csv('../data/raw/2022.csv')  # Load the 2022 World Happiness data
df_2023 = pd.read_csv('../data/raw/2023.csv')  # Load the 2023 World Happiness data
df_2024 = pd.read_csv('../data/raw/2024.csv')  # Load the 2024 World Happiness data

In [42]:
# Congo variants in each dataset from 2015 to 2024

congo_2015 = df_2015[df_2015['Country'].str.contains('Congo', case=False, na=False)]
print("Congo variants in 2015 dataset:")
print(congo_2015[['Country','Happiness Rank']])

congo_2016 = df_2016[df_2016['Country'].str.contains('Congo', case=False, na=False)]
print("\nCongo variants in 2016 dataset:")
print(congo_2016[['Country','Happiness Rank']])

congo_2017 = df_2017[df_2017['Country'].str.contains('Congo', case=False, na=False)]
print("\nCongo variants in 2017 dataset:")
print(congo_2017[['Country','Happiness.Rank']])

congo_2018 = df_2018[df_2018['Country or region'].str.contains('Congo', case=False, na=False)]
print("\nCongo variants in 2018 dataset:")
print(congo_2018[['Country or region','Overall rank']])

congo_2019 = df_2019[df_2019['Country or region'].str.contains('Congo', case=False, na=False)]
print("\nCongo variants in 2019 dataset:")
print(congo_2019[['Country or region','Overall rank']])

congo_2020 = df_2020[df_2020['Country name'].str.contains('Congo', case=False, na=False)]
print("\nCongo variants in 2020 dataset:")
print(congo_2020[['Country name','Happiness Rank']])

congo_2021 = df_2021[df_2021['Country name'].str.contains('Congo', case=False, na=False)]
print("\nCongo variants in 2021 dataset:")
print(congo_2021[['Country name','Happiness Rank']])

congo_2022 = df_2022[df_2022['Country name'].str.contains('Congo', case=False, na=False)]
print("\nCongo variants in 2022 dataset:")
print(congo_2022[['Country name','Happiness Rank']])

congo_2023 = df_2023[df_2023['Country name'].str.contains('Congo', case=False, na=False)]
print("\nCongo variants in 2023 dataset:")
print(congo_2023[['Country name','Happiness Rank']])

congo_2024 = df_2024[df_2024['Country name'].str.contains('Congo', case=False, na=False)]
print("\nCongo variants in 2024 dataset:")
print(congo_2024[['Country name','Happiness Rank']])


Congo variants in 2015 dataset:
                 Country  Happiness Rank
119     Congo (Kinshasa)             120
138  Congo (Brazzaville)             139

Congo variants in 2016 dataset:
                 Country  Happiness Rank
124     Congo (Kinshasa)             125
126  Congo (Brazzaville)             127

Congo variants in 2017 dataset:
                 Country  Happiness.Rank
123  Congo (Brazzaville)             124
125     Congo (Kinshasa)             126

Congo variants in 2018 dataset:
       Country or region  Overall rank
113  Congo (Brazzaville)           114
131     Congo (Kinshasa)           132

Congo variants in 2019 dataset:
       Country or region  Overall rank
102  Congo (Brazzaville)           103
126     Congo (Kinshasa)           127

Congo variants in 2020 dataset:
            Country name  Happiness Rank
87   Congo (Brazzaville)              88
130     Congo (Kinshasa)             131

Congo variants in 2021 dataset:
           Country name  Happiness Rank
82  

**Action:**  
To resolve this, I compared the **Happiness Rank** values of both variants across the years. Since **Congo (Kinshasa)** had incomplete data in all datasets, I dropped all rows corresponding to **Congo (Kinshasa)** from the merged dataset. This ensured the dataset contains only the complete and consistent **Congo (Brazzaville)** data mapped to **`Congo`**.

In [47]:


#  Define the (Year, Happiness Rank) pairs to remove
rows_to_drop = [
    (2015, 120),
    (2016, 125),
    (2017, 126),
    (2018, 132),
    (2019, 127),
    (2020, 131),
    (2023, 133),
    (2024, 139),
]

# Build a boolean mask that is True only for those rows
mask = (
    (df['Country'] == 'Congo') &
    df.set_index(['Year', 'Happiness Rank']).index.isin(rows_to_drop)
)

#  Drop the rows
df = df[~mask]

#  Verify the result
print("Expected rows:", 131 * 10)        # 1 310
print("Actual rows after drop:", len(df))


Expected rows: 1310
Actual rows after drop: 1310


In [48]:
# Total number of unique countries
unique_countries_count = df['Country'].nunique()
print(f"Total unique countries: {unique_countries_count}")

# Number of times each country appears
country_counts = df['Country'].value_counts()

print(country_counts.to_string())


Total unique countries: 131
Country
Afghanistan                10
Romania                    10
Poland                     10
Philippines                10
Peru                       10
Paraguay                   10
Panama                     10
Palestinian Territories    10
Pakistan                   10
Norway                     10
North Macedonia            10
Nigeria                    10
Niger                      10
Nicaragua                  10
New Zealand                10
Netherlands                10
Nepal                      10
Myanmar                    10
Morocco                    10
Montenegro                 10
Mongolia                   10
Moldova                    10
Mexico                     10
Mauritius                  10
Mauritania                 10
Malta                      10
Mali                       10
Malaysia                   10
Malawi                     10
Madagascar                 10
Luxembourg                 10
Portugal                   10
Russ

In [49]:
congo_rows = df[df['Country'] == 'Congo']
print("Total 'Congo' rows:", len(congo_rows))
display(congo_rows)


Total 'Congo' rows: 10


Unnamed: 0,Country,Happiness Rank,Happiness Score,Economy,Family,Healthy life expectancy,Freedom to make life choices,Perceptions of corruption,Generosity,Continent,Year
241,Congo,139,3.989,0.67866,0.6629,0.31051,0.41466,0.11686,0.12388,Africa,2015
243,Congo,127,4.236,0.77109,0.47799,0.28212,0.37938,0.09753,0.12077,Africa,2016
244,Congo,124,4.291,0.808964,0.832044,0.289957,0.435026,0.079618,0.120852,Africa,2017
246,Congo,114,4.559,0.682,0.811,0.343,0.514,0.077,0.091,Africa,2018
248,Congo,103,4.812,0.673,0.799,0.508,0.372,0.093,0.105,Africa,2019
250,Congo,88,5.19,0.63,0.76,0.46,0.39,0.12,0.12,Africa,2020
252,Congo,83,5.342,0.518,0.392,0.307,0.381,0.124,0.144,Africa,2021
253,Congo,99,5.075,0.95,0.405,0.355,0.431,0.146,0.13,Africa,2022
254,Congo,86,5.267,0.921,0.665,0.145,0.464,0.136,0.134,Africa,2023
256,Congo,89,5.221,0.892,0.622,0.306,0.523,0.138,0.124,Africa,2024


Unwanted Congo (Kinshasa) rows successfully dropped.

Final dataset row count: 1,310 rows (as expected — 31 countries × 10 years).

Ensured only one valid Congo entry per year remains for analysis.