# World Happiness â€“ Regional Interpretation of Well-Being
**Student:** Bakhtiyor Sohibnazarov  
**Student Number:** Z22590018  
**Module:** Data Visualization   
**Date Updated:** December 19, 2025  

This notebook documents the data preparation, exploratory analysis, research questions, and visualisation workflow used in the assessment.

## 1. Importing Libraries and Loading the Dataset  
This section loads all required Python libraries and imports the World Happiness dataset from the working directory.  
Basic inspection steps are also performed to understand the structure and quality of the data.


In [29]:
!pip -q install pandas numpy matplotlib seaborn

In [74]:
# Importing essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure default visual styles
sns.set(style="whitegrid", context="paper")

# Load dataset
df = pd.read_csv("dataset/world-happiness-report.csv")

# Display the first rows
df.head()

Unnamed: 0,Country name,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect
0,Afghanistan,2008,3.724,7.37,0.451,50.8,0.718,0.168,0.882,0.518,0.258
1,Afghanistan,2009,4.402,7.54,0.552,51.2,0.679,0.19,0.85,0.584,0.237
2,Afghanistan,2010,4.758,7.647,0.539,51.6,0.6,0.121,0.707,0.618,0.275
3,Afghanistan,2011,3.832,7.62,0.521,51.92,0.496,0.162,0.731,0.611,0.267
4,Afghanistan,2012,3.783,7.705,0.521,52.24,0.531,0.236,0.776,0.71,0.268


## 2. Data Preparation

This section performs light cleaning and ensures that key variables are available in a consistent format.

Rename colonums for easier analysis. We will note use Postive and Negative affect cols so its best to remove them from dataset we are analyzing

In [75]:
# Rename original colonumns with given new names
df = df.rename(columns={
    "Country name": "Country",
    "year": "Year",
    "Life Ladder": "Happiness",
    "Log GDP per capita": "GDP",
    "Social support": "SocialSupport",
    "Healthy life expectancy at birth": "Health",
    "Freedom to make life choices": "Freedom",
    "Perceptions of corruption": "Corruption"
})

# Drop Positive and negative affect colonums
try:
    df = df.drop(columns=["Positive affect", "Negative affect"])
    print("Specified cols deleted successfully...")
except:
    print("Cols doen't exist. Skipping...")

num_countries = df["Country"].nunique()

print("Initial number of countries: ", num_countries, f"\n\n\n") # \n space for readability

# Recheck data structure
df.head()

Specified cols deleted successfully...
Initial number of countries:  166 





Unnamed: 0,Country,Year,Happiness,GDP,SocialSupport,Health,Freedom,Generosity,Corruption
0,Afghanistan,2008,3.724,7.37,0.451,50.8,0.718,0.168,0.882
1,Afghanistan,2009,4.402,7.54,0.552,51.2,0.679,0.19,0.85
2,Afghanistan,2010,4.758,7.647,0.539,51.6,0.6,0.121,0.707
3,Afghanistan,2011,3.832,7.62,0.521,51.92,0.496,0.162,0.731
4,Afghanistan,2012,3.783,7.705,0.521,52.24,0.531,0.236,0.776


We can see that data is given for countries in the period of years. Before going into analysis year consistency should be checked to prevent bias created by our analysis pipeline

In [76]:
# Per-country year coverage
country_coverage = (
    df.groupby("Country")["Year"]
      .agg(
          YearsReported="nunique",
          YearMin="min",
          YearMax="max"
      )
      .assign(
          TotalSpan=lambda x: x["YearMax"] - x["YearMin"] + 1,
          MissingYears=lambda x: x["TotalSpan"] - x["YearsReported"]
      )
      .sort_values("MissingYears", ascending=False)
)

summary = pd.DataFrame({
    "Years Reported": country_coverage["YearsReported"].describe(),
    "Missing Years": country_coverage["MissingYears"].describe()
}).round(2)

summary

Unnamed: 0,Years Reported,Missing Years
count,166.0,166.0
mean,11.74,1.32
std,3.92,1.77
min,1.0,0.0
25%,10.0,0.0
50%,13.0,1.0
75%,15.0,2.0
max,15.0,8.0


To keep consistent temporal coverage we will check which countries reported consistent 3 years and lock it for analysis. This will allow us to minimize bias in data as much as possible

In [77]:
MIN_YEARS = 3

# ------------------------------------------------------------
# 1. Find best contiguous MIN_YEARS window (max country retention)
# ------------------------------------------------------------

years = np.sort(df["Year"].unique())

windows = [
    (years[i], years[i + MIN_YEARS - 1])
    for i in range(len(years) - MIN_YEARS + 1)
]

best_start, best_end = max(
    windows,
    key=lambda w: df[
        (df["Year"] >= w[0]) & (df["Year"] <= w[1])
    ].groupby("Country")["Year"].nunique().ge(MIN_YEARS).sum()
)

best_window = pd.Series({
    "StartYear": best_start,
    "EndYear": best_end,
    "WindowLength": MIN_YEARS
})

# ------------------------------------------------------------
# 2. Lock dataset to window and enforce consistency
# ------------------------------------------------------------

df_window = df[
    (df["Year"] >= best_start) &
    (df["Year"] <= best_end)
]

df_balanced = (
    df_window
    .groupby("Country")
    .filter(lambda x: x["Year"].nunique() >= MIN_YEARS)
)

# ------------------------------------------------------------
# 3. Country accounting (who stayed, who was lost, and why)
# ------------------------------------------------------------

countries_all = set(df["Country"].unique())
countries_in_window = set(df_window["Country"].unique())
countries_final = set(df_balanced["Country"].unique())

lost_by_window = countries_all - countries_in_window
lost_by_consistency = countries_in_window - countries_final
lost_total = countries_all - countries_final

# ------------------------------------------------------------
# 4. Sanity checks (distribution of years per country)
# ------------------------------------------------------------

check_locked = (
    df_window
    .groupby("Country")["Year"]
    .nunique()
    .value_counts()
    .sort_index()
)

check_balanced = (
    df_balanced
    .groupby("Country")["Year"]
    .nunique()
    .value_counts()
    .sort_index()
)

# ------------------------------------------------------------
# 5. Print clear summary
# ------------------------------------------------------------

print("Best window:")
print(best_window, "\n")

print("=== Country Retention Summary ===")
print(f"Total countries in original dataset: {len(countries_all)}")
print(f"Countries in selected year window:   {len(countries_in_window)}")
print(f"Countries in final dataset:          {len(countries_final)}\n")

print("Lost countries:")
print(f"Excluded by year window:      {len(lost_by_window)}")
print(f"Excluded by < {MIN_YEARS} yrs:          {len(lost_by_consistency)}")
print(f"Total excluded overall:       {len(lost_total)}\n")

# ------------------------------------------------------------
# 6. Export excluded country lists
# ------------------------------------------------------------

pd.Series(sorted(lost_total), name="ExcludedCountries") \
  .to_csv("excluded_countries.csv", index=False)

print("***Sucessfully exported. Check root directory to see excluded countries list")

Best window:
StartYear       2015
EndYear         2017
WindowLength       3
dtype: int64 

=== Country Retention Summary ===
Total countries in original dataset: 166
Countries in selected year window:   153
Countries in final dataset:          135

Lost countries:
Excluded by year window:      13
Excluded by < 3 yrs:          18
Total excluded overall:       31

***Sucessfully exported. Check root directory to see excluded countries list


## Interpolation
### Check missing data
We need to check if there is missing cells in the balanced dataset and interpolate where it is possible

In [81]:
# Check missing data by grouping
missing_by_country_var = (df_balanced.set_index(["Country", "Year"]).isna().groupby("Country").sum())

# Display
missing_by_country_var[missing_by_country_var.sum(axis=1) > 0]

Unnamed: 0_level_0,Happiness,GDP,SocialSupport,Health,Freedom,Generosity,Corruption
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Bahrain,0,0,0,0,0,0,3
China,0,0,0,0,2,0,3
Egypt,0,0,0,0,0,0,1
Jordan,0,0,0,0,0,0,3
Kosovo,0,0,0,3,0,0,0
Kuwait,0,0,0,0,0,0,3
Libya,0,0,0,0,0,0,2
Palestinian Territories,0,0,0,3,0,0,0
Saudi Arabia,0,0,0,0,0,0,3
South Sudan,0,3,0,0,0,3,0


### Interpolate
We will interpolate data where possible

In [82]:
num_cols = [
    "GDP",
    "SocialSupport",
    "Health",
    "Freedom",
    "Generosity",
    "Corruption"]

df_balanced = (
    df_balanced
    .sort_values(["Country", "Year"])
    .groupby("Country", group_keys=False)
    .apply(
        lambda g: g.assign(
            **{
                col: g[col].interpolate(
                    method="linear",
                    limit_direction="both"
                )
                for col in num_cols
            }
        )
    )
)


missing_by_country_var = (
    df_balanced
    .set_index(["Country", "Year"])
    .isna()
    .groupby("Country")
    .sum()
)

missing_by_country_var[missing_by_country_var.sum(axis=1) > 0]

  .apply(


Unnamed: 0_level_0,Happiness,GDP,SocialSupport,Health,Freedom,Generosity,Corruption
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Bahrain,0,0,0,0,0,0,3
China,0,0,0,0,0,0,3
Jordan,0,0,0,0,0,0,3
Kosovo,0,0,0,3,0,0,0
Kuwait,0,0,0,0,0,0,3
Palestinian Territories,0,0,0,3,0,0,0
Saudi Arabia,0,0,0,0,0,0,3
South Sudan,0,3,0,0,0,3,0
Taiwan Province of China,0,0,0,3,0,0,0
Turkmenistan,0,0,0,0,0,0,3


### Further Cleaning
As we can see interpolation filled most of the cells but extreme missingness cannot be filled further, fabricated data will corrupt dataset. Remaining NaNs should be handled during analysis but extreme missingness should be removed in order to make dataset as much clean as possible

In [85]:
# Removes South Sudan to stabilize GDP and Corruption col. to stabilize entire dataset

# 1. Remove South Sudan (structural GDP/Generosity missingness)
df_clean = df_balanced[df_balanced["Country"] != "South Sudan"].copy()

# 2. Drop Corruption variable (structural missingness across key regions)
df_clean = df_clean.drop(columns=["Corruption"])

# 3. Quick sanity check
print(df_clean.isna().sum().sort_values(ascending=False))
df_clean.head()

Health           9
Country          0
Happiness        0
Year             0
GDP              0
SocialSupport    0
Freedom          0
Generosity       0
dtype: int64


Unnamed: 0,Country,Year,Happiness,GDP,SocialSupport,Health,Freedom,Generosity
7,Afghanistan,2015,3.983,7.702,0.529,53.2,0.389,0.08
8,Afghanistan,2016,4.22,7.697,0.559,53.0,0.523,0.042
9,Afghanistan,2017,2.662,7.697,0.491,52.8,0.427,-0.121
19,Albania,2015,4.607,9.403,0.639,67.8,0.704,-0.081
20,Albania,2016,4.511,9.437,0.638,68.1,0.73,-0.017
