# World Happiness Dataset - ETL Pipeline

This ETL (Extract, Transform, Load) pipeline processes the World Happiness data from multiple CSV files spanning the years 2015 to 2024. The goal is to prepare a clean, consistent dataset ready for analysis, modeling, and visualization.

---

## Pipeline Steps

### Step 1. Extract
- Load all five CSV files (`2015.csv`, `2016.csv`, `2017.csv`, `2018.csv`, `2019.csv`,`2020.csv`, `2021.csv`, `2022.csv`, `2023.csv`, `2024.csv`) into separate dataframes.
- Each file contains similar but not identical columns, with some variations in column names and structure.



In [57]:
# Import necessary libraries
import pandas as pd  # pandas is used for data loading, manipulation, and analysis (like handling DataFrames)
import numpy as np   # numpy is used for numerical operations, handling arrays, and working with missing values

In [58]:
# Load each CSV file into a separate DataFrame
df_2015 = pd.read_csv('../data/raw/2015.csv')  # Load the 2015 World Happiness data
df_2016 = pd.read_csv('../data/raw/2016.csv')  # Load the 2016 World Happiness data
df_2017 = pd.read_csv('../data/raw/2017.csv')  # Load the 2017 World Happiness data
df_2018 = pd.read_csv('../data/raw/2018.csv')  # Load the 2018 World Happiness data
df_2019 = pd.read_csv('../data/raw/2019.csv')  # Load the 2019 World Happiness data
df_2020 = pd.read_csv('../data/raw/2020.csv')  # Load the 2020 World Happiness data
df_2021 = pd.read_csv('../data/raw/2021.csv')  # Load the 2021 World Happiness data
df_2022 = pd.read_csv('../data/raw/2022.csv')  # Load the 2022 World Happiness data
df_2023 = pd.read_csv('../data/raw/2023.csv')  # Load the 2023 World Happiness data
df_2024 = pd.read_csv('../data/raw/2024.csv')  # Load the 2024 World Happiness data

In [59]:
# Quickly inspect the structure and content of each DataFrame
# This helps us understand the shape, column names, data types, and missing values.
for year, df in zip(range(2015, 2025), [df_2015, df_2016, df_2017, df_2018, df_2019, df_2020, df_2021, df_2022, df_2023, df_2024]):
    print(f"\nYear: {year}")         # Print the year to label each output block
    print(df.info())                 # Show column names, data types, and non-null counts
    print(df.head())                # Display the first 5 rows of the DataFrame


Year: 2015
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        158 non-null    object 
 1   Region                         158 non-null    object 
 2   Happiness Rank                 158 non-null    int64  
 3   Happiness Score                158 non-null    float64
 4   Standard Error                 158 non-null    float64
 5   Economy (GDP per Capita)       158 non-null    float64
 6   Family                         158 non-null    float64
 7   Health (Life Expectancy)       158 non-null    float64
 8   Freedom                        158 non-null    float64
 9   Trust (Government Corruption)  158 non-null    float64
 10  Generosity                     158 non-null    float64
 11  Dystopia Residual              158 non-null    float64
dtypes: float64(9), int64(1), object(2)
mem


## World Happiness Report Column Names by Year

### 2015
- Country  
- Region  
- Happiness Rank  
- Happiness Score  
- Standard Error  
- Economy (GDP per Capita)  
- Family  
- Health (Life Expectancy)  
- Freedom  
- Trust (Government Corruption)  
- Generosity  
- Dystopia Residual

---

### 2016
- Country  
- Region  
- Happiness Rank  
- Happiness Score  
- Lower Confidence Interval  
- Upper Confidence Interval  
- Economy (GDP per Capita)  
- Family  
- Health (Life Expectancy)  
- Freedom  
- Trust (Government Corruption)  
- Generosity  
- Dystopia Residual

---

### 2017
- Country  
- Happiness.Rank  
- Happiness.Score  
- Whisker.high  
- Whisker.low  
- Economy..GDP.per.Capita.  
- Family  
- Health..Life.Expectancy.  
- Freedom  
- Generosity  
- Trust..Government.Corruption.  
- Dystopia.Residual

---

### 2018
- Overall rank  
- Country or region  
- Score  
- GDP per capita  
- Social support  
- Healthy life expectancy  
- Freedom to make life choices  
- Generosity  
- Perceptions of corruption

---

### 2019
- Overall rank  
- Country or region  
- Score  
- GDP per capita  
- Social support  
- Healthy life expectancy  
- Freedom to make life choices  
- Generosity  
- Perceptions of corruption

---

### 2020
- Country name  
- Happiness Rank  
- Happiness score  
- Upperwhisker  
- Lowerwhisker  
- Economy (GDP per Capita)  
- Social support  
- Healthy life expectancy  
- Freedom to make life choices  
- Generosity  
- Perceptions of corruption

---

### 2021
- Country name  
- Happiness Rank  
- Happiness score  
- Upperwhisker  
- Lowerwhisker  
- Economy (GDP per Capita)  
- Social support  
- Healthy life expectancy  
- Freedom to make life choices  
- Generosity  
- Perceptions of corruption

---

### 2022
- Country name  
- Happiness Rank  
- Happiness score  
- Upperwhisker  
- Lowerwhisker  
- Economy (GDP per Capita)  
- Social support  
- Healthy life expectancy  
- Freedom to make life choices  
- Generosity  
- Perceptions of corruption

---

### 2023
- Country name  
- Happiness Rank  
- Happiness score  
- Upperwhisker  
- Lowerwhisker  
- Economy (GDP per Capita)  
- Social support  
- Healthy life expectancy  
- Freedom to make life choices  
- Generosity  
- Perceptions of corruption

---

### 2024
- Country name  
- Happiness Rank  
- Happiness score  
- Upperwhisker  
- Lowerwhisker  
- Economy (GDP per Capita)  
- Social support  
- Healthy life expectancy  
- Freedom to make life choices  
- Generosity  
- Perceptions of corruption


# World Happiness Dataset - Column Meanings

| Column Name                                   | Meaning / Description                                                                                   |
|----------------------------------------------|-------------------------------------------------------------------------------------------------------|
| **Country**               | The name of the country being evaluated.                                                    |
| **Region**                                    | Broad geographic grouping or continent used for aggregations (e.g., Asia, Europe, Africa, Oceania).     |
| **Happiness Rank / Overall rank**             | The position of the country in the list ranked by happiness score (1 = happiest).                      |
| **Happiness Score / Score**                    | A numerical score representing the overall happiness of the country; higher means happier.            |
| **Standard Error**                             | The margin of error or uncertainty in the happiness score estimate.                       |
| **Lower Confidence Interval**                  | The lower bound of the confidence interval for the happiness score, indicating uncertainty.|
| **Upper Confidence Interval**                  | The upper bound of the confidence interval for the happiness score.                       |
| **Whisker.high / Whisker.low**                 | Similar to confidence intervals, indicating the range of uncertainty in the happiness score.|
| **Economy (GDP per Capita) / GDP per capita** | A measure of economic output per person, indicating the wealth level of the country.                   |
| **Family / Social support**                    | The perceived level of social support from family, friends, and community.                            |
| **Health (Life Expectancy) / Healthy life expectancy** | Average expected lifespan or a proxy for health quality in the country.                            |
| **Freedom / Freedom to make life choices**    | The perceived freedom individuals have to make their own life decisions.                             |
| **Trust (Government Corruption) / Perceptions of corruption** | Measure of how much corruption is perceived in government and public institutions.          |
| **Generosity**                                 | A measure of the generosity or charitable behavior in the country’s population.                      |
| **Dystopia Residual**                          | The gap between the predicted happiness score based on the model and actual scores — a residual value representing unexplained factors. |

In [60]:
# Print the number of rows and columns in each DataFrame
for year, df in zip(range(2015, 2025), [df_2015, df_2016, df_2017, df_2018, df_2019, df_2020, df_2021, df_2022, df_2023, df_2024]):
    print(f"\nYear: {year}")
    print(f"Shape: {df.shape}")  # Returns a tuple (rows, columns)
    print(f"Number of Rows: {df.shape[0]}")
    print(f"Number of Columns: {df.shape[1]}")


Year: 2015
Shape: (158, 12)
Number of Rows: 158
Number of Columns: 12

Year: 2016
Shape: (157, 13)
Number of Rows: 157
Number of Columns: 13

Year: 2017
Shape: (155, 12)
Number of Rows: 155
Number of Columns: 12

Year: 2018
Shape: (156, 9)
Number of Rows: 156
Number of Columns: 9

Year: 2019
Shape: (156, 9)
Number of Rows: 156
Number of Columns: 9

Year: 2020
Shape: (153, 11)
Number of Rows: 153
Number of Columns: 11

Year: 2021
Shape: (149, 11)
Number of Rows: 149
Number of Columns: 11

Year: 2022
Shape: (146, 11)
Number of Rows: 146
Number of Columns: 11

Year: 2023
Shape: (137, 11)
Number of Rows: 137
Number of Columns: 11

Year: 2024
Shape: (143, 11)
Number of Rows: 143
Number of Columns: 11


### Step 2. Transform

As part of the data transformation process, we clean and prepare the data for analysis:

- Dropped columns related to uncertainty that are not needed for analysis or modeling:
  - `Region` (2015,2016)
  - `Standard Error` (2015)
  - `Lower Confidence Interval` and `Upper Confidence Interval` (2016)
  - `Whisker.high` and `Whisker.low` (2017)
  - `UpperWhisker.` and `LowerWhisker` (2020-2024)

In addition to removing uncertainty-related columns, we also dropped:

- `Dystopia Residual`(2015,2016,2017): This is a synthetic value used in the original happiness score calculation. Since we aim to build our own models, this residual component is not useful and may bias our interpretation or predictions.

Dropping this ensures our models are trained only on raw, interpretable variables.

In [61]:
print("2015:", df_2015.columns.tolist())
print("2016:", df_2016.columns.tolist())
print("2017:", df_2017.columns.tolist())
print("2018:", df_2018.columns.tolist())
print("2019:", df_2019.columns.tolist())
print("2020:", df_2020.columns.tolist())
print("2021:", df_2021.columns.tolist())
print("2022:", df_2022.columns.tolist())
print("2023:", df_2023.columns.tolist())
print("2024:", df_2024.columns.tolist())

2015: ['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Standard Error', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual']
2016: ['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Lower Confidence Interval', 'Upper Confidence Interval', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual']
2017: ['Country', 'Happiness.Rank', 'Happiness.Score', 'Whisker.high', 'Whisker.low', 'Economy..GDP.per.Capita.', 'Family', 'Health..Life.Expectancy.', 'Freedom', 'Generosity', 'Trust..Government.Corruption.', 'Dystopia.Residual']
2018: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
2019: ['Overall rank', 'Country or region', 'Score', 'GDP per capita', 'Social sup

In [62]:
# 2015: Drop Standard Error,Dystopia Residual and Region
df_2015.drop(columns=["Standard Error", "Dystopia Residual", "Region"], inplace=True)

# 2016: Drop Confidence Intervals,Dystopia Residual and Region
df_2016.drop(columns=["Lower Confidence Interval", "Upper Confidence Interval", "Dystopia Residual","Region"], inplace=True)

# 2017: Drop Whiskers and Dystopia Residual
df_2017.drop(columns=["Whisker.high", "Whisker.low", "Dystopia.Residual"], inplace=True)

# 2018 and 2019;no need to drop columns

# 2020: Drop UpperWhisker and LowerWhisker
df_2020.drop(columns=["Upperwhisker", "Lowerwhisker"], inplace=True)

# 2021: Drop UpperWhisker and LowerWhisker
df_2021.drop(columns=["Upperwhisker", "Lowerwhisker"], inplace=True)

# 2022: Drop UpperWhisker and LowerWhisker
df_2022.drop(columns=["Upperwhisker", "Lowerwhisker"], inplace=True)

# 2023: Drop UpperWhisker and LowerWhisker
df_2023.drop(columns=["Upperwhisker", "Lowerwhisker"], inplace=True)

# 2024: Drop UpperWhisker and LowerWhisker
df_2024.drop(columns=["Upperwhisker", "Lowerwhisker"], inplace=True)

In [23]:
# Print column names for 2015 dataset to verify dropped columns
print("2015 columns:", df_2015.columns.tolist())

# Print column names for 2016 dataset to verify dropped columns
print("2016 columns:", df_2016.columns.tolist())

# Print column names for 2017 dataset to verify dropped columns
print("2017 columns:", df_2017.columns.tolist())

# Print column names for 2020 dataset to verify dropped columns
print("2020 columns:", df_2020.columns.tolist())

# Print column names for 2021 dataset to verify dropped columns
print("2021 columns:", df_2021.columns.tolist())

# Print column names for 2022 dataset to verify dropped columns
print("2022 columns:", df_2022.columns.tolist())

# Print column names for 2023 dataset to verify dropped columns
print("2023 columns:", df_2023.columns.tolist())

# Print column names for 2024 dataset to verify dropped columns
print("2024 columns:", df_2024.columns.tolist())


2015 columns: ['Country', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity']
2016 columns: ['Country', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity']
2017 columns: ['Country', 'Happiness.Rank', 'Happiness.Score', 'Economy..GDP.per.Capita.', 'Family', 'Health..Life.Expectancy.', 'Freedom', 'Generosity', 'Trust..Government.Corruption.']
2020 columns: ['Country name', 'Happiness Rank', 'Happiness score', 'Economy (GDP per Capita)\t', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
2021 columns: ['Country name', 'Happiness Rank', 'Happiness score', 'Economy (GDP per Capita)\t', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption']
202

Rename columns across all datasets for consistency and easier merging

- Standardized column names like 'Happiness Rank', 'Happiness Score', 'Economy', etc.
- Ensures consistent structure across 2015–2024 datasets
- Improves readability and supports seamless merging and analysis


In [63]:
# Rename columns in 2015 dataset
df_2015.rename(columns={
    'Country': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness Score': 'Happiness Score',
    'Economy (GDP per Capita)': 'Economy',
    'Family': 'Family',
    'Health (Life Expectancy)': 'Healthy life expectancy',
    'Freedom': 'Freedom to make life choices',
    'Trust (Government Corruption)': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2016 dataset
df_2016.rename(columns={
    'Country': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness Score': 'Happiness Score',
    'Economy (GDP per Capita)': 'Economy',
    'Family': 'Family',
    'Health (Life Expectancy)': 'Healthy life expectancy',
    'Freedom': 'Freedom to make life choices',
    'Trust (Government Corruption)': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2017 dataset
df_2017.rename(columns={
    'Country': 'Country',
    'Happiness.Rank': 'Happiness Rank',
    'Happiness.Score': 'Happiness Score',
    'Economy..GDP.per.Capita.': 'Economy',
    'Family': 'Family',
    'Health..Life.Expectancy.': 'Healthy life expectancy',
    'Freedom': 'Freedom to make life choices',
    'Trust..Government.Corruption.': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2018 dataset
df_2018.rename(columns={
    'Country or region': 'Country',
    'Overall rank': 'Happiness Rank',
    'Score': 'Happiness Score',
    'GDP per capita': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2019 dataset
df_2019.rename(columns={
    'Country or region': 'Country',
    'Overall rank': 'Happiness Rank',
    'Score': 'Happiness Score',
    'GDP per capita': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2020 dataset
df_2020.rename(columns={
    'Country name': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness score': 'Happiness Score',
    'Economy (GDP per Capita)\t': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2021 dataset
df_2021.rename(columns={
    'Country name': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness score': 'Happiness Score',
    'Economy (GDP per Capita)\t': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2022 dataset
df_2022.rename(columns={
    'Country name': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness score': 'Happiness Score',
    'Economy (GDP per Capita)\t': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2023 dataset
df_2023.rename(columns={
    'Country name': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness score': 'Happiness Score',
    'Economy (GDP per Capita)\t': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2024 dataset
df_2024.rename(columns={
    'Country name': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness score': 'Happiness Score',
    'Economy (GDP per Capita)\t': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)



In [64]:
# Print renamed column names for each dataset
print("2015 Columns:\n", df_2015.columns)
print("\n2016 Columns:\n", df_2016.columns)
print("\n2017 Columns:\n", df_2017.columns)
print("\n2018 Columns:\n", df_2018.columns)
print("\n2019 Columns:\n", df_2019.columns)
print("\n2020 Columns:\n", df_2019.columns)
print("\n2021 Columns:\n", df_2019.columns)
print("\n2022 Columns:\n", df_2019.columns)
print("\n2023 Columns:\n", df_2019.columns)
print("\n2024 Columns:\n", df_2019.columns)


2015 Columns:
 Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity'],
      dtype='object')

2016 Columns:
 Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity'],
      dtype='object')

2017 Columns:
 Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

2018 Columns:
 Index(['Happiness Rank', 'Country', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

2019 Columns:
 Index(['Happiness Rank', 'Country', 'Happiness Score', 'Econo

# 🌍 Adding Continent Column to All Datasets (2015–2024)

## 📌 Objective
To enhance geographic analysis by adding a new `Continent` column to each of the World Happiness datasets (2015–2024), based on the `Country` column. The continents used are:
- Africa  
- Asia  
- Europe  
- North America  
- South America  
- Oceania  
- Antarctica *(not present in 2015-2024 datasets)*

This addition enables consistent regional comparisons and supports continent-level aggregation and visualization.


In [65]:
# Check unique Countries in 2015
print("2015 - Unique Countries:", df_2015['Country'].nunique())
print(sorted(df_2015['Country'].unique()))

# Check unique Countries in 2016
print("\n2016 - Unique Countries:", df_2016['Country'].nunique())
print(sorted(df_2016['Country'].unique()))

# Check unique Countries in 2017
print("\n2017 - Unique Countries:", df_2017['Country'].nunique())
print(sorted(df_2017['Country'].unique()))

# Check unique Countries in 2018
print("\n2018 - Unique Countries:", df_2018['Country'].nunique())
print(sorted(df_2018['Country'].unique()))

# Check unique Countries in 2019
print("\n2019 - Unique Countries:", df_2019['Country'].nunique())
print(sorted(df_2019['Country'].unique()))

# Check unique Countries in 2020
print("\n2020 - Unique Countries:", df_2020['Country'].nunique())
print(sorted(df_2020['Country'].unique()))

# Check unique Countries in 2021
print("\n2021 - Unique Countries:", df_2021['Country'].nunique())
print(sorted(df_2021['Country'].unique()))

# Check unique Countries in 2022
print("\n2022 - Unique Countries:", df_2022['Country'].nunique())
print(sorted(df_2022['Country'].unique()))

# Check unique Countries in 2023
print("\n2023 - Unique Countries:", df_2023['Country'].nunique())
print(sorted(df_2023['Country'].unique()))

# Check unique Countries in 2024
print("\n2024 - Unique Countries:", df_2024['Country'].nunique())
print(sorted(df_2024['Country'].unique()))





2015 - Unique Countries: 158
['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Belarus', 'Belgium', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Costa Rica', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Estonia', 'Ethiopia', 'Finland', 'France', 'Gabon', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Guatemala', 'Guinea', 'Haiti', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Lithua

## 🌍 Check Common and Missing Countries in All Datasets (2015–2024)

The code below helps identify:

- ✅ Countries that appear in **all datasets** from 2015 to 2024
- ❌ Countries that are **missing in one or more years**

In [66]:
# Step 1: Create a list of your DataFrames
dfs = [df_2015, df_2016, df_2017, df_2018, df_2019,
       df_2020, df_2021, df_2022, df_2023, df_2024]

# Step 2: Create a list of sets of country names
country_sets = [set(df['Country'].str.strip()) for df in dfs]

# Step 3: Find common countries (in all years)
common_countries = set.intersection(*country_sets)

# Step 4: Find all countries (in any year)
all_countries = set.union(*country_sets)

# Step 5: Find countries not common (i.e., missing in one or more years)
not_common_countries = all_countries - common_countries

# Step 6: Print results
print(f"✅ Common countries in ALL years ({len(common_countries)}):")
for country in sorted(common_countries):
    print(country)

print(f"\n❌ Countries NOT in all years ({len(not_common_countries)}):")
for country in sorted(not_common_countries):
    print(country)


✅ Common countries in ALL years (116):
Afghanistan
Albania
Algeria
Argentina
Armenia
Australia
Austria
Bahrain
Bangladesh
Belgium
Benin
Bolivia
Bosnia and Herzegovina
Brazil
Bulgaria
Burkina Faso
Cambodia
Cameroon
Canada
Chile
China
Colombia
Costa Rica
Croatia
Cyprus
Denmark
Dominican Republic
Ecuador
Egypt
El Salvador
Estonia
Ethiopia
Finland
France
Gabon
Georgia
Germany
Ghana
Greece
Guinea
Honduras
Hungary
Iceland
India
Indonesia
Iran
Iraq
Ireland
Israel
Italy
Ivory Coast
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kosovo
Kyrgyzstan
Latvia
Lebanon
Lithuania
Malawi
Malaysia
Mali
Malta
Mauritius
Mexico
Moldova
Mongolia
Montenegro
Morocco
Myanmar
Nepal
Netherlands
New Zealand
Nicaragua
Nigeria
Norway
Pakistan
Panama
Paraguay
Peru
Philippines
Poland
Portugal
Romania
Russia
Saudi Arabia
Senegal
Serbia
Sierra Leone
Singapore
Slovakia
Slovenia
South Africa
South Korea
Spain
Sri Lanka
Sweden
Switzerland
Tajikistan
Tanzania
Thailand
Togo
Tunisia
Uganda
Ukraine
United Arab Emirates
United Kingdom
Un

# 🌍 Country Name Consistency Analysis (2015–2024)

## 1. Introduction
During data consolidation across 10 datasets (2015–2024), many country names appeared with slight variations or inconsistencies. To ensure accurate comparisons, a **country name standardization** was performed by mapping alternative or inconsistent names to a single standard format.

---

## 2. Country Name Standardization

Common inconsistencies fixed include:

- Azerbaijan / Azerbaijan*
- Belarus / Belarus*
- Chad / Chad*
- Hong Kong variations (3 different names)
- Congo variations (Brazzaville vs Kinshasa)
- Macedonia / North Macedonia
- Turkey / Turkiye

A mapping dictionary was used to unify these variations.

---

## 3. Analysis Results

### ✅ Countries Present in **All Years** (116 countries)

Afghanistan, Albania, Algeria, Argentina, Armenia, Australia, Austria, Bahrain, Bangladesh, Belgium, Benin, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Burkina Faso, Cambodia, Cameroon, Canada, Chile, China, Colombia, Costa Rica, Croatia, Cyprus, Denmark, Dominican Republic, Ecuador, Egypt, El Salvador, Estonia, Ethiopia, Finland, France, Gabon, Georgia, Germany, Ghana, Greece, Guinea, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Ireland, Israel, Italy, Ivory Coast, Jamaica, Japan, Jordan, Kazakhstan, Kenya, Kosovo, Kyrgyzstan, Latvia, Lebanon, Lithuania, Malawi, Malaysia, Mali, Malta, Mauritius, Mexico, Moldova, Mongolia, Montenegro, Morocco, Myanmar, Nepal, Netherlands, New Zealand, Nicaragua, Nigeria, Norway, Pakistan, Panama, Paraguay, Peru, Philippines, Poland, Portugal, Romania, Russia, Saudi Arabia, Senegal, Serbia, Sierra Leone, Singapore, Slovakia, Slovenia, South Africa, South Korea, Spain, Sri Lanka, Sweden, Switzerland, Tajikistan, Tanzania, Thailand, Togo, Tunisia, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Venezuela, Vietnam, Zambia, Zimbabwe

---

### ❌ Countries **Not Present in All Years** (82 countries)

Angola, Azerbaijan, Azerbaijan*, Belarus, Belarus*, Belize, Bhutan, Botswana, Botswana*, Burundi, Central African Republic, Chad, Chad*, Comoros, Comoros*, Congo, Congo (Brazzaville), Congo (Kinshasa), Czech Republic, Czechia, Djibouti, Eswatini, Eswatini, Kingdom of*, Gambia, Gambia*, Guatemala, Guatemala*, Haiti, Hong Kong, Hong Kong S.A.R. of China, Hong Kong S.A.R., China, Kuwait, Kuwait*, Laos, Lesotho, Lesotho*, Liberia, Liberia*, Libya, Libya*, Luxembourg, Luxembourg*, Macedonia, Madagascar, Madagascar*, Maldives, Mauritania, Mauritania*, Mozambique, Namibia, Niger, Niger*, North Cyprus, North Cyprus*, North Macedonia, Northern Cyprus, Oman, Palestinian Territories, Palestinian Territories*, Puerto Rico, Qatar, Rwanda, Rwanda*, Somalia, Somaliland Region, Somaliland region, South Sudan, State of Palestine, Sudan, Suriname, Swaziland, Syria, Taiwan, Taiwan Province of China, Trinidad & Tobago, Trinidad and Tobago, Turkey, Turkiye, Turkmenistan, Turkmenistan*, Yemen, Yemen*

---

## 4. Conclusion

Standardizing country names resolved many discrepancies, enabling consistent cross-year comparison. The **116 countries** found in all datasets represent a stable core for longitudinal analysis. The **82 countries** not found in every year may reflect data reporting differences, geopolitical changes, or dataset-specific exclusions.

This standardization approach ensures cleaner, more reliable data analysis across multi-year datasets.


## 🌍 Standardize Country Names to Fix Inconsistencies (2015–2024)

Before comparing countries across years, we fix inconsistent naming using a dictionary.

In [67]:
# Dictionary to map inconsistent names to standard names
country_mapping = {
    # Remove asterisks
    'Azerbaijan*': 'Azerbaijan',
    'Belarus*': 'Belarus',
    'Botswana*': 'Botswana',
    'Chad*': 'Chad',
    'Comoros*': 'Comoros',
    'Eswatini, Kingdom of*': 'Eswatini',
    'Gambia*': 'Gambia',
    'Guatemala*': 'Guatemala',
    'Kuwait*': 'Kuwait',
    'Lesotho*': 'Lesotho',
    'Liberia*': 'Liberia',
    'Libya*': 'Libya',
    'Luxembourg*': 'Luxembourg',
    'Madagascar*': 'Madagascar',
    'Mauritania*': 'Mauritania',
    'Niger*': 'Niger',
    'North Cyprus*': 'North Cyprus',
    'Palestinian Territories*': 'Palestinian Territories',
    'Rwanda*': 'Rwanda',
    'Turkmenistan*': 'Turkmenistan',
    'Yemen*': 'Yemen',
    
    # Standardize country names
    'Czech Republic': 'Czechia',
    'Macedonia': 'North Macedonia',
    'Northern Cyprus': 'North Cyprus',
    'Swaziland': 'Eswatini',
    'Turkey': 'Turkiye',
    'Trinidad & Tobago': 'Trinidad and Tobago',
    'Taiwan Province of China': 'Taiwan',
    'State of Palestine': 'Palestinian Territories',
    'Somaliland region': 'Somaliland Region',
    
    # Hong Kong variations
    'Hong Kong S.A.R. of China': 'Hong Kong',
    'Hong Kong S.A.R., China': 'Hong Kong',
    
    # Congo variations (choose one standard)
    'Congo (Brazzaville)': 'Congo',
    'Congo (Kinshasa)': 'Congo'  
}

In [68]:
# Apply standardization to the 'Country' column in all datasets
for df in [df_2015, df_2016, df_2017, df_2018, df_2019,
           df_2020, df_2021, df_2022, df_2023, df_2024]:
    df['Country'] = df['Country'].str.strip().replace(country_mapping)


In [30]:
# Create sets of country names for each year
country_sets = [set(df['Country']) for df in [
    df_2015, df_2016, df_2017, df_2018, df_2019,
    df_2020, df_2021, df_2022, df_2023, df_2024
]]

# Find common and not-common countries
common_countries = set.intersection(*country_sets)
all_countries = set.union(*country_sets)
not_common_countries = all_countries - common_countries

# Print results
print(f"✅ Common countries in ALL years ({len(common_countries)}):")
for c in sorted(common_countries):
    print(c)

print(f"\n❌ Countries NOT in all years ({len(not_common_countries)}):")
for c in sorted(not_common_countries):
    print(c)


✅ Common countries in ALL years (131):
Afghanistan
Albania
Algeria
Argentina
Armenia
Australia
Austria
Bahrain
Bangladesh
Belgium
Benin
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Bulgaria
Burkina Faso
Cambodia
Cameroon
Canada
Chad
Chile
China
Colombia
Congo
Costa Rica
Croatia
Cyprus
Czechia
Denmark
Dominican Republic
Ecuador
Egypt
El Salvador
Estonia
Ethiopia
Finland
France
Gabon
Georgia
Germany
Ghana
Greece
Guatemala
Guinea
Honduras
Hong Kong
Hungary
Iceland
India
Indonesia
Iran
Iraq
Ireland
Israel
Italy
Ivory Coast
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kosovo
Kyrgyzstan
Latvia
Lebanon
Liberia
Lithuania
Luxembourg
Madagascar
Malawi
Malaysia
Mali
Malta
Mauritania
Mauritius
Mexico
Moldova
Mongolia
Montenegro
Morocco
Myanmar
Nepal
Netherlands
New Zealand
Nicaragua
Niger
Nigeria
North Macedonia
Norway
Pakistan
Palestinian Territories
Panama
Paraguay
Peru
Philippines
Poland
Portugal
Romania
Russia
Saudi Arabia
Senegal
Serbia
Sierra Leone
Singapore
Slovakia
Slovenia
South Africa
South K

# ✅ Country Standardization Results and Final Dataset Decision

## 🔍 Summary of Improvements

Thanks to the country name standardization:
- ✅ **Common countries across all years increased from 116 → 131**  
  (15 additional countries now included in full analysis)
- ❌ **Inconsistently named or missing countries reduced from 82 → 33**

---

## 🧩 Analysis of the 33 Remaining Countries

These countries are still not present in **all years**, and many have valid reasons:

### 🌍 Recent Countries / Political Changes
- **South Sudan** – Independent since 2011  
- **Kosovo** – Declared independence in 2008  
- **Czechia** – New name for Czech Republic  
- **North Macedonia** – Formerly Macedonia

### ⚠️ Conflict / Instability
- **Syria**, **Somalia**, **Yemen**, **Libya**, **Sudan**  
- **Central African Republic**, **Burundi**

### 🌐 Small or Special-status Nations
- **Maldives**, **Bhutan**, **Eswatini**, **Lesotho**, **Djibouti**  
- **Puerto Rico** – US territory  
- **Somaliland Region** – Not internationally recognized

### 💰 Economic / Political Factors
- **North Cyprus** – Limited international recognition  
- **Kuwait**, **Qatar**, **Oman** – May be missing in earlier years

---

## ✅ Final Decision: Use the 131 Common Countries

To maintain consistency and reliability across all years (2015–2024), the analysis will proceed with the **131 countries** that are present in every dataset and drop the **33 countries** that are not present in every dataset.

### Benefits:
- Complete time-series data
- High-quality global coverage
- Easier comparison across years

---


In [69]:
# Paste the 131 countries as a set
common_countries = {
    "Afghanistan", "Albania", "Algeria", "Argentina", "Armenia", "Australia", "Austria", "Bahrain",
    "Bangladesh", "Belgium", "Benin", "Bolivia", "Bosnia and Herzegovina", "Botswana", "Brazil",
    "Bulgaria", "Burkina Faso", "Cambodia", "Cameroon", "Canada", "Chad", "Chile", "China", "Colombia",
    "Congo", "Costa Rica", "Croatia", "Cyprus", "Czechia", "Denmark", "Dominican Republic", "Ecuador",
    "Egypt", "El Salvador", "Estonia", "Ethiopia", "Finland", "France", "Gabon", "Georgia", "Germany",
    "Ghana", "Greece", "Guatemala", "Guinea", "Honduras", "Hong Kong", "Hungary", "Iceland", "India",
    "Indonesia", "Iran", "Iraq", "Ireland", "Israel", "Italy", "Ivory Coast", "Jamaica", "Japan",
    "Jordan", "Kazakhstan", "Kenya", "Kosovo", "Kyrgyzstan", "Latvia", "Lebanon", "Liberia", "Lithuania",
    "Luxembourg", "Madagascar", "Malawi", "Malaysia", "Mali", "Malta", "Mauritania", "Mauritius",
    "Mexico", "Moldova", "Mongolia", "Montenegro", "Morocco", "Myanmar", "Nepal", "Netherlands",
    "New Zealand", "Nicaragua", "Niger", "Nigeria", "North Macedonia", "Norway", "Pakistan",
    "Palestinian Territories", "Panama", "Paraguay", "Peru", "Philippines", "Poland", "Portugal",
    "Romania", "Russia", "Saudi Arabia", "Senegal", "Serbia", "Sierra Leone", "Singapore", "Slovakia",
    "Slovenia", "South Africa", "South Korea", "Spain", "Sri Lanka", "Sweden", "Switzerland", "Taiwan",
    "Tajikistan", "Tanzania", "Thailand", "Togo", "Tunisia", "Turkiye", "Uganda", "Ukraine",
    "United Arab Emirates", "United Kingdom", "United States", "Uruguay", "Uzbekistan", "Venezuela",
    "Vietnam", "Zambia", "Zimbabwe"
}

# List of all your DataFrames
dfs = [df_2015, df_2016, df_2017, df_2018, df_2019, df_2020, df_2021, df_2022, df_2023, df_2024]

# Filter each DataFrame to keep only the 131 common countries
for i in range(len(dfs)):
    dfs[i] = dfs[i][dfs[i]['Country'].isin(common_countries)]

# Assign back to original names
df_2015, df_2016, df_2017, df_2018, df_2019, df_2020, df_2021, df_2022, df_2023, df_2024 = dfs


### ✅ Verifying Cleaned Country Coverage (131 countries across all datasets)

We filtered each dataset (2015–2024) to retain only the 131 standardized country names that appear in all years.

The following code confirms that each dataset now includes exactly 131 unique countries and no unexpected entries:

In [70]:
# Verify that each DataFrame has exactly 131 unique countries
dfs      = [df_2015, df_2016, df_2017, df_2018, df_2019,
            df_2020, df_2021, df_2022, df_2023, df_2024]
year_labels = range(2015, 2025)

for df, yr in zip(dfs, year_labels):
    n_unique = df['Country'].nunique()
    print(f"{yr} – unique countries: {n_unique}")
    
    # Sanity‑check: list any countries NOT in the 131‑country set (should be empty)
    bad = set(df['Country']) - common_countries
    if bad:
        print(f"⚠️  {yr}: unexpected countries found → {sorted(bad)}")


2015 – unique countries: 131
2016 – unique countries: 131
2017 – unique countries: 131
2018 – unique countries: 131
2019 – unique countries: 131
2020 – unique countries: 131
2021 – unique countries: 131
2022 – unique countries: 131
2023 – unique countries: 131
2024 – unique countries: 131


In [None]:
# to check the number of rows and columns in each DataFrame
for year, df in zip(range(2015, 2025), [df_2015, df_2016, df_2017, df_2018, df_2019, df_2020, df_2021, df_2022, df_2023, df_2024]):
    print(f"\nYear: {year}")
    print(f"Shape: {df.shape}")  # Returns a tuple (rows, columns)
    print(f"Number of Rows: {df.shape[0]}")
    print(f"Number of Columns: {df.shape[1]}")

In [71]:
# to recheck column names for each dataset
print("2015 Columns:\n", df_2015.columns)
print("\n2016 Columns:\n", df_2016.columns)
print("\n2017 Columns:\n", df_2017.columns)
print("\n2018 Columns:\n", df_2018.columns)
print("\n2019 Columns:\n", df_2019.columns)
print("\n2020 Columns:\n", df_2019.columns)
print("\n2021 Columns:\n", df_2019.columns)
print("\n2022 Columns:\n", df_2019.columns)
print("\n2023 Columns:\n", df_2019.columns)
print("\n2024 Columns:\n", df_2019.columns)


2015 Columns:
 Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity'],
      dtype='object')

2016 Columns:
 Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity'],
      dtype='object')

2017 Columns:
 Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

2018 Columns:
 Index(['Happiness Rank', 'Country', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

2019 Columns:
 Index(['Happiness Rank', 'Country', 'Happiness Score', 'Econo

## 🌍 Adding `Continent` Column Based on Country Name

To enhance our datasets (`df_2015` to `df_2024`), we add a `Continent` column by mapping each country to its respective continent using a predefined dictionary.

In [74]:
# Define the Country-to-Continent Mapping
country_to_continent = {
    # Africa
    "Algeria": "Africa",
    "Benin": "Africa",
    "Botswana": "Africa",
    "Burkina Faso": "Africa",
    "Cameroon": "Africa",
    "Chad": "Africa",
    "Congo": "Africa",
    "Egypt": "Africa",
    "Ethiopia": "Africa",
    "Gabon": "Africa",
    "Ghana": "Africa",
    "Guinea": "Africa",
    "Ivory Coast": "Africa",
    "Kenya": "Africa",
    "Liberia": "Africa",
    "Madagascar": "Africa",
    "Malawi": "Africa",
    "Mali": "Africa",
    "Mauritania": "Africa",
    "Mauritius": "Africa",
    "Morocco": "Africa",
    "Niger": "Africa",
    "Nigeria": "Africa",
    "Senegal": "Africa",
    "Sierra Leone": "Africa",
    "South Africa": "Africa",
    "Tanzania": "Africa",
    "Togo": "Africa",
    "Tunisia": "Africa",
    "Uganda": "Africa",
    "Zambia": "Africa",
    "Zimbabwe": "Africa",
    
    # Asia
    "Afghanistan": "Asia",
    "Armenia": "Asia",
    "Bahrain": "Asia",
    "Bangladesh": "Asia",
    "Cambodia": "Asia",
    "China": "Asia",
    "Hong Kong": "Asia",
    "India": "Asia",
    "Indonesia": "Asia",
    "Iran": "Asia",
    "Iraq": "Asia",
    "Israel": "Asia",
    "Japan": "Asia",
    "Jordan": "Asia",
    "Kazakhstan": "Asia",
    "Kosovo": "Europe",  # Geographically in Europe
    "Kyrgyzstan": "Asia",
    "Lebanon": "Asia",
    "Malaysia": "Asia",
    "Myanmar": "Asia",
    "Nepal": "Asia",
    "Pakistan": "Asia",
    "Palestinian Territories": "Asia",
    "Philippines": "Asia",
    "Saudi Arabia": "Asia",
    "Singapore": "Asia",
    "South Korea": "Asia",
    "Sri Lanka": "Asia",
    "Taiwan": "Asia",
    "Tajikistan": "Asia",
    "Thailand": "Asia",
    "United Arab Emirates": "Asia",
    "Uzbekistan": "Asia",
    "Vietnam": "Asia",
    
    # Europe
    "Albania": "Europe",
    "Austria": "Europe",
    "Belgium": "Europe",
    "Bosnia and Herzegovina": "Europe",
    "Bulgaria": "Europe",
    "Croatia": "Europe",
    "Cyprus": "Europe",
    "Czechia": "Europe",
    "Denmark": "Europe",
    "Estonia": "Europe",
    "Finland": "Europe",
    "France": "Europe",
    "Georgia": "Europe",  # Transcontinental, but commonly considered Europe
    "Germany": "Europe",
    "Greece": "Europe",
    "Hungary": "Europe",
    "Iceland": "Europe",
    "Ireland": "Europe",
    "Italy": "Europe",
    "Latvia": "Europe",
    "Lithuania": "Europe",
    "Luxembourg": "Europe",
    "Malta": "Europe",
    "Moldova": "Europe",
    "Montenegro": "Europe",
    "Netherlands": "Europe",
    "North Macedonia": "Europe",
    "Norway": "Europe",
    "Poland": "Europe",
    "Portugal": "Europe",
    "Romania": "Europe",
    "Russia": "Europe",  # Transcontinental, but capital in Europe
    "Serbia": "Europe",
    "Slovakia": "Europe",
    "Slovenia": "Europe",
    "Spain": "Europe",
    "Sweden": "Europe",
    "Switzerland": "Europe",
    "Turkiye": "Asia",  # Transcontinental, but majority in Asia
    "Ukraine": "Europe",
    "United Kingdom": "Europe",
    
    # North America
    "Canada": "North America",
    "Costa Rica": "North America",
    "Dominican Republic": "North America",
    "El Salvador": "North America",
    "Guatemala": "North America",
    "Honduras": "North America",
    "Jamaica": "North America",
    "Mexico": "North America",
    "Nicaragua": "North America",
    "Panama": "North America",
    "United States": "North America",
    
    # South America
    "Argentina": "South America",
    "Bolivia": "South America",
    "Brazil": "South America",
    "Chile": "South America",
    "Colombia": "South America",
    "Ecuador": "South America",
    "Paraguay": "South America",
    "Peru": "South America",
    "Uruguay": "South America",
    "Venezuela": "South America",
    
    # Oceania
    "Australia": "Oceania",
    "New Zealand": "Oceania",
    
    # Mongolia (Asia)
    "Mongolia": "Asia"
}




In [75]:
# Apply the Mapping to All DataFrames

# List of all yearly DataFrames
dataframes = [df_2015, df_2016, df_2017, df_2018, df_2019,
              df_2020, df_2021, df_2022, df_2023, df_2024]

# Add the 'Continent' column using the mapping
for df in dataframes:
    df["Continent"] = df["Country"].map(country_to_continent)


In [78]:
# Verify All Countries Were Mapped

for year, df in zip(range(2015, 2025), dataframes):
    missing = df[df["Continent"].isnull()]
    if missing.empty:
        print(f"All countries are mapped in {year}.")
    else:
        print(f"Missing countries in {year}:", missing["Country"].unique())



All countries are mapped in 2015.
All countries are mapped in 2016.
All countries are mapped in 2017.
All countries are mapped in 2018.
All countries are mapped in 2019.
All countries are mapped in 2020.
All countries are mapped in 2021.
All countries are mapped in 2022.
All countries are mapped in 2023.
All countries are mapped in 2024.


## Data Preprocessing for Machine Learning

In this section, we prepare the dataset for machine learning by performing the following steps:

1. **Separate Features by Type**  
   - Categorical features: `Country`, `Continent`  
   - Numeric features: `Year`, `Happiness Rank`, `Happiness Score`, `Economy`, `Family`, `Healthy life expectancy`, `Freedom to make life choices`, `Generosity`, `Perceptions of corruption`

2. **One-Hot Encoding**  
   Convert categorical variables into binary columns using one-hot encoding to enable machine learning algorithms to process categorical data.  
   We use `drop='first'` to avoid the dummy variable trap (redundant features).

3. **Feature Scaling**  
   Scale numeric features using `StandardScaler` to standardize the range of features (mean = 0, standard deviation = 1).  
   This improves model convergence and performance.

4. **Combine Processed Features**  
   Concatenate the scaled numeric features and one-hot encoded categorical features into a single DataFrame, ready for model training.


In [72]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler


# Step 1: Separate features by type
categorical_cols = ['Country', 'Continent']  # Categorical features to encode
numeric_cols = ['Year', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family', 
                'Healthy life expectancy', 'Freedom to make life choices', 
                'Generosity', 'Perceptions of corruption']  # Numeric features to scale



In [79]:
# Step 2: One-hot encode categorical columns
ohe = OneHotEncoder(sparse_output=False, drop='first')

encoded_cats = ohe.fit_transform(df_all[categorical_cols])

# Get new column names for encoded features
encoded_cat_cols = ohe.get_feature_names_out(categorical_cols)

In [80]:
# Step 3: Scale numeric columns
scaler = StandardScaler()
scaled_nums = scaler.fit_transform(df_all[numeric_cols])

In [81]:
# Step 4: Create DataFrames from transformed data
df_encoded = pd.DataFrame(encoded_cats, columns=encoded_cat_cols)
df_scaled = pd.DataFrame(scaled_nums, columns=numeric_cols)

In [83]:
# Step 5: Concatenate encoded categorical and scaled numeric features
df_all_encoded = pd.concat([df_scaled, df_encoded], axis=1)

In [85]:
# Now df_final is ready for ML models
print(df_all_encoded.head())

       Year  Happiness Rank  Happiness Score   Economy    Family  \
0 -1.407459       -1.720758         1.959628  1.180267  0.823221   
1 -1.407459       -1.698611         1.936552  0.948887  0.983300   
2 -1.407459       -1.676464         1.906377  1.005780  0.856834   
3 -1.407459       -1.654318         1.901939  1.333775  0.766865   
4 -1.407459       -1.632171         1.817625  1.007770  0.741542   

   Healthy life expectancy  Freedom to make life choices  Generosity  \
0                 1.325870                      1.665629    0.639746   
1                 1.351701                      1.424764    1.781084   
2                 1.056718                      1.559661    1.004676   
3                 1.099313                      1.692857    1.050487   
4                 1.181602                      1.452254    1.959500   

   Perceptions of corruption  Country_Albania  ...  Country_Venezuela  \
0                   2.776523              0.0  ...                0.0   
1           

### Step 3. Load
- Save the cleaned and transformed dataset into a new CSV file.
---

In [74]:
# Save the cleaned DataFrame to a CSV file without including the index column
df_all.to_csv('cleaned_dataset.csv', index=False)


In [86]:
# Save the encoded cleaned dataser to a CSV file without Index column 
df_all_encoded.to_csv('encoded_cleaned_dataset.csv', index=False)

## Summary of Two Cleaned Datasets

- **Cleaned Dataset 1:**  
  - Combined data from 2015,2016,2017, 2018, and 2019 after standardizing country names and added Continent columns.  
  - Removed or corrected inconsistent country names and handled missing values.  
  - Saved as `cleaned_dataset.csv` for general analysis and visualization.

- **Cleaned Dataset 2 (ML-Ready):**  
  - Applied one-hot encoding to categorical variables (`Country`, `Continent`).  
  - Scaled all numeric features using `StandardScaler` for normalization.  
  - Ensured the dataset is fully numeric and standardized for machine learning models.  
  - Saved as `encoded_cleaned_dataset.csv` to be used directly in ML workflows.

Both datasets maintain consistency and quality, tailored for their respective use cases: exploratory data analysis and machine learning.