# World Happiness Dataset - ETL Pipeline

This ETL (Extract, Transform, Load) pipeline processes the World Happiness data from multiple CSV files spanning the years 2015 to 2019. The goal is to prepare a clean, consistent dataset ready for analysis, modeling, and visualization.

---

## Pipeline Steps

### Step 1. Extract
- Load all five CSV files (`2015.csv`, `2016.csv`, `2017.csv`, `2018.csv`, `2019.csv`) into separate dataframes.
- Each file contains similar but not identical columns, with some variations in column names and structure.



In [1]:
# Import necessary libraries
import pandas as pd  # pandas is used for data loading, manipulation, and analysis (like handling DataFrames)
import numpy as np   # numpy is used for numerical operations, handling arrays, and working with missing values

In [2]:
# Load each CSV file into a separate DataFrame
df_2015 = pd.read_csv('../data/raw/2015.csv')  # Load the 2015 World Happiness data
df_2016 = pd.read_csv('../data/raw/2016.csv')  # Load the 2016 World Happiness data
df_2017 = pd.read_csv('../data/raw/2017.csv')  # Load the 2017 World Happiness data
df_2018 = pd.read_csv('../data/raw/2018.csv')  # Load the 2018 World Happiness data
df_2019 = pd.read_csv('../data/raw/2019.csv')  # Load the 2019 World Happiness data

In [3]:
# Quickly inspect the structure and content of each DataFrame
# This helps us understand the shape, column names, data types, and missing values.
for year, df in zip(range(2015, 2020), [df_2015, df_2016, df_2017, df_2018, df_2019]):
    print(f"\nYear: {year}")         # Print the year to label each output block
    print(df.info())                 # Show column names, data types, and non-null counts
    print(df.head())                # Display the first 5 rows of the DataFrame


Year: 2015
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        158 non-null    object 
 1   Region                         158 non-null    object 
 2   Happiness Rank                 158 non-null    int64  
 3   Happiness Score                158 non-null    float64
 4   Standard Error                 158 non-null    float64
 5   Economy (GDP per Capita)       158 non-null    float64
 6   Family                         158 non-null    float64
 7   Health (Life Expectancy)       158 non-null    float64
 8   Freedom                        158 non-null    float64
 9   Trust (Government Corruption)  158 non-null    float64
 10  Generosity                     158 non-null    float64
 11  Dystopia Residual              158 non-null    float64
dtypes: float64(9), int64(1), object(2)
mem

# World Happiness Dataset - Column Names by Year

### 2015.csv
- Country 
- Region
- Happiness Rank
- Happiness Score
- Standard Error
- Economy (GDP per Capita)
- Family
- Health (Life Expectancy)
- Freedom
- Trust (Government Corruption)
- Generosity
- Dystopia Residual

---

### 2016.csv
- Country
- Region
- Happiness Rank
- Happiness Score
- Lower Confidence Interval
- Upper Confidence Interval
- Economy (GDP per Capita)
- Family
- Health (Life Expectancy)
- Freedom
- Trust (Government Corruption)
- Generosity
- Dystopia Residual

---

### 2017.csv
- Country
- Happiness.Rank
- Happiness.Score
- Whisker.high
- Whisker.low
- Economy..GDP.per.Capita.
- Family
- Health..Life.Expectancy.
- Freedom
- Generosity
- Trust..Government.Corruption.
- Dystopia.Residual

---

### 2018.csv
- Overall rank
- Country or region
- Score
- GDP per capita
- Social support
- Healthy life expectancy
- Freedom to make life choices
- Generosity
- Perceptions of corruption

---

### 2019.csv
- Overall rank
- Country or region
- Score
- GDP per capita
- Social support
- Healthy life expectancy
- Freedom to make life choices
- Generosity
- Perceptions of corruption


# World Happiness Dataset - Column Meanings

| Column Name                                   | Meaning / Description                                                                                   |
|----------------------------------------------|-------------------------------------------------------------------------------------------------------|
| **Country**               | The name of the country being evaluated.                                                    |
| **Region**                                    | Broad geographic grouping or continent used for aggregations (e.g., Asia, Europe, Africa, Oceania).     |
| **Happiness Rank / Overall rank**             | The position of the country in the list ranked by happiness score (1 = happiest).                      |
| **Happiness Score / Score**                    | A numerical score representing the overall happiness of the country; higher means happier.            |
| **Standard Error**                             | The margin of error or uncertainty in the happiness score estimate (2015 only).                       |
| **Lower Confidence Interval**                  | The lower bound of the confidence interval for the happiness score, indicating uncertainty (2016 only).|
| **Upper Confidence Interval**                  | The upper bound of the confidence interval for the happiness score (2016 only).                       |
| **Whisker.high / Whisker.low**                 | Similar to confidence intervals, indicating the range of uncertainty in the happiness score (2017 only).|
| **Economy (GDP per Capita) / GDP per capita** | A measure of economic output per person, indicating the wealth level of the country.                   |
| **Family / Social support**                    | The perceived level of social support from family, friends, and community.                            |
| **Health (Life Expectancy) / Healthy life expectancy** | Average expected lifespan or a proxy for health quality in the country.                            |
| **Freedom / Freedom to make life choices**    | The perceived freedom individuals have to make their own life decisions.                             |
| **Trust (Government Corruption) / Perceptions of corruption** | Measure of how much corruption is perceived in government and public institutions.          |
| **Generosity**                                 | A measure of the generosity or charitable behavior in the country’s population.                      |
| **Dystopia Residual**                          | The gap between the predicted happiness score based on the model and actual scores — a residual value representing unexplained factors. |

In [3]:
# Print the number of rows and columns in each DataFrame
for year, df in zip(range(2015, 2020), [df_2015, df_2016, df_2017, df_2018, df_2019]):
    print(f"\nYear: {year}")
    print(f"Shape: {df.shape}")  # Returns a tuple (rows, columns)
    print(f"Number of Rows: {df.shape[0]}")
    print(f"Number of Columns: {df.shape[1]}")


Year: 2015
Shape: (158, 12)
Number of Rows: 158
Number of Columns: 12

Year: 2016
Shape: (157, 13)
Number of Rows: 157
Number of Columns: 13

Year: 2017
Shape: (155, 12)
Number of Rows: 155
Number of Columns: 12

Year: 2018
Shape: (156, 9)
Number of Rows: 156
Number of Columns: 9

Year: 2019
Shape: (156, 9)
Number of Rows: 156
Number of Columns: 9


### Step 2. Transform

As part of the data transformation process, we clean and prepare the data for analysis:

- Dropped columns related to uncertainty that are not needed for analysis or modeling:
  - `Standard Error` (2015)
  - `Lower Confidence Interval` and `Upper Confidence Interval` (2016)
  - `Whisker.high` and `Whisker.low` (2017)

In addition to removing uncertainty-related columns, we also dropped:

- `Dystopia Residual`: This is a synthetic value used in the original happiness score calculation. Since we aim to build our own models, this residual component is not useful and may bias our interpretation or predictions.

Dropping this ensures our models are trained only on raw, interpretable variables.

In [4]:
# 2015: Drop Standard Error and Dystopia Residual
df_2015.drop(columns=["Standard Error", "Dystopia Residual"], inplace=True)

# 2016: Drop Confidence Intervals and Dystopia Residual
df_2016.drop(columns=["Lower Confidence Interval", "Upper Confidence Interval", "Dystopia Residual"], inplace=True)

# 2017: Drop Whiskers and Dystopia Residual
df_2017.drop(columns=["Whisker.high", "Whisker.low", "Dystopia.Residual"], inplace=True)

# 2018 and 2019;no need to drop columns

In [5]:
# Print column names for 2015 dataset to verify dropped columns
print("2015 columns:", df_2015.columns.tolist())

# Print column names for 2016 dataset to verify dropped columns
print("2016 columns:", df_2016.columns.tolist())

# Print column names for 2017 dataset to verify dropped columns
print("2017 columns:", df_2017.columns.tolist())

2015 columns: ['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity']
2016 columns: ['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity']
2017 columns: ['Country', 'Happiness.Rank', 'Happiness.Score', 'Economy..GDP.per.Capita.', 'Family', 'Health..Life.Expectancy.', 'Freedom', 'Generosity', 'Trust..Government.Corruption.']


Renamed columns across all datasets for consistency and easier merging

- Standardized column names like 'Happiness Rank', 'Happiness Score', 'Economy', etc.
- Ensures consistent structure across 2015–2019 datasets
- Improves readability and supports seamless merging and analysis


In [6]:
# Rename columns in 2015 dataset
df_2015.rename(columns={
    'Country': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness Score': 'Happiness Score',
    'Economy (GDP per Capita)': 'Economy',
    'Family': 'Family',
    'Health (Life Expectancy)': 'Healthy life expectancy',
    'Freedom': 'Freedom to make life choices',
    'Trust (Government Corruption)': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2016 dataset
df_2016.rename(columns={
    'Country': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness Score': 'Happiness Score',
    'Economy (GDP per Capita)': 'Economy',
    'Family': 'Family',
    'Health (Life Expectancy)': 'Healthy life expectancy',
    'Freedom': 'Freedom to make life choices',
    'Trust (Government Corruption)': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2017 dataset
df_2017.rename(columns={
    'Country': 'Country',
    'Happiness.Rank': 'Happiness Rank',
    'Happiness.Score': 'Happiness Score',
    'Economy..GDP.per.Capita.': 'Economy',
    'Family': 'Family',
    'Health..Life.Expectancy.': 'Healthy life expectancy',
    'Freedom': 'Freedom to make life choices',
    'Trust..Government.Corruption.': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2018 dataset
df_2018.rename(columns={
    'Country or region': 'Country',
    'Overall rank': 'Happiness Rank',
    'Score': 'Happiness Score',
    'GDP per capita': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2019 dataset
df_2019.rename(columns={
    'Country or region': 'Country',
    'Overall rank': 'Happiness Rank',
    'Score': 'Happiness Score',
    'GDP per capita': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)


In [7]:
# Print renamed column names for each dataset
print("2015 Columns:\n", df_2015.columns)
print("\n2016 Columns:\n", df_2016.columns)
print("\n2017 Columns:\n", df_2017.columns)
print("\n2018 Columns:\n", df_2018.columns)
print("\n2019 Columns:\n", df_2019.columns)


2015 Columns:
 Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Economy',
       'Family', 'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity'],
      dtype='object')

2016 Columns:
 Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Economy',
       'Family', 'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity'],
      dtype='object')

2017 Columns:
 Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

2018 Columns:
 Index(['Happiness Rank', 'Country', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

2019 Columns:
 Index(['Happiness Rank', 'Country', 'Happ

# 🌍 Standardizing Country and Region Columns (2015 & 2016)

## 📌 Objective
To ensure consistency in country names across the World Happiness datasets (2015–2019), and to replace or standardize the `Region` column to reflect the **7 continents of the world**:
- Africa
- Asia
- Europe
- North America
- South America
- Oceania
- Antarctica *(not present in these datasets)*

---


In [8]:
# Check unique Countries and Regions in 2015
print("2015 - Unique Countries:", df_2015['Country'].nunique())
print(sorted(df_2015['Country'].unique()))

print("\n2015 - Unique Regions:", df_2015['Region'].nunique())
print(sorted(df_2015['Region'].unique()))

# Check unique Countries and Regions in 2016
print("\n2016 - Unique Countries:", df_2016['Country'].nunique())
print(sorted(df_2016['Country'].unique()))

print("\n2016 - Unique Regions:", df_2016['Region'].nunique())
print(sorted(df_2016['Region'].unique()))


2015 - Unique Countries: 158
['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Belarus', 'Belgium', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Costa Rica', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Estonia', 'Ethiopia', 'Finland', 'France', 'Gabon', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Guatemala', 'Guinea', 'Haiti', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Lithua

In [9]:

# Use 'Country' columns
countries_2015 = set(df_2015['Country'].str.strip())
countries_2016 = set(df_2016['Country'].str.strip())

# Countries only in 2015
only_in_2015 = countries_2015 - countries_2016

# Countries only in 2016
only_in_2016 = countries_2016 - countries_2015

# Countries in both years
common_countries = countries_2015.intersection(countries_2016)

print(f"Countries only in 2015 ({len(only_in_2015)}):")
for c in sorted(only_in_2015):
    print(c)

print(f"\nCountries only in 2016 ({len(only_in_2016)}):")
for c in sorted(only_in_2016):
    print(c)

print(f"\nCommon countries ({len(common_countries)}):")
for c in sorted(common_countries):
    print(c)


Countries only in 2015 (7):
Central African Republic
Djibouti
Lesotho
Mozambique
Oman
Somaliland region
Swaziland

Countries only in 2016 (6):
Belize
Namibia
Puerto Rico
Somalia
Somaliland Region
South Sudan

Common countries (151):
Afghanistan
Albania
Algeria
Angola
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahrain
Bangladesh
Belarus
Belgium
Benin
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Chad
Chile
China
Colombia
Comoros
Congo (Brazzaville)
Congo (Kinshasa)
Costa Rica
Croatia
Cyprus
Czech Republic
Denmark
Dominican Republic
Ecuador
Egypt
El Salvador
Estonia
Ethiopia
Finland
France
Gabon
Georgia
Germany
Ghana
Greece
Guatemala
Guinea
Haiti
Honduras
Hong Kong
Hungary
Iceland
India
Indonesia
Iran
Iraq
Ireland
Israel
Italy
Ivory Coast
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kosovo
Kuwait
Kyrgyzstan
Laos
Latvia
Lebanon
Liberia
Libya
Lithuania
Luxembourg
Macedonia
Madagascar
Malawi
Malaysia
Mali
Malta
Mauritania
Mauritius

# Country Name Standardization in World Happiness Data (2015 & 2016)

## Findings

- The 2015 dataset contains 7 countries not found in 2016:  
  *Central African Republic, Djibouti, Lesotho, Mozambique, Oman, Somaliland region, Swaziland*

- The 2016 dataset contains 6 countries not found in 2015:  
  *Belize, Namibia, Puerto Rico, Somalia, Somaliland Region, South Sudan*

- Upon review, **“Somaliland region” (2015)** and **“Somaliland Region” (2016)** are the same entity with inconsistent casing/spelling.

- **“Swaziland” (2015)** has been officially renamed **“Eswatini”** since 2018 and may be missing or renamed in 2016.

- Other differences represent either legitimate new countries, territories included in one dataset but not the other, or coverage differences.

- To prepare for consistent analysis, country names need standardization to fix naming inconsistencies and account for renamed countries.

-To improve clarity and enable better visualizations, the following changes were made to both the 2015 and 2016 World Happiness dataset:Dropped 'Region' Column & Added 'Continent' Column



---



In [10]:
# Mapping of inconsistent or renamed countries to standard names
standardize_map = {
    'Somaliland region': 'Somaliland Region',  # unify casing/spelling
    'Swaziland': 'Eswatini',                   # rename country to current official name
    # Add more mappings here as needed
}
# Function to standardize country names by:
# 1. Removing leading and trailing whitespace from the input name.
# 2. Replacing the name with a standardized version if it exists in the mapping dictionary.
# 3. Returning the original name unchanged if no mapping is found.
def standardize_country(name):
    name = name.strip()  # Remove extra spaces
    return standardize_map.get(name, name)  # Map if in dictionary else keep original

# Apply standardization
df_2015['Country'] = df_2015['Country'].apply(standardize_country)
df_2016['Country'] = df_2016['Country'].apply(standardize_country)

# Optional: re-check differences after standardizing
countries_2015 = set(df_2015['Country'])
countries_2016 = set(df_2016['Country'])

only_in_2015 = countries_2015 - countries_2016
only_in_2016 = countries_2016 - countries_2015
common_countries = countries_2015.intersection(countries_2016)

print(f"Countries only in 2015 after standardizing ({len(only_in_2015)}):")
for c in sorted(only_in_2015):
    print(c)

print(f"\nCountries only in 2016 after standardizing ({len(only_in_2016)}):")
for c in sorted(only_in_2016):
    print(c)

print(f"\nCommon countries after standardizing ({len(common_countries)}):")
for c in sorted(common_countries):
    print(c)

Countries only in 2015 after standardizing (6):
Central African Republic
Djibouti
Eswatini
Lesotho
Mozambique
Oman

Countries only in 2016 after standardizing (5):
Belize
Namibia
Puerto Rico
Somalia
South Sudan

Common countries after standardizing (152):
Afghanistan
Albania
Algeria
Angola
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahrain
Bangladesh
Belarus
Belgium
Benin
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Chad
Chile
China
Colombia
Comoros
Congo (Brazzaville)
Congo (Kinshasa)
Costa Rica
Croatia
Cyprus
Czech Republic
Denmark
Dominican Republic
Ecuador
Egypt
El Salvador
Estonia
Ethiopia
Finland
France
Gabon
Georgia
Germany
Ghana
Greece
Guatemala
Guinea
Haiti
Honduras
Hong Kong
Hungary
Iceland
India
Indonesia
Iran
Iraq
Ireland
Israel
Italy
Ivory Coast
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kosovo
Kuwait
Kyrgyzstan
Laos
Latvia
Lebanon
Liberia
Libya
Lithuania
Luxembourg
Macedonia
Madagascar
Malawi
Malaysia
Mali
Mal

In [12]:
# Drop 'Region' column
df_2015 = df_2015.drop(columns=['Region'])
df_2016 = df_2016.drop(columns=['Region'])

In [14]:
print(df_2015.columns)
print(df_2016.columns)


Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity'],
      dtype='object')
Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity'],
      dtype='object')


In [15]:
# Manual mapping of countries to continents
country_to_continent = {
    # Africa
    "Algeria": "Africa", "Angola": "Africa", "Benin": "Africa", "Botswana": "Africa", "Burkina Faso": "Africa",
    "Burundi": "Africa", "Cameroon": "Africa", "Chad": "Africa", "Comoros": "Africa", "Congo (Brazzaville)": "Africa",
    "Congo (Kinshasa)": "Africa", "Egypt": "Africa", "Ethiopia": "Africa", "Gabon": "Africa", "Ghana": "Africa",
    "Guinea": "Africa", "Ivory Coast": "Africa", "Kenya": "Africa", "Liberia": "Africa", "Libya": "Africa",
    "Madagascar": "Africa", "Malawi": "Africa", "Mali": "Africa", "Mauritania": "Africa", "Mauritius": "Africa",
    "Morocco": "Africa", "Niger": "Africa", "Nigeria": "Africa", "Rwanda": "Africa", "Senegal": "Africa",
    "Sierra Leone": "Africa", "Somaliland Region": "Africa", "South Africa": "Africa", "Sudan": "Africa",
    "Tanzania": "Africa", "Togo": "Africa", "Tunisia": "Africa", "Uganda": "Africa", "Zambia": "Africa", "Zimbabwe": "Africa",

    # Asia
    "Afghanistan": "Asia", "Armenia": "Asia", "Azerbaijan": "Asia", "Bahrain": "Asia", "Bangladesh": "Asia",
    "Bhutan": "Asia", "Cambodia": "Asia", "China": "Asia", "Georgia": "Asia", "Hong Kong": "Asia",
    "India": "Asia", "Indonesia": "Asia", "Iran": "Asia", "Iraq": "Asia", "Israel": "Asia", "Japan": "Asia",
    "Jordan": "Asia", "Kazakhstan": "Asia", "Kosovo": "Asia", "Kuwait": "Asia", "Kyrgyzstan": "Asia",
    "Laos": "Asia", "Lebanon": "Asia", "Malaysia": "Asia", "Myanmar": "Asia", "Nepal": "Asia", "Pakistan": "Asia",
    "Palestinian Territories": "Asia", "Philippines": "Asia", "Qatar": "Asia", "Russia": "Asia", "Saudi Arabia": "Asia",
    "Singapore": "Asia", "South Korea": "Asia", "Sri Lanka": "Asia", "Syria": "Asia", "Taiwan": "Asia",
    "Tajikistan": "Asia", "Thailand": "Asia", "Turkey": "Asia", "Turkmenistan": "Asia", "United Arab Emirates": "Asia",
    "Uzbekistan": "Asia", "Vietnam": "Asia", "Yemen": "Asia",

    # Europe
    "Albania": "Europe", "Austria": "Europe", "Belarus": "Europe", "Belgium": "Europe", "Bosnia and Herzegovina": "Europe",
    "Bulgaria": "Europe", "Croatia": "Europe", "Cyprus": "Europe", "Czech Republic": "Europe", "Denmark": "Europe",
    "Estonia": "Europe", "Finland": "Europe", "France": "Europe", "Germany": "Europe", "Greece": "Europe",
    "Hungary": "Europe", "Iceland": "Europe", "Ireland": "Europe", "Italy": "Europe", "Latvia": "Europe",
    "Lithuania": "Europe", "Luxembourg": "Europe", "Macedonia": "Europe", "Malta": "Europe", "Moldova": "Europe",
    "Montenegro": "Europe", "Netherlands": "Europe", "North Cyprus": "Europe", "Norway": "Europe", "Poland": "Europe",
    "Portugal": "Europe", "Romania": "Europe", "Serbia": "Europe", "Slovakia": "Europe", "Slovenia": "Europe",
    "Spain": "Europe", "Sweden": "Europe", "Switzerland": "Europe", "Ukraine": "Europe", "United Kingdom": "Europe",

    # North America
    "Canada": "North America", "Costa Rica": "North America", "Dominican Republic": "North America",
    "El Salvador": "North America", "Guatemala": "North America", "Haiti": "North America", "Honduras": "North America",
    "Jamaica": "North America", "Mexico": "North America", "Nicaragua": "North America", "Panama": "North America",
    "Trinidad and Tobago": "North America", "United States": "North America",

    # South America
    "Argentina": "South America", "Bolivia": "South America", "Brazil": "South America", "Chile": "South America",
    "Colombia": "South America", "Ecuador": "South America", "Paraguay": "South America", "Peru": "South America",
    "Suriname": "South America", "Uruguay": "South America", "Venezuela": "South America",

    # Oceania
    "Australia": "Oceania", "New Zealand": "Oceania"
}


In [16]:
# Apply mapping to both 2015 and 2016 dataframes
df_2015["Continent"] = df_2015["Country"].map(country_to_continent)
df_2016["Continent"] = df_2016["Country"].map(country_to_continent)


In [17]:
# Check which countries could not be mapped
missing_2015 = df_2015[df_2015["Continent"].isnull()]["Country"].unique()
missing_2016 = df_2016[df_2016["Continent"].isnull()]["Country"].unique()

print("Unmapped countries in 2015:", missing_2015)
print("Unmapped countries in 2016:", missing_2016)


Unmapped countries in 2015: ['Oman' 'Mozambique' 'Lesotho' 'Mongolia' 'Eswatini' 'Djibouti'
 'Central African Republic']
Unmapped countries in 2016: ['Puerto Rico' 'Belize' 'Somalia' 'Mongolia' 'Namibia' 'South Sudan']


In [18]:
# Update for previously unmapped countries
country_to_continent.update({
    "Central African Republic": "Africa",
    "Djibouti": "Africa",
    "Eswatini": "Africa",
    "Lesotho": "Africa",
    "Mozambique": "Africa",
    "Oman": "Asia",
    "Mongolia": "Asia",
    "Puerto Rico": "North America",
    "Belize": "North America",
    "Somalia": "Africa",
    "Namibia": "Africa",
    "South Sudan": "Africa",
})


In [19]:
df_2015["Continent"] = df_2015["Country"].map(country_to_continent)
df_2016["Continent"] = df_2016["Country"].map(country_to_continent)

# Check unmapped countries again
print("Unmapped countries in 2015:", df_2015[df_2015["Continent"].isnull()]["Country"].unique())
print("Unmapped countries in 2016:", df_2016[df_2016["Continent"].isnull()]["Country"].unique())


Unmapped countries in 2015: []
Unmapped countries in 2016: []


In [20]:
df_2015.head()

Unnamed: 0,Country,Happiness Rank,Happiness Score,Economy,Family,Healthy life expectancy,Freedom to make life choices,Perceptions of corruption,Generosity,Continent
0,Switzerland,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,Europe
1,Iceland,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,Europe
2,Denmark,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,Europe
3,Norway,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,Europe
4,Canada,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,North America


In [21]:
df_2016.head()

Unnamed: 0,Country,Happiness Rank,Happiness Score,Economy,Family,Healthy life expectancy,Freedom to make life choices,Perceptions of corruption,Generosity,Continent
0,Denmark,1,7.526,1.44178,1.16374,0.79504,0.57941,0.44453,0.36171,Europe
1,Switzerland,2,7.509,1.52733,1.14524,0.86303,0.58557,0.41203,0.28083,Europe
2,Iceland,3,7.501,1.42666,1.18326,0.86733,0.56624,0.14975,0.47678,Europe
3,Norway,4,7.498,1.57744,1.1269,0.79579,0.59609,0.35776,0.37895,Europe
4,Finland,5,7.413,1.40598,1.13464,0.81091,0.57104,0.41004,0.25492,Europe


In [22]:
# Quickly inspect the structure and content of each DataFrame
# This helps us understand the shape, column names, data types, and missing values.
for year, df in zip(range(2015, 2020), [df_2015, df_2016, df_2017, df_2018, df_2019]):
    print(f"\nYear: {year}")         # Print the year to label each output block
    print(df.info())                 # Show column names, data types, and non-null counts
    print(df.head())                # Display the first 5 rows of the DataFrame


Year: 2015
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 10 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country                       158 non-null    object 
 1   Happiness Rank                158 non-null    int64  
 2   Happiness Score               158 non-null    float64
 3   Economy                       158 non-null    float64
 4   Family                        158 non-null    float64
 5   Healthy life expectancy       158 non-null    float64
 6   Freedom to make life choices  158 non-null    float64
 7   Perceptions of corruption     158 non-null    float64
 8   Generosity                    158 non-null    float64
 9   Continent                     158 non-null    object 
dtypes: float64(7), int64(1), object(2)
memory usage: 12.5+ KB
None
       Country  Happiness Rank  Happiness Score  Economy   Family  \
0  Switzerland               1        

### Step 3. Load
- Save the cleaned and transformed dataset into a new CSV file.
---