# World Happiness Dataset - ETL Pipeline

This ETL (Extract, Transform, Load) pipeline processes the World Happiness data from multiple CSV files spanning the years 2015 to 2019. The goal is to prepare a clean, consistent dataset ready for analysis, modeling, and visualization.

---

## Pipeline Steps

### Step 1. Extract
- Load all five CSV files (`2015.csv`, `2016.csv`, `2017.csv`, `2018.csv`, `2019.csv`) into separate dataframes.
- Each file contains similar but not identical columns, with some variations in column names and structure.



In [3]:
# Import necessary libraries
import pandas as pd  # pandas is used for data loading, manipulation, and analysis (like handling DataFrames)
import numpy as np   # numpy is used for numerical operations, handling arrays, and working with missing values

In [5]:
# Load each CSV file into a separate DataFrame
df_2015 = pd.read_csv('../data/raw/2015.csv')  # Load the 2015 World Happiness data
df_2016 = pd.read_csv('../data/raw/2016.csv')  # Load the 2016 World Happiness data
df_2017 = pd.read_csv('../data/raw/2017.csv')  # Load the 2017 World Happiness data
df_2018 = pd.read_csv('../data/raw/2018.csv')  # Load the 2018 World Happiness data
df_2019 = pd.read_csv('../data/raw/2019.csv')  # Load the 2019 World Happiness data

In [6]:
# Quickly inspect the structure and content of each DataFrame
# This helps us understand the shape, column names, data types, and missing values.
for year, df in zip(range(2015, 2020), [df_2015, df_2016, df_2017, df_2018, df_2019]):
    print(f"\nYear: {year}")         # Print the year to label each output block
    print(df.info())                 # Show column names, data types, and non-null counts
    print(df.head())                # Display the first 5 rows of the DataFrame


Year: 2015
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        158 non-null    object 
 1   Region                         158 non-null    object 
 2   Happiness Rank                 158 non-null    int64  
 3   Happiness Score                158 non-null    float64
 4   Standard Error                 158 non-null    float64
 5   Economy (GDP per Capita)       158 non-null    float64
 6   Family                         158 non-null    float64
 7   Health (Life Expectancy)       158 non-null    float64
 8   Freedom                        158 non-null    float64
 9   Trust (Government Corruption)  158 non-null    float64
 10  Generosity                     158 non-null    float64
 11  Dystopia Residual              158 non-null    float64
dtypes: float64(9), int64(1), object(2)
mem

# World Happiness Dataset - Column Names by Year

### 2015.csv
- Country 
- Region
- Happiness Rank
- Happiness Score
- Standard Error
- Economy (GDP per Capita)
- Family
- Health (Life Expectancy)
- Freedom
- Trust (Government Corruption)
- Generosity
- Dystopia Residual

---

### 2016.csv
- Country
- Region
- Happiness Rank
- Happiness Score
- Lower Confidence Interval
- Upper Confidence Interval
- Economy (GDP per Capita)
- Family
- Health (Life Expectancy)
- Freedom
- Trust (Government Corruption)
- Generosity
- Dystopia Residual

---

### 2017.csv
- Country
- Happiness.Rank
- Happiness.Score
- Whisker.high
- Whisker.low
- Economy..GDP.per.Capita.
- Family
- Health..Life.Expectancy.
- Freedom
- Generosity
- Trust..Government.Corruption.
- Dystopia.Residual

---

### 2018.csv
- Overall rank
- Country or region
- Score
- GDP per capita
- Social support
- Healthy life expectancy
- Freedom to make life choices
- Generosity
- Perceptions of corruption

---

### 2019.csv
- Overall rank
- Country or region
- Score
- GDP per capita
- Social support
- Healthy life expectancy
- Freedom to make life choices
- Generosity
- Perceptions of corruption


# World Happiness Dataset - Column Meanings

| Column Name                                   | Meaning / Description                                                                                   |
|----------------------------------------------|-------------------------------------------------------------------------------------------------------|
| **Country**               | The name of the country being evaluated.                                                    |
| **Region**                                    | Broad geographic grouping or continent used for aggregations (e.g., Asia, Europe, Africa, Oceania).     |
| **Happiness Rank / Overall rank**             | The position of the country in the list ranked by happiness score (1 = happiest).                      |
| **Happiness Score / Score**                    | A numerical score representing the overall happiness of the country; higher means happier.            |
| **Standard Error**                             | The margin of error or uncertainty in the happiness score estimate (2015 only).                       |
| **Lower Confidence Interval**                  | The lower bound of the confidence interval for the happiness score, indicating uncertainty (2016 only).|
| **Upper Confidence Interval**                  | The upper bound of the confidence interval for the happiness score (2016 only).                       |
| **Whisker.high / Whisker.low**                 | Similar to confidence intervals, indicating the range of uncertainty in the happiness score (2017 only).|
| **Economy (GDP per Capita) / GDP per capita** | A measure of economic output per person, indicating the wealth level of the country.                   |
| **Family / Social support**                    | The perceived level of social support from family, friends, and community.                            |
| **Health (Life Expectancy) / Healthy life expectancy** | Average expected lifespan or a proxy for health quality in the country.                            |
| **Freedom / Freedom to make life choices**    | The perceived freedom individuals have to make their own life decisions.                             |
| **Trust (Government Corruption) / Perceptions of corruption** | Measure of how much corruption is perceived in government and public institutions.          |
| **Generosity**                                 | A measure of the generosity or charitable behavior in the country’s population.                      |
| **Dystopia Residual**                          | The gap between the predicted happiness score based on the model and actual scores — a residual value representing unexplained factors. |

In [10]:
# Print the number of rows and columns in each DataFrame
for year, df in zip(range(2015, 2020), [df_2015, df_2016, df_2017, df_2018, df_2019]):
    print(f"\nYear: {year}")
    print(f"Shape: {df.shape}")  # Returns a tuple (rows, columns)
    print(f"Number of Rows: {df.shape[0]}")
    print(f"Number of Columns: {df.shape[1]}")


Year: 2015
Shape: (158, 12)
Number of Rows: 158
Number of Columns: 12

Year: 2016
Shape: (157, 13)
Number of Rows: 157
Number of Columns: 13

Year: 2017
Shape: (155, 12)
Number of Rows: 155
Number of Columns: 12

Year: 2018
Shape: (156, 9)
Number of Rows: 156
Number of Columns: 9

Year: 2019
Shape: (156, 9)
Number of Rows: 156
Number of Columns: 9


### Step 2. Transform

As part of the data transformation process, we clean and prepare the data for analysis:

- Dropped columns related to uncertainty that are not needed for analysis or modeling:
  - `Standard Error` (2015)
  - `Lower Confidence Interval` and `Upper Confidence Interval` (2016)
  - `Whisker.high` and `Whisker.low` (2017)

In addition to removing uncertainty-related columns, we also dropped:

- `Dystopia Residual`: This is a synthetic value used in the original happiness score calculation. Since we aim to build our own models (e.g., linear regression), this residual component is not useful and may bias our interpretation or predictions.

Dropping this ensures our models are trained only on raw, interpretable variables.

In [11]:
# 2015: Drop Standard Error and Dystopia Residual
df_2015.drop(columns=["Standard Error", "Dystopia Residual"], inplace=True)

# 2016: Drop Confidence Intervals and Dystopia Residual
df_2016.drop(columns=["Lower Confidence Interval", "Upper Confidence Interval", "Dystopia Residual"], inplace=True)

# 2017: Drop Whiskers and Dystopia Residual
df_2017.drop(columns=["Whisker.high", "Whisker.low", "Dystopia.Residual"], inplace=True)

# 2018 and 2019;no need to drop columns

In [13]:
# Print column names for 2015 dataset to verify dropped columns
print("2015 columns:", df_2015.columns.tolist())

# Print column names for 2016 dataset to verify dropped columns
print("2016 columns:", df_2016.columns.tolist())

# Print column names for 2017 dataset to verify dropped columns
print("2017 columns:", df_2017.columns.tolist())

2015 columns: ['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity']
2016 columns: ['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity']
2017 columns: ['Country', 'Happiness.Rank', 'Happiness.Score', 'Economy..GDP.per.Capita.', 'Family', 'Health..Life.Expectancy.', 'Freedom', 'Generosity', 'Trust..Government.Corruption.']


Renamed columns across all datasets for consistency and easier merging

- Standardized column names like 'Happiness Rank', 'Happiness Score', 'Economy', etc.
- Ensures consistent structure across 2015–2019 datasets
- Improves readability and supports seamless merging and analysis


In [14]:
# Rename columns in 2015 dataset
df_2015.rename(columns={
    'Country': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness Score': 'Happiness Score',
    'Economy (GDP per Capita)': 'Economy',
    'Family': 'Family',
    'Health (Life Expectancy)': 'Healthy life expectancy',
    'Freedom': 'Freedom to make life choices',
    'Trust (Government Corruption)': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2016 dataset
df_2016.rename(columns={
    'Country': 'Country',
    'Happiness Rank': 'Happiness Rank',
    'Happiness Score': 'Happiness Score',
    'Economy (GDP per Capita)': 'Economy',
    'Family': 'Family',
    'Health (Life Expectancy)': 'Healthy life expectancy',
    'Freedom': 'Freedom to make life choices',
    'Trust (Government Corruption)': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2017 dataset
df_2017.rename(columns={
    'Country': 'Country',
    'Happiness.Rank': 'Happiness Rank',
    'Happiness.Score': 'Happiness Score',
    'Economy..GDP.per.Capita.': 'Economy',
    'Family': 'Family',
    'Health..Life.Expectancy.': 'Healthy life expectancy',
    'Freedom': 'Freedom to make life choices',
    'Trust..Government.Corruption.': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2018 dataset
df_2018.rename(columns={
    'Country or region': 'Country',
    'Overall rank': 'Happiness Rank',
    'Score': 'Happiness Score',
    'GDP per capita': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)

# Rename columns in 2019 dataset
df_2019.rename(columns={
    'Country or region': 'Country',
    'Overall rank': 'Happiness Rank',
    'Score': 'Happiness Score',
    'GDP per capita': 'Economy',
    'Social support': 'Family',
    'Healthy life expectancy': 'Healthy life expectancy',
    'Freedom to make life choices': 'Freedom to make life choices',
    'Perceptions of corruption': 'Perceptions of corruption'
}, inplace=True)


In [15]:
# Print renamed column names for each dataset
print("2015 Columns:\n", df_2015.columns)
print("\n2016 Columns:\n", df_2016.columns)
print("\n2017 Columns:\n", df_2017.columns)
print("\n2018 Columns:\n", df_2018.columns)
print("\n2019 Columns:\n", df_2019.columns)


2015 Columns:
 Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Economy',
       'Family', 'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity'],
      dtype='object')

2016 Columns:
 Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score', 'Economy',
       'Family', 'Healthy life expectancy', 'Freedom to make life choices',
       'Perceptions of corruption', 'Generosity'],
      dtype='object')

2017 Columns:
 Index(['Country', 'Happiness Rank', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

2018 Columns:
 Index(['Happiness Rank', 'Country', 'Happiness Score', 'Economy', 'Family',
       'Healthy life expectancy', 'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')

2019 Columns:
 Index(['Happiness Rank', 'Country', 'Happ

# Adding Region Column to World Happiness Dataset (2017–2019)

To enhance analysis by geographic region, we add a new column **`Region`** that maps each country to its corresponding world region.

In [16]:
country_to_region = {
    # North America
    'United States': 'North America',
    'Canada': 'North America',
    'Mexico': 'North America',

    # Latin America and Caribbean
    'Brazil': 'Latin America',
    'Argentina': 'Latin America',
    'Colombia': 'Latin America',
    'Chile': 'Latin America',
    'Costa Rica': 'Latin America',
    'Guatemala': 'Latin America',
    'Panama': 'Latin America',
    'El Salvador': 'Latin America',
    'Honduras': 'Latin America',
    'Nicaragua': 'Latin America',
    'Paraguay': 'Latin America',
    'Peru': 'Latin America',
    'Uruguay': 'Latin America',
    'Venezuela': 'Latin America',

    # Europe
    'United Kingdom': 'Europe',
    'Germany': 'Europe',
    'France': 'Europe',
    'Italy': 'Europe',
    'Spain': 'Europe',
    'Netherlands': 'Europe',
    'Sweden': 'Europe',
    'Norway': 'Europe',
    'Finland': 'Europe',
    'Denmark': 'Europe',
    'Belgium': 'Europe',
    'Austria': 'Europe',
    'Switzerland': 'Europe',
    'Ireland': 'Europe',
    'Poland': 'Europe',
    'Czech Republic': 'Europe',
    'Portugal': 'Europe',
    'Greece': 'Europe',
    'Hungary': 'Europe',
    'Slovakia': 'Europe',
    'Croatia': 'Europe',
    'Lithuania': 'Europe',
    'Slovenia': 'Europe',
    'Latvia': 'Europe',
    'Estonia': 'Europe',
    'Bulgaria': 'Europe',
    'Romania': 'Europe',
    'Moldova': 'Europe',
    'Belarus': 'Europe',
    'Ukraine': 'Europe',
    'Russia': 'Europe',
    'Luxembourg': 'Europe',
    'Iceland': 'Europe',

    # Asia
    'India': 'South Asia',
    'China': 'East Asia',
    'Japan': 'East Asia',
    'South Korea': 'East Asia',
    'Taiwan': 'East Asia',
    'Hong Kong': 'East Asia',
    'Indonesia': 'Southeast Asia',
    'Thailand': 'Southeast Asia',
    'Malaysia': 'Southeast Asia',
    'Singapore': 'Southeast Asia',
    'Philippines': 'Southeast Asia',
    'Vietnam': 'Southeast Asia',
    'Bangladesh': 'South Asia',
    'Pakistan': 'South Asia',
    'Sri Lanka': 'South Asia',
    'Nepal': 'South Asia',
    'Mongolia': 'East Asia',
    'Cambodia': 'Southeast Asia',
    'Myanmar': 'Southeast Asia',
    'Laos': 'Southeast Asia',
    'Israel': 'Middle East',
    'Turkey': 'Middle East',
    'Saudi Arabia': 'Middle East',
    'United Arab Emirates': 'Middle East',
    'Qatar': 'Middle East',
    'Kuwait': 'Middle East',
    'Jordan': 'Middle East',
    'Lebanon': 'Middle East',
    'Oman': 'Middle East',
    'Bahrain': 'Middle East',
    'Iran': 'Middle East',
    'Iraq': 'Middle East',

    # Africa
    'South Africa': 'Sub-Saharan Africa',
    'Nigeria': 'Sub-Saharan Africa',
    'Kenya': 'Sub-Saharan Africa',
    'Ghana': 'Sub-Saharan Africa',
    'Tanzania': 'Sub-Saharan Africa',
    'Uganda': 'Sub-Saharan Africa',
    'Zimbabwe': 'Sub-Saharan Africa',
    'Ethiopia': 'Sub-Saharan Africa',
    'Botswana': 'Sub-Saharan Africa',
    'Mozambique': 'Sub-Saharan Africa',
    'Senegal': 'Sub-Saharan Africa',
    'Cameroon': 'Sub-Saharan Africa',
    'Malawi': 'Sub-Saharan Africa',
    'Rwanda': 'Sub-Saharan Africa',
    'Zambia': 'Sub-Saharan Africa',
    'Madagascar': 'Sub-Saharan Africa',
    'Algeria': 'North Africa',
    'Egypt': 'North Africa',
    'Morocco': 'North Africa',
    'Tunisia': 'North Africa',
    'Libya': 'North Africa',

    # Oceania
    'Australia': 'Oceania',
    'New Zealand': 'Oceania',

    # Others / Unclassified
    'Hong Kong SAR': 'East Asia',
    'Taiwan Province of China': 'East Asia'
}


# Map countries to their respective regions for 2017, 2018, and 2019 datasets
# This adds a new 'Region' column based on the country-to-region mapping dictionary
# Any countries not found in the mapping are assigned 'Unknown' to handle missing values

In [18]:
df_2017['Region'] = df_2017['Country'].map(country_to_region)
df_2018['Region'] = df_2018['Country'].map(country_to_region)
df_2019['Region'] = df_2019['Country'].map(country_to_region)

#fill missing regions with 'Unknown'
df_2017['Region'] = df_2017['Region'].fillna('Unknown')
df_2018['Region'] = df_2018['Region'].fillna('Unknown')
df_2019['Region'] = df_2019['Region'].fillna('Unknown')


In [19]:
# Check unique regions present in each dataset
print("2017 Regions:", df_2017['Region'].unique())
print("2018 Regions:", df_2018['Region'].unique())
print("2019 Regions:", df_2019['Region'].unique())

# Check countries that did NOT get mapped (if any)
print("2017 countries with unknown region:", df_2017[df_2017['Region'] == 'Unknown']['Country'].tolist())
print("2018 countries with unknown region:", df_2018[df_2018['Region'] == 'Unknown']['Country'].tolist())
print("2019 countries with unknown region:", df_2019[df_2019['Region'] == 'Unknown']['Country'].tolist())

# Preview some rows with country and region columns side-by-side
print(df_2017[['Country', 'Region']].head())


2017 Regions: ['Europe' 'North America' 'Oceania' 'Middle East' 'Latin America'
 'Southeast Asia' 'Unknown' 'East Asia' 'North Africa' 'South Asia'
 'Sub-Saharan Africa']
2018 Regions: ['Europe' 'North America' 'Oceania' 'Latin America' 'Middle East'
 'Unknown' 'East Asia' 'Southeast Asia' 'North Africa' 'South Asia'
 'Sub-Saharan Africa']
2019 Regions: ['Europe' 'Oceania' 'North America' 'Latin America' 'Middle East'
 'Unknown' 'East Asia' 'Southeast Asia' 'South Asia' 'North Africa'
 'Sub-Saharan Africa']
2017 countries with unknown region: ['Malta', 'Trinidad and Tobago', 'Ecuador', 'Uzbekistan', 'Belize', 'Bolivia', 'Turkmenistan', 'Kazakhstan', 'North Cyprus', 'Mauritius', 'Cyprus', 'Hong Kong S.A.R., China', 'Serbia', 'Jamaica', 'Kosovo', 'Montenegro', 'Azerbaijan', 'Dominican Republic', 'Bosnia and Herzegovina', 'Macedonia', 'Somalia', 'Tajikistan', 'Bhutan', 'Kyrgyzstan', 'Palestinian Territories', 'Sierra Leone', 'Albania', 'Namibia', 'Gabon', 'Armenia', 'Mauritania', 'Congo (

In [20]:
country_to_region.update({
    # Adjust names for consistency
    'Trinidad and Tobago': 'Latin America',
    'North Macedonia': 'Europe',
    'Macedonia': 'Europe',
    'Northern Cyprus': 'Middle East',
    'Kosovo': 'Europe',
    'Bosnia and Herzegovina': 'Europe',
    'Serbia': 'Europe',
    'Montenegro': 'Europe',
    'Albania': 'Europe',
    'Armenia': 'Europe',
    'Azerbaijan': 'Europe',
    'Georgia': 'Europe',
    'Cyprus': 'Middle East',
    'Malta': 'Europe',
    'Ecuador': 'Latin America',
    'Belize': 'Latin America',
    'Bolivia': 'Latin America',
    'Dominican Republic': 'Latin America',
    'Jamaica': 'Latin America',
    'Mauritius': 'Sub-Saharan Africa',
    'Kazakhstan': 'Asia',
    'Kyrgyzstan': 'Asia',
    'Tajikistan': 'Asia',
    'Turkmenistan': 'Asia',
    'Uzbekistan': 'Asia',
    'Bhutan': 'Asia',
    'Palestinian Territories': 'Middle East',
    'Somalia': 'Sub-Saharan Africa',
    'Sierra Leone': 'Sub-Saharan Africa',
    'Namibia': 'Sub-Saharan Africa',
    'Gabon': 'Sub-Saharan Africa',
    'Mauritania': 'Sub-Saharan Africa',
    'Congo (Brazzaville)': 'Sub-Saharan Africa',
    'Congo (Kinshasa)': 'Sub-Saharan Africa',
    'Mali': 'Sub-Saharan Africa',
    'Ivory Coast': 'Sub-Saharan Africa',
    'Sudan': 'Sub-Saharan Africa',
    'Burkina Faso': 'Sub-Saharan Africa',
    'Niger': 'Sub-Saharan Africa',
    'Chad': 'Sub-Saharan Africa',
    'Lesotho': 'Sub-Saharan Africa',
    'Angola': 'Sub-Saharan Africa',
    'Afghanistan': 'Asia',
    'Benin': 'Sub-Saharan Africa',
    'Haiti': 'Latin America',
    'Yemen': 'Middle East',
    'South Sudan': 'Sub-Saharan Africa',
    'Liberia': 'Sub-Saharan Africa',
    'Guinea': 'Sub-Saharan Africa',
    'Togo': 'Sub-Saharan Africa',
    'Syria': 'Middle East',
    'Burundi': 'Sub-Saharan Africa',
    'Central African Republic': 'Sub-Saharan Africa',
    'Gambia': 'Sub-Saharan Africa',
    'Comoros': 'Sub-Saharan Africa',
    'Swaziland': 'Sub-Saharan Africa',  # now Eswatini
})


In [None]:
# Normalize your dataframe country names before mapping
def normalize_country_name(name):
    name = name.replace('&', 'and')
    name = name.replace('.', '')
    name = name.replace(',', '')
    name = name.strip()
    return name

for df in [df_2017, df_2018, df_2019]:
    df['Country'] = df['Country'].apply(normalize_country_name)
    df['Region'] = df['Country'].map(country_to_region)
    df['Region'] = df['Region'].fillna('Unknown')

In [None]:
#You can check for countries with 'Unknown' region and count them like this:
for year, df in zip([2017, 2018, 2019], [df_2017, df_2018, df_2019]):
    unknown_countries = df[df['Region'] == 'Unknown']['Country'].unique()
    print(f"{year} unknown countries ({len(unknown_countries)}): {list(unknown_countries)}")
    unknown_rows = df[df['Region'] == 'Unknown'].shape[0]
    print(f"{year} rows with unknown region: {unknown_rows}\n")


2017 unknown countries (2): ['North Cyprus', 'Hong Kong SAR China']
2017 rows with unknown region: 2

2018 unknown countries (0): []
2018 rows with unknown region: 0

2019 unknown countries (0): []
2019 rows with unknown region: 0



In [23]:
# Manually assign regions for known missing countries in 2017
df_2017.loc[df_2017['Country'] == 'North Cyprus', 'Region'] = 'Europe'
df_2017.loc[df_2017['Country'] == 'Hong Kong SAR China', 'Region'] = 'East Asia'


In [None]:
# Recheck for countries with 'Unknown' region and count them like this:
for year, df in zip([2017, 2018, 2019], [df_2017, df_2018, df_2019]):
    unknown_countries = df[df['Region'] == 'Unknown']['Country'].unique()
    print(f"{year} unknown countries ({len(unknown_countries)}): {list(unknown_countries)}")
    unknown_rows = df[df['Region'] == 'Unknown'].shape[0]
    print(f"{year} rows with unknown region: {unknown_rows}\n")


2017 unknown countries (0): []
2017 rows with unknown region: 0

2018 unknown countries (0): []
2018 rows with unknown region: 0

2019 unknown countries (0): []
2019 rows with unknown region: 0



### Step 3. Load
- Save the cleaned and transformed dataset into a new CSV file.
---