<h1 align="center"><font size="5">Global Unemployment and GDP Analysis: Data Cleaning, Transformation, and Merging </font></h1>

### Introduction

This notebook documents the full pipeline for preparing a clean, analysis-ready dataset that combines global unemployment statistics with GDP data across countries and years.
This cleaned dataset lays the foundation for exploring economic relationships, demographic disparities, and trends over time—especially between unemployment rates and GDP across regions and age groups.

#### Preparing and Merging GDP with Unemployment Data

This section covers the steps for basic data wrangling, which is the procedure for cleaning and merging the GDP and unemployment datasets:

- The first step is to import the appropriate libraries and load the raw datasets from your local files.
- The next step is to standardize the country names in the unemployment dataset to match the GDP records for each corresponding country using a custom mapping.
- You then want to drop any regions and territories that do not have GDP data; otherwise, you will not be able to complete the merge accurately.
- In the next step, you reshape both datasets into long form to perform time-series analysis:
  - GDP: Country, Year, GDP
  - Unemployment: country_std, Year, Unemployment + demographics
- You need to convert the year columns to the integer data type to merge the datasets accurately.
- You then merge the datasets on standardized country name and year.
- You drop unnecessary columns and filter out rows without GDP or unemployment.
- Finally, you preview the cleaned and merged dataset and have the option to save your dataset for future use.

In [None]:
# Importing Libraries
import pandas as pd

# Load the datasets from GitHub (raw URLs)
gdp_url = "https://raw.githubusercontent.com/evidentart/global-economic-insights-ml/main/data/raw_gdp_per_country.csv"
unemp_url = "https://raw.githubusercontent.com/evidentart/global-economic-insights-ml/main/data/raw_global_unemployment_data.csv"

gdp_df = pd.read_csv(gdp_url)
unemp_df = pd.read_csv(unemp_url)

# Standardize country names in unemployment dataset
country_mapping = {
    'Sao Tome and Principe': 'São Tomé and Príncipe',
    'Congo, Democratic Republic of the': 'Democratic Republic of the Congo',
    'Congo': 'Republic of the Congo',
    "Korea, Democratic People's Republic of": 'North Korea',
    'Korea, Republic of': 'South Korea',
    'Iran, Islamic Republic of': 'Iran',
    "Lao People's Democratic Republic": 'Laos',
    'Moldova, Republic of': 'Moldova',
    'Macau, China': 'Macau',
    'Hong Kong, China': 'Hong Kong',
    'Viet Nam': 'Vietnam',
    'Palestinian Territories': 'Palestine',
    'Russian Federation': 'Russia',
    'Brunei Darussalam': 'Brunei'
}

unemp_df['country_std'] = unemp_df['country_name'].replace(country_mapping)

# Drop countries with no GDP data
drop_countries = [
    'South America', 'Channel Islands', 'French Polynesia',
    'United States Virgin Islands', 'New Caledonia', 'Guam', 'Cuba'
]
unemp_df = unemp_df[~unemp_df['country_std'].isin(drop_countries)]

# Reshape GDP dataset to long format (country, year, GDP)
gdp_long = gdp_df.melt(id_vars=['Country'], var_name='Year', value_name='GDP')

# Convert Year columns to integer for both datasets
unemp_df_long = unemp_df.melt(
    id_vars=['country_std', 'indicator_name', 'sex', 'age_group', 'age_categories'],
    value_vars=[str(y) for y in range(2014, 2025)],
    var_name='Year',
    value_name='Unemployment'
)

gdp_long['Year'] = gdp_long['Year'].astype(int)
unemp_df_long['Year'] = unemp_df_long['Year'].astype(int)

# Merge datasets on standardized country name and year
merged_df = pd.merge(
    unemp_df_long,
    gdp_long,
    left_on=['country_std', 'Year'],
    right_on=['Country', 'Year'],
    how='left'
)

# Drop duplicate GDP country column
merged_df.drop(columns=['Country'], inplace=True)

# Filter only years where both GDP and Unemployment exist
merged_df_combined = merged_df.dropna(subset=['GDP', 'Unemployment'])

# Check the filtered combined dataframe
print("Filtered combined dataframe sample (only 2020–2024 with GDP):")
print(merged_df_combined.head())

# Save filtered merged dataset (optional)
# You can skip saving locally or use this to export if needed:
# merged_df_combined.to_csv("unemployment_gdp_combined_filtered.csv", index=False)

#### Checking for Missing GDP Data

This section reloads the combined dataset from a local file and focuses on identifying gaps in GDP data:

1. **Preview the dataset** to confirm successful loading.
2. **Count missing GDP values** to assess the extent of missing economic data.
3. **Inspect missing GDP entries** by country and year to identify patterns or specific regions/time periods that may require imputation or exclusion in analysis.

In [4]:
import pandas as pd

# Load the combined dataset
file_path = url = "https://raw.githubusercontent.com/evidentart/global-economic-insights-ml/main/data/clean_merge_unemployment_gdp.csv"
df = pd.read_csv(file_path)

# Check first few rows
print(df.head())

# Count how many rows have missing GDP
missing_gdp_count = df['GDP'].isna().sum()
print(f"Number of rows with missing GDP: {missing_gdp_count}")

# Optionally, check which years/countries are missing GDP
missing_gdp_info = df[df['GDP'].isna()][['country_std', 'Year']]
print(missing_gdp_info)

   country_std                    indicator_name     sex age_group  \
0  Afghanistan  Unemployment rate by sex and age  Female     15-24   
1  Afghanistan  Unemployment rate by sex and age  Female       25+   
2  Afghanistan  Unemployment rate by sex and age  Female  Under 15   
3  Afghanistan  Unemployment rate by sex and age    Male     15-24   
4  Afghanistan  Unemployment rate by sex and age    Male       25+   

  age_categories  Year  Unemployment      GDP  
0          Youth  2020        21.228  20136.0  
1         Adults  2020        14.079  20136.0  
2       Children  2020        16.783  20136.0  
3          Youth  2020        14.452  20136.0  
4         Adults  2020         8.732  20136.0  
Number of rows with missing GDP: 0
Empty DataFrame
Columns: [country_std, Year]
Index: []
