# World Bank - Population Growth metadata

## ETL - Part 2

Load & cleanse the source 'metadata country' dataset (metadata for the Population Growth dataset from Part 1). As we want to present country-based values only, exclude any country codes that are not true ISO3166 codes (e.g. regional coding such as EMU='Euro area', or economic such as OED='OECD members'). Write out the extracted dataset to a new file.

In [43]:
# Dependencies
import pandas as pd

In [44]:
# Load data file
pop_growth_md_df = pd.read_csv('source_data/Metadata_Country_API_SP.POP.GROW_DS2_en_csv_v2_3404396.csv', delimiter=',')
pop_growth_md_df.head()

Unnamed: 0,Country Code,Region,IncomeGroup,SpecialNotes,TableName,Unnamed: 5
0,ABW,Latin America & Caribbean,High income,,Aruba,
1,AFE,,,"26 countries, stretching from the Red Sea in t...",Africa Eastern and Southern,
2,AFG,South Asia,Low income,The reporting period for national accounts dat...,Afghanistan,
3,AFW,,,"22 countries, stretching from the westernmost ...",Africa Western and Central,
4,AGO,Sub-Saharan Africa,Lower middle income,The World Bank systematically assesses the app...,Angola,


In [45]:
# Check number of data records after initial load
initial_num = len(pop_growth_md_df)
print(initial_num)

265


In [46]:
# Rename column - TableName to Country Name
pop_growth_md_df = pop_growth_md_df.rename(columns={'TableName': 'Country Name'})
pop_growth_md_df.head(10)

Unnamed: 0,Country Code,Region,IncomeGroup,SpecialNotes,Country Name,Unnamed: 5
0,ABW,Latin America & Caribbean,High income,,Aruba,
1,AFE,,,"26 countries, stretching from the Red Sea in t...",Africa Eastern and Southern,
2,AFG,South Asia,Low income,The reporting period for national accounts dat...,Afghanistan,
3,AFW,,,"22 countries, stretching from the westernmost ...",Africa Western and Central,
4,AGO,Sub-Saharan Africa,Lower middle income,The World Bank systematically assesses the app...,Angola,
5,ALB,Europe & Central Asia,Upper middle income,,Albania,
6,AND,Europe & Central Asia,High income,,Andorra,
7,ARB,,,Arab World aggregate. Arab World is composed o...,Arab World,
8,ARE,Middle East & North Africa,High income,,United Arab Emirates,
9,ARG,Latin America & Caribbean,Upper middle income,The World Bank systematically assesses the app...,Argentina,


In [47]:
# Drop unwanted columns
#
# The data file includes a comma at the end of each row, which is interpreted as an extra 'unnamed' column. 
# A trailing comma on each line of a CSV file is not part of the normal CSV-format definition (refer to 
# https://www.rfc-editor.org/rfc/rfc4180).
# This column in the dataframe can simply be dropped.
#
pop_growth_md_df = pop_growth_md_df.drop(columns=["SpecialNotes", "Unnamed: 5"])
pop_growth_md_df

Unnamed: 0,Country Code,Region,IncomeGroup,Country Name
0,ABW,Latin America & Caribbean,High income,Aruba
1,AFE,,,Africa Eastern and Southern
2,AFG,South Asia,Low income,Afghanistan
3,AFW,,,Africa Western and Central
4,AGO,Sub-Saharan Africa,Lower middle income,Angola
...,...,...,...,...
260,XKX,Europe & Central Asia,Upper middle income,Kosovo
261,YEM,Middle East & North Africa,Low income,"Yemen, Rep."
262,ZAF,Sub-Saharan Africa,Upper middle income,South Africa
263,ZMB,Sub-Saharan Africa,Lower middle income,Zambia


In [48]:
# NOTE regarding 'null' (missing) data values
#
# Case(s) of 'IncomeGroup' with missing values were noted and are listed here:  
pop_growth_md_df[pop_growth_md_df.drop(columns=["IncomeGroup"]).isna().any(axis=1)]

# If any of these cases remain after cross-filtering against valid international country codes (see below), they will be 
# addressed with the help of a data validation library in a subsequent ETL stage.

Unnamed: 0,Country Code,Region,IncomeGroup,Country Name
1,AFE,,,Africa Eastern and Southern
3,AFW,,,Africa Western and Central
7,ARB,,,Arab World
36,CEB,,,Central Europe and the Baltics
49,CSS,,,Caribbean small states
61,EAP,,,East Asia & Pacific (excluding high income)
62,EAR,,,Early-demographic dividend
63,EAS,,,East Asia & Pacific
64,ECA,,,Europe & Central Asia (excluding high income)
65,ECS,,,Europe & Central Asia


In [49]:
# Load official Country Codes reference data file
country_codes_df = pd.read_csv('source_data/CountryCodes_ISO3166.csv')
country_codes_df.head()

Unnamed: 0,English short name,French short name,Alpha-2 code,Alpha-3 code,Numeric
0,Afghanistan,Afghanistan (l'),AF,AFG,4
1,Albania,Albanie (l'),AL,ALB,8
2,Algeria,Algérie (l'),DZ,DZA,12
3,American Samoa,Samoa américaines (les),AS,ASM,16
4,Andorra,Andorre (l'),AD,AND,20


In [50]:
# Look for 3-letter entries in the dataset that DON'T match official ISO3166 country codes
#
# Return all rows in pop_growth_md_df that do NOT have a matching team in country_codes_df, following the approach in:
# https://www.statology.org/pandas-anti-join/
#
# (1) perform outer join
outer = pd.merge(pop_growth_md_df, country_codes_df, left_on='Country Code', right_on='Alpha-3 code', how='outer', indicator=True)

# (2) perform anti-join
anti_join = outer[(outer._merge=='left_only')].drop('_merge', axis=1)

# View results
anti_join[["Country Name", "Country Code"]]


Unnamed: 0,Country Name,Country Code
1,Africa Eastern and Southern,AFE
3,Africa Western and Central,AFW
9,Arab World,ARB
44,Central Europe and the Baltics,CEB
46,Channel Islands,CHI
58,Caribbean small states,CSS
71,East Asia & Pacific (excluding high income),EAP
72,Early-demographic dividend,EAR
73,East Asia & Pacific,EAS
74,Europe & Central Asia (excluding high income),ECA


In [51]:
# Extract the "non-country codes" so we can use those as a filter
non_country_codes = anti_join["Country Code"]
num_non_country_codes = len(non_country_codes)
print(num_non_country_codes)

50


In [52]:
# Filter out non-country-codes from the original population growth dataset so we are left with ISO3166 country codes only
pop_growth_md_df = pop_growth_md_df[~pop_growth_md_df["Country Code"].isin(non_country_codes)]
pop_growth_md_df.head()

Unnamed: 0,Country Code,Region,IncomeGroup,Country Name
0,ABW,Latin America & Caribbean,High income,Aruba
2,AFG,South Asia,Low income,Afghanistan
4,AGO,Sub-Saharan Africa,Lower middle income,Angola
5,ALB,Europe & Central Asia,Upper middle income,Albania
6,AND,Europe & Central Asia,High income,Andorra


In [53]:
# Check number of data records after filtering
filtered_num = len(pop_growth_md_df)
print(filtered_num)

post_filter_check_OK = filtered_num == (initial_num - num_non_country_codes)
print('Filter check passed: ', post_filter_check_OK)

215
Filter check passed:  True


In [54]:
# Write out cleansed 'population growth metadata' dataset
pop_growth_md_df.to_csv('./data/ETL_Metadata_Country_POP_GROW.csv', encoding='utf8', index=False)