# World Bank - Population Growth dataset

## ETL - Part 4 - Data/Schema Validation & Cleansing

Load the intermediate dataset (metadata dataset from Part 2) and perform additional data validation / cleansing.
Write out the cleansed dataset to a new file.

In [26]:
import pandas as pd

In [27]:
# Load the CSV data
uncleaned_df = pd.read_csv("data/ETL_Metadata_Country_POP_GROW.csv")

uncleaned_df.head()

Unnamed: 0,Country Code,Region,IncomeGroup,Country Name
0,ABW,Latin America & Caribbean,High income,Aruba
1,AFG,South Asia,Low income,Afghanistan
2,AGO,Sub-Saharan Africa,Lower middle income,Angola
3,ALB,Europe & Central Asia,Upper middle income,Albania
4,AND,Europe & Central Asia,High income,Andorra


In [28]:
# Count the rows and columns
num_rows, num_columns = uncleaned_df.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

Number of rows: 215
Number of columns: 4


In [29]:
# Remove duplicates - none expected at this point in the ETL
uncleaned_df2 = uncleaned_df.drop_duplicates()
num_rows2, num_columns2 = uncleaned_df2.shape
print(f"Number of rows: {num_rows2}")
print(f"Number of columns: {num_columns2}")

Number of rows: 215
Number of columns: 4


In [30]:
# Drop the 'Country Name' column as that's already present in the main population growth dataset
uncleaned_df2 = uncleaned_df2.drop(columns=['Country Name'], errors='ignore')
uncleaned_df2

Unnamed: 0,Country Code,Region,IncomeGroup
0,ABW,Latin America & Caribbean,High income
1,AFG,South Asia,Low income
2,AGO,Sub-Saharan Africa,Lower middle income
3,ALB,Europe & Central Asia,Upper middle income
4,AND,Europe & Central Asia,High income
...,...,...,...
210,WSM,East Asia & Pacific,Lower middle income
211,YEM,Middle East & North Africa,Low income
212,ZAF,Sub-Saharan Africa,Upper middle income
213,ZMB,Sub-Saharan Africa,Lower middle income


In [31]:
# Data cleansing
#
# Fill missing values (particuarly noted in the 'IncomeGroup' column, but check the whole dataset) with a clear 
# indication to end users that the data values for these cases were Not Available in the original source data.
validated_df = uncleaned_df2.fillna('(Not Specified)')

In [32]:
# Check that no NaN values now remain in the dataset

# Find the rows and columns with NaN (blank) values
blank_values = validated_df.isna()

# Count the number of blank (NaN) values in each row
blank_rows_count = validated_df.isna().any(axis=1).sum()

# Display the total counts
print(f"Total number of rows with at least one blank (NaN) value: {blank_rows_count}")

# Find the rows with NaN (blank) values
blank_rows = validated_df[validated_df.isna().any(axis=1)]  # Rows with NaN values

# Display the rows with blank values
print("Rows with Blank (NaN) Values:")
print(blank_rows)


Total number of rows with at least one blank (NaN) value: 0
Rows with Blank (NaN) Values:
Empty DataFrame
Columns: [Country Code, Region, IncomeGroup]
Index: []


In [33]:
# Make a file for validated_df dataset
validated_df.to_csv('./data/Cleansed_Metadata_Country_POP_GROW.csv', encoding='utf8', index=False)