## Demographics - ETL Process

This notebook is organized in the following sections:
* [Step 0 - Preliminary: Viewing the data](#0)
* [Step 1 - Checking for duplicates](#1)
* [Step 2 - Checking for missing values](#2)
* [Step 3 - Imputing/dropping missing values](#3)
* [Step 4 - Ensuring Correct Datatypes](#4)


<a id='0'></a>
## Step 0 - Preliminary: Viewing the data

In [1]:
import pandas as pd

In [2]:
demographics = pd.read_csv('data/demographics.zip') #KEEP FOR SCRIPT -- DEMOGRAPHICS 1

In [None]:
#Viewing the data
demographics.head()

In [None]:
demographics.shape

The demographics dataset has 5097 rows and 19 columns

<a id='1'></a>
## Step 1 - Checking for duplicates

In [None]:
#Checking for duplicates
demographics.duplicated().any()

There are no duplicates in the whole dataset, so we can continue to check the datatype of each column.

<a id='2'></a>
## Step 2 - Checking for missing values

The cell below shows the number of missing values per column

In [None]:
demographics.isna().sum()

The cell below shows the proportion of missing values with respect to the total number of values per column.
If a column is fully null, it will be dropped.

In [None]:
demographics.isna().sum() / len(demographics)

There are missing values, so we will proceed to either drop or impute them.

<a id='3'></a>
## Step 3 - Imputing/dropping missing values

We drop all columns that are completely null.

In [3]:
demographics = demographics.dropna(axis = 'columns', how = 'all')  #KEEP FOR SCRIPT -- DEMOGRAPHICS 2

Below are some checks to be sure that these columns have been dropped correctly

In [None]:
#Check 1
demographics.head()

In [None]:
#Check 2
demographics.isna().sum() / len(demographics)

We also drop the population density column, as it has over 90% missing values, and given that this column is not really aggregatable at a country level (it is not aggregatable as it is a ratio and not an absolute population value, unlike the other population columns).

In [4]:
demographics = demographics.drop(columns = ['population_density'])  #KEEP FOR SCRIPT -- DEMOGRAPHICS 3

In [None]:
#Check 1
demographics.head()

In [None]:
#Check 2
demographics.isna().sum() / len(demographics)

The rest of columns with null values we will clean once we have merged all the datasets. This is because, these are absolute population variables that can be aggregated at a country level. In other words, in doesn't matter if we don't clean them now, because once we group by country name in the macrotable (using the sum as a metric), the null values will disappear (as they are not taken into account in the grouping). 

<a id='4'></a>
## Step 4 - Ensuring correct datatypes

In [None]:
#Ensuring the datatypes we have are correct
demographics.info()

The datatypes of all of the columns appear to be correct.
All columns are of float datatype execpt for location_key which is an object; makes sense.