# Health - ETL Process

This notebook is organized in the following sections:
* [Step 0 - Preliminary: Viewing the data](#0)
* [Step 1 - Checking for duplicates](#1)
* [Step 2 - Checking for missing values](#2)
* [Step 3 - Imputing/dropping missing values](#3)
* [Step 4 - Ensuring Correct Datatypes](#4)


<a id='0'></a>
## Step 0 - Preliminary: Viewing the data

In [19]:
import pandas as pd

In [20]:
health = pd.read_csv('data/health.zip') #KEEP FOR SCRIPT - HEALTH 1

In [None]:
health.head()

In [None]:
health.shape

The health dataset has 3022 rows and 14 columns

<a id='1'></a>
## Step 1 - Checking for duplicates

In [None]:
#Checking for duplicates
health.duplicated().any() 

There are no duplicates in the whole dataset, so we can continue to check the datatype of each column.

<a id='2'></a>
## Step 2 - Checking for missing values

The cell below shows the number of missing values per column

In [None]:
health.isna().sum()

The cell below shows the proportion of missing values with respect to the total number of values per column.

In [None]:
health.isna().sum() / len(health)

There are missing values, so we will proceed to either drop or impute them.

<a id='3'></a>
## Step 3 - Imputing/dropping missing values

All the columns except for location_key and life_expectancy are fully null. These are providing no value to our dataset, therefore they can be dropped.

In [21]:
health = health.dropna(axis = 'columns', how = 'all') #KEEP FOR SCRIPT - HEALTH 2

In [None]:
#Check 1
health.head()

In [None]:
#Check 2
health.isna().sum() / len(health)

In the health dataset, there is no need to do any imputation of any sorts, as none of the remaining columns have missing values.

<a id='4'></a>
## Step 4 - Ensuring correct datatypes

In [None]:
#Ensuring the datatypes we have are correct
health.info()

The datatypes of the columns appear to be correct.

Having gone through all the data quality check steps and performed the appropriate transformations, the health dataset is now ready to be merged!