# Index - ETL Process

This notebook is organized in the following sections:
* [Step 0 - Preliminary: Viewing the data](#0)
* [Step 1 - Checking for duplicates](#1)
* [Step 2 - Checking for missing values](#2)
* [Step 3 - Imputing/dropping missing values](#3)
* [Step 4 - Ensuring Correct Datatypes](#4)


<a id='0'></a>
## Step 0 - Preliminary: Viewing the data

In [34]:
import pandas as pd

In [35]:
index = pd.read_csv('data/index.zip') #KEEP FOR SCRIPT - INDEX 1

In [None]:
index.head()

In [None]:
index.shape

The index dataset has 5121 rows and 15 columns

<a id='1'></a>
## Step 1 - Checking for duplicates

In [None]:
#Checking for duplicates
index.duplicated().any() 

There are no duplicates in the whole dataset, so we can continue to check the datatype of each column.

<a id='2'></a>
## Step 2 - Checking for missing values

The cell below shows the number of missing values per column

In [None]:
index.isna().sum()

The cell below shows the proportion of missing values with respect to the total number of values per column.

In [None]:
index.isna().sum() / len(index)

There are missing values, so we will proceed to either drop or impute them.

<a id='3'></a>
## Step 3 - Imputing/dropping missing values

Where locality is missing, subregion2 is not and viceversa --> cannot assure that these 2 are equal.
Drop both locality and subregion as these are not adding anything to country level analysis (and we already have subregion1).

In [36]:
index = index.drop(columns = ['locality_code', 'locality_name']) 
index = index.drop(columns = ['subregion2_code', 'subregion2_name']) #KEEP FOR SCRIPT - INDEX 2

In [None]:
#Check 1
index.head()

In [None]:
#Check 2
index.isna().sum() / len(index)

The columns with null values we will deal with next are: place_id, wikidata_id, datacommons_id

Exploring the data of these columnms in more detail:

In [None]:
#place_id: number of unique values
index.place_id.nunique()

In [None]:
#place_id: number of non null values
index.place_id.count()

In [13]:
#All place_id values except for 1 are unique

In [None]:
#wikidata_id: number of unique values
index.wikidata_id.nunique()

In [None]:
#wikidata_id: number of non null values
index.wikidata_id.count()

In [None]:
#All wikidata_id values except for 1 are unique

In [None]:
#datacommmons_id: number of unique values
index.datacommons_id.nunique()

In [None]:
#datacommons_id: number of non null values
index.datacommons_id.count()

In [None]:
#All datacommons_id values are unique

Given the objective of the assignment (building a weekly covid-19 death predictor by country), the characterisitcs, and the value these 3 columns could provide, we have decided to drop them entirely. Dropping the rows with null values would not make sense as we would be getting rid of perfectly good country level data (country level data is completely non-null and is necessary for our objective), and any means of imputation seems difficult given that the unique characteristic of the values.

In [37]:
index = index.drop(columns = ['place_id', 'wikidata_id', 'datacommons_id']) #KEEP FOR SCRIPT - INDEX 3

In [None]:
#Check 1
index.head()

In [None]:
#Check 2
index.isna().sum() / len(index)

The columns iso_3166_1_alpha_2, iso_3166_1_alpha_3 and aggregation_level are not providing any value to our analysis, so we drop them as well.

In [38]:
index = index.drop(columns = ['iso_3166_1_alpha_2', 'iso_3166_1_alpha_3', 'aggregation_level']) #KEEP FOR SCRIPT - INDEX 4

In [None]:
#Check 1
index.head()

In [None]:
#Check 2
index.isna().sum() / len(index)

Now, given the dataset contains no null values, we proceed to ensure the datatypes of each column is correct!

<a id='4'></a>
## Step 4 - Ensuring correct datatypes

In [None]:
index.info()

The datatypes of all the columns appear to be correct.

Having gone through all the data quality check steps and performed the appropriate transformations, the index dataset is now ready to be merged!