# Vaccinations - ETL Process

This notebook is organized in the following sections:
* [Step 0 - Preliminary: Viewing the data](#0)
* [Step 1 - Checking for duplicates](#1)
* [Step 2 - Checking for missing values](#2)
* [Step 3 - Imputing/dropping missing values](#3)
* [Step 4 - Ensuring Correct Datatypes](#4)
* [Step 5 - Preparation for merging](#5)


<a id='0'></a>
## Step 0 - Preliminary: Viewing the data

In [78]:
import pandas as pd

In [79]:
vaccinations = pd.read_csv('data/vaccinations.zip') #KEEP FOR SCRIPT - VACCINATIONS 1

In [None]:
vaccinations.head()

In [None]:
vaccinations.shape

The vaccinations dataset has 1562414 rows and 32 columns

<a id='1'></a>
## Step 1 - Checking for duplicates

In [None]:
#Checking for duplicates
vaccinations.duplicated().any()

There are no duplicates in the whole dataset, so we can continue to check the datatype of each column.

<a id='2'></a>
## Step 2 - Checking for missing values

The cell below shows the number of missing values per column

In [None]:
vaccinations.isna().sum()

The cell below shows the proportion of missing values with respect to the total number of values per column.

In [None]:
vaccinations.isna().sum() / len(vaccinations)

As there are a lot of columns, the output is truncated. Therefore in the next step, we will drop the columns which are fully null (there are many), and will repeat this check.

<a id='3'></a>
## Step 3 - Imputing/dropping missing values

In [80]:
vaccinations = vaccinations.dropna(axis = 'columns', how = 'all') #KEEP FOR SCRIPT - VACCINATIONS 2

In [None]:
#Check 1
vaccinations.head()

In [None]:
#Check 2
vaccinations.isna().sum()

All missing values in the new_persons_fully_vaccinated column are for the first day (or the day previous to the first) of vaccinations within the time perod of the dataset. Therefore these must be 0, hence we impute these missing values with 0s as such.

In [81]:
vaccinations = vaccinations.fillna(0) #KEEP FOR SCRIPT - VACCINATIONS 3

In [None]:
#Check 1
vaccinations.head()

In [None]:
#Check 2
vaccinations.isna().sum()

Now, given the dataset contains no null values, we proceed to ensure the datatypes of each column is correct!

<a id='4'></a>
## Step 4 - Ensuring correct datatypes

In [None]:
vaccinations.info()

From this series, we can infer that in principle the only column whose datatype must be changed is date: from object to datetime64[ns].

In [82]:
vaccinations['date'] = pd.to_datetime(vaccinations['date'], format="%Y-%m-%d") #KEEP FOR SCRIPT - VACCINATIONS 4

In [None]:
#Check 1
vaccinations.head()

In [None]:
#Check 2 
vaccinations.info()

Having gone through all the data quality check steps and performed the appropriate transformations, the vaccinations dataset is now ready to be prepared for  merging!

<a id='5'></a>
## Step 5 - Preparation for merging

Changing the column date from day to week:

In [83]:
vaccinations['date'] =  vaccinations['date'].dt.to_period("W") #KEEP FOR SCRIPT - VACCINATIONS 5

Grouping by date (week) and location_key, as these are the indices we want.

We use the sum metric for the group by.

In [72]:
vaccinations= vaccinations.groupby(['date','location_key'])[['new_persons_fully_vaccinated', 'cumulative_persons_fully_vaccinated']].sum()
#KEEP FOR SCRIPT - VACCINATIONS 6

Formatting the index

In [74]:
vaccinations = vaccinations.reset_index()
#KEEP FOR SCRIPT - VACCINATIONS 7