# Epidemiology - ETL Process

This notebook is organized in the following sections:
* [Step 0 - Preliminary: Viewing the data](#0)
* [Step 1 - Checking for duplicates](#1)
* [Step 2 - Checking for missing values](#2)
* [Step 3 - Imputing/dropping missing values](#3)
* [Step 4 - Ensuring Correct Datatypes](#4)
* [Step 5 - Preparation for merging](#5)


<a id='0'></a>
## Step 0 - Preliminary: Viewing the data

In [1]:
import pandas as pd

In [2]:
epidemiology = pd.read_csv('data/epidemiology.zip') #KEEP FOR SCRIPT -- EPIDEMIOLOGY 1

In [None]:
#Viewing the data
epidemiology.head()

In [None]:
epidemiology.shape

The epidemiology dataset has 3161033 rows and 10 columns

<a id='1'></a>
## Step 1 - Checking for duplicates

In [None]:
#Checking for duplicates
epidemiology.duplicated().any() 

There are no duplicates in the whole dataset, so we can continue to check the datatype of each column.

<a id='2'></a>
## Step 2 - Checking for missing values

The cell below shows the number of missing values per column

In [None]:
epidemiology.isna().sum()

The cell below shows the proportion of missing values with respect to the total number of values per column.

In [None]:
epidemiology.isna().sum() / len(epidemiology)

There are missing values, so we will proceed to either drop or impute them.

<a id='3'></a>
## Step 3 - Imputing/dropping missing values

We drop the new_tested and cumulative_tested columns as both these are nearly completely null.
Also, there is no way to calculate/impute tested as we would need another column with the confirmed negative cases of covid (which we don't).

In [81]:
epidemiology = epidemiology.drop(columns = ['new_tested', 'cumulative_tested']) #KEEP FOR SCRIPT -- EPIDEMIOLOGY 2

As recovered columns are over 90% null, we also drop these columns.

In [82]:
epidemiology = epidemiology.drop(columns = ['new_recovered', 'cumulative_recovered']) #KEEP FOR SCRIPT -- EPIDEMIOLOGY 3

Imputation

In [None]:
# Step 1: Replace nulls in 'new_confirmed' by calculating its value from consecutive 'cumulative_confirmed' values
for i in range(len(epidemiology)):
    if pd.isna(epidemiology.at[i, "new_confirmed"]) and epidemiology.at[i, "location_key"] == epidemiology.at[i - 1, "location_key"]:
        epidemiology.at[i, "new_confirmed"] = epidemiology.at[i, "cumulative_confirmed"] - epidemiology.at[i - 1, "cumulative_confirmed"]
# Step 2: Handle rows with minimum dates
min_date_indices = epidemiology.groupby('location_key')['date'].idxmin()
min_date_rows = epidemiology.loc[min_date_indices].copy()
min_date_rows.loc[min_date_rows['new_confirmed'].isnull(), 'new_confirmed'] = min_date_rows['cumulative_confirmed']
epidemiology.update(min_date_rows)

# Step 3: Sort by 'location_key' and 'date'
epidemiology = epidemiology.sort_values(by=['location_key', 'date']).reset_index(drop=True)

# Step 4: Forward-fill 'cumulative_deceased', leaving all-null groups as NaN
epidemiology['cumulative_deceased'] = epidemiology.groupby('location_key')['cumulative_deceased'].transform(
    lambda group: group.ffill() if group.notna().any() else group
)

# Step 5: Calculate 'new_deceased', leaving all-null groups as NaN
epidemiology['new_deceased'] = epidemiology.groupby('location_key')['cumulative_deceased'].transform(
    lambda group: group.diff() if group.notna().any() else group
)

# Step 6: Clip negative values
epidemiology['new_deceased'] = epidemiology['new_deceased'].clip(lower=0)
epidemiology['new_confirmed'] = epidemiology['new_confirmed'].clip(lower=0)

# Step 7: Extract region code
epidemiology['region'] = epidemiology['location_key'].str[3:5]

# Step 8: Calculate 'x' (average of new_deceased / new_confirmed) per region and date
region_date_avg = epidemiology.groupby(['region', 'date']).apply(
    lambda group: group['new_deceased'].sum() / group['new_confirmed'].sum()
    if group['new_confirmed'].sum() > 0 else 0
).rename('x').reset_index()

# Step 9: Merge 'x' into the original DataFrame
epidemiology = epidemiology.merge(region_date_avg, on=['region', 'date'], how='left')

# Step 10: Impute missing 'new_deceased' values using 'x * new_confirmed'
epidemiology['new_deceased'] = epidemiology.apply(
    lambda row: (
        math.ceil(row['x'] * row['new_confirmed']) 
        if (row['x'] * row['new_confirmed']) % 1 > 0.05 
        else round(row['x'] * row['new_confirmed'])
    ) if pd.isna(row['new_deceased']) else row['new_deceased'],
    axis=1
)

# Step 11: Drop temporary columns
epidemiology.drop(columns=['region', 'x'], inplace=True)

# Step 12: Recalculate 'cumulative_deceased'
epidemiology['cumulative_deceased'] = epidemiology.groupby('location_key')['new_deceased'].transform(lambda group: group.cumsum())

# Final sorting
epidemiology = epidemiology.sort_values(by=['location_key', 'date']).reset_index(drop=True)
epidemiology['date'] = pd.to_datetime(epidemiology['date'], format="%Y-%m-%d")
epidemiology['date'] =  epidemiology['date'].dt.to_period("W") 
epidemiology= epidemiology.groupby(['date','location_key'])[['new_confirmed', 'new_deceased', 'cumulative_confirmed', 'cumulative_deceased']].sum()
epidemiology = epidemiology.reset_index() 

Below are some checks to be sure that these columns have been dropped correctly

In [None]:
#Check 1
epidemiology.head()

In [None]:
#Check 2
epidemiology.isna().sum() / len(epidemiology)

In [None]:
#Check 3
epidemiology.isna().sum()

<a id='4'></a>
## Step 4 - Ensuring correct datatypes

In [None]:
#Ensuring the datatypes we have are correct
epidemiology.info() 

The datatype of date is incorrect, it should be of type datetime64[ns] and not object. 

Therefore, we changed the datatype of the column 'date':

In [70]:
epidemiology['date'] = pd.to_datetime(epidemiology['date'], format="%Y-%m-%d") #KEEP FOR SCRIPT -- EPIDEMIOLOGY 5

Checking the datatypes of each of the columns again to be 100% sure.

In [None]:
epidemiology.info()

The rest of the columns appear to have the appropriate datatype.

In [None]:
#One final check

epidemiology.count()

## Step 5 - Preparation for merging

Changing the column date from day to week:

In [56]:
epidemiology['date'] =  epidemiology['date'].dt.to_period("W") #KEEP FOR SCRIPT -- EPIDEMIOLOGY 6

Grouping by date (week) and location_key, as these are the indices we want.

We use the sum metric for the group by.

In [61]:
epidemiology= epidemiology.groupby(['date','location_key'])[['new_confirmed', 'new_deceased', 'cumulative_confirmed', 'cumulative_deceased']].sum()
#KEEP FOR SCRIPT -- EPIDEMIOLOGY 7

Formatting the index

In [63]:
epidemiology = epidemiology.reset_index() #KEEP FOR SCRIPT -- EPIDEMIOLOGY 8