## Hospitalizations - ETL Process

This notebook is organized in the following sections:
* [Step 0 - Preliminary: Viewing the data](#0)
* [Step 1 - Checking for duplicates](#1)
* [Step 2 - Checking for missing values](#2)
* [Step 3 - Imputing/dropping missing values](#3)
* [Step 4 - Ensuring Correct Datatypes](#4)


<a id='0'></a>
## Step 0 - Preliminary: Viewing the data

In [17]:
import pandas as pd

In [18]:
hospitalizations = pd.read_csv('data/hospitalizations.zip')

In [None]:
#Viewing the data
hospitalizations.head()

In [None]:
hospitalizations.shape

The demographics dataset has 6297 rows and 11 columns

<a id='1'></a>
## Step 1 - Checking for duplicates

In [None]:
#Checking for duplicates
hospitalizations.duplicated().any()

There are no duplicates in the whole dataset, so we can continue to check the datatype of each column.

<a id='2'></a>
## Step 2 - Checking for missing values

The cell below shows the number of missing values per column

In [None]:
hospitalizations.isna().sum()

The cell below shows the proportion of missing values with respect to the total number of values per column.
If a column is fully null, it will be dropped.

In [None]:
hospitalizations.isna().sum() / len(hospitalizations)

In [None]:
hospitalizations.info()

<a id='3'></a>
## Step 3 - Imputing/dropping missing values

Ventilator patients columns are dropped as these are fully null

In [19]:
hospitalizations = hospitalizations.drop(columns = ['new_ventilator_patients', 'cumulative_ventilator_patients', 'current_ventilator_patients']) #-- KEEP FOR SCRIPT

In [None]:
#Check
hospitalizations.info()

Exploring intensive care patients columns:

In [None]:
intensive_df = hospitalizations[['new_intensive_care_patients', 'cumulative_intensive_care_patients', 'current_intensive_care_patients']]
intensive_df 

We also drop intensive care patients columns as there pretty much fully null

In [20]:
hospitalizations = hospitalizations.drop(columns = ['new_intensive_care_patients', 'cumulative_intensive_care_patients', 'current_intensive_care_patients']) # -- KEEP for script

In [None]:
hospitalizations.info()

Exploring hospitalized patients columns:

In [None]:
#0.13959 + 0.86041 = 1
hospitalizations.isna().sum() / len(hospitalizations)
#These two columns are definitelty related --> there must be a way to impute the 2

In [None]:
patients_df = hospitalizations[['new_hospitalized_patients', 'cumulative_hospitalized_patients', 'current_hospitalized_patients']]
patients_df[patients_df.new_hospitalized_patients.notnull()]

In [None]:
patients_df

In [None]:
hospitalizations.location_key.value_counts()

Through analysis of the hospitalizations dataset, we discovered there were only 7 location keys (all corresponding to locations within the United States), whereas in other tables (such as index) there were 5121, for all (4) countries.

If we merge the hospitalizations dataset, this would cause there to be many null values (for all countries except the US). This is not feasible as per the assignment instructions (that the macrotable cannot have any null/missing values). A solution to this issue would be to remove those null values in the macrotable, however, in that case we would have a macrotable with very few rows (and with values only for the US).

Even though we recognize the value of the hospitalization data:
 1. It is not essential to predict deaths (which is our final objective)
 2. We prioritize a large dataset covering all countries over having a small macrotable covering data for the US only.

 In conclusion, we decied not merge the hospitalizations table to our macrotable, and drop it from our analysis. 

<a id='4'></a>
## Step 4 - Ensuring correct datatypes

In [22]:
hospitalizations['date'] = pd.to_datetime(hospitalizations['date'], format="%Y-%m-%d") #KEEP FOR SCRIPT