# DFx ETL Pipeline

## healthdata.org

An ETL pipeline for [Global Burden of Disease Study dataset](https://ghdx.healthdata.org/gbd-2021) from the Global Health Data Exchange (GHDx).

### Libraries

In [1]:
from dotenv import load_dotenv
from tqdm import tqdm

load_dotenv()
from dfpp.storage import AzureStorage as Storage
from dfpp.sources import healthdata_org as source

storage = Storage()
SOURCE_NAME = "healthdata_org"

### Extract

In [2]:
df_raw = storage.read_dataset(
    storage.join_path("manual/IHME-GBD_2021_DATA-c13547d7-1.csv")
)
print("Shape:", df_raw.shape)
display(df_raw.head())

Shape: (303400, 16)


Unnamed: 0,measure_id,measure_name,location_id,location_name,sex_id,sex_name,age_id,age_name,cause_id,cause_name,metric_id,metric_name,year,val,upper,lower
0,1,Deaths,349,Greenland,1,Male,22,All ages,526,Digestive diseases,1,Number,1980,6.965934,8.394566,5.592003
1,1,Deaths,349,Greenland,2,Female,22,All ages,526,Digestive diseases,1,Number,1980,7.915022,9.458732,6.676566
2,1,Deaths,349,Greenland,1,Male,22,All ages,526,Digestive diseases,3,Rate,1980,25.770021,31.055151,20.687252
3,1,Deaths,349,Greenland,2,Female,22,All ages,526,Digestive diseases,3,Rate,1980,34.754251,41.532561,29.316287
4,1,Deaths,84,Ireland,1,Male,22,All ages,491,Cardiovascular diseases,1,Number,1980,8929.646298,9166.850692,8651.144217


### Transform

In [3]:
df_transformed = source.transform(df_raw)
print("Shape:", df_transformed.shape)
display(df_transformed.head())



Shape: (303400, 12)


Unnamed: 0,source,series_id,series_name,disagr_sex,disagr_age,disagr_cause,alpha_3_code,prop_unit,prop_observation_type,year,value,prop_value_label
0,https://www.healthdata.org/,deaths_digestive_diseases,"Deaths, Digestive diseases",male,All ages,Digestive diseases,GRL,Number,,1980,6.965934,
1,https://www.healthdata.org/,deaths_digestive_diseases,"Deaths, Digestive diseases",female,All ages,Digestive diseases,GRL,Number,,1980,7.915022,
2,https://www.healthdata.org/,deaths_digestive_diseases,"Deaths, Digestive diseases",male,All ages,Digestive diseases,GRL,Rate,,1980,25.770021,
3,https://www.healthdata.org/,deaths_digestive_diseases,"Deaths, Digestive diseases",female,All ages,Digestive diseases,GRL,Rate,,1980,34.754251,
4,https://www.healthdata.org/,deaths_cardiovascular_diseases,"Deaths, Cardiovascular diseases",male,All ages,Cardiovascular diseases,IRL,Number,,1980,8929.646298,


In [4]:
df_transformed[['source', 'series_id', 'series_name']].drop_duplicates().to_clipboard(index=False)

### Load

In [8]:
for series_id, df in tqdm(df_transformed.groupby("series_id")):
    df.name = series_id
    storage.publish_dataset(df, folder_path=SOURCE_NAME)

100%|██████████| 10/10 [00:15<00:00,  1.54s/it]
