# Notebook to Import and Merge Together Public Data

In this notebook, we will merge together data coming from different datasets. 
We will have 36 x 2 final datasets, 36 for years from 2017 to 2021 and 36 for years from 1980 to 2021, so that in our later analysis we can be free to use more or less years, depending on our final scope.
The 36 datasets are the results of matching different diseases with different measurs and metric. In particular:
- disease: **COPD** (Chronic obstructive pulmonary disease) or **asthma** or **tuberculosis**
- measure: **deaths**, **DALYs**, **incidence**, **prevalence**
- metric: **number**, **percent**, **rate**

So finally we have: 3 x 4 x 3 = 36 datasets (for each year extension, as already explained).

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from functools import reduce

## Respiratory disease data - 2017 to 2021

In [3]:
# Votre code ici:
df_diseases = pd.read_csv("../Data/health/IHME/IHME-data-2017to2021/IHME-GBD_2021_DATA-db2c71f9-1.csv")
df_diseases = df_diseases[["measure", "location", "sex", "age", "cause", "metric", "year", "val"]]
df_diseases.columns = ["Measure", "Country Name", "Sex", "Age Class", "Disease", "Metric", "Year", "Value"]
df_diseases

Unnamed: 0,Measure,Country Name,Sex,Age Class,Disease,Metric,Year,Value
0,Deaths,Italy,Male,15-49 years,Chronic obstructive pulmonary disease,Number,2017,72.918995
1,Deaths,Italy,Female,15-49 years,Chronic obstructive pulmonary disease,Number,2017,35.326758
2,Deaths,Italy,Male,15-49 years,Chronic obstructive pulmonary disease,Percent,2017,0.006337
3,Deaths,Italy,Female,15-49 years,Chronic obstructive pulmonary disease,Percent,2017,0.005591
4,Deaths,Italy,Male,15-49 years,Chronic obstructive pulmonary disease,Rate,2017,0.553510
...,...,...,...,...,...,...,...,...
385555,Incidence,Sudan,Female,75+ years,Interstitial lung disease and pulmonary sarcoi...,Number,2021,19.755892
385556,Incidence,Sudan,Male,75+ years,Interstitial lung disease and pulmonary sarcoi...,Percent,2021,0.000032
385557,Incidence,Sudan,Female,75+ years,Interstitial lung disease and pulmonary sarcoi...,Percent,2021,0.000021
385558,Incidence,Sudan,Male,75+ years,Interstitial lung disease and pulmonary sarcoi...,Rate,2021,13.682119


In [3]:
# different diseases
print(df_diseases["Disease"].unique())

df_COPD = df_diseases[df_diseases["Disease"] == "Chronic obstructive pulmonary disease"]
df_asthma = df_diseases[df_diseases["Disease"] == "Asthma"]
df_tuberculosis = df_diseases[df_diseases["Disease"] == "Tuberculosis"]

['Chronic obstructive pulmonary disease' 'Asthma'
 'Interstitial lung disease and pulmonary sarcoidosis' 'Tuberculosis']


In [4]:
# combining with different measures
print(df_diseases["Measure"].unique())

df_COPD_deaths = df_COPD[df_COPD["Measure"] == "Deaths"]
df_COPD_DALYs = df_COPD[df_COPD["Measure"] == "DALYs (Disability-Adjusted Life Years)"]
df_COPD_prevalence = df_COPD[df_COPD["Measure"] == "Prevalence"]
df_COPD_incidence = df_COPD[df_COPD["Measure"] == "Incidence"]

df_asthma_deaths = df_asthma[df_asthma["Measure"] == "Deaths"]
df_asthma_DALYs = df_asthma[df_asthma["Measure"] == "DALYs (Disability-Adjusted Life Years)"]
df_asthma_prevalence = df_asthma[df_asthma["Measure"] == "Prevalence"]
df_asthma_incidence = df_asthma[df_asthma["Measure"] == "Incidence"]

df_tuberculosis_deaths = df_tuberculosis[df_tuberculosis["Measure"] == "Deaths"]
df_tuberculosis_DALYs = df_tuberculosis[df_tuberculosis["Measure"] == "DALYs (Disability-Adjusted Life Years)"]
df_tuberculosis_prevalence = df_tuberculosis[df_tuberculosis["Measure"] == "Prevalence"]
df_tuberculosis_incidence = df_tuberculosis[df_tuberculosis["Measure"] == "Incidence"]

['Deaths' 'DALYs (Disability-Adjusted Life Years)' 'Prevalence'
 'Incidence']


In [5]:
# combining with different metrics
print(df_diseases["Metric"].unique())

df_COPD_deaths_nb = df_COPD_deaths[df_COPD_deaths["Metric"] == "Number"]
df_COPD_deaths_pct = df_COPD_deaths[df_COPD_deaths["Measure"] == "Percent"]
df_COPD_deaths_rate = df_COPD_deaths[df_COPD_deaths["Measure"] == "Rate"]

df_COPD_DALYs_nb = df_COPD_DALYs[df_COPD_DALYs["Metric"] == "Number"]
df_COPD_DALYs_pct = df_COPD_DALYs[df_COPD_DALYs["Measure"] == "Percent"]
df_COPD_DALYs_rate = df_COPD_DALYs[df_COPD_DALYs["Measure"] == "Rate"]

df_COPD_prevalence_nb = df_COPD_prevalence[df_COPD_prevalence["Metric"] == "Number"]
df_COPD_prevalence_pct = df_COPD_prevalence[df_COPD_prevalence["Measure"] == "Percent"]
df_COPD_prevalence_rate = df_COPD_prevalence[df_COPD_prevalence["Measure"] == "Rate"]

df_COPD_incidence_nb = df_COPD_incidence[df_COPD_incidence["Metric"] == "Number"]
df_COPD_incidence_pct = df_COPD_incidence[df_COPD_incidence["Measure"] == "Percent"]
df_COPD_incidence_rate = df_COPD_incidence[df_COPD_incidence["Measure"] == "Rate"]

df_asthma_deaths_nb = df_asthma_deaths[df_asthma_deaths["Metric"] == "Number"]
df_asthma_deaths_pct = df_asthma_deaths[df_asthma_deaths["Measure"] == "Percent"]
df_asthma_deaths_rate = df_asthma_deaths[df_asthma_deaths["Measure"] == "Rate"]

df_asthma_DALYs_nb = df_asthma_DALYs[df_asthma_DALYs["Metric"] == "Number"]
df_asthma_DALYs_pct = df_asthma_DALYs[df_asthma_DALYs["Measure"] == "Percent"]
df_asthma_DALYs_rate = df_asthma_DALYs[df_asthma_DALYs["Measure"] == "Rate"]

df_asthma_prevalence_nb = df_asthma_prevalence[df_asthma_prevalence["Metric"] == "Number"]
df_asthma_prevalence_pct = df_asthma_prevalence[df_asthma_prevalence["Measure"] == "Percent"]
df_asthma_prevalence_rate = df_asthma_prevalence[df_asthma_prevalence["Measure"] == "Rate"]

df_asthma_incidence_nb = df_asthma_incidence[df_asthma_incidence["Metric"] == "Number"]
df_asthma_incidence_pct = df_asthma_incidence[df_asthma_incidence["Measure"] == "Percent"]
df_asthma_incidence_rate = df_asthma_incidence[df_asthma_incidence["Measure"] == "Rate"]

df_tuberculosis_deaths_nb = df_tuberculosis_deaths[df_tuberculosis_deaths["Metric"] == "Number"]
df_tuberculosis_deaths_pct = df_tuberculosis_deaths[df_tuberculosis_deaths["Measure"] == "Percent"]
df_tuberculosis_deaths_rate = df_tuberculosis_deaths[df_tuberculosis_deaths["Measure"] == "Rate"]

df_tuberculosis_DALYs_nb = df_tuberculosis_DALYs[df_tuberculosis_DALYs["Metric"] == "Number"]
df_tuberculosis_DALYs_pct = df_tuberculosis_DALYs[df_tuberculosis_DALYs["Measure"] == "Percent"]
df_tuberculosis_DALYs_rate = df_tuberculosis_DALYs[df_tuberculosis_DALYs["Measure"] == "Rate"]

df_tuberculosis_prevalence_nb = df_tuberculosis_prevalence[df_tuberculosis_prevalence["Metric"] == "Number"]
df_tuberculosis_prevalence_pct = df_tuberculosis_prevalence[df_tuberculosis_prevalence["Measure"] == "Percent"]
df_tuberculosis_prevalence_rate = df_tuberculosis_prevalence[df_tuberculosis_prevalence["Measure"] == "Rate"]

df_tuberculosis_incidence_nb = df_tuberculosis_incidence[df_tuberculosis_incidence["Metric"] == "Number"]
df_tuberculosis_incidence_pct = df_tuberculosis_incidence[df_tuberculosis_incidence["Measure"] == "Percent"]
df_tuberculosis_incidence_rate = df_tuberculosis_incidence[df_tuberculosis_incidence["Measure"] == "Rate"]

['Number' 'Percent' 'Rate']


## Respiratory disease data - 1980 to 2021

Here we'll do the same, but for all data (since 1980).

In [None]:
# Votre code ici:
df_diseases1_8021 = pd.read_csv("../Data/health/IHME/IHME-data-1980to2021/IHME-GBD_2021_DATA-bd7f55df-1.csv")
df_diseases2_8021 = pd.read_csv("../Data/health/IHME/IHME-data-1980to2021/IHME-GBD_2021_DATA-bd7f55df-2.csv")
df_diseases3_8021 = pd.read_csv("../Data/health/IHME/IHME-data-1980to2021/IHME-GBD_2021_DATA-bd7f55df-3.csv")
df_diseases4_8021 = pd.read_csv("../Data/health/IHME/IHME-data-1980to2021/IHME-GBD_2021_DATA-bd7f55df-4.csv")
df_diseases5_8021 = pd.read_csv("../Data/health/IHME/IHME-data-1980to2021/IHME-GBD_2021_DATA-bd7f55df-5.csv")
df_diseases6_8021 = pd.read_csv("../Data/health/IHME/IHME-data-1980to2021/IHME-GBD_2021_DATA-bd7f55df-6.csv")
df_diseases7_8021 = pd.read_csv("../Data/health/IHME/IHME-data-1980to2021/IHME-GBD_2021_DATA-bd7f55df-7.csv")
df_diseases8_8021 = pd.read_csv("../Data/health/IHME/IHME-data-1980to2021/IHME-GBD_2021_DATA-bd7f55df-8.csv")
df_diseases_8021 = pd.concat([df_diseases1_8021, df_diseases2_8021, df_diseases3_8021, df_diseases4_8021, df_diseases5_8021,\
                             df_diseases6_8021, df_diseases7_8021, df_diseases8_8021])
df_diseases_8021 = df_diseases_8021[["measure", "location", "sex", "age", "cause", "metric", "year", "val"]]
df_diseases_8021.columns = ["Measure", "Country Name", "Sex", "Age Class", "Disease", "Metric", "Year", "Value"]
df_diseases_8021

Unnamed: 0,Measure,Country Name,Sex,Age Class,Disease,Metric,Year,Value
0,Deaths,Georgia,Male,15-49 years,Asthma,Number,1980,27.367289
1,Deaths,Georgia,Female,15-49 years,Asthma,Number,1980,14.762167
2,Deaths,Georgia,Male,15-49 years,Asthma,Percent,1980,0.006820
3,Deaths,Georgia,Female,15-49 years,Asthma,Percent,1980,0.009002
4,Deaths,Georgia,Male,15-49 years,Asthma,Rate,1980,2.141352
...,...,...,...,...,...,...,...,...
451067,Incidence,United States Virgin Islands,Female,75+ years,Chronic obstructive pulmonary disease,Number,2021,35.932975
451068,Incidence,United States Virgin Islands,Male,75+ years,Chronic obstructive pulmonary disease,Percent,2021,0.002453
451069,Incidence,United States Virgin Islands,Female,75+ years,Chronic obstructive pulmonary disease,Percent,2021,0.001791
451070,Incidence,United States Virgin Islands,Male,75+ years,Chronic obstructive pulmonary disease,Rate,2021,1043.998350


In [27]:
# different diseases
print(df_diseases_8021["Disease"].unique())

df_8021_COPD = df_diseases_8021[df_diseases_8021["Disease"] == "Chronic obstructive pulmonary disease"]
df_8021_asthma = df_diseases_8021[df_diseases_8021["Disease"] == "Asthma"]
df_8021_tuberculosis = df_diseases_8021[df_diseases_8021["Disease"] == "Tuberculosis"]

['Asthma' 'Interstitial lung disease and pulmonary sarcoidosis'
 'Chronic obstructive pulmonary disease' 'Pneumoconiosis' 'COVID-19'
 'Tuberculosis']


In [28]:
# combining with different measures
print(df_diseases_8021["Measure"].unique())

df_8021_COPD_deaths = df_8021_COPD[df_8021_COPD["Measure"] == "Deaths"]
df_8021_COPD_DALYs = df_8021_COPD[df_8021_COPD["Measure"] == "DALYs (Disability-Adjusted Life Years)"]
df_8021_COPD_prevalence = df_8021_COPD[df_8021_COPD["Measure"] == "Prevalence"]
df_8021_COPD_incidence = df_8021_COPD[df_8021_COPD["Measure"] == "Incidence"]

df_8021_asthma_deaths = df_8021_asthma[df_8021_asthma["Measure"] == "Deaths"]
df_8021_asthma_DALYs = df_8021_asthma[df_8021_asthma["Measure"] == "DALYs (Disability-Adjusted Life Years)"]
df_8021_asthma_prevalence = df_8021_asthma[df_8021_asthma["Measure"] == "Prevalence"]
df_8021_asthma_incidence = df_8021_asthma[df_8021_asthma["Measure"] == "Incidence"]

df_8021_tuberculosis_deaths = df_8021_tuberculosis[df_8021_tuberculosis["Measure"] == "Deaths"]
df_8021_tuberculosis_DALYs = df_8021_tuberculosis[df_8021_tuberculosis["Measure"] == "DALYs (Disability-Adjusted Life Years)"]
df_8021_tuberculosis_prevalence = df_8021_tuberculosis[df_8021_tuberculosis["Measure"] == "Prevalence"]
df_8021_tuberculosis_incidence = df_8021_tuberculosis[df_8021_tuberculosis["Measure"] == "Incidence"]

['Deaths' 'DALYs (Disability-Adjusted Life Years)' 'Prevalence'
 'Incidence']


In [29]:
# combining with different metrics
print(df_diseases["Metric"].unique())

df_8021_COPD_deaths_nb = df_8021_COPD_deaths[df_8021_COPD_deaths["Metric"] == "Number"]
df_8021_COPD_deaths_pct = df_8021_COPD_deaths[df_8021_COPD_deaths["Measure"] == "Percent"]
df_8021_COPD_deaths_rate = df_8021_COPD_deaths[df_8021_COPD_deaths["Measure"] == "Rate"]

df_8021_COPD_DALYs_nb = df_8021_COPD_DALYs[df_8021_COPD_DALYs["Metric"] == "Number"]
df_8021_COPD_DALYs_pct = df_8021_COPD_DALYs[df_8021_COPD_DALYs["Measure"] == "Percent"]
df_8021_COPD_DALYs_rate = df_8021_COPD_DALYs[df_8021_COPD_DALYs["Measure"] == "Rate"]

df_8021_COPD_prevalence_nb = df_8021_COPD_prevalence[df_8021_COPD_prevalence["Metric"] == "Number"]
df_8021_COPD_prevalence_pct = df_8021_COPD_prevalence[df_8021_COPD_prevalence["Measure"] == "Percent"]
df_8021_COPD_prevalence_rate = df_8021_COPD_prevalence[df_8021_COPD_prevalence["Measure"] == "Rate"]

df_8021_COPD_incidence_nb = df_8021_COPD_incidence[df_8021_COPD_incidence["Metric"] == "Number"]
df_8021_COPD_incidence_pct = df_8021_COPD_incidence[df_8021_COPD_incidence["Measure"] == "Percent"]
df_8021_COPD_incidence_rate = df_8021_COPD_incidence[df_8021_COPD_incidence["Measure"] == "Rate"]

df_8021_asthma_deaths_nb = df_8021_asthma_deaths[df_8021_asthma_deaths["Metric"] == "Number"]
df_8021_asthma_deaths_pct = df_8021_asthma_deaths[df_8021_asthma_deaths["Measure"] == "Percent"]
df_8021_asthma_deaths_rate = df_8021_asthma_deaths[df_8021_asthma_deaths["Measure"] == "Rate"]

df_8021_asthma_DALYs_nb = df_8021_asthma_DALYs[df_8021_asthma_DALYs["Metric"] == "Number"]
df_8021_asthma_DALYs_pct = df_8021_asthma_DALYs[df_8021_asthma_DALYs["Measure"] == "Percent"]
df_8021_asthma_DALYs_rate = df_8021_asthma_DALYs[df_8021_asthma_DALYs["Measure"] == "Rate"]

df_8021_asthma_prevalence_nb = df_8021_asthma_prevalence[df_8021_asthma_prevalence["Metric"] == "Number"]
df_8021_asthma_prevalence_pct = df_8021_asthma_prevalence[df_8021_asthma_prevalence["Measure"] == "Percent"]
df_8021_asthma_prevalence_rate = df_8021_asthma_prevalence[df_8021_asthma_prevalence["Measure"] == "Rate"]

df_8021_asthma_incidence_nb = df_8021_asthma_incidence[df_8021_asthma_incidence["Metric"] == "Number"]
df_8021_asthma_incidence_pct = df_8021_asthma_incidence[df_8021_asthma_incidence["Measure"] == "Percent"]
df_8021_asthma_incidence_rate = df_8021_asthma_incidence[df_8021_asthma_incidence["Measure"] == "Rate"]

df_8021_tuberculosis_deaths_nb = df_8021_tuberculosis_deaths[df_8021_tuberculosis_deaths["Metric"] == "Number"]
df_8021_tuberculosis_deaths_pct = df_8021_tuberculosis_deaths[df_8021_tuberculosis_deaths["Measure"] == "Percent"]
df_8021_tuberculosis_deaths_rate = df_8021_tuberculosis_deaths[df_8021_tuberculosis_deaths["Measure"] == "Rate"]

df_8021_tuberculosis_DALYs_nb = df_8021_tuberculosis_DALYs[df_8021_tuberculosis_DALYs["Metric"] == "Number"]
df_8021_tuberculosis_DALYs_pct = df_8021_tuberculosis_DALYs[df_8021_tuberculosis_DALYs["Measure"] == "Percent"]
df_8021_tuberculosis_DALYs_rate = df_8021_tuberculosis_DALYs[df_8021_tuberculosis_DALYs["Measure"] == "Rate"]

df_8021_tuberculosis_prevalence_nb = df_8021_tuberculosis_prevalence[df_8021_tuberculosis_prevalence["Metric"] == "Number"]
df_8021_tuberculosis_prevalence_pct = df_8021_tuberculosis_prevalence[df_8021_tuberculosis_prevalence["Measure"] == "Percent"]
df_8021_tuberculosis_prevalence_rate = df_8021_tuberculosis_prevalence[df_8021_tuberculosis_prevalence["Measure"] == "Rate"]

df_8021_tuberculosis_incidence_nb = df_8021_tuberculosis_incidence[df_8021_tuberculosis_incidence["Metric"] == "Number"]
df_8021_tuberculosis_incidence_pct = df_8021_tuberculosis_incidence[df_8021_tuberculosis_incidence["Measure"] == "Percent"]
df_8021_tuberculosis_incidence_rate = df_8021_tuberculosis_incidence[df_8021_tuberculosis_incidence["Measure"] == "Rate"]

['Number' 'Percent' 'Rate']


## World Developement Indicators Data - 1974 to 2021

In [None]:
df_wdi_1 = pd.read_csv("../Data/economic/WorldBankGroup/World_Development_Indicators/wdi_1.csv")
df_wdi_2 = pd.read_csv("../Data/economic/WorldBankGroup/World_Development_Indicators/wdi_2.csv", encoding="cp1252", engine="python")
df_wdi = pd.concat([df_wdi_1, df_wdi_2])
df_wdi


Unnamed: 0,Country Name,Country Code,Series Name,Series Code,1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],...,2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,GDP (current US$),NY.GDP.MKTP.CD,..,..,..,..,..,..,...,19907329777.5872,20146416757.5987,20497128555.6972,19134221644.7325,18116572395.0772,18753456497.8159,18053222687.4126,18799444490.1128,19955929052.1496,14259995441.0759
1,Afghanistan,AFG,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,..,..,..,..,..,..,...,568.929021458341,580.603833333096,575.146245808546,565.569730408751,563.872336723147,562.769574140988,553.125151688293,557.861533207459,527.834554499306,408.625855217403
2,Afghanistan,AFG,"Population, total",SP.POP.TOTL,12469127,12773954,13059851,13340756,13611441,13655567,...,30560034,31622704,32792523,33831764,34700612,35688935,36743039,37856121,39068979,40000412
3,Afghanistan,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,39.469,39.994,40.518,41.082,40.086,38.844,...,61.735,62.188,62.26,62.27,62.646,62.406,62.443,62.941,61.454,60.417
4,Afghanistan,AFG,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,210.7,207.5,204.1,200.4,196.6,192.9,...,71.3,68.7,66.4,64.2,62.3,60.4,58.6,56.9,55.3,53.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3988,World,WLD,People using at least basic sanitation service...,SH.STA.BASS.ZS,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
3989,World,WLD,Surface area (sq. km),AG.SRF.TOTL.K2,139735046.037,139735046.037,139735046.037,139732926.037,139732926.037,139734966.737,...,140021205.18,140378206.158,140376282.955,140383797.195,140384998.854,140388337.658,140388117.186,140489642.695,140496038.92,140490177.614
3990,,,,,,,,,,,...,,,,,,,,,,
3991,,,,,,,,,,,...,,,,,,,,,,


In [8]:
# Optional: rename the year columns to just the year number
year_cols = [c for c in df_wdi.columns if "[" in c]   # picks the YR… columns
rename_map = {c: c.split("[")[0].strip() for c in year_cols}
df_wdi = df_wdi.rename(columns=rename_map)

print("\nAfter renaming:", df_wdi.columns.tolist(), "\n")

# Melt - Collapse all year columns into one
# Identify the columns that hold the yearly values
year_columns = [c for c in df_wdi.columns if c.isdigit()]   # e.g. ['1974','1975',...]

# Melt (wide → long)
df_long = df_wdi.melt(
    id_vars=['Country Name', 'Country Code', 'Series Name', 'Series Code'],
    value_vars=year_columns,
    var_name='Year',          # name of the new column that will hold the year
    value_name='Value'        # name of the column that will hold the measurement
)

print("\nShape after melt:", df_long.shape)

# Pivot - Spread the different series into separate columns
df_tidy = df_long.pivot_table(
    index=['Country Name', 'Country Code', 'Year'],   # what defines a unique row
    columns='Series Name',                            # each distinct series becomes a column
    values='Value',                                   # fill cells with the measurement
    aggfunc='first'                                   # there should be only one value per cell
).reset_index()

# After pivot, the column hierarchy is a MultiIndex (Series Names are under the level "Series Name").
# Flatten it for easier use:
df_tidy.columns.name = None          # drop the name of the columns axis
df_tidy = df_tidy.rename_axis(None, axis=1)   # also removes the axis name

print("\nFinal shape:", df_tidy.shape)
df_wdi = df_tidy
df_wdi


After renaming: ['Country Name', 'Country Code', 'Series Name', 'Series Code', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'] 


Shape after melt: (383424, 6)

Final shape: (12768, 18)


Unnamed: 0,Country Name,Country Code,Year,Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),Carbon dioxide (CO2) emissions excluding LULUCF per capita (t CO2e/capita),"Compulsory education, duration (years)",GDP (current US$),GDP per capita (constant 2015 US$),Gini index,"Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",People using at least basic sanitation services (% of population),"Population, total",Poverty headcount ratio at national poverty lines (% of population),Renewable electricity output (% of total electricity output),Surface area (sq. km),"Unemployment, total (% of total labor force) (national estimate)"
0,Afghanistan,AFG,1974,..,..,0.175657846776282,..,..,..,..,39.469,210.7,..,12469127,..,..,652860,..
1,Afghanistan,AFG,1975,..,..,0.158838837215165,..,..,..,..,39.994,207.5,..,12773954,..,..,652860,..
2,Afghanistan,AFG,1976,..,..,0.144917426699585,..,..,..,..,40.518,204.1,..,13059851,..,..,652860,..
3,Afghanistan,AFG,1977,..,..,0.17109974876986,..,..,..,..,41.082,200.4,..,13340756,..,..,652860,..
4,Afghanistan,AFG,1978,..,..,0.142093698969859,..,..,..,..,40.086,196.6,..,13611441,..,..,652860,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12763,Zimbabwe,ZWE,2017,30.1,44,0.714627028745081,7,51074726484.0037,1422.1934603003,44.3,60.263,49.2,36.9416742711943,14812482,30.4,55.1814329227917,390760,..
12764,Zimbabwe,ZWE,2018,30.3,45.4,0.816125522899006,7,34156057417.3285,1471.39488971183,..,60.906,47.4,36.3571601293685,15034452,..,63.0333013128402,390760,..
12765,Zimbabwe,ZWE,2019,30.3,46.7,0.731381759643275,7,25715657177.4682,1356.83821089692,50.3,61.06,46,35.7743358079873,15271368,38.3,68.8452182208443,390760,7.373
12766,Zimbabwe,ZWE,2020,30.5,52.7,0.584283212450557,7,26868564055.12,1230.19155671068,..,61.53,44.9,35.1923618234591,15526888,..,60.7855537239622,390760,..


## Environment Air Emissions Data - OECD

In [None]:
df_air_emissions = pd.read_csv("../Data/enivronment/OECD/air_emissions.csv")
pollutant_col = df_air_emissions.pivot(columns='Pollutants', values='OBS_VALUE')[["Carbon dioxide", "Carbon monoxide"]]
df_air_emissions = pd.concat([df_air_emissions, pollutant_col], axis = 1)
df_air_emissions_CO2 = df_air_emissions[df_air_emissions["Carbon dioxide"].notnull()][["REF_AREA", "Reference area", "TIME_PERIOD", "Carbon dioxide"]]
df_air_emissions_CO = df_air_emissions[df_air_emissions["Carbon monoxide"].notnull()][["REF_AREA", "Reference area", "TIME_PERIOD", "Carbon monoxide"]]
df_air_emissions_CO2.columns = ["Country Code", "Country Name", "Year", "Carbon dioxide (tonnes)"]
df_air_emissions_CO.columns = ["Country Code", "Country Name", "Year", "Carbon monoxide (tonnes)"]
display(df_air_emissions_CO)
display(df_air_emissions_CO2)

Unnamed: 0,Country Code,Country Name,Year,Carbon monoxide (tonnes)
27,FRA,France,2015,0.00
30,FRA,France,2016,0.00
31,FRA,France,2017,0.00
34,FRA,France,2018,0.00
35,FRA,France,2019,0.00
...,...,...,...,...
2675,FRA,France,2023,829.29
2676,FRA,France,2023,3672.59
2677,FRA,France,2023,413.41
2678,FRA,France,2023,23035.91


Unnamed: 0,Country Code,Country Name,Year,Carbon dioxide (tonnes)
0,FRA,France,2015,0.00
1,FRA,France,2016,0.00
2,FRA,France,2017,0.00
3,FRA,France,2018,0.00
4,FRA,France,2019,0.00
...,...,...,...,...
3413,FRA,France,2023,418864.51
3414,FRA,France,2023,13970993.14
3415,FRA,France,2023,182809.47
3416,FRA,France,2023,37375417.58


## Environment Air Pollutants Emissions Data - OECD

In [None]:
df_air_poll_emissions = pd.read_csv("../Data/enivronment/OECD/air_pollutants_emissions.csv")
pollutant_col = df_air_poll_emissions.pivot(columns='Pollutant', values='OBS_VALUE')
df_air_poll_emissions = pd.concat([df_air_poll_emissions, pollutant_col], axis = 1)
df_air_poll_emissions = df_air_poll_emissions[["REF_AREA", "Reference area", "TIME_PERIOD", "Sulphur oxides"]] #unit measure is T (tonnes) for all
df_air_poll_emissions.columns = ["Country Code", "Country Name", "Year", "Sulphur oxides (tonnes)"]
df_air_poll_emissions

Unnamed: 0,Country Code,Country Name,Year,Sulphur oxides (tonnes)
0,OECD,OECD,2015,14293.0000
1,OECD,OECD,2016,13498.9300
2,OECD,OECD,2017,13012.7900
3,OECD,OECD,2018,12739.2900
4,OECD,OECD,2019,12058.2600
...,...,...,...,...
508,UKR,Ukraine,2019,840.3821
509,UKR,Ukraine,2020,755.6960
510,UKR,Ukraine,2021,568.1924
511,UKR,Ukraine,2022,334.9731


## Exposure to Air Pollution - OECD

In [None]:
df_exposure_pollution = pd.read_csv("../Data/enivronment/OECD/exposure_to_air_pollution.csv")
pollutant_col = df_exposure_pollution.pivot(columns='Pollutant', values='OBS_VALUE')
df_exposure_pollution = pd.concat([df_exposure_pollution, pollutant_col], axis = 1)
df_exposure_pollution = df_exposure_pollution[["REF_AREA", "Reference area", "TIME_PERIOD", "Fine particulate matter (PM2.5)"]] #unit measure is MCG_M3 for all
df_exposure_pollution.columns = ["Country Code", "Country Name", "Year", "Fine particulate matter (PM2.5) (MCR_M3)"]
df_exposure_pollution

Unnamed: 0,Country Code,Country Name,Year,Fine particulate matter (PM2.5) (MCR_M3)
0,AUS,Australia,2020,8.110881
1,AU1,New South Wales,2020,9.794581
2,AU2,Victoria,2020,8.289856
3,AU3,Queensland,2020,6.342767
4,AU4,South Australia,2020,5.730050
...,...,...,...,...
1329,USA,United States,2016,7.303724
1330,USA,United States,2017,7.675848
1331,USA,United States,2018,7.705240
1332,USA,United States,2019,7.071832


## Greenhouse Gas Emissions - OECD

In [None]:
df_greenhouse_gas = pd.read_csv("../Data/enivronment/OECD/greenhouse_gas_emissions.csv")
pollutant_col = df_greenhouse_gas.pivot(columns='Pollutant', values='OBS_VALUE')
df_greenhouse_gas = pd.concat([df_greenhouse_gas, pollutant_col], axis = 1)
df_greenhouse_gas = df_greenhouse_gas[["REF_AREA", "Reference area", "TIME_PERIOD", "Greenhouse gases"]] #unit measure is Kg of CO2-equivalent per person for all
df_greenhouse_gas.columns = ["Country Code", "Country Name", "Year", "Greenhouse gases (Kg CO2-equivalent Per Person)"]
df_greenhouse_gas

Unnamed: 0,Country Code,Country Name,Year,Greenhouse gases (Kg CO2-equivalent Per Person)
0,OECDE,OECD Europe,2014,8.223466
1,OECDE,OECD Europe,2015,8.240490
2,OECDE,OECD Europe,2016,8.185606
3,OECDE,OECD Europe,2017,8.193707
4,OECDE,OECD Europe,2018,8.002245
...,...,...,...,...
797,THA,Thailand,2020,5.205036
798,UKR,Ukraine,2020,7.278898
799,URY,Uruguay,2020,10.706970
800,UZB,Uzbekistan,2020,5.595903


## Intensity Use of Forests Resources - OECD

In [None]:
df_use_forests_resources = pd.read_csv("../Data/enivronment/OECD/intensity_use_forests_resources.csv")
measure_col = df_use_forests_resources.pivot(columns='Measure', values='OBS_VALUE')
df_use_forests_resources = pd.concat([df_use_forests_resources, measure_col], axis = 1)
df_use_forests_resources = df_use_forests_resources[["REF_AREA", "Reference area", "TIME_PERIOD", "Intensity of use of forest resources"]] #unit measure is Percentage Points for all
df_use_forests_resources.columns = ["Country Code", "Country Name", "Year", "Intensity of use of forest resources (Percentage Points)"]
df_use_forests_resources

Unnamed: 0,Country Code,Country Name,Year,Intensity of use of forest resources (Percentage Points)
0,AUS,Australia,2015,0.830061
1,AUS,Australia,2016,0.915014
2,AUS,Australia,2017,1.010009
3,AUS,Australia,2018,1.001892
4,AUS,Australia,2019,0.991555
...,...,...,...,...
283,NZL,New Zealand,2013,0.657170
284,SVK,Slovak Republic,2017,0.781253
285,CHE,Switzerland,2010,0.713892
286,CHE,Switzerland,2015,0.695307


## Land Use - OECD

In [None]:
df_land_use = pd.read_csv("../Data/enivronment/OECD/land_use.csv")
measure_col = df_land_use.pivot(columns='Measure', values='OBS_VALUE')
df_land_use = pd.concat([df_land_use, measure_col], axis = 1)
df_land_use = df_land_use[["REF_AREA", "Reference area", "TIME_PERIOD", "Total area"]] #unit measure is Square Km for all
df_land_use.columns = ["Country Code", "Country Name", "Year", "Total area (Square Km)"]
df_land_use

Unnamed: 0,Country Code,Country Name,Year,Total area (Square Km)
0,SHN,Saint Helena,2010,390.0
1,SHN,Saint Helena,2011,390.0
2,SHN,Saint Helena,2012,390.0
3,SHN,Saint Helena,2013,390.0
4,SHN,Saint Helena,2014,390.0
...,...,...,...,...
3211,ZWE,Zimbabwe,2019,390760.0
3212,ZWE,Zimbabwe,2020,390760.0
3213,ZWE,Zimbabwe,2021,390760.0
3214,ZWE,Zimbabwe,2022,390760.0


## Pesticides Use - OECD

In [None]:
df_pesticides_use = pd.read_csv("../Data/enivronment/OECD/pesticides_use.csv")
measure_col = df_pesticides_use.pivot(columns='Measure', values='OBS_VALUE')[["Total molluscicides", "Total sales of agricultural pesticides"]]
df_pesticides_use = pd.concat([df_pesticides_use, measure_col], axis = 1)

df_pesticides_use_total_pesticides = df_pesticides_use[df_pesticides_use["Total sales of agricultural pesticides"].notnull()][["REF_AREA", "Reference area", "TIME_PERIOD", "Total sales of agricultural pesticides"]]
df_pesticides_use_total_molluscicides = df_pesticides_use[df_pesticides_use["Total molluscicides"].notnull()][["REF_AREA", "Reference area", "TIME_PERIOD", "Total molluscicides"]]
df_pesticides_use_total_pesticides.columns = ["Country Code", "Country Name", "Year", "Total sales of agricultural pesticides (tonnes)"]
df_pesticides_use_total_molluscicides.columns = ["Country Code", "Country Name", "Year", "Total molluscicides (tonnes)"]
display(df_pesticides_use_total_pesticides)
display(df_pesticides_use_total_molluscicides)

Unnamed: 0,Country Code,Country Name,Year,Total sales of agricultural pesticides (tonnes)
0,AUS,Australia,2012,48687.875
1,AUS,Australia,2013,45177.187
2,AUS,Australia,2014,49857.349
3,AUS,Australia,2015,50921.602
4,AUS,Australia,2016,63416.482
...,...,...,...,...
481,VNM,Viet Nam,2016,19154.000
482,VNM,Viet Nam,2017,19154.000
483,VNM,Viet Nam,2018,19154.000
484,VNM,Viet Nam,2019,19154.000


Unnamed: 0,Country Code,Country Name,Year,Total molluscicides (tonnes)
2000,AUT,Austria,2012,23.653
2001,AUT,Austria,2013,13.471
2002,AUT,Austria,2014,16.180
2003,AUT,Austria,2015,21.214
2004,AUT,Austria,2016,10.679
...,...,...,...,...
2240,ROU,Romania,2017,4.981
2241,ROU,Romania,2018,4.829
2242,ROU,Romania,2019,4.263
2243,ROU,Romania,2020,9.304


## Tobacco Consumption - OECD

In [None]:
df_tobacco_consumption = pd.read_csv("../Data/health/OECD/tobacco_consumption.csv")
measure_col = df_tobacco_consumption.pivot(columns='Measure', values='OBS_VALUE')
df_tobacco_consumption = pd.concat([df_tobacco_consumption, measure_col], axis = 1)

df_tobacco_consumption_pct = df_tobacco_consumption[df_tobacco_consumption["Share of population who are daily smokers"].notnull()][["REF_AREA", "Reference area", "TIME_PERIOD", "Share of population who are daily smokers"]]
df_tobacco_consumption_pct.columns = ["Country Code", "Country Name", "Year", "Share of population who are daily smokers (Pct population)"]

df_tobacco_consumption = df_tobacco_consumption[df_tobacco_consumption["Tobacco consumption"].notnull()]
df_tobacco_consumption_nbcigarettes = df_tobacco_consumption[df_tobacco_consumption["Unit of measure"] == 'Cigarettes per smoker per day'][["REF_AREA", "Reference area", "TIME_PERIOD", "Tobacco consumption"]]
df_tobacco_consumption_grperperson = df_tobacco_consumption[df_tobacco_consumption["Unit of measure"] == 'Grammes per person'][["REF_AREA", "Reference area", "TIME_PERIOD", "Tobacco consumption"]]
df_tobacco_consumption_nbcigarettes.columns = ["Country Code", "Country Name", "Year", "Tobacco consumption (Cigarettes per smoker per day)"]
df_tobacco_consumption_grperperson.columns = ["Country Code", "Country Name", "Year", "Tobacco consumption (Grammes per person)"]

display(df_tobacco_consumption_pct)
display(df_tobacco_consumption_nbcigarettes)
display(df_tobacco_consumption_grperperson)

Unnamed: 0,Country Code,Country Name,Year,Share of population who are daily smokers (Pct population)
0,AUS,Australia,2010,14.1
1,AUS,Australia,2013,11.3
2,AUS,Australia,2016,10.8
3,AUS,Australia,2019,10.0
4,AUS,Australia,2022,7.8
...,...,...,...,...
2515,RUS,Russia,2016,30.3
2516,RUS,Russia,2016,49.5
2517,RUS,Russia,2017,45.9
2518,RUS,Russia,2017,27.5


Unnamed: 0,Country Code,Country Name,Year,Tobacco consumption (Cigarettes per smoker per day)
1526,AUS,Australia,2010,15.9
1527,AUS,Australia,2013,13.7
1528,AUS,Australia,2016,13.4
1529,AUS,Australia,2019,12.9
1530,AUS,Australia,2022,13.1
...,...,...,...,...
2461,ISR,Israel,2016,17.2
2463,LVA,Latvia,2016,13.5
2464,NLD,Netherlands,2010,10.5
2465,NLD,Netherlands,2014,10.7


Unnamed: 0,Country Code,Country Name,Year,Tobacco consumption (Grammes per person)
1756,AUS,Australia,2010,1009.0
1757,AUS,Australia,2012,964.7
1758,AUS,Australia,2013,915.8
1759,AUS,Australia,2014,830.8
1760,AUS,Australia,2015,790.0
...,...,...,...,...
2456,AUS,Australia,2011,971.7
2459,DEU,Germany,2022,1350.0
2462,ISR,Israel,2023,753.0
2466,NZL,New Zealand,2010,675.0


## Environment Air Quality Data - WHO

In [None]:
df_air_quality_who = pd.read_csv("../Data/enivronment/WHO/air_quality.csv")
df_air_quality_who = df_air_quality_who[["iso3", "country_name", "city", "year", "pm10_concentration", "pm25_concentration", "no2_concentration"]]
df_air_quality_who.columns = ["Country Code", "Country Name", "City", "Year", "PM10_Concentration",  "PM25_Concentration", "NO2_Concentration"]
df_air_quality_who = df_air_quality_who.groupby(["Country Code", "Country Name", "Year"])[["PM10_Concentration",  "PM25_Concentration", "NO2_Concentration"]].mean().reset_index()
df_air_quality_who.columns = ["Country Code", "Country Name", "Year", "PM10_ConcentrationAvg",  "PM25_ConcentrationAvg", "NO2_ConcentrationAvg"]
df_air_quality_who["Year"] = df_air_quality_who["Year"].astype(int)
df_air_quality_who

Unnamed: 0,Country Code,Country Name,Year,PM10_ConcentrationAvg,PM25_ConcentrationAvg,NO2_ConcentrationAvg
0,AFG,Afghanistan,2019,,119.774000,
1,ALB,Albania,2014,28.181800,12.910750,22.387800
2,ALB,Albania,2015,25.467500,16.355750,18.094000
3,ALB,Albania,2016,26.074250,16.847500,17.044200
4,ALB,Albania,2017,28.301250,17.493750,17.376000
...,...,...,...,...,...,...
856,ZAF,South Africa,2017,31.283000,15.886000,20.182500
857,ZAF,South Africa,2018,40.729000,22.498250,30.163000
858,ZAF,South Africa,2019,44.204471,21.539143,19.648000
859,ZAF,South Africa,2020,38.231467,23.539231,17.292067


## Putting data together

In [18]:
def normalize_country_name(name):
    return (
        str(name).lower()
        .replace(" ", "")
        .replace("-", "")
        .replace(".", "")
        .strip()
    )

def smart_merge(df1, df2):
    # Add normalized names
    for df in [df1, df2]:
        df["Country Name Clean"] = df["Country Name"].apply(normalize_country_name)

    # Try merge on Country Code + Year
    merged = pd.merge(
        df1, df2,
        on=["Country Code", "Year"],
        how="outer",
        suffixes=("_1", "_2"),
        indicator=True
    )

    # Find unmatched and try name-based merge
    unmatched_left = df1[~df1["Country Code"].isin(merged["Country Code"])]
    unmatched_right = df2[~df2["Country Code"].isin(merged["Country Code"])]

    merged_name = pd.merge(
        unmatched_left, unmatched_right,
        on=["Country Name Clean", "Year"],
        how="inner",
        suffixes=("_1", "_2")
    )

    # Combine both
    final = pd.concat([merged, merged_name], ignore_index=True)

    # Keep one version of name/code/year
    final["Country Name"] = final.get("Country Name_1", None).combine_first(final.get("Country Name_2", None))
    final["Country Code"] = final.get("Country Code", None)
    final["Year"] = final.get("Year", None)

    # Drop redundant columns
    final = final.drop(columns=[
        "Country Name_1", "Country Name_2",
        "Country Code_1", "Country Code_2",
        "Country Name Clean", "Country Name Clean_1", "Country Name Clean_2",
        "_merge"
    ], errors="ignore")

    return final

In [19]:
df_wdi['Year']  = df_wdi['Year'].astype(str)
df_air_emissions_CO['Year'] = df_air_emissions_CO['Year'].astype(str)
df_air_emissions_CO2['Year']  = df_air_emissions_CO2['Year'].astype(str)
df_pesticides_use_total_pesticides['Year']  = df_pesticides_use_total_pesticides['Year'].astype(str)
df_pesticides_use_total_molluscicides['Year']  = df_pesticides_use_total_molluscicides['Year'].astype(str)
df_tobacco_consumption_pct['Year']  = df_tobacco_consumption_pct['Year'].astype(str)
df_tobacco_consumption_nbcigarettes['Year']  = df_tobacco_consumption_nbcigarettes['Year'].astype(str)
df_tobacco_consumption_grperperson['Year']  = df_tobacco_consumption_grperperson['Year'].astype(str)
df_air_poll_emissions['Year']  = df_air_poll_emissions['Year'].astype(str)
df_exposure_pollution['Year']  = df_exposure_pollution['Year'].astype(str)
df_greenhouse_gas['Year']  = df_greenhouse_gas['Year'].astype(str)
df_use_forests_resources['Year']  = df_use_forests_resources['Year'].astype(str)
df_land_use['Year']  = df_land_use['Year'].astype(str)
df_air_quality_who['Year']  = df_air_quality_who['Year'].astype(str)

# Example: a list of dataframes
dataframes = [df_wdi, df_air_emissions_CO, df_air_emissions_CO2, df_pesticides_use_total_pesticides, df_pesticides_use_total_molluscicides,\
    df_tobacco_consumption_pct, df_tobacco_consumption_nbcigarettes, df_tobacco_consumption_grperperson, df_air_poll_emissions,\
    df_exposure_pollution, df_greenhouse_gas, df_use_forests_resources, df_land_use, df_air_quality_who]

# Apply smart_merge cumulatively
df_merged = reduce(smart_merge, dataframes)
df_merged

Unnamed: 0,Country Code,Year,Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),Carbon dioxide (CO2) emissions excluding LULUCF per capita (t CO2e/capita),"Compulsory education, duration (years)",GDP (current US$),GDP per capita (constant 2015 US$),Gini index,"Life expectancy at birth, total (years)",...,Tobacco consumption (Grammes per person),Sulphur oxides (tonnes),Fine particulate matter (PM2.5) (MCR_M3),Greenhouse gases (Kg CO2-equivalent Per Person),Intensity of use of forest resources (Percentage Points),Total area (Square Km),PM10_ConcentrationAvg,PM25_ConcentrationAvg,NO2_ConcentrationAvg,Country Name
0,A9,1990,,,,,,,,,...,,,24.543554,,,,,,,Latin America and the Caribbean
1,A9,1995,,,,,,,,,...,,,24.430119,,,,,,,Latin America and the Caribbean
2,A9,2000,,,,,,,,,...,,,24.324705,,,,,,,Latin America and the Caribbean
3,A9,2001,,,,,,,,,...,,,23.030916,,,,,,,Latin America and the Caribbean
4,A9,2002,,,,,,,,,...,,,23.079172,,,,,,,Latin America and the Caribbean
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
879633,ZWE,2019,30.3,46.7,0.731381759643275,7,25715657177.4682,1356.83821089692,50.3,61.06,...,,,,,,390760.0,,,,Zimbabwe
879634,ZWE,2020,30.5,52.7,0.584283212450557,7,26868564055.12,1230.19155671068,..,61.53,...,,,,,,390760.0,,,,Zimbabwe
879635,ZWE,2021,30.5,49,0.67247950745733,7,27240507841.6729,1311.53099950094,..,60.135,...,,,,,,390760.0,,,,Zimbabwe
879636,ZWE,2022,,,,,,,,,...,,,,,,390760.0,,,,Zimbabwe


## Final datasets creation, initial filtering and saving

In [20]:
def filter_high_nan(df, group_col, threshold=0.6):
    """
    Remove groups in `group_col` where the average NaN ratio > threshold.
    """
    nan_ratio = (
        df.drop(columns=[group_col])
          .groupby(df[group_col], observed=True)
          .agg(lambda x: x.isna().sum() / x.size)
          .mean(axis=1)
    )
    
    groups_to_drop = nan_ratio[nan_ratio > threshold].index
    return df[~df[group_col].isin(groups_to_drop)]

In [None]:
def save_final_csv(df_left, df_right, df_name):
    df_left.loc[:, 'Year'] = df_left['Year'].astype(str)
    
    # Merge once
    df_merge = pd.merge(
        df_left, df_right,
        on=["Country Name", "Year"],
        how="left",
        suffixes=("_1", "_2")
    )
    
    # Filter first by Country Name, then by Year
    final = filter_high_nan(df_merge, "Country Name")
    final = filter_high_nan(final, "Year")
    
    print(f"Saving file {df_name}")
    final.to_csv(f"../Data/Refined/{df_name}")

### 36 Datasets (Data from 2017 to 2021)

In [None]:
df_health = [df_COPD_deaths_nb, df_COPD_deaths_pct, df_COPD_deaths_rate, df_COPD_DALYs_nb, df_COPD_DALYs_pct, df_COPD_DALYs_rate,\
            df_COPD_prevalence_nb, df_COPD_prevalence_pct, df_COPD_prevalence_rate, df_COPD_incidence_nb, df_COPD_incidence_pct, df_COPD_incidence_rate,\
            df_asthma_deaths_nb, df_asthma_deaths_pct, df_asthma_deaths_rate, df_asthma_DALYs_nb, df_asthma_DALYs_pct, df_asthma_DALYs_rate,\
            df_asthma_prevalence_nb, df_asthma_prevalence_pct, df_asthma_prevalence_rate, df_asthma_incidence_nb, df_asthma_incidence_pct, df_asthma_incidence_rate,\
            df_tuberculosis_deaths_nb, df_tuberculosis_deaths_pct, df_tuberculosis_deaths_rate, df_tuberculosis_DALYs_nb, df_tuberculosis_DALYs_pct, df_tuberculosis_DALYs_rate,\
            df_tuberculosis_prevalence_nb, df_tuberculosis_prevalence_pct, df_tuberculosis_prevalence_rate, df_tuberculosis_incidence_nb, df_tuberculosis_incidence_pct, df_tuberculosis_incidence_rate]

df_health_names = ["1721/COPD_deaths_nb", "1721/COPD_deaths_pct", "1721/COPD_deaths_rate", "1721/COPD_DALYs_nb", "1721/COPD_DALYs_pct", "1721/COPD_DALYs_rate",\
            "1721/COPD_prevalence_nb", "1721/COPD_prevalence_pct", "1721/COPD_prevalence_rate", "1721/COPD_incidence_nb", "1721/COPD_incidence_pct", "1721/COPD_incidence_rate",\
            "1721/asthma_deaths_nb", "1721/asthma_deaths_pct", "1721/asthma_deaths_rate", "1721/asthma_DALYs_nb", "a1721/sthma_DALYs_pct", "1721/asthma_DALYs_rate",\
            "1721/asthma_prevalence_nb", "1721/asthma_prevalence_pct", "1721/asthma_prevalence_rate", "1721/asthma_incidence_nb", "1721/asthma_incidence_pct", "1721/asthma_incidence_rate",\
            "1721/tuberculosis_deaths_nb", "1721/tuberculosis_deaths_pct", "1721/tuberculosis_deaths_rate", "1721/tuberculosis_DALYs_nb", "1721/tuberculosis_DALYs_pct", "1721/tuberculosis_DALYs_rate",\
            "1721/tuberculosis_prevalence_nb", "1721/tuberculosis_prevalence_pct", "1721/tuberculosis_prevalence_rate", "1721/tuberculosis_incidence_nb", "1721/tuberculosis_incidence_pct", "1721/tuberculosis_incidence_rate"]

for i in range(len(df_health)):
    save_final_csv(df_health[i], df_merged, df_health_names[i])

  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file COPD_deaths_nb
Saving file COPD_deaths_pct
Saving file COPD_deaths_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file COPD_DALYs_nb
Saving file COPD_DALYs_pct
Saving file COPD_DALYs_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file COPD_prevalence_nb
Saving file COPD_prevalence_pct
Saving file COPD_prevalence_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file COPD_incidence_nb
Saving file COPD_incidence_pct
Saving file COPD_incidence_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file asthma_deaths_nb
Saving file asthma_deaths_pct
Saving file asthma_deaths_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file asthma_DALYs_nb
Saving file asthma_DALYs_pct
Saving file asthma_DALYs_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file asthma_prevalence_nb
Saving file asthma_prevalence_pct
Saving file asthma_prevalence_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file asthma_incidence_nb
Saving file asthma_incidence_pct
Saving file asthma_incidence_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file tuberculosis_deaths_nb
Saving file tuberculosis_deaths_pct
Saving file tuberculosis_deaths_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file tuberculosis_DALYs_nb
Saving file tuberculosis_DALYs_pct
Saving file tuberculosis_DALYs_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file tuberculosis_prevalence_nb
Saving file tuberculosis_prevalence_pct
Saving file tuberculosis_prevalence_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file tuberculosis_incidence_nb
Saving file tuberculosis_incidence_pct
Saving file tuberculosis_incidence_rate


### 36 Datasets (Data from 1980 to 2021)

In [31]:
df_8021_health = [df_8021_COPD_deaths_nb, df_8021_COPD_deaths_pct, df_8021_COPD_deaths_rate, df_8021_COPD_DALYs_nb, df_8021_COPD_DALYs_pct, df_8021_COPD_DALYs_rate,\
            df_8021_COPD_prevalence_nb, df_8021_COPD_prevalence_pct, df_8021_COPD_prevalence_rate, df_8021_COPD_incidence_nb, df_8021_COPD_incidence_pct, df_8021_COPD_incidence_rate,\
            df_8021_asthma_deaths_nb, df_8021_asthma_deaths_pct, df_8021_asthma_deaths_rate, df_8021_asthma_DALYs_nb, df_8021_asthma_DALYs_pct, df_8021_asthma_DALYs_rate,\
            df_8021_asthma_prevalence_nb, df_8021_asthma_prevalence_pct, df_8021_asthma_prevalence_rate, df_8021_asthma_incidence_nb, df_8021_asthma_incidence_pct, df_8021_asthma_incidence_rate,\
            df_8021_tuberculosis_deaths_nb, df_8021_tuberculosis_deaths_pct, df_8021_tuberculosis_deaths_rate, df_8021_tuberculosis_DALYs_nb, df_8021_tuberculosis_DALYs_pct, df_8021_tuberculosis_DALYs_rate,\
            df_8021_tuberculosis_prevalence_nb, df_8021_tuberculosis_prevalence_pct, df_8021_tuberculosis_prevalence_rate, df_8021_tuberculosis_incidence_nb, df_8021_tuberculosis_incidence_pct, df_8021_tuberculosis_incidence_rate]

df_8021_health_names = ["8021/COPD_deaths_nb", "8021/COPD_deaths_pct", "8021/COPD_deaths_rate", "8021/COPD_DALYs_nb", "8021/COPD_DALYs_pct", "8021/COPD_DALYs_rate",\
            "8021/COPD_prevalence_nb", "8021/COPD_prevalence_pct", "8021/COPD_prevalence_rate", "8021/COPD_incidence_nb", "8021/COPD_incidence_pct", "8021/COPD_incidence_rate",\
            "8021/asthma_deaths_nb", "8021/asthma_deaths_pct", "8021/asthma_deaths_rate", "8021/asthma_DALYs_nb", "8021/asthma_DALYs_pct", "8021/asthma_DALYs_rate",\
            "8021/asthma_prevalence_nb", "8021/asthma_prevalence_pct", "8021/asthma_prevalence_rate", "8021/asthma_incidence_nb", "8021/asthma_incidence_pct", "8021/asthma_incidence_rate",\
            "8021/tuberculosis_deaths_nb", "8021/tuberculosis_deaths_pct", "8021/tuberculosis_deaths_rate", "8021/tuberculosis_DALYs_nb", "8021/tuberculosis_DALYs_pct", "8021/tuberculosis_DALYs_rate",\
            "8021/tuberculosis_prevalence_nb", "8021/tuberculosis_prevalence_pct", "8021/tuberculosis_prevalence_rate", "8021/tuberculosis_incidence_nb", "8021/tuberculosis_incidence_pct", "8021/tuberculosis_incidence_rate"]

for i in range(len(df_8021_health)):
    save_final_csv(df_8021_health[i], df_merged, df_8021_health_names[i])

Saving file 8021/COPD_deaths_nb
Saving file 8021/COPD_deaths_pct
Saving file 8021/COPD_deaths_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/COPD_DALYs_nb
Saving file 8021/COPD_DALYs_pct
Saving file 8021/COPD_DALYs_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/COPD_prevalence_nb
Saving file 8021/COPD_prevalence_pct
Saving file 8021/COPD_prevalence_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/COPD_incidence_nb
Saving file 8021/COPD_incidence_pct
Saving file 8021/COPD_incidence_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/asthma_deaths_nb
Saving file 8021/asthma_deaths_pct
Saving file 8021/asthma_deaths_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/asthma_DALYs_nb
Saving file 8021/asthma_DALYs_pct
Saving file 8021/asthma_DALYs_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/asthma_prevalence_nb
Saving file 8021/asthma_prevalence_pct
Saving file 8021/asthma_prevalence_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/asthma_incidence_nb
Saving file 8021/asthma_incidence_pct
Saving file 8021/asthma_incidence_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/tuberculosis_deaths_nb
Saving file 8021/tuberculosis_deaths_pct
Saving file 8021/tuberculosis_deaths_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/tuberculosis_DALYs_nb
Saving file 8021/tuberculosis_DALYs_pct
Saving file 8021/tuberculosis_DALYs_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/tuberculosis_prevalence_nb
Saving file 8021/tuberculosis_prevalence_pct
Saving file 8021/tuberculosis_prevalence_rate


  df_left.loc[:, 'Year'] = df_left['Year'].astype(str)


Saving file 8021/tuberculosis_incidence_nb
Saving file 8021/tuberculosis_incidence_pct
Saving file 8021/tuberculosis_incidence_rate


We can use any of the final 36 (+ 36 for 1980-2021) datasets for our study (they combine different diseases, metrics, measures).