# Notebook to Import and Merge Together Public Data

In this notebook, we will merge together data coming from different datasets. 
We will finally have 6 final datasets for years from 2015 to 2021, since these are the years common to most of our datasets. Note that not all of the initial datasets are used, since some of them are missing too many countries and years, and it would mean to have to deal with too many missing values, potential source of bias.
In our later analysis we can be free to use one of these datasets, depending on our final scope (which disease, which measure).
The 36 datasets are the results of matching different diseases with different measures. We kept only the **rate** as final metric.
In particular:
- disease: **COPD** (Chronic obstructive pulmonary disease) or **asthma** or **tuberculosis**
- measure: **incidence**, **prevalence**

So finally we have: 3 x 2 = 6 datasets

In [74]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cfgrib
import xarray as xr
from functools import reduce

## Respiratory disease data - 1990 to 2021 (no sex or age)

In [75]:
# Votre code ici:
df_diseases = pd.read_csv("../Data/health/IHME/IHME-data-allCountries-allYears/IHME-GBD_2023_DATA-8de3a169-1.csv")
df_diseases = df_diseases[["measure", "location", "cause", "metric", "year", "val"]]
df_diseases.columns = ["Measure", "Country Name", "Disease", "Metric", "Year", "Value"]
df_diseases

Unnamed: 0,Measure,Country Name,Disease,Metric,Year,Value
0,Prevalence,Taiwan,Chronic obstructive pulmonary disease,Rate,1990,1366.915295
1,Prevalence,Taiwan,Chronic obstructive pulmonary disease,Rate,1991,1398.950832
2,Prevalence,Taiwan,Chronic obstructive pulmonary disease,Rate,1992,1432.737779
3,Prevalence,Taiwan,Chronic obstructive pulmonary disease,Rate,1993,1465.397615
4,Prevalence,Taiwan,Chronic obstructive pulmonary disease,Rate,1994,1504.531328
...,...,...,...,...,...,...
41611,Incidence,Niue,Tuberculosis,Rate,2019,39.523161
41612,Incidence,Niue,Tuberculosis,Rate,2020,39.049782
41613,Incidence,Niue,Tuberculosis,Rate,2021,38.504013
41614,Incidence,Niue,Tuberculosis,Rate,2022,37.588972


In [76]:
df_countries = pd.read_csv("../Data/economic/wikipedia-iso-country-codes.csv")
df_countries = df_countries[['English short name lower case', 'Alpha-3 code']]
df_countries.rename(columns={'English short name lower case': 'Country Name'}, inplace=True)
df_diseases = pd.merge(
        df_diseases, df_countries,
        on=["Country Name"],
        how="left",
        suffixes=("_1", "_2")
    )
df_diseases.rename(columns={'Alpha-3 code': 'Country Code'}, inplace=True)
df_diseases

Unnamed: 0,Measure,Country Name,Disease,Metric,Year,Value,Country Code
0,Prevalence,Taiwan,Chronic obstructive pulmonary disease,Rate,1990,1366.915295,TWN
1,Prevalence,Taiwan,Chronic obstructive pulmonary disease,Rate,1991,1398.950832,TWN
2,Prevalence,Taiwan,Chronic obstructive pulmonary disease,Rate,1992,1432.737779,TWN
3,Prevalence,Taiwan,Chronic obstructive pulmonary disease,Rate,1993,1465.397615,TWN
4,Prevalence,Taiwan,Chronic obstructive pulmonary disease,Rate,1994,1504.531328,TWN
...,...,...,...,...,...,...,...
41611,Incidence,Niue,Tuberculosis,Rate,2019,39.523161,NIU
41612,Incidence,Niue,Tuberculosis,Rate,2020,39.049782,NIU
41613,Incidence,Niue,Tuberculosis,Rate,2021,38.504013,NIU
41614,Incidence,Niue,Tuberculosis,Rate,2022,37.588972,NIU


In [77]:
countries_to_fix = df_diseases[df_diseases["Country Code"].isna()]["Country Name"].unique()

EXCEPTIONS = {
    # key = name as it appears in dataframe
    # value = correct ISO‑3 alpha‑3 code
    "Democratic People's Republic of Korea": "PRK",
    "Viet Nam":                               "VNM",
    "Micronesia (Federated States of)":       "FSM",
    "Czechia":                                "CZE",
    "North Macedonia":                        "MKD",
    "United States of America":               "USA",
    "Republic of Korea":                      "KOR",
    "Russian Federation":                     "RUS",
    "Republic of Moldova":                    "MDA",
    "Bolivia (Plurinational State of)":       "BOL",
    "Venezuela (Bolivarian Republic of)":     "VEN",
    "Palestine":                              "PSE",  
    "Iran (Islamic Republic of)":             "IRN",
    "Libya":                                  "LBY",
    "Türkiye":                                "TUR",
    "Democratic Republic of the Congo":       "COD",
    "United Republic of Tanzania":            "TZA",
    "Eswatini":                               "SWZ",
    "Cabo Verde":                             "CPV",
    "United States Virgin Islands":            "VIR",
    "South Sudan":                            "SSD"
}

def name_to_iso3(name: str) -> str | None:
    """
    Return the ISO‑3 alpha‑3 code for *name*.
    First tries the EXCEPTIONS dict, then falls back to pycountry.
    Returns None if no match is found.
    """
    # Exception list (covers the non‑standard spellings you gave)
    if name in EXCEPTIONS:
        return EXCEPTIONS[name]

    # Try a direct lookup via pycountry (matches the official ISO name)
    try:
        country = pycountry.countries.lookup(name)
        return country.alpha_3
    except LookupError:
        # No exact match
        return None

# Apply the mapping to the dataframe
for cn in countries_to_fix:
    iso3 = name_to_iso3(cn)

    if iso3 is None:
        print(f"⚠️  No ISO‑3 code found for '{cn}'. Skipping.")
        continue

    df_diseases.loc[
        df_diseases["Country Name"] == cn,
        "Country Code"
    ] = iso3

In [78]:
# different diseases
print(df_diseases["Disease"].unique())

df_COPD = df_diseases[df_diseases["Disease"] == "Chronic obstructive pulmonary disease"]
df_asthma = df_diseases[df_diseases["Disease"] == "Asthma"]
df_tuberculosis = df_diseases[df_diseases["Disease"] == "Tuberculosis"]

['Chronic obstructive pulmonary disease' 'Tuberculosis' 'Asthma']


In [79]:
# combining with different measures
print(df_diseases["Measure"].unique())

df_COPD_prevalence = df_COPD[df_COPD["Measure"] == "Prevalence"]
df_COPD_incidence = df_COPD[df_COPD["Measure"] == "Incidence"]

df_asthma_prevalence = df_asthma[df_asthma["Measure"] == "Prevalence"]
df_asthma_incidence = df_asthma[df_asthma["Measure"] == "Incidence"]

df_tuberculosis_prevalence = df_tuberculosis[df_tuberculosis["Measure"] == "Prevalence"]
df_tuberculosis_incidence = df_tuberculosis[df_tuberculosis["Measure"] == "Incidence"]

['Prevalence' 'Incidence']


In [80]:
# combining with different metrics
print(df_diseases["Metric"].unique())

df_COPD_prevalence_rate = df_COPD_prevalence[df_COPD_prevalence["Metric"] == "Rate"]
df_COPD_incidence_rate = df_COPD_incidence[df_COPD_incidence["Metric"] == "Rate"]

df_asthma_prevalence_rate = df_asthma_prevalence[df_asthma_prevalence["Metric"] == "Rate"]
df_asthma_incidence_rate = df_asthma_incidence[df_asthma_incidence["Metric"] == "Rate"]

df_tuberculosis_prevalence_rate = df_tuberculosis_prevalence[df_tuberculosis_prevalence["Metric"] == "Rate"]
df_tuberculosis_incidence_rate = df_tuberculosis_incidence[df_tuberculosis_incidence["Metric"] == "Rate"]

['Rate']


## World Developement Indicators Data - 1974 to 2021

In [81]:
df_wdi_1 = pd.read_csv("../Data/economic/WorldBankGroup/World_Development_Indicators/wdi_1.csv")
df_wdi_2 = pd.read_csv("../Data/economic/WorldBankGroup/World_Development_Indicators/wdi_2.csv", encoding="cp1252", engine="python")
df_wdi = pd.concat([df_wdi_1, df_wdi_2])
df_wdi.iloc[:-3, :]


Unnamed: 0,Country Name,Country Code,Series Name,Series Code,1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],...,2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]
0,Afghanistan,AFG,GDP (current US$),NY.GDP.MKTP.CD,..,..,..,..,..,..,...,19907329777.5872,20146416757.5987,20497128555.6972,19134221644.7325,18116572395.0772,18753456497.8159,18053222687.4126,18799444490.1128,19955929052.1496,14259995441.0759
1,Afghanistan,AFG,GDP per capita (constant 2015 US$),NY.GDP.PCAP.KD,..,..,..,..,..,..,...,568.929021458341,580.603833333096,575.146245808546,565.569730408751,563.872336723147,562.769574140988,553.125151688293,557.861533207459,527.834554499306,408.625855217403
2,Afghanistan,AFG,"Population, total",SP.POP.TOTL,12469127,12773954,13059851,13340756,13611441,13655567,...,30560034,31622704,32792523,33831764,34700612,35688935,36743039,37856121,39068979,40000412
3,Afghanistan,AFG,"Life expectancy at birth, total (years)",SP.DYN.LE00.IN,39.469,39.994,40.518,41.082,40.086,38.844,...,61.735,62.188,62.26,62.27,62.646,62.406,62.443,62.941,61.454,60.417
4,Afghanistan,AFG,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,210.7,207.5,204.1,200.4,196.6,192.9,...,71.3,68.7,66.4,64.2,62.3,60.4,58.6,56.9,55.3,53.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3985,World,WLD,Renewable electricity output (% of total elect...,EG.ELC.RNEW.ZS,24.3995400383505,24.1421558217958,22.2708064372341,21.8056902760591,22.183018520084,22.4624702326194,...,21.2031409070651,21.7907962916827,22.4313850607087,22.9852849250343,23.8879015015538,24.5284507220852,25.1733904780905,26.1929651576066,28.099222284205,27.8784853731492
3986,World,WLD,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,..,..,..,..,..,..,...,59.5495688485653,60.9543345446539,62.2904088976113,63.6969537121187,65.2272562828093,66.6831170081874,68.2440148869911,69.685504021452,71.1422843390958,72.4370389532721
3987,World,WLD,Access to electricity (% of population),EG.ELC.ACCS.ZS,..,..,..,..,..,..,...,84.9363734839566,85.7108410210463,86.1958012620982,86.9242629695724,88.1047725764967,88.9320882686089,89.797731809362,90.108768771248,90.3960882790908,91.3346473455962
3988,World,WLD,People using at least basic sanitation service...,SH.STA.BASS.ZS,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..


In [82]:
# Optional: rename the year columns to just the year number
year_cols = [c for c in df_wdi.columns if "[" in c]   # picks the YR… columns
rename_map = {c: c.split("[")[0].strip() for c in year_cols}
df_wdi = df_wdi.rename(columns=rename_map)

print("\nAfter renaming:", df_wdi.columns.tolist(), "\n")

# Melt - Collapse all year columns into one
# Identify the columns that hold the yearly values
year_columns = [c for c in df_wdi.columns if c.isdigit()]   # e.g. ['1974','1975',...]

# Melt (wide → long)
df_long = df_wdi.melt(
    id_vars=['Country Code', 'Series Name', 'Series Code'],
    value_vars=year_columns,
    var_name='Year',          # name of the new column that will hold the year
    value_name='Value'        # name of the column that will hold the measurement
)

print("\nShape after melt:", df_long.shape)

# Pivot - Spread the different series into separate columns
df_tidy = df_long.pivot_table(
    index=['Country Code', 'Year'],   # what defines a unique row
    columns='Series Name',                            # each distinct series becomes a column
    values='Value',                                   # fill cells with the measurement
    aggfunc='first'                                   # there should be only one value per cell
).reset_index()

# After pivot, the column hierarchy is a MultiIndex (Series Names are under the level "Series Name").
# Flatten it for easier use:
df_tidy.columns.name = None          # drop the name of the columns axis
df_tidy = df_tidy.rename_axis(None, axis=1)   # also removes the axis name

print("\nFinal shape:", df_tidy.shape)
df_wdi = df_tidy
df_wdi


After renaming: ['Country Name', 'Country Code', 'Series Name', 'Series Code', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'] 


Shape after melt: (383424, 5)

Final shape: (12768, 17)


Unnamed: 0,Country Code,Year,Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),Carbon dioxide (CO2) emissions excluding LULUCF per capita (t CO2e/capita),"Compulsory education, duration (years)",GDP (current US$),GDP per capita (constant 2015 US$),Gini index,"Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",People using at least basic sanitation services (% of population),"Population, total",Poverty headcount ratio at national poverty lines (% of population),Renewable electricity output (% of total electricity output),Surface area (sq. km),"Unemployment, total (% of total labor force) (national estimate)"
0,ABW,1974,..,..,0.745514061937651,..,..,..,..,69.278,..,..,58349,..,..,180,..
1,ABW,1975,..,..,0.984647053778197,..,..,..,..,69.564,..,..,58295,..,..,180,..
2,ABW,1976,..,..,0.966282894736842,..,..,..,..,69.808,..,..,58368,..,..,180,..
3,ABW,1977,..,..,1.14544213041994,..,..,..,..,70.054,..,..,58580,..,..,180,..
4,ABW,1978,..,..,1.22328841704097,..,..,..,..,70.271,..,..,58776,..,..,180,..
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12763,ZWE,2017,30.1,44,0.714627028745081,7,51074726484.0037,1422.1934603003,44.3,60.263,49.2,36.9416742711943,14812482,30.4,55.1814329227917,390760,..
12764,ZWE,2018,30.3,45.4,0.816125522899006,7,34156057417.3285,1471.39488971183,..,60.906,47.4,36.3571601293685,15034452,..,63.0333013128402,390760,..
12765,ZWE,2019,30.3,46.7,0.731381759643275,7,25715657177.4682,1356.83821089692,50.3,61.06,46,35.7743358079873,15271368,38.3,68.8452182208443,390760,7.373
12766,ZWE,2020,30.5,52.7,0.584283212450557,7,26868564055.12,1230.19155671068,..,61.53,44.9,35.1923618234591,15526888,..,60.7855537239622,390760,..


In [83]:
df_wdi["Country Code"].unique()

array(['ABW', 'AFE', 'AFG', 'AFW', 'AGO', 'ALB', 'AND', 'ARB', 'ARE',
       'ARG', 'ARM', 'ASM', 'ATG', 'AUS', 'AUT', 'AZE', 'BDI', 'BEL',
       'BEN', 'BFA', 'BGD', 'BGR', 'BHR', 'BHS', 'BIH', 'BLR', 'BLZ',
       'BMU', 'BOL', 'BRA', 'BRB', 'BRN', 'BTN', 'BWA', 'CAF', 'CAN',
       'CEB', 'CHE', 'CHI', 'CHL', 'CHN', 'CIV', 'CMR', 'COD', 'COG',
       'COL', 'COM', 'CPV', 'CRI', 'CSS', 'CUB', 'CUW', 'CYM', 'CYP',
       'CZE', 'DEU', 'DJI', 'DMA', 'DNK', 'DOM', 'DZA', 'EAP', 'EAR',
       'EAS', 'ECA', 'ECS', 'ECU', 'EGY', 'EMU', 'ERI', 'ESP', 'EST',
       'ETH', 'EUU', 'FCS', 'FIN', 'FJI', 'FRA', 'FRO', 'FSM', 'GAB',
       'GBR', 'GEO', 'GHA', 'GIB', 'GIN', 'GMB', 'GNB', 'GNQ', 'GRC',
       'GRD', 'GRL', 'GTM', 'GUM', 'GUY', 'HIC', 'HKG', 'HND', 'HPC',
       'HRV', 'HTI', 'HUN', 'IBD', 'IBT', 'IDA', 'IDB', 'IDN', 'IDX',
       'IMN', 'IND', 'INX', 'IRL', 'IRN', 'IRQ', 'ISL', 'ISR', 'ITA',
       'JAM', 'JOR', 'JPN', 'KAZ', 'KEN', 'KGZ', 'KHM', 'KIR', 'KNA',
       'KOR', 'KWT',

In [84]:
df_wdi[df_wdi["Country Code"]=="GBR"]

Unnamed: 0,Country Code,Year,Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),Carbon dioxide (CO2) emissions excluding LULUCF per capita (t CO2e/capita),"Compulsory education, duration (years)",GDP (current US$),GDP per capita (constant 2015 US$),Gini index,"Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",People using at least basic sanitation services (% of population),"Population, total",Poverty headcount ratio at national poverty lines (% of population),Renewable electricity output (% of total electricity output),Surface area (sq. km),"Unemployment, total (% of total labor force) (national estimate)"
3888,GBR,1974,..,..,11.4643517352507,..,206131369798.971,21413.6423467213,29.0,72.5243902439024,16.1,..,56229974,..,..,243610,2.6
3889,GBR,1975,..,..,11.092011140793,..,241756637168.142,21099.6465690657,27.9,72.7243902439024,15.4,..,56225800,..,..,243610,4.0
3890,GBR,1976,..,..,11.2042207808842,..,232614555256.065,21719.0454960571,27.6,72.7756097560976,14.8,..,56211968,..,..,243610,5.5
3891,GBR,1977,..,..,11.3591890676593,..,263066457352.172,22260.1620046358,27.1,73.2243902439024,14.1,..,56193492,..,..,243610,5.8
3892,GBR,1978,..,..,11.325919847256,..,335883029721.956,23194.7939594163,27.2,73.1756097560976,13.4,..,56196504,..,..,243610,5.7
3893,GBR,1979,..,..,11.9591122370349,..,438994070309.191,24042.7876941013,27.4,73.2756097560976,12.7,..,56246951,..,..,243610,5.3
3894,GBR,1980,..,..,10.8023416325285,..,564947710899.373,23526.2555043517,28.5,73.6756097560976,12.1,..,56314216,..,..,243610,6.8
3895,GBR,1981,..,..,10.4559020832758,..,540765675241.158,23332.8025676938,29.7,74.0268292682927,11.4,..,56333829,..,..,243610,10.4
3896,GBR,1982,..,..,10.1769480684085,..,515048916841.37,23806.7980916528,29.9,74.1780487804878,10.8,..,56313641,..,..,243610,10.9
3897,GBR,1983,..,..,10.0521333485571,..,489618008185.539,24803.4271519669,29.8,74.3780487804878,10.3,..,56332848,..,..,243610,11.088


## Environment Air Pollutants Emissions Data - OECD

In [85]:
df_air_poll_emissions = pd.read_csv("../Data/enivronment/OECD/air_pollutants_emissions.csv")
pollutant_col = df_air_poll_emissions.pivot(columns='Pollutant', values='OBS_VALUE')
df_air_poll_emissions = pd.concat([df_air_poll_emissions, pollutant_col], axis = 1)
df_air_poll_emissions = df_air_poll_emissions[["REF_AREA", "TIME_PERIOD", "Sulphur oxides"]] #unit measure is T (tonnes) for all
df_air_poll_emissions.columns = ["Country Code", "Year", "Sulphur oxides (tonnes)"]
df_air_poll_emissions

Unnamed: 0,Country Code,Year,Sulphur oxides (tonnes)
0,OECD,2015,14293.0000
1,OECD,2016,13498.9300
2,OECD,2017,13012.7900
3,OECD,2018,12739.2900
4,OECD,2019,12058.2600
...,...,...,...
508,UKR,2019,840.3821
509,UKR,2020,755.6960
510,UKR,2021,568.1924
511,UKR,2022,334.9731


In [86]:
exclude_esnemble_countries = ['OECD', 'OECDE']
df_air_poll_emissions = df_air_poll_emissions[~df_air_poll_emissions['Country Code'].isin(exclude_esnemble_countries)]
df_air_poll_emissions["Country Code"].unique()

array(['AUS', 'AUT', 'BEL', 'CAN', 'CHL', 'COL', 'CZE', 'DNK', 'EST',
       'FIN', 'FRA', 'DEU', 'GRC', 'HUN', 'ISL', 'IRL', 'ISR', 'ITA',
       'JPN', 'KOR', 'LVA', 'LTU', 'LUX', 'MEX', 'NLD', 'NZL', 'NOR',
       'POL', 'PRT', 'SVK', 'SVN', 'ESP', 'SWE', 'CHE', 'TUR', 'GBR',
       'USA', 'ALB', 'ARM', 'AZE', 'BLR', 'BIH', 'BGR', 'HRV', 'CYP',
       'GEO', 'KAZ', 'KGZ', 'MLT', 'MCO', 'MNE', 'MKD', 'ROU', 'RUS',
       'SRB', 'TJK', 'UKR'], dtype=object)

## Greenhouse Gas Emissions - OECD

In [87]:
df_greenhouse_gas = pd.read_csv("../Data/enivronment/OECD/greenhouse_gas_emissions.csv")
pollutant_col = df_greenhouse_gas.pivot(columns='Pollutant', values='OBS_VALUE')
df_greenhouse_gas = pd.concat([df_greenhouse_gas, pollutant_col], axis = 1)
df_greenhouse_gas = df_greenhouse_gas[["REF_AREA", "TIME_PERIOD", "Greenhouse gases"]] #unit measure is Kg of CO2-equivalent per person for all
df_greenhouse_gas.columns = ["Country Code", "Year", "Greenhouse gases (Kg CO2-equivalent Per Person)"]
df_greenhouse_gas

Unnamed: 0,Country Code,Year,Greenhouse gases (Kg CO2-equivalent Per Person)
0,OECDE,2014,8.223466
1,OECDE,2015,8.240490
2,OECDE,2016,8.185606
3,OECDE,2017,8.193707
4,OECDE,2018,8.002245
...,...,...,...
797,THA,2020,5.205036
798,UKR,2020,7.278898
799,URY,2020,10.706970
800,UZB,2020,5.595903


In [88]:
exclude_esnemble_countries = ['EU27_2020', 'OECD', 'OECDE', 'OECDA', 'OECDSO']
df_greenhouse_gas = df_greenhouse_gas[~df_greenhouse_gas['Country Code'].isin(exclude_esnemble_countries)]
df_greenhouse_gas["Country Code"].unique()

array(['AUS', 'AUT', 'BEL', 'CAN', 'CZE', 'DNK', 'EST', 'FIN', 'FRA',
       'DEU', 'GRC', 'HUN', 'ISL', 'IRL', 'ISR', 'ITA', 'JPN', 'LVA',
       'LTU', 'LUX', 'NLD', 'NZL', 'NOR', 'POL', 'PRT', 'SVK', 'SVN',
       'ESP', 'SWE', 'CHE', 'TUR', 'GBR', 'BLR', 'BGR', 'HRV', 'CYP',
       'KAZ', 'LIE', 'MLT', 'MCO', 'ROU', 'RUS', 'SRB', 'UKR', 'CHL',
       'COL', 'CRI', 'KOR', 'MEX', 'USA', 'DZA', 'ARG', 'BRA', 'BRN',
       'CIV', 'CUB', 'EGY', 'GEO', 'GHA', 'GTM', 'GNB', 'GUY', 'VAT',
       'IDN', 'KEN', 'MYS', 'MDV', 'MUS', 'MDA', 'NAM', 'NPL', 'NGA',
       'PAN', 'ZAF', 'UZB', 'VEN', 'MAR', 'AZE', 'BTN', 'LBN', 'THA',
       'URY', 'SAU', 'IND', 'CHN'], dtype=object)

## Intensity Use of Forests Resources - OECD

In [89]:
df_use_forests_resources = pd.read_csv("../Data/enivronment/OECD/intensity_use_forests_resources.csv")
measure_col = df_use_forests_resources.pivot(columns='Measure', values='OBS_VALUE')
df_use_forests_resources = pd.concat([df_use_forests_resources, measure_col], axis = 1)
df_use_forests_resources = df_use_forests_resources[["REF_AREA", "TIME_PERIOD", "Intensity of use of forest resources"]] #unit measure is Percentage Points for all
df_use_forests_resources.columns = ["Country Code", "Year", "Intensity of use of forest resources (Percentage Points)"]
df_use_forests_resources

Unnamed: 0,Country Code,Year,Intensity of use of forest resources (Percentage Points)
0,AUS,2015,0.830061
1,AUS,2016,0.915014
2,AUS,2017,1.010009
3,AUS,2018,1.001892
4,AUS,2019,0.991555
...,...,...,...
283,NZL,2013,0.657170
284,SVK,2017,0.781253
285,CHE,2010,0.713892
286,CHE,2015,0.695307


In [90]:
df_use_forests_resources["Country Code"].unique()

array(['AUS', 'BEL', 'CHL', 'CZE', 'DNK', 'FIN', 'FRA', 'DEU', 'HUN',
       'ISL', 'IRL', 'ITA', 'KOR', 'LVA', 'LTU', 'LUX', 'NLD', 'NZL',
       'NOR', 'SVK', 'SVN', 'ESP', 'SWE', 'CHE', 'TUR', 'GBR', 'EST',
       'POL', 'USA', 'BGR', 'HRV', 'AUT'], dtype=object)

## Land Use - OECD

In [91]:
df_land_use = pd.read_csv("../Data/enivronment/OECD/land_use.csv")
measure_col = df_land_use.pivot(columns='Measure', values='OBS_VALUE')
df_land_use = pd.concat([df_land_use, measure_col], axis = 1)
df_land_use = df_land_use[["REF_AREA", "TIME_PERIOD", "Total area"]] #unit measure is Square Km for all
df_land_use.columns = ["Country Code", "Year", "Total area (Square Km)"]
df_land_use

Unnamed: 0,Country Code,Year,Total area (Square Km)
0,SHN,2010,390.0
1,SHN,2011,390.0
2,SHN,2012,390.0
3,SHN,2013,390.0
4,SHN,2014,390.0
...,...,...,...
3211,ZWE,2019,390760.0
3212,ZWE,2020,390760.0
3213,ZWE,2021,390760.0
3214,ZWE,2022,390760.0


In [92]:
exclude_esnemble_countries = ['OECD', 'OECDE', 'OECDA', 'OECDSO']
df_land_use = df_land_use[~df_land_use['Country Code'].isin(exclude_esnemble_countries)]
df_land_use["Country Code"].unique()

array(['SHN', 'MNP', 'PLW', 'SYC', 'GIB', 'TKL', 'COK', 'SPM', 'DMA',
       'TON', 'NIU', 'KNA', 'VCT', 'NRU', 'ABW', 'MHL', 'VIR', 'AUS',
       'AUT', 'BEL', 'CAN', 'CHL', 'COL', 'CRI', 'CZE', 'DNK', 'EST',
       'FIN', 'FRA', 'DEU', 'GRC', 'HUN', 'ISL', 'IRL', 'ISR', 'ITA',
       'JPN', 'KOR', 'LVA', 'LTU', 'LUX', 'MEX', 'NLD', 'NZL', 'NOR',
       'POL', 'PRT', 'SVK', 'SVN', 'ESP', 'SWE', 'CHE', 'TUR', 'GBR',
       'USA', 'AFG', 'ALB', 'DZA', 'ASM', 'AGO', 'AND', 'AIA', 'ATG',
       'ARG', 'ARM', 'AZE', 'BHS', 'BHR', 'BGD', 'BRB', 'BLR', 'BLZ',
       'BEN', 'BMU', 'BTN', 'BOL', 'BIH', 'BWA', 'BRA', 'VGB', 'BRN',
       'BGR', 'BFA', 'BDI', 'CPV', 'KHM', 'CMR', 'CYM', 'CAF', 'TCD',
       'CHN', 'COM', 'COG', 'CIV', 'HRV', 'CUB', 'CUW', 'CYP', 'PRK',
       'COD', 'DJI', 'DOM', 'ECU', 'EGY', 'SLV', 'GNQ', 'ERI', 'SWZ',
       'ETH', 'FLK', 'FRO', 'FJI', 'GUF', 'PYF', 'GAB', 'GMB', 'GEO',
       'GHA', 'GRL', 'GRD', 'GLP', 'GUM', 'GTM', 'GIN', 'GNB', 'GUY',
       'HTI', 'VAT',

## Meteo - era5

In [93]:
df_meteo_data = pd.read_csv("../Data/enivronment/era5/era5_climate_country.csv")
df_meteo_data

Unnamed: 0,Year,Country Code,u10,v10,d2m,t2m,sst,sp,skt,blh
0,1980,IDN,-0.731723,0.000275,295.007338,298.317703,34.691068,98360.698946,298.493671,426.010201
1,1980,MYS,0.317648,0.064888,294.381366,297.039004,71.816384,96500.498069,297.242948,342.285277
2,1980,CYP,0.641971,0.409293,285.697660,289.763028,290.739371,102027.155390,290.599272,867.391694
3,1980,IND,-0.736435,0.067963,288.873006,295.436724,149.975586,94774.793878,296.593739,629.126892
4,1980,CHN,0.685689,0.316234,283.616753,288.801672,166.547391,99887.850986,289.777521,703.330091
...,...,...,...,...,...,...,...,...,...,...
5791,2021,STP,-6.380368,0.477989,295.925461,299.805618,300.596619,100975.863281,300.393341,849.644623
5792,2021,ALA,-0.523110,0.754011,268.164795,270.973999,0.000000,95983.742188,270.478149,398.638153
5793,2021,SLB,-6.612650,3.167327,293.739627,298.514514,299.679770,101396.002459,299.436417,928.301249
5794,2021,VUT,-6.237249,2.279653,290.641373,295.628057,296.759196,101710.388737,296.538135,935.802239


## Pesticides Use - OECD

In [94]:
df_pesticides_use = pd.read_csv("../Data/enivronment/OECD/pesticides_use.csv")
measure_col = df_pesticides_use.pivot(columns='Measure', values='OBS_VALUE')[["Total molluscicides", "Total sales of agricultural pesticides"]]
df_pesticides_use = pd.concat([df_pesticides_use, measure_col], axis = 1)

df_pesticides_use_total_pesticides = df_pesticides_use[df_pesticides_use["Total sales of agricultural pesticides"].notnull()][["REF_AREA", "TIME_PERIOD", "Total sales of agricultural pesticides"]]
df_pesticides_use_total_molluscicides = df_pesticides_use[df_pesticides_use["Total molluscicides"].notnull()][["REF_AREA", "TIME_PERIOD", "Total molluscicides"]]
df_pesticides_use_total_pesticides.columns = ["Country Code", "Year", "Total sales of agricultural pesticides (tonnes)"]
df_pesticides_use_total_molluscicides.columns = ["Country Code", "Year", "Total molluscicides (tonnes)"]
display(df_pesticides_use_total_pesticides)
display(df_pesticides_use_total_molluscicides)

Unnamed: 0,Country Code,Year,Total sales of agricultural pesticides (tonnes)
0,AUS,2012,48687.875
1,AUS,2013,45177.187
2,AUS,2014,49857.349
3,AUS,2015,50921.602
4,AUS,2016,63416.482
...,...,...,...
481,VNM,2016,19154.000
482,VNM,2017,19154.000
483,VNM,2018,19154.000
484,VNM,2019,19154.000


Unnamed: 0,Country Code,Year,Total molluscicides (tonnes)
2000,AUT,2012,23.653
2001,AUT,2013,13.471
2002,AUT,2014,16.180
2003,AUT,2015,21.214
2004,AUT,2016,10.679
...,...,...,...
2240,ROU,2017,4.981
2241,ROU,2018,4.829
2242,ROU,2019,4.263
2243,ROU,2020,9.304


In [95]:
df_pesticides_use_total_pesticides["Country Code"].unique()

array(['AUS', 'AUT', 'BEL', 'CAN', 'CHL', 'COL', 'CRI', 'CZE', 'DNK',
       'EST', 'FIN', 'FRA', 'DEU', 'GRC', 'HUN', 'ISL', 'IRL', 'ISR',
       'ITA', 'JPN', 'KOR', 'LVA', 'LTU', 'LUX', 'MEX', 'NLD', 'NOR',
       'POL', 'PRT', 'SVK', 'SVN', 'ESP', 'SWE', 'CHE', 'TUR', 'GBR',
       'USA', 'ARG', 'BRA', 'BGR', 'CHN', 'HRV', 'CYP', 'IND', 'IDN',
       'KAZ', 'MLT', 'ROU', 'RUS', 'ZAF', 'UKR', 'VNM'], dtype=object)

## Tobacco Consumption - OECD

In [96]:
df_tobacco_consumption = pd.read_csv("../Data/health/OECD/tobacco_consumption.csv")
df_tobacco_consumption = df_tobacco_consumption[~df_tobacco_consumption['SEX'].isin(['M', 'F'])]
measure_col = df_tobacco_consumption.pivot(columns='Measure', values='OBS_VALUE')
df_tobacco_consumption = pd.concat([df_tobacco_consumption, measure_col], axis = 1)

df_tobacco_consumption_pct = df_tobacco_consumption[df_tobacco_consumption["Share of population who are daily smokers"].notnull()][["REF_AREA", "TIME_PERIOD", "Share of population who are daily smokers"]]
df_tobacco_consumption_pct.columns = ["Country Code", "Year", "Share of population who are daily smokers (Pct population)"]

df_tobacco_consumption = df_tobacco_consumption[df_tobacco_consumption["Tobacco consumption"].notnull()]
df_tobacco_consumption_nbcigarettes = df_tobacco_consumption[df_tobacco_consumption["Unit of measure"] == 'Cigarettes per smoker per day'][["REF_AREA", "TIME_PERIOD", "Tobacco consumption"]]
df_tobacco_consumption_grperperson = df_tobacco_consumption[df_tobacco_consumption["Unit of measure"] == 'Grammes per person'][["REF_AREA", "TIME_PERIOD", "Tobacco consumption"]]
df_tobacco_consumption_nbcigarettes.columns = ["Country Code", "Year", "Tobacco consumption (Cigarettes per smoker per day)"]
df_tobacco_consumption_grperperson.columns = ["Country Code", "Year", "Tobacco consumption (Grammes per person)"]

display(df_tobacco_consumption_pct)
display(df_tobacco_consumption_nbcigarettes)
display(df_tobacco_consumption_grperperson)

Unnamed: 0,Country Code,Year,Share of population who are daily smokers (Pct population)
554,AUS,2010,15.3
555,AUS,2013,13.0
556,AUS,2016,12.4
557,AUS,2019,11.2
558,AUS,2022,8.5
...,...,...,...
2506,NLD,2010,20.9
2509,NLD,2014,19.1
2513,SWE,2022,8.7
2515,RUS,2016,30.3


Unnamed: 0,Country Code,Year,Tobacco consumption (Cigarettes per smoker per day)
1526,AUS,2010,15.9
1527,AUS,2013,13.7
1528,AUS,2016,13.4
1529,AUS,2019,12.9
1530,AUS,2022,13.1
...,...,...,...
2461,ISR,2016,17.2
2463,LVA,2016,13.5
2464,NLD,2010,10.5
2465,NLD,2014,10.7


Unnamed: 0,Country Code,Year,Tobacco consumption (Grammes per person)
1756,AUS,2010,1009.0
1757,AUS,2012,964.7
1758,AUS,2013,915.8
1759,AUS,2014,830.8
1760,AUS,2015,790.0
...,...,...,...
2456,AUS,2011,971.7
2459,DEU,2022,1350.0
2462,ISR,2023,753.0
2466,NZL,2010,675.0


In [97]:
df_tobacco_consumption_pct['Country Code'].unique()

array(['AUS', 'AUT', 'BEL', 'CAN', 'CHL', 'CRI', 'CZE', 'DNK', 'EST',
       'FIN', 'FRA', 'DEU', 'GRC', 'HUN', 'ISL', 'IRL', 'ISR', 'ITA',
       'JPN', 'KOR', 'LVA', 'LTU', 'LUX', 'MEX', 'NLD', 'NZL', 'POL',
       'PRT', 'SVK', 'SVN', 'ESP', 'SWE', 'CHE', 'TUR', 'GBR', 'USA',
       'BRA', 'BGR', 'HRV', 'PER', 'ROU', 'RUS', 'COL', 'ARG', 'CHN',
       'IND', 'IDN', 'ZAF', 'UKR', 'NOR'], dtype=object)

## Environment Air Quality Data - WHO

In [98]:
df_air_quality_who = pd.read_csv("../Data/enivronment/WHO/air_quality.csv")
df_air_quality_who = df_air_quality_who[["iso3", "city", "year", "pm10_concentration", "pm25_concentration", "no2_concentration"]]
df_air_quality_who.columns = ["Country Code", "City", "Year", "PM10_Concentration",  "PM25_Concentration", "NO2_Concentration"]
df_air_quality_who = df_air_quality_who.groupby(["Country Code", "Year"])[["PM10_Concentration",  "PM25_Concentration", "NO2_Concentration"]].mean().reset_index()
df_air_quality_who.columns = ["Country Code", "Year", "PM10_ConcentrationAvg",  "PM25_ConcentrationAvg", "NO2_ConcentrationAvg"]
df_air_quality_who["Year"] = df_air_quality_who["Year"].astype(int)
df_air_quality_who

Unnamed: 0,Country Code,Year,PM10_ConcentrationAvg,PM25_ConcentrationAvg,NO2_ConcentrationAvg
0,AFG,2019,,119.774000,
1,ALB,2014,28.181800,12.910750,22.387800
2,ALB,2015,25.467500,16.355750,18.094000
3,ALB,2016,26.074250,16.847500,17.044200
4,ALB,2017,28.301250,17.493750,17.376000
...,...,...,...,...,...
846,ZAF,2017,31.283000,15.886000,20.182500
847,ZAF,2018,40.729000,22.498250,30.163000
848,ZAF,2019,44.204471,21.539143,19.648000
849,ZAF,2020,38.231467,23.539231,17.292067


In [99]:
df_air_quality_who['Country Code'].unique()

array(['AFG', 'ALB', 'AND', 'ARE', 'ARG', 'AUS', 'AUT', 'BEL', 'BEN',
       'BGD', 'BGR', 'BHR', 'BHS', 'BIH', 'BLR', 'BOL', 'BRA', 'BTN',
       'CAN', 'CHE', 'CHL', 'CHN', 'CMR', 'COL', 'CRI', 'CUB', 'CYP',
       'CZE', 'DEU', 'DNK', 'DOM', 'DZA', 'ECU', 'EGY', 'ESP', 'EST',
       'ETH', 'FIN', 'FJI', 'FRA', 'GBR', 'GEO', 'GHA', 'GRC', 'GTM',
       'HND', 'HRV', 'HUN', 'IDN', 'IND', 'IRL', 'IRN', 'IRQ', 'ISL',
       'ISR', 'ITA', 'JAM', 'JOR', 'JPN', 'KAZ', 'KEN', 'KGZ', 'KHM',
       'KOR', 'KWT', 'LAO', 'LBN', 'LIE', 'LKA', 'LTU', 'LUX', 'LVA',
       'MAR', 'MCO', 'MDG', 'MDV', 'MEX', 'MKD', 'MLT', 'MMR', 'MNE',
       'MNG', 'MUS', 'MYS', 'NGA', 'NLD', 'NOR', 'NPL', 'NZL', 'PAK',
       'PAN', 'PER', 'PHL', 'POL', 'PRI', 'PRT', 'PRY', 'QAT', 'ROU',
       'RUS', 'SAU', 'SEN', 'SGP', 'SLV', 'SRB', 'SVK', 'SVN', 'SWE',
       'THA', 'TJK', 'TKM', 'TTO', 'TUN', 'TUR', 'TZA', 'UGA', 'UKR',
       'URY', 'USA', 'UZB', 'VEN', 'VNM', 'ZAF'], dtype=object)

## Putting data together

In [100]:
def inner_merge(df1, df2):
    merged = pd.merge(
        df1, df2,
        on=["Country Code", "Year"],
        how="outer"
    )
    return merged

In [101]:
df_wdi = df_wdi.astype({'Year': 'string'})
df_air_quality_who = df_air_quality_who.astype({'Year': 'string'})
df_land_use = df_land_use.astype({'Year': 'string'})
df_greenhouse_gas = df_greenhouse_gas.astype({'Year': 'string'})
df_air_poll_emissions = df_air_poll_emissions.astype({'Year': 'string'})
df_pesticides_use_total_pesticides = df_pesticides_use_total_pesticides.astype({'Year': 'string'})
df_tobacco_consumption_pct = df_tobacco_consumption_pct.astype({'Year': 'string'})
df_meteo_data = df_meteo_data.astype({'Year': 'string'})

dataframes = [df_wdi, df_land_use, df_air_quality_who, df_greenhouse_gas, df_air_poll_emissions, df_pesticides_use_total_pesticides,\
              df_tobacco_consumption_pct, df_meteo_data]

for df in dataframes:
    df['Country Code'] = df['Country Code'].str.strip().str.upper()
    df['Year'] = df['Year'].astype(str).str.strip()

# Apply smart_merge cumulatively
df_merged = reduce(inner_merge, dataframes)

# Compute completeness
non_key_cols = [c for c in df_merged.columns if c not in ["Country Code", "Year"]]
df_merged["data_completeness"] = df_merged[non_key_cols].notna().mean(axis=1)

threshold = 0.9
df_filtered = df_merged[df_merged["data_completeness"] >= threshold]

if df_filtered.empty:
    print(f"No rows with >= {threshold*100:.0f}% completeness, trying 0.6...")
    threshold = 0.6
    df_filtered = df_merged[df_merged["data_completeness"] >= threshold]

print(f"Keeping {len(df_filtered)} rows with >= {threshold*100:.0f}% completeness")

columns_to_keep = [col for col in df_filtered.columns if col not in ['data_completeness']]
df_filtered = df_filtered[columns_to_keep]

print(df_filtered['Year'].unique())
print(df_filtered['Country Code'].unique())

Keeping 493 rows with >= 90% completeness
['2015' '2016' '2017' '2018' '2019' '2020' '2012' '2013' '2014' '2021'
 '2010' '2011']
['ALB' 'AUS' 'AUT' 'BEL' 'BGR' 'BIH' 'BLR' 'CHE' 'CHN' 'CYP' 'CZE' 'DEU'
 'DNK' 'ESP' 'EST' 'FIN' 'GBR' 'GEO' 'GRC' 'HRV' 'HUN' 'IDN' 'IND' 'ISR'
 'ITA' 'JPN' 'KAZ' 'KOR' 'LTU' 'LUX' 'LVA' 'MKD' 'MYS' 'NLD' 'NZL' 'POL'
 'ROU' 'RUS' 'SRB' 'SVK' 'SVN' 'SWE' 'THA' 'TUR' 'UKR' 'USA' 'ZAF']


In [102]:
df_filtered

Unnamed: 0,Country Code,Year,Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),Carbon dioxide (CO2) emissions excluding LULUCF per capita (t CO2e/capita),"Compulsory education, duration (years)",GDP (current US$),GDP per capita (constant 2015 US$),Gini index,"Life expectancy at birth, total (years)",...,Total sales of agricultural pesticides (tonnes),Share of population who are daily smokers (Pct population),u10,v10,d2m,t2m,sst,sp,skt,blh
343,ALB,2015,76.6,100,1.68830316766428,9,11470171826.9575,3981.72662261867,32.8,78.358,...,,,3.129017,1.121436,282.771270,286.020931,286.654837,101536.890591,286.564666,837.423412
344,ALB,2016,78.3,99.9,1.57515330650766,9,11988668784.6628,4143.98988316994,33.7,78.643,...,,,4.840293,1.204059,281.794459,285.207172,285.960591,101593.640812,285.867100,922.500902
345,ALB,2017,79.6,99.9,1.85891071277559,9,13258268435.6048,4283.98262744437,33.1,78.9,...,,,2.685428,0.199308,281.950864,285.313433,286.164351,101697.791112,286.068191,844.233317
346,ALB,2018,80.7,100,1.85097837827277,9,15379508329.7568,4452.23714652476,30.1,79.238,...,,,2.727313,0.766708,282.979019,286.166477,286.772324,101830.483277,286.686765,820.588157
347,ALB,2019,82,100,1.73551805047385,9,15585111614.0376,4563.4673625169,30.1,79.467,...,,,2.887687,1.093749,282.999927,286.175235,286.662972,101649.299121,286.577131,834.104547
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13826,ZAF,2017,85.8,84.4,8.15147357441279,9,381448814653.456,6125.6920507497,..,65.422,...,26857.0,,-1.545739,0.537433,289.005088,293.605999,295.018212,101652.492466,294.828647,895.751978
13827,ZAF,2018,86.6,84.7,8.08182130104548,9,405260723892.517,6117.2701406047,..,65.726,...,26857.0,20.3,-1.596374,0.674547,288.913919,293.405261,294.757231,101637.891350,294.569259,880.144254
13828,ZAF,2019,87.4,85,8.0316393172874,9,389330032224.269,6032.82972598668,..,66.071,...,26857.0,20.2,-2.176614,0.399050,288.893624,293.465185,294.767808,101678.802298,294.580473,904.959140
13829,ZAF,2020,88.1,90,6.91474134743811,9,337974655408.055,5569.58483319272,..,65.15,...,26857.0,20.2,-1.380377,0.060903,289.394558,293.841927,295.153098,101690.402977,294.970789,878.903095


## Final datasets creation, initial filtering and saving

In [103]:
def filter_high_nan(df, group_col, threshold=0.6):
    """
    Remove groups in `group_col` where the average NaN ratio > threshold.
    """
    nan_ratio = (
        df.drop(columns=[group_col])
          .groupby(df[group_col], observed=True)
          .agg(lambda x: x.isna().sum() / x.size)
          .mean(axis=1)
    )
    
    groups_to_drop = nan_ratio[nan_ratio > threshold].index
    return df[~df[group_col].isin(groups_to_drop)]

In [104]:
def save_final_csv(df_left, df_right, df_name):
    df_left.loc[:, 'Year'] = df_left['Year'].astype(str)
    
    # Merge once
    df_merge = pd.merge(
        df_left, df_right,
        on=["Country Code", "Year"],
        how="inner",
        suffixes=("_1", "_2")
    )
    
    # Filter first by Country Code, then by Year
    df_filtered = filter_high_nan(df_merge, "Country Code")
    df_filtered = filter_high_nan(df_filtered, "Year")
    df_filtered = df_filtered.drop_duplicates()
    
    print(f"Saving file {df_name} with {df_filtered.shape[0]} rows")
    df_filtered.to_csv(f"../Data/Refined/{df_name}.csv", index=False)

### 6 Datasets (Data from 2010 to 2021)

In [105]:
df_COPD_prevalence_rate = df_COPD_prevalence_rate.astype({'Year': 'string'})
df_COPD_incidence_rate = df_COPD_incidence_rate.astype({'Year': 'string'})
df_asthma_prevalence_rate = df_asthma_prevalence_rate.astype({'Year': 'string'})
df_asthma_incidence_rate = df_asthma_incidence_rate.astype({'Year': 'string'})
df_tuberculosis_prevalence_rate = df_tuberculosis_prevalence_rate.astype({'Year': 'string'})
df_tuberculosis_incidence_rate = df_tuberculosis_incidence_rate.astype({'Year': 'string'})

df_health = [df_COPD_prevalence_rate, df_COPD_incidence_rate, df_asthma_prevalence_rate, df_asthma_incidence_rate,\
            df_tuberculosis_prevalence_rate, df_tuberculosis_incidence_rate]

df_health_names = ["1021/COPD_prevalence_rate", "1021/COPD_incidence_rate", "1021/asthma_prevalence_rate", "1021/asthma_incidence_rate",\
            "1021/tuberculosis_prevalence_rate", "1021/tuberculosis_incidence_rate"]

for i in range(len(df_health)):
    save_final_csv(df_health[i], df_filtered, df_health_names[i])

Saving file 1021/COPD_prevalence_rate with 493 rows
Saving file 1021/COPD_incidence_rate with 493 rows
Saving file 1021/asthma_prevalence_rate with 493 rows
Saving file 1021/asthma_incidence_rate with 493 rows
Saving file 1021/tuberculosis_prevalence_rate with 493 rows
Saving file 1021/tuberculosis_incidence_rate with 493 rows


We can use any of the final 6 datasets for our study (they combine different diseases, metrics, measures).