<a href="https://colab.research.google.com/github/annakpke/World-Bank-API/blob/main/API_Worldbank_Diarrhea.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

For this project, we explored global health data using the World Bank API to uncover interesting trends and patterns. The goal was to apply what we’ve learned in the Practical Introduction to Programming in Python course to analyze real-world data. We chose the World Bank API because of its rich collection of health-related indicators, which allowed us to investigate meaningful global health topics.

We focused on the follwoing research question:

- To what extent does the gap in diarrhea prevalence for children under the age of 5 between the poorest and richest quintiles vary across countries in different income groups (low income, lower middle income, upper middle income)? How do the treatment of the diarrhea cases correlate with the income group and the quintile looked at?

To answer this question, we retrieved and cleaned the data using Python, tackled missing or inconsistent values, and created visualizations to make the findings more accessible. We also included interactive features, like ipywidgets, so users could engage with the data and explore different aspects dynamically.


## Research Question
To what extent does the gap in diarrhea prevalence for children under the age of 5 between the poorest and richest quintiles vary across countries in different income groups (low income, lower middle income, upper middle income)? How do the treatment of the diarrhea cases correlate with the income group and the quintile looked at?

For this research question we had a look at the prevalence of diarrhea for children under the age of 5 and differentiated between different income group countries. We also included the treatment of diarrhea for children under the age of 5.

The figures (boxplots) will represent the diarrhea prevalence and treatment (%) among children under 5, comparing the poorest and richest quintiles within the countries across the three income groups.

First we need to import all the necessary libraries and fetch the data from the Worldbank API.

In [None]:
import requests
import xml.etree.ElementTree as ET
import matplotlib.pyplot as plt
import pandas as pd
import ipywidgets as widgets
from IPython.display import display, clear_output
from scipy.stats import ttest_ind
import csv

income_level_codes = {
    'Low income': "LIC",
    'Lower middle income': "LMC",
    'Upper middle income': "UMC"
}

# fetch income groups
# per page = 300 to get all countries in one request
url = "https://api.worldbank.org/v2/country?format=xml&per_page=300"
response = requests.get(url)

root = ET.fromstring(response.content)

# Define the namespace
namespace = {'wb': 'http://www.worldbank.org'}

# Extract country codes and income levels
country_dict = {}
country_dict['LIC'] = []
country_dict['LMC'] = []
country_dict['UMC'] = []

for country in root.findall('wb:country', namespace):
    country_id = country.get('id')
    income_level = country.find('wb:incomeLevel', namespace).text

    for description, code in income_level_codes.items():
        if description == income_level:
            country_dict[code].append(country_id)

Then we fetched the different indicators needed to answer our research question. We therefore fetched the data for each indicator and each country.

In [None]:
# fetch indicators
# List of indicators (for diarrhea prevalence and treatment in poorest vs richest quintiles)
indicators = {
    "Prevalence": {
        "Diarrhea prevalence of children under the age of 5 - Poorest quintile": "SH.STA.DIRH.Q1.ZS",
        "Diarrhea prevalence of children under the age of 5 - Richest quintile": "SH.STA.DIRH.Q5.ZS",
    },
    "Treatment": {
        "Diarrhea treatment of children under the age of 5 - Poorest quintile": "SH.STA.ORHF.Q1.ZS",
        "Diarrhea treatment of children under the age of 5 - Richest quintile": "SH.STA.ORHF.Q5.ZS",
    },
}


# fetch the most recent year of data for a specific country and indicator
def fetch_latest_data(indicator):
    # mrnev=1 get most recent non-empty
    url = f"https://api.worldbank.org/v2/country/all/indicator/{indicator}?mrnev=1&format=xml&per_page=300"
    response = requests.get(url)
    if response.status_code == 200:
        try:
            data = response.content
            return ET.fromstring(response.content)
        except Exception as e:
            print(f"Error processing data for {indicator}: {e}")
    return None


# Fetch data for each indicator and country
indicator_dict = {}
for income_group, income_code in income_level_codes.items():
    indicator_dict[income_code] = {}

    for prev_treat, indicator_codes in indicators.items():
        indicator_dict[income_code][prev_treat] = {}

        for indicator_name, indicator_code in indicator_codes.items():
            indicator_dict[income_code][prev_treat][indicator_code] = {}

            record = fetch_latest_data(indicator_code)
            for data in record.findall("wb:data", namespace):
                iso3code = data.find("wb:countryiso3code", namespace).text
                value = data.find("wb:value", namespace).text

                for country_income_group, country_list in country_dict.items():
                    if income_code == country_income_group:
                        for country in country_list:
                            if country == iso3code:
                                indicator_dict[income_code][prev_treat][indicator_code][iso3code] = value

In the following code we included a function to save the created dictionary into a CSV file. This can be helpful for later use of the data. We also included a t-test and widgets to make the data more easy to manage.


In [None]:
# Function to save dictionary data to a CSV file
def save_data_to_csv(data, filename="data.csv"):
    with open(filename, mode="w", newline="") as file:
        writer = csv.writer(file)
        # Write the header
        writer.writerow(["Indicator", "Country", "Percentage"])

        # Write data rows
        for indicator, countries in data.items():
            for country, percentage in countries.items():
                writer.writerow([indicator, country, percentage])
    print(f"Data saved to {filename}")

def get_indicator_data(selected_income_group, selected_indicator):
    rich_vs_poor = {}

    for _, indicator_code in indicators[selected_indicator].items():
        rich_vs_poor[indicator_code] = indicator_dict[selected_income_group][selected_indicator][indicator_code]

    return rich_vs_poor

def calculate_t_test(data):
    # Extract values as floats
    vals = []
    for subdict in data.values():
        vals.append(list(map(float, subdict.values())))

    if len(vals) != 2:
        raise ValueError("The data should contain exactly two groups for a t-test.")

    # Perform a two-sample independent t-test
    return ttest_ind(vals[0], vals[1])

# Create GUI with ipywidgets
income_group_selector = widgets.Dropdown(
    options=list(income_level_codes.keys()),
    description='Income Group:',
    style={'description_width': 'initial'}
)

indicator_selector = widgets.Dropdown(
    options=list(indicators.keys()),
    description='Indicator:',
    style={'description_width': 'initial'}
)

fetch_button = widgets.Button(
    description="Get Data",
    button_style='success',
    tooltip='Display visualization',
    icon='search'
)

output_area = widgets.Output()

# Fetch and visualize data on button click
def on_fetch_button_click(b):
    with output_area:
        clear_output()
        selected_income_group = income_group_selector.value
        selected_income_code = income_level_codes[selected_income_group]
        selected_indicator_name = indicator_selector.value
        # selected_indicator_code = indicators[selected_indicator_name]

        # Display boxplot
        data = get_indicator_data(selected_income_code, selected_indicator_name)

        # Prepare boxplot data
        values = [list(map(float, subdict.values())) for subdict in data.values()]
        labels = []
        for key in data.keys():
            if 'Q1' in key:
                labels.append('poorest quintile')
            if 'Q5' in key:
                labels.append('richest quintile')

        # Create the boxplot
        fig, ax = plt.subplots()
        ax.boxplot(values, labels=labels)
        ax.set_ylabel("Percentage (%)")
        ax.set_title(f"{selected_indicator_name} of diarrhea in {selected_income_group.lower()} countries")
        plt.show()

        # Save CSV data
        download_button = widgets.Button(
            description="Download CSV",
            button_style='info'
        )

        def on_download_button_click(d):
            save_data_to_csv(data, filename = f"{selected_income_code}_{selected_indicator_name}.csv")

        download_button.on_click(on_download_button_click)
        display(download_button)

        # ttest
        t_test_button = widgets.Button(
            description='t test',
            button_style='info',
            tooltip='Calculate t statistic'
        )

        def on_ttest_button_click(d):
            t_stat, p_value = calculate_t_test(data)

            # Print the results
            with output_area:
                display(f"T-statistic: {t_stat}")
                display(f"P-value: {p_value}")

                # Interpretation
                alpha = 0.05
                if p_value < alpha:
                    display("The difference between the two groups is statistically significant.")
                else:
                    display("The difference between the two groups is not statistically significant.")

        t_test_button.on_click(on_ttest_button_click)
        display(t_test_button)


fetch_button.on_click(on_fetch_button_click)

# Display widgets
display(widgets.VBox([income_group_selector, indicator_selector, fetch_button, output_area]))

VBox(children=(Dropdown(description='Income Group:', options=('Low income', 'Lower middle income', 'Upper midd…

# Summary and Conclusion
##Variations in Diarrhea Prevalence across Income Groups
###Low-Income-Countries

For the poorest quintile the median diarrhea prevalence is higher compared to the richest quintile. The values are distributed over a wide range which indicates a variability between countries. The prevalence in the richest quintile is lower and less variable. The interquartile range is also lower.

Children in the poorest quintile typically experience higher prevalence of diarrhea due to the lack of access to clean water, sanitation and basic healthcare. The richest quintile still experiences higher prevalence than lower- or upper-middle-income countries due to its systemic underdevelopement. The richest quintile is still constrained by overall infrastructure challenges. The gap and difference between the poorest and richest quintile is quite low which is also confirmed by the result of the T-test. With a P-value above 0.05 the difference is not statistically significant.
###Lower-Middle-Income Countries

The median diarrhea prevalence is reduced compared to low-income countries. A variability can still be seen, as well as an outlier above the upper whisker. The interquartile range suggests moderate consistency in prevalence. Looking at the richest quintile the median is lower than in the poorest quintile, showing even less variability. This trend aligns with improved conditions for wealthier populations as the overall income increases.

In general the gap between richest and poorest quintile begins to widen in lower-middle-income countries as these countries can invest in more public health initiatives. The urbanization of lower-middle-income countries comes with an improvement of infrastructure. However disparities remain still which is also confirmed within the T-test.
###Upper-Middle-Income Countries

Comparing to the other income groups the median prevalence of the poorest quintile is significantly lower. Still some variability is remaining and an outlier suggests that certain countries still experience higher diarrhea rates in the poorest quintile. With minimal variability the richest quintile of upper-middle-income countries is the lowest above all. This is an indicator that the wealthier populations in this income groups experience the best health outcomes in terms of diarrhea prevalence.

Overall the prevalence of diarrhea is lower in both quintiles due to a better sanitation and water access. The gap between poorest and richest quintile is the highest compared to the other income groups. These countries typically have a better infrastructure, while remote communities in the poorest quintile still may lag behind. With the T-test and a P-Value above 0.05 the statistical significance is confirmed.
Correlation between Diarrhea Treatment Rates and Income/Quintile
###Low-Income-Countries

The richest quintile within low-income countries show a slightly lower rate of diarrhea treatment with a higher variability and an outlier above the upper whisker, compared to the poorest quintile. This can be interpreted as inequalities due to interventions in developping countries. The higher variability and outlier in treatment rates can be linked to developement aid. Development aid improves healthcare, especially in the poorest quintile as it is more needed there. This leads to localized success and uneven improvements, as not all groups benefit equally from such interventions.

Treatment rates are low for both quintiles but the poorest quintile has slightly higher treatement rates within. This can be due to developement aid that focus more on the poorest quintile leading to unequal treatment rates and possibilites. For this income group a correlation between treatment rates and income cannot be found. The T-test is also confirming this interpretation as the P-value is above 0.05, stating that there is noch statistical significance.
###Lower-Middle-Income Countries

In the poorest quintile the Median appears to be higher compared to the richest quintile. The Median is around 60%, which indicates that about half the children under the age of 5 receive treatment for diarrhea. The interquartile range is more clustered in the poorest quintile. The whiskers in the poorest quintile extend from 20 to almost 100%, showing that treatment rates range widely. Looking at the richest quintile the Median is lower (around 50%) and the IQR is wider, indicating a higher variability of diarrhea treatment.

In general the poorest quintile has a higher rate of diarrhea treatment. This could be due to public health interventions targeting poorer populations, as already seen in Low-Income-Countries. Another reason could be the difference in the perception of diarrhea as a healt emergency. The higher variability in the treatment of diarrhea in the richest quintile can be interpreted due to the different health-seeking behaviours, with some relying less on treatment. According to the T-test there is noch statistical significance in the difference between poorest and richest country.
###Higher-Middle-Income Countries

The Median in Upper-Middle-Income Countries in the richest quintile is around 20% higher compared to the poorest quintile. This reflects a better access to health care and treatment possibilities.

The Gap in this income group countries is higher than in the Low-Income-Countries and the Lower-Middle-Income-Countries (Median Difference around 20%). In this Income range developement aid is not common leading to a bigger gap between poorest and richest quintile. The richest quintile has more universal access to treatment, while the poores quintile may face residual access or cultural barriers. Still the p-value is to high to have a significant difference between poorest and richest quintile.
##Conclusion
###To what extent does the gap in diarrhea prevalence for children under the age of 5 between the poorest and richest quintiles vary across countries in different income groups (low income, lower middle income, upper middle income)?

There is a marked difference in diarrhea prevalence between the poorest and richest quintiles in lower-middle-income (LMC) and upper-middle-income (UMC) countries, where the gap is statistically significant (p-value < 0.05). This suggests growing inequality in access to clean water and sanitation as countries develop, leaving the poorest quintile behind.

In contrast, this gap is not significant (p-value > 0.05) in low-income countries (LICs), indicating that both the poorest and richest quintiles face similarly high rates of diarrhea. This can be attributed to widespread underdevelopment, where even the wealthiest households lack consistent access to clean water and sanitation.
###How do the treatment of the diarrhea cases correlate with the income group and the quintile looked at?

Treatment rates for diarrhea show no significant correlation with wealth quintiles in any income group. Contrary to expectations, the p-values in all income groups (LIC, LMC, UMC) indicate no statistically significant difference between the poorest and richest quintiles. However, the high variability in both income-groups (LMC, UMC) shows that the data is difficult to statistially evaluate. This is the reason why the research cannot be answered cleary and further investigations need to be done.
