# Python Web APIs: Accessing World Bank Data

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Introduction to World Bank API](#worldbank)
2. [Exploring Countries and Indicators](#explore)
3. [Retrieving Economic and Social Data](#data)
4. [Time Series Analysis](#timeseries)
5. [Demo: Global Development Comparisons](#demo)

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import seaborn as sns
import requests
import json

<a id='worldbank'></a>

# World Bank Data API

The World Bank provides free access to comprehensive development data about countries worldwide. The API offers:

- **Economic indicators**: GDP, inflation, unemployment, trade data
- **Social indicators**: Education, health, poverty, population statistics
- **Environmental data**: Climate, energy, sustainability metrics
- **Historical data**: Time series going back decades for most indicators
- **Country metadata**: Geographic, income level, regional classifications

💡 **Tip**: The World Bank API is completely free and doesn't require an API key, making it perfect for development projects and research!

## Installing World Bank Data Library

We'll use the `wbdata` library which provides a convenient Python interface to the World Bank API:

In [None]:
%pip install wbdata

## Setting Up the World Bank API Client

In [None]:
import wbdata

# The wbdata library handles all API calls for us
print("World Bank Data API ready to use!")
print(f"wbdata version: {wbdata.__version__ if hasattr(wbdata, '__version__') else 'Unknown'}")

<a id='explore'></a>

# Exploring Countries and Indicators

Let's start by exploring what data is available through the World Bank API.

## Available Countries

In [None]:
# Get all countries
countries = wbdata.get_countries()

print(f"Total entities: {len(countries)}")

# Convert to DataFrame for easier analysis
df_countries = pd.DataFrame(countries)
print(f"\nColumns: {df_countries.columns.tolist()}")
df_countries.head()

In [None]:
# Filter to actual countries (exclude aggregates like regions)
actual_countries = df_countries[df_countries['capitalCity'].notna()]
print(f"Actual countries: {len(actual_countries)}")

# Show distribution by region and income level
print("\nCountries by region:")
print(actual_countries['region']['value'].value_counts())

print("\nCountries by income level:")
print(actual_countries['incomeLevel']['value'].value_counts())

## Available Indicators

The World Bank has thousands of indicators. Let's explore some key categories:

In [None]:
# Get indicators (this might take a moment as there are many)
print("Fetching indicators... (this may take a moment)")
indicators = wbdata.get_indicators()

print(f"Total indicators: {len(indicators)}")

# Convert to DataFrame
df_indicators = pd.DataFrame(indicators)
df_indicators.head()

In [None]:
# Search for specific types of indicators
def search_indicators(keyword, df=df_indicators):
    """Search indicators by keyword"""
    mask = df['name'].str.contains(keyword, case=False, na=False)
    results = df[mask][['id', 'name']]
    return results

# Find GDP-related indicators
gdp_indicators = search_indicators('GDP')
print(f"Found {len(gdp_indicators)} GDP-related indicators")
print("\nTop GDP indicators:")
print(gdp_indicators.head(10))

In [None]:
# Key indicators we'll use in this lesson
key_indicators = {
    'NY.GDP.PCAP.CD': 'GDP per capita (current US$)',
    'SP.POP.TOTL': 'Population, total',
    'SP.DYN.LE00.IN': 'Life expectancy at birth, total (years)',
    'SE.ADT.LITR.ZS': 'Literacy rate, adult total (% of people ages 15 and above)',
    'SH.MED.BEDS.ZS': 'Hospital beds (per 1,000 people)',
    'EN.ATM.CO2E.PC': 'CO2 emissions (metric tons per capita)',
    'SL.UEM.TOTL.ZS': 'Unemployment, total (% of total labor force)',
    'FP.CPI.TOTL.ZG': 'Inflation, consumer prices (annual %)'
}

print("Key indicators for analysis:")
for code, name in key_indicators.items():
    print(f"  {code}: {name}")

<a id='data'></a>

# Retrieving Economic and Social Data

Now let's retrieve some actual data for analysis.

## Single Indicator, Multiple Countries

In [None]:
# Get GDP per capita for specific countries
countries_of_interest = ['USA', 'CHN', 'JPN', 'DEU', 'IND', 'BRA', 'GBR', 'FRA']

# Get data for the most recent year available
gdp_data = wbdata.get_dataframe(
    {'NY.GDP.PCAP.CD': 'GDP_per_capita'}, 
    country=countries_of_interest,
    date=(datetime.now().year - 1, datetime.now().year)  # Last 2 years
)

print(f"Retrieved GDP data: {gdp_data.shape}")
print("\nMost recent GDP per capita data:")
print(gdp_data.dropna().sort_values('GDP_per_capita', ascending=False))

## Multiple Indicators, Single Country

In [None]:
# Get multiple indicators for the United States
usa_indicators = {
    'NY.GDP.PCAP.CD': 'GDP_per_capita',
    'SP.DYN.LE00.IN': 'Life_expectancy',
    'SE.ADT.LITR.ZS': 'Literacy_rate',
    'SL.UEM.TOTL.ZS': 'Unemployment_rate'
}

usa_data = wbdata.get_dataframe(
    usa_indicators,
    country='USA',
    date=(2020, 2023)
)

print("USA Development Indicators (Recent Years):")
print(usa_data.dropna())

## 🥊 Challenge: Compare Development Indicators

- Choose 5 countries from different regions
- Compare their life expectancy, GDP per capita, and literacy rates
- Which country performs best on each indicator?

In [None]:
# YOUR CODE HERE



<a id='timeseries'></a>

# Time Series Analysis

One of the most powerful features of World Bank data is the historical time series. Let's analyze trends over time.

## GDP Growth Over Time

In [None]:
# Get historical GDP per capita data
emerging_economies = ['CHN', 'IND', 'BRA', 'RUS', 'ZAF']  # BRICS countries
developed_economies = ['USA', 'JPN', 'DEU', 'GBR', 'FRA']

# Get data from 2000 to present
historical_gdp = wbdata.get_dataframe(
    {'NY.GDP.PCAP.CD': 'GDP_per_capita'},
    country=emerging_economies + developed_economies,
    date=(2000, 2022)
)

print(f"Historical GDP data shape: {historical_gdp.shape}")
historical_gdp.head()

In [None]:
# Prepare data for plotting
# Reset index to get country and date as columns
gdp_plot_data = historical_gdp.reset_index()
gdp_plot_data = gdp_plot_data.dropna()

# Create the plot
plt.figure(figsize=(15, 8))

# Plot emerging economies
plt.subplot(1, 2, 1)
for country in emerging_economies:
    country_data = gdp_plot_data[gdp_plot_data['country'] == country]
    if not country_data.empty:
        plt.plot(country_data['date'], country_data['GDP_per_capita'], 
                marker='o', label=country, linewidth=2)

plt.xlabel('Year')
plt.ylabel('GDP per Capita (USD)')
plt.title('Emerging Economies - GDP per Capita Over Time')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot developed economies
plt.subplot(1, 2, 2)
for country in developed_economies:
    country_data = gdp_plot_data[gdp_plot_data['country'] == country]
    if not country_data.empty:
        plt.plot(country_data['date'], country_data['GDP_per_capita'], 
                marker='s', label=country, linewidth=2)

plt.xlabel('Year')
plt.ylabel('GDP per Capita (USD)')
plt.title('Developed Economies - GDP per Capita Over Time')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Life Expectancy Trends

In [None]:
# Get life expectancy data for selected countries
life_exp_data = wbdata.get_dataframe(
    {'SP.DYN.LE00.IN': 'Life_expectancy'},
    country=['USA', 'CHN', 'IND', 'BRA', 'JPN', 'ETH', 'NGA'],
    date=(1990, 2021)
)

# Prepare for plotting
life_exp_plot = life_exp_data.reset_index().dropna()

plt.figure(figsize=(12, 6))

countries_to_plot = ['USA', 'CHN', 'IND', 'BRA', 'JPN', 'ETH', 'NGA']
colors = plt.cm.Set1(np.linspace(0, 1, len(countries_to_plot)))

for i, country in enumerate(countries_to_plot):
    country_data = life_exp_plot[life_exp_plot['country'] == country]
    if not country_data.empty:
        plt.plot(country_data['date'], country_data['Life_expectancy'], 
                marker='o', label=country, linewidth=2, color=colors[i])

plt.xlabel('Year')
plt.ylabel('Life Expectancy (Years)')
plt.title('Life Expectancy Trends (1990-2021)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate improvement over time
print("Life Expectancy Improvement (1990 vs 2020):")
for country in countries_to_plot:
    country_data = life_exp_plot[life_exp_plot['country'] == country]
    if len(country_data) > 10:  # Ensure we have enough data
        early_data = country_data[country_data['date'] <= 1995]['Life_expectancy'].mean()
        recent_data = country_data[country_data['date'] >= 2015]['Life_expectancy'].mean()
        if pd.notna(early_data) and pd.notna(recent_data):
            improvement = recent_data - early_data
            print(f"  {country}: +{improvement:.1f} years")

## Economic Indicators Analysis

In [None]:
# Get multiple economic indicators for analysis
economic_indicators = {
    'NY.GDP.PCAP.CD': 'GDP_per_capita',
    'FP.CPI.TOTL.ZG': 'Inflation_rate',
    'SL.UEM.TOTL.ZS': 'Unemployment_rate'
}

economic_data = wbdata.get_dataframe(
    economic_indicators,
    country=['USA', 'DEU', 'JPN', 'GBR'],
    date=(2010, 2022)
)

econ_plot = economic_data.reset_index().dropna()

plt.figure(figsize=(15, 5))

countries = ['USA', 'DEU', 'JPN', 'GBR']

# Plot 1: GDP per capita
plt.subplot(1, 3, 1)
for country in countries:
    country_data = econ_plot[econ_plot['country'] == country]
    if not country_data.empty:
        plt.plot(country_data['date'], country_data['GDP_per_capita'], 
                marker='o', label=country, linewidth=2)
plt.xlabel('Year')
plt.ylabel('GDP per Capita (USD)')
plt.title('GDP per Capita')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Inflation rate
plt.subplot(1, 3, 2)
for country in countries:
    country_data = econ_plot[econ_plot['country'] == country]
    if not country_data.empty:
        plt.plot(country_data['date'], country_data['Inflation_rate'], 
                marker='s', label=country, linewidth=2)
plt.xlabel('Year')
plt.ylabel('Inflation Rate (%)')
plt.title('Inflation Rate')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 3: Unemployment rate
plt.subplot(1, 3, 3)
for country in countries:
    country_data = econ_plot[econ_plot['country'] == country]
    if not country_data.empty:
        plt.plot(country_data['date'], country_data['Unemployment_rate'], 
                marker='^', label=country, linewidth=2)
plt.xlabel('Year')
plt.ylabel('Unemployment Rate (%)')
plt.title('Unemployment Rate')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<a id='demo'></a>

# 🎬 Demo: Global Development Comparisons

Let's create a comprehensive analysis comparing development indicators across different regions and income levels.

In [None]:
# Select representative countries from different regions and income levels
country_profiles = {
    'High Income': {
        'North America': 'USA',
        'Europe': 'DEU', 
        'East Asia': 'JPN',
        'Oceania': 'AUS'
    },
    'Upper Middle Income': {
        'East Asia': 'CHN',
        'Latin America': 'BRA',
        'Europe & Central Asia': 'RUS',
        'Middle East': 'TUR'
    },
    'Lower Middle Income': {
        'South Asia': 'IND',
        'Sub-Saharan Africa': 'NGA',
        'East Asia': 'VNM',
        'Latin America': 'BOL'
    },
    'Low Income': {
        'Sub-Saharan Africa': 'ETH',
        'South Asia': 'AFG',
        'Sub-Saharan Africa 2': 'TCD',
        'Sub-Saharan Africa 3': 'MDG'
    }
}

# Flatten to get all countries
all_countries = []
country_income_map = {}
for income_level, regions in country_profiles.items():
    for region, country in regions.items():
        all_countries.append(country)
        country_income_map[country] = income_level

print(f"Analyzing {len(all_countries)} countries across 4 income levels")

In [None]:
# Get comprehensive development data
development_indicators = {
    'NY.GDP.PCAP.CD': 'GDP_per_capita',
    'SP.DYN.LE00.IN': 'Life_expectancy', 
    'SE.ADT.LITR.ZS': 'Literacy_rate',
    'SH.MED.BEDS.ZS': 'Hospital_beds_per_1000',
    'EN.ATM.CO2E.PC': 'CO2_emissions_per_capita',
    'SP.POP.TOTL': 'Population'
}

print("Fetching development data...")
dev_data = wbdata.get_dataframe(
    development_indicators,
    country=all_countries,
    date=(2019, 2022)  # Most recent data
)

# Get most recent available data for each country
dev_summary = dev_data.reset_index().groupby('country').last()
dev_summary['income_level'] = dev_summary.index.map(country_income_map)

print(f"Development data summary: {dev_summary.shape}")
dev_summary.head()

In [None]:
# Create comprehensive comparison visualizations
plt.figure(figsize=(20, 12))

# Plot 1: GDP vs Life Expectancy
plt.subplot(2, 3, 1)
income_colors = {'High Income': 'darkgreen', 'Upper Middle Income': 'orange', 
                'Lower Middle Income': 'red', 'Low Income': 'darkred'}

for income_level in income_colors.keys():
    subset = dev_summary[dev_summary['income_level'] == income_level]
    plt.scatter(subset['GDP_per_capita'], subset['Life_expectancy'], 
               c=income_colors[income_level], label=income_level, alpha=0.7, s=100)

plt.xlabel('GDP per Capita (USD)')
plt.ylabel('Life Expectancy (Years)')
plt.title('GDP vs Life Expectancy')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Development by Income Level - Life Expectancy
plt.subplot(2, 3, 2)
income_order = ['Low Income', 'Lower Middle Income', 'Upper Middle Income', 'High Income']
life_exp_by_income = [dev_summary[dev_summary['income_level'] == level]['Life_expectancy'].dropna().tolist() 
                     for level in income_order]
plt.boxplot(life_exp_by_income, labels=income_order)
plt.ylabel('Life Expectancy (Years)')
plt.title('Life Expectancy by Income Level')
plt.xticks(rotation=45)

# Plot 3: Literacy Rate by Income Level
plt.subplot(2, 3, 3)
literacy_by_income = [dev_summary[dev_summary['income_level'] == level]['Literacy_rate'].dropna().tolist() 
                     for level in income_order]
plt.boxplot(literacy_by_income, labels=income_order)
plt.ylabel('Literacy Rate (%)')
plt.title('Literacy Rate by Income Level')
plt.xticks(rotation=45)

# Plot 4: CO2 Emissions vs GDP
plt.subplot(2, 3, 4)
for income_level in income_colors.keys():
    subset = dev_summary[dev_summary['income_level'] == income_level]
    plt.scatter(subset['GDP_per_capita'], subset['CO2_emissions_per_capita'], 
               c=income_colors[income_level], label=income_level, alpha=0.7, s=100)

plt.xlabel('GDP per Capita (USD)')
plt.ylabel('CO2 Emissions per Capita (metric tons)')
plt.title('Economic Development vs Environmental Impact')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 5: Healthcare Infrastructure
plt.subplot(2, 3, 5)
healthcare_by_income = [dev_summary[dev_summary['income_level'] == level]['Hospital_beds_per_1000'].dropna().tolist() 
                       for level in income_order]
plt.boxplot(healthcare_by_income, labels=income_order)
plt.ylabel('Hospital Beds per 1,000 people')
plt.title('Healthcare Infrastructure by Income Level')
plt.xticks(rotation=45)

# Plot 6: Population vs GDP per capita
plt.subplot(2, 3, 6)
for income_level in income_colors.keys():
    subset = dev_summary[dev_summary['income_level'] == income_level]
    plt.scatter(subset['Population'], subset['GDP_per_capita'], 
               c=income_colors[income_level], label=income_level, alpha=0.7, s=100)

plt.xlabel('Population (log scale)')
plt.ylabel('GDP per Capita (USD)')
plt.title('Population vs Economic Development')
plt.xscale('log')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Statistical summary by income level
print("Development Statistics by Income Level:")
print("=" * 50)

summary_stats = dev_summary.groupby('income_level')[[
    'GDP_per_capita', 'Life_expectancy', 'Literacy_rate', 
    'Hospital_beds_per_1000', 'CO2_emissions_per_capita'
]].agg(['mean', 'median', 'std']).round(2)

print(summary_stats)

# Correlation analysis
print("\n\nCorrelation Matrix:")
print("=" * 30)
correlation_matrix = dev_summary[[
    'GDP_per_capita', 'Life_expectancy', 'Literacy_rate', 
    'Hospital_beds_per_1000', 'CO2_emissions_per_capita'
]].corr().round(3)

print(correlation_matrix)

# Create a correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title('Correlation Matrix: Development Indicators')
plt.tight_layout()
plt.show()

## Collecting Data for Your Final Project

Here's a comprehensive template for collecting World Bank data for research projects:

In [None]:
def collect_worldbank_data(indicators, countries='all', start_year=2000, end_year=None, 
                          include_metadata=True):
    """Collect comprehensive World Bank data for analysis
    
    Parameters:
    indicators (dict): Dictionary mapping indicator codes to friendly names
    countries (list or 'all'): List of country codes or 'all' for all countries
    start_year (int): Starting year for data collection
    end_year (int): Ending year (defaults to current year)
    include_metadata (bool): Whether to include country metadata
    """
    
    if end_year is None:
        end_year = datetime.now().year
    
    print(f"Collecting World Bank data from {start_year} to {end_year}")
    print(f"Indicators: {list(indicators.values())}")
    
    # Get the main data
    if countries == 'all':
        data = wbdata.get_dataframe(indicators, date=(start_year, end_year))
    else:
        data = wbdata.get_dataframe(indicators, country=countries, date=(start_year, end_year))
    
    # Reset index to get country and date as columns
    df = data.reset_index()
    
    # Add metadata if requested
    if include_metadata:
        print("Adding country metadata...")
        countries_meta = wbdata.get_countries()
        meta_df = pd.DataFrame(countries_meta)
        
        # Create a mapping of country codes to metadata
        country_info = {}
        for _, row in meta_df.iterrows():
            country_info[row['id']] = {
                'country_name': row['name'],
                'region': row['region']['value'] if row['region'] else None,
                'income_level': row['incomeLevel']['value'] if row['incomeLevel'] else None,
                'capital_city': row['capitalCity'],
                'longitude': row['longitude'],
                'latitude': row['latitude']
            }
        
        # Add metadata to main dataframe
        for col in ['country_name', 'region', 'income_level', 'capital_city', 'longitude', 'latitude']:
            df[col] = df['country'].map(lambda x: country_info.get(x, {}).get(col))
    
    # Add derived features
    df['decade'] = (df['date'] // 10) * 10
    
    # Remove rows where all indicator values are NaN
    indicator_cols = list(indicators.values())
    df = df.dropna(subset=indicator_cols, how='all')
    
    print(f"Final dataset: {len(df)} observations for {df['country'].nunique()} countries")
    
    return df

# Example usage for different research projects

# Example 1: Climate and Development
# climate_indicators = {
#     'EN.ATM.CO2E.PC': 'CO2_emissions_per_capita',
#     'EG.USE.PCAP.KG.OE': 'Energy_use_per_capita',
#     'NY.GDP.PCAP.CD': 'GDP_per_capita',
#     'SP.POP.TOTL': 'Population',
#     'SP.URB.TOTL.IN.ZS': 'Urban_population_percent'
# }
# climate_data = collect_worldbank_data(climate_indicators, start_year=1990)
# climate_data.to_csv('climate_development_data.csv', index=False)

# Example 2: Health and Economic Development
# health_indicators = {
#     'SP.DYN.LE00.IN': 'Life_expectancy',
#     'SH.MED.BEDS.ZS': 'Hospital_beds_per_1000',
#     'SH.XPD.CHEX.PC.CD': 'Health_expenditure_per_capita',
#     'NY.GDP.PCAP.CD': 'GDP_per_capita',
#     'SP.DYN.IMRT.IN': 'Infant_mortality_rate'
# }
# health_data = collect_worldbank_data(health_indicators, start_year=2000)
# health_data.to_csv('health_development_data.csv', index=False)

print("Data collection templates ready for use!")

<div class="alert alert-success">

## ❗ Key Points

* World Bank API provides free access to comprehensive development data for 200+ countries
* Data covers economic, social, environmental, and governance indicators spanning decades
* No API key required, making it ideal for educational and research projects
* Historical time series data enables trend analysis and policy impact studies
* Country metadata includes regional and income level classifications for comparative analysis
* Strong correlations exist between economic development and social outcomes
* Environmental impact often increases with economic development, highlighting sustainability challenges
* Data quality and availability varies by country and indicator
* Perfect for research on development economics, public policy, and global trends
  
</div>