# World Bank Data Extraction for Urban Mobility Analysis

**Project:** Urban Mobility Optimization: A Data-Driven Framework for Sustainable Transportation Investment

**Objective:** Extract and combine multiple World Bank indicators to create a custom dataset for analyzing transportation efficiency and sustainability.

**Data Period:** 2010-2022

---

## Indicator Selection Strategy

We are extracting 13 carefully selected indicators across 5 categories:

### 1. Transportation Infrastructure
- Road density - Measures land transport accessibility
- Rail lines - Alternative to road-based transport
- Air passengers - Aviation infrastructure utilization
- Container traffic - Maritime logistics capability

### 2. Energy & Emissions
- Energy intensity - Efficiency of energy use relative to economic output
- CO2 per capita - Environmental impact per person
- Transport emissions - Sector-specific environmental footprint

### 3. Economic Performance
- GDP per capita - Economic development level
- Trade volume - Economic openness and connectivity needs
- Urbanization rate - Demand driver for urban mobility

### 4. Logistics Efficiency
- LPI Overall - Comprehensive logistics performance
- LPI Infrastructure - Transport infrastructure quality perception

### 5. Investment
- Capital formation - Proxy for infrastructure investment capacity

---

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import wbdata
import datetime
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Step 1: Define Indicators and Time Period

In [None]:
# Define World Bank indicator codes
indicators = {
    # Transportation Infrastructure
    'IS.ROD.DNST.K2': 'road_density_km_per_100sqkm',
    'IS.RRS.TOTL.KM': 'rail_lines_total_km',
    'IS.AIR.PSGR': 'air_transport_passengers',
    'IS.SHP.GOOD.TU': 'container_port_traffic_teu',
    
    # Energy & Emissions
    'EG.USE.COMM.GD.PP.KD': 'energy_use_per_gdp',
    'EN.ATM.CO2E.PC': 'co2_emissions_per_capita',
    'EN.CO2.TRAN.ZS': 'co2_transport_pct_total',
    
    # Economic Performance
    'NY.GDP.PCAP.PP.KD': 'gdp_per_capita_ppp',
    'NE.TRD.GNFS.ZS': 'trade_pct_gdp',
    'SP.URB.TOTL.IN.ZS': 'urban_population_pct',
    
    # Logistics Performance
    'LP.LPI.OVRL.XQ': 'lpi_overall_score',
    'LP.LPI.INFR.XQ': 'lpi_infrastructure_score',
    
    # Investment
    'NE.GDI.TOTL.ZS': 'gross_capital_formation_pct_gdp'
}

# Define time period
start_date = datetime.datetime(2010, 1, 1)
end_date = datetime.datetime(2022, 12, 31)

print(f"Total indicators to extract: {len(indicators)}")
print(f"Time period: {start_date.year} - {end_date.year}")
print("\nIndicator mapping:")
for code, name in indicators.items():
    print(f"  {code:25} -> {name}")

## Step 2: Extract Data from World Bank API

We'll extract data for all available economies. The World Bank API provides data in a standardized format.

In [None]:
def extract_indicator_data(indicator_code, indicator_name, start_date, end_date):
    """
    Extract data for a single indicator from World Bank API
    
    Parameters:
    -----------
    indicator_code : str
        World Bank indicator code
    indicator_name : str
        Custom name for the indicator
    start_date : datetime
        Start date for data extraction
    end_date : datetime
        End date for data extraction
    
    Returns:
    --------
    pd.DataFrame
        DataFrame with columns: economy, year, indicator_value
    """
    try:
        # Fetch data from World Bank
        data = wbdata.get_dataframe(
            {indicator_code: indicator_name},
            convert_date=True
        )
        
        # Reset index to get economy and date as columns
        data = data.reset_index()
        
        # Rename columns
        data.columns = ['economy', 'date', indicator_name]
        
        # Extract year from date
        data['year'] = data['date'].dt.year
        
        # Filter by date range
        data = data[(data['year'] >= start_date.year) & (data['year'] <= end_date.year)]
        
        # Select final columns
        data = data[['economy', 'year', indicator_name]]
        
        return data
    
    except Exception as e:
        print(f"Error extracting {indicator_name}: {str(e)}")
        return None

print("Data extraction function defined successfully")

In [None]:
# Extract all indicators
print("Starting data extraction from World Bank API...\n")

dataframes = {}

for indicator_code, indicator_name in tqdm(indicators.items(), desc="Extracting indicators"):
    print(f"\nExtracting: {indicator_name}")
    df = extract_indicator_data(indicator_code, indicator_name, start_date, end_date)
    
    if df is not None:
        dataframes[indicator_name] = df
        print(f"  ✓ Extracted {len(df)} records for {df['economy'].nunique()} economies")
    else:
        print(f"  ✗ Failed to extract {indicator_name}")

print(f"\n{'='*60}")
print(f"Extraction complete: {len(dataframes)}/{len(indicators)} indicators successfully extracted")
print(f"{'='*60}")

## Step 3: Merge All Indicators into Single Dataset

We'll perform a sequential left join on economy and year to create our master dataset.

In [None]:
# Start with the first dataframe
indicator_list = list(dataframes.keys())
merged_data = dataframes[indicator_list[0]].copy()

print(f"Starting merge with: {indicator_list[0]}")
print(f"Initial shape: {merged_data.shape}")

# Merge remaining dataframes
for indicator_name in indicator_list[1:]:
    merged_data = merged_data.merge(
        dataframes[indicator_name],
        on=['economy', 'year'],
        how='outer'
    )
    print(f"After merging {indicator_name}: {merged_data.shape}")

print(f"\nFinal merged dataset shape: {merged_data.shape}")
print(f"Economies: {merged_data['economy'].nunique()}")
print(f"Years: {sorted(merged_data['year'].unique())}")

## Step 4: Add Income Group Classification

We'll add World Bank income group classifications to enable segmented analysis.

In [None]:
# Get economy metadata including income groups
try:
    economies = wbdata.get_country()
    
    # Create income group mapping
    income_mapping = {}
    for economy in economies:
        income_mapping[economy['name']] = economy['incomeLevel']['value']
    
    # Add income group to merged data
    merged_data['income_group'] = merged_data['economy'].map(income_mapping)
    
    print("Income group classification added successfully")
    print("\nIncome group distribution:")
    print(merged_data['income_group'].value_counts())
    
except Exception as e:
    print(f"Warning: Could not fetch income groups: {str(e)}")
    merged_data['income_group'] = 'Unknown'

## Step 5: Data Quality Assessment

In [None]:
# Data quality overview
print("DATA QUALITY REPORT")
print("="*60)

print("\n1. Dataset Dimensions:")
print(f"   Total rows: {len(merged_data):,}")
print(f"   Total columns: {len(merged_data.columns)}")
print(f"   Unique economies: {merged_data['economy'].nunique()}")
print(f"   Years covered: {merged_data['year'].min()} - {merged_data['year'].max()}")

print("\n2. Missing Data Analysis:")
missing_summary = pd.DataFrame({
    'Column': merged_data.columns,
    'Missing_Count': merged_data.isnull().sum().values,
    'Missing_Percent': (merged_data.isnull().sum().values / len(merged_data) * 100).round(2)
})
missing_summary = missing_summary[missing_summary['Missing_Count'] > 0].sort_values('Missing_Percent', ascending=False)
print(missing_summary.to_string(index=False))

print("\n3. Sample Data:")
print(merged_data.head(10))

In [None]:
# Descriptive statistics
print("\n4. Descriptive Statistics:")
print(merged_data.describe().round(2))

## Step 6: Save Raw Data

In [None]:
# Save to CSV
output_path = '../data/raw/world_bank_transport_data_raw.csv'
merged_data.to_csv(output_path, index=False)
print(f"Raw data saved to: {output_path}")
print(f"File size: {len(merged_data):,} rows x {len(merged_data.columns)} columns")

# Save metadata
metadata = {
    'extraction_date': datetime.datetime.now().strftime('%Y-%m-%d'),
    'data_source': 'World Bank Open Data API',
    'time_period': f"{start_date.year}-{end_date.year}",
    'total_indicators': len(indicators),
    'total_economies': merged_data['economy'].nunique(),
    'total_records': len(merged_data),
    'indicators': indicators
}

import json
with open('../data/raw/extraction_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=4)

print("\nMetadata saved successfully")

## Summary

### What We Accomplished:

1. **Extracted 13 indicators** from World Bank API across 5 key categories
2. **Merged into single dataset** with economy-year as composite key
3. **Added income group classification** for segmented analysis
4. **Assessed data quality** including missingness patterns
5. **Saved raw data** for subsequent processing

### Next Steps:

1. Data cleaning and preprocessing (handling missing values, outliers)
2. Exploratory Data Analysis (EDA)
3. Feature engineering
4. Machine learning modeling
5. Business insights and recommendations

---

**Note:** This dataset is custom-built by combining multiple World Bank indicators. It does not exist as a pre-packaged dataset on Kaggle or other platforms.