## Phase I Project Proposal
### What determines whether a developing country successfully transitions to high-income status or falls into the middle-income trap?

#### Name: Haojin He, DS 3000


### Introduction

What determines whether a developing country successfully transitions to high-income status or falls into the middle-income trap? The middle-income trap is a critical phenomenon in development economics where countries achieve middle-income status (around $4,000- $12,000 per capita GDP) but then experience prolonged stagnation, failing to breakthrough to high-income levels for decades. I'm interested in examining which economic indicators such as manufacturing share of GDP, R&D expenditure, education levels, and export complexity can predict whether a country will successfully escape this trap. I also want to investigate what distinguishes countries that successfully transitioned (like South Korea, Singapore) from those that remained stuck (like Brazil, South Africa, and Malaysia).

### Data Collection

I plan to use the World Bank's World Development Indicators (WDI) API to collect economic and social data for countries over the past 30 years. The World Bank API is particularly suitable for this project as it provides comprehensive, standardized country-level data without requiring authentication keys, making it straightforward to access. I will focus on collecting data for middle-income countries and those that have recently transitioned (or failed to transition) to high-income status, which will help me analyze the factors that distinguish successful transitions from failures.

In [17]:
import requests
import pandas as pd
import numpy as np

def get_world_bank_data(indicator, start_year=1993, end_year=2023):
    """
    Fetch data from World Bank API for a specific indicator
    """
    url = f"https://api.worldbank.org/v2/country/all/indicator/{indicator}"
    params = {
        'format': 'json',
        'date': f'{start_year}:{end_year}',
        'per_page': 500
    }
    
    response = requests.get(url, params=params)
    data = response.json()

      # Parse the JSON response
    if len(data) > 1 and data[1]:
        records = []
        for item in data[1]:
            if item['value'] is not None:  # Only keep non-null values
                records.append({
                    'country_name': item['country']['value'],
                    'country_code': item['countryiso3code'],
                    'year': int(item['date']),
                    'indicator': indicator,
                    'value': float(item['value'])
                })
        return pd.DataFrame(records)
    return pd.DataFrame()

In [18]:
gdp_df = get_world_bank_data('NY.GDP.PCAP.CD')
print(f"Successfully retrieved {len(gdp_df)} records")
gdp_df.head(50)

Successfully retrieved 500 records


Unnamed: 0,country_name,country_code,year,indicator,value
0,Africa Eastern and Southern,AFE,2023,NY.GDP.PCAP.CD,1568.159891
1,Africa Eastern and Southern,AFE,2022,NY.GDP.PCAP.CD,1628.318944
2,Africa Eastern and Southern,AFE,2021,NY.GDP.PCAP.CD,1522.393346
3,Africa Eastern and Southern,AFE,2020,NY.GDP.PCAP.CD,1344.10321
4,Africa Eastern and Southern,AFE,2019,NY.GDP.PCAP.CD,1493.817938
5,Africa Eastern and Southern,AFE,2018,NY.GDP.PCAP.CD,1538.901679
6,Africa Eastern and Southern,AFE,2017,NY.GDP.PCAP.CD,1520.212231
7,Africa Eastern and Southern,AFE,2016,NY.GDP.PCAP.CD,1329.807285
8,Africa Eastern and Southern,AFE,2015,NY.GDP.PCAP.CD,1479.61526
9,Africa Eastern and Southern,AFE,2014,NY.GDP.PCAP.CD,1656.167709


### Data Usage and Remaining Issues

The above dataset is mostly cleaned already, but there are still some issues to address. Mainly, data completeness varies significantly across indicators—while GDP and demographic data have excellent coverage (85-95%), indicators like manufacturing share (60%) and R&D expenditure (30%) have substantial missing values. This should be fixed by either using forward-fill interpolation for slowly-changing variables like education enrollment, or focusing the analysis on countries with at least 70% data completeness across all key indicators. Additionally, I need to define clear criteria for labeling countries as "escaped," "trapped," or "at risk" based on their income transition history. I'm considering using a 10-year window where countries with average GDP per capita growth below 2% after reaching middle-income status would be classified as "trapped." 


However, I have plenty of numeric features, including GDP per capita growth rates, manufacturing share of GDP, education enrollment rates, and urban population percentages that should be useful in predicting middle-income trap risk.  While we have not covered any ML models in class yet, I've read about supervised machine learning, and both of my questions seem like they could be reasonably answered with classification models (predicting whether a country will escape the trap within 10 years) or regression models (predicting the number of years needed to transition to high-income status). There may also be some unsupervised ML techniques like clustering that could help identify different development pathways or country archetypes, but I'm less familiar with those and would have to investigate further.