#  1. WDI Package (from World Bank)

 Installation:

In [1]:
install.packages("WDI")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



Load:

In [2]:
library(WDI)

 Example: Downloading Multiple Indicators

In [3]:
# Load WDI package
library(WDI)

# Define indicators to download
indicators <- c(
  edu_exp_gdp = "SE.XPD.TOTL.GD.ZS",         # Education expenditure (% of GDP)
  gdp_per_capita = "NY.GDP.PCAP.CD",         # GDP per capita
  gov_exp_gdp = "NE.CON.GOVT.ZS",            # Gov spending (% of GDP)
  unemployment_rate = "SL.UEM.TOTL.ZS",      # Unemployment rate
  inflation_rate = "FP.CPI.TOTL.ZG",         # Inflation rate
  population = "SP.POP.TOTL",                # Total population
  urban_pop_percent = "SP.URB.TOTL.IN.ZS",   # Urban population %
  literacy_rate = "SE.ADT.LITR.ZS",          # Literacy rate
  education_index = "SE.SEC.ENRR",           # Secondary school enrollment (%)
  hdi_proxy = "NY.GNP.PCAP.CD"               # GNI per capita (proxy for HDI)

)

# Download data for all countries from 2000 to 2023
data <- WDI(
  country = "all",
  indicator = indicators,
  start = 1990,
  end = 2023,
  extra = TRUE,     # includes region, income level, etc.
  cache = NULL
)

# View the first few rows
head(data)

“cannot open URL 'https://api.worldbank.org/v2/en/country/all/indicator/SL.UEM.TOTL.ZS?format=json&date=1990:2023&per_page=32500&page=5': HTTP status was '400 Bad Request'”
“cannot open URL 'https://api.worldbank.org/v2/en/country/all/indicator/SP.URB.TOTL.IN.ZS?format=json&date=1990:2023&per_page=32500&page=10': HTTP status was '400 Bad Request'”


Unnamed: 0_level_0,country,iso2c,iso3c,year,status,lastupdated,edu_exp_gdp,gdp_per_capita,gov_exp_gdp,unemployment_rate,⋯,urban_pop_percent,literacy_rate,education_index,hdi_proxy,region,capital,longitude,latitude,income,lending
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Afghanistan,AF,AFG,1991,,2025-07-01,,,,8.07,⋯,21.266,,19.6126,,South Asia,Kabul,69.1761,34.5228,Low income,IDA
2,Afghanistan,AF,AFG,2019,,2025-07-01,,496.6025,,11.185,⋯,25.754,,,520.0,South Asia,Kabul,69.1761,34.5228,Low income,IDA
3,Afghanistan,AF,AFG,2015,,2025-07-01,3.2558,565.5697,,9.052,⋯,24.803,33.75384,53.28514,590.0,South Asia,Kabul,69.1761,34.5228,Low income,IDA
4,Afghanistan,AF,AFG,2000,,2025-07-01,,174.931,,7.935,⋯,22.078,,,,South Asia,Kabul,69.1761,34.5228,Low income,IDA
5,Afghanistan,AF,AFG,2003,,2025-07-01,,198.8711,,7.88,⋯,22.353,,14.07805,190.0,South Asia,Kabul,69.1761,34.5228,Low income,IDA
6,Afghanistan,AF,AFG,2017,,2025-07-01,4.34319,525.4698,,11.184,⋯,25.25,,55.40215,530.0,South Asia,Kabul,69.1761,34.5228,Low income,IDA


Filter Example: Only Keep Countries (Exclude Aggregates)

In [4]:
# Keep only countries (remove regions like "World" or "Europe & Central Asia")
data_clean <- subset(data, region != "Aggregates")

# Preview cleaned dataset
head(data_clean)

Unnamed: 0_level_0,country,iso2c,iso3c,year,status,lastupdated,edu_exp_gdp,gdp_per_capita,gov_exp_gdp,unemployment_rate,⋯,urban_pop_percent,literacy_rate,education_index,hdi_proxy,region,capital,longitude,latitude,income,lending
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Afghanistan,AF,AFG,1991,,2025-07-01,,,,8.07,⋯,21.266,,19.6126,,South Asia,Kabul,69.1761,34.5228,Low income,IDA
2,Afghanistan,AF,AFG,2019,,2025-07-01,,496.6025,,11.185,⋯,25.754,,,520.0,South Asia,Kabul,69.1761,34.5228,Low income,IDA
3,Afghanistan,AF,AFG,2015,,2025-07-01,3.2558,565.5697,,9.052,⋯,24.803,33.75384,53.28514,590.0,South Asia,Kabul,69.1761,34.5228,Low income,IDA
4,Afghanistan,AF,AFG,2000,,2025-07-01,,174.931,,7.935,⋯,22.078,,,,South Asia,Kabul,69.1761,34.5228,Low income,IDA
5,Afghanistan,AF,AFG,2003,,2025-07-01,,198.8711,,7.88,⋯,22.353,,14.07805,190.0,South Asia,Kabul,69.1761,34.5228,Low income,IDA
6,Afghanistan,AF,AFG,2017,,2025-07-01,4.34319,525.4698,,11.184,⋯,25.25,,55.40215,530.0,South Asia,Kabul,69.1761,34.5228,Low income,IDA


In [5]:
length(data_clean)

The `data_clean` dataframe contains World Bank Development Indicators data for various countries from 1990 to 2023. It includes indicators related to education, economy, population, and more. The data has been filtered to exclude aggregate regions and only includes individual countries.

In [6]:
# Calculate the number of NA values in each column
na_counts <- colSums(is.na(data_clean))

# Print the number of NA values for each column
print(na_counts)

          country             iso2c             iso3c              year 
                0                 0                 0                 0 
           status       lastupdated       edu_exp_gdp    gdp_per_capita 
                0                 0              3257               370 
      gov_exp_gdp unemployment_rate    inflation_rate        population 
             1799              1210              1522                 0 
urban_pop_percent     literacy_rate   education_index         hdi_proxy 
               68              6393              2766               798 
           region           capital         longitude          latitude 
                0                 0                 0                 0 
           income           lending 
                0                 0 


In [7]:
# Remove rows where 'edu_exp_gdp' is NA
data_clean <- data_clean[!is.na(data_clean$edu_exp_gdp), ]

# Display the first few rows of the updated dataframe
head(data_clean)

Unnamed: 0_level_0,country,iso2c,iso3c,year,status,lastupdated,edu_exp_gdp,gdp_per_capita,gov_exp_gdp,unemployment_rate,⋯,urban_pop_percent,literacy_rate,education_index,hdi_proxy,region,capital,longitude,latitude,income,lending
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
3,Afghanistan,AF,AFG,2015,,2025-07-01,3.2558,565.5697,,9.052,⋯,24.803,33.75384,53.28514,590,South Asia,Kabul,69.1761,34.5228,Low income,IDA
6,Afghanistan,AF,AFG,2017,,2025-07-01,4.34319,525.4698,,11.184,⋯,25.25,,55.40215,530,South Asia,Kabul,69.1761,34.5228,Low income,IDA
7,Afghanistan,AF,AFG,2016,,2025-07-01,4.54397,522.0822,,10.133,⋯,25.02,,53.50634,560,South Asia,Kabul,69.1761,34.5228,Low income,IDA
8,Afghanistan,AF,AFG,2014,,2025-07-01,3.69522,625.0549,,7.915,⋯,24.587,,54.23548,640,South Asia,Kabul,69.1761,34.5228,Low income,IDA
16,Afghanistan,AF,AFG,2009,,2025-07-01,4.81064,452.0537,,7.754,⋯,23.528,,44.39717,460,South Asia,Kabul,69.1761,34.5228,Low income,IDA
17,Afghanistan,AF,AFG,2013,,2025-07-01,3.45446,637.0871,,7.93,⋯,24.373,,54.75422,670,South Asia,Kabul,69.1761,34.5228,Low income,IDA


In [8]:
nrow(data_clean)

Save as CSV

In [9]:
# Calculate the number of NA values in each column
na_counts <- colSums(is.na(literacy_rate))

# Print the number of NA values for each column
print(na_counts)

          country             iso2c             iso3c              year 
                0                 0                 0                 0 
           status       lastupdated       edu_exp_gdp    gdp_per_capita 
                0                 0                 0                26 
      gov_exp_gdp unemployment_rate    inflation_rate        population 
              554               333               302                 0 
urban_pop_percent     literacy_rate   education_index         hdi_proxy 
                0              3392               962               109 
           region           capital         longitude          latitude 
                0                 0                 0                 0 
           income           lending 
                0                 0 


In [10]:
# Remove the 'literacy_rate' column
data_clean <- subset(data_clean, select = -literacy_rate)

# Display the first few rows of the updated dataframe to confirm
head(data_clean)

Unnamed: 0_level_0,country,iso2c,iso3c,year,status,lastupdated,edu_exp_gdp,gdp_per_capita,gov_exp_gdp,unemployment_rate,⋯,population,urban_pop_percent,education_index,hdi_proxy,region,capital,longitude,latitude,income,lending
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
3,Afghanistan,AF,AFG,2015,,2025-07-01,3.2558,565.5697,,9.052,⋯,33831764,24.803,53.28514,590,South Asia,Kabul,69.1761,34.5228,Low income,IDA
6,Afghanistan,AF,AFG,2017,,2025-07-01,4.34319,525.4698,,11.184,⋯,35688935,25.25,55.40215,530,South Asia,Kabul,69.1761,34.5228,Low income,IDA
7,Afghanistan,AF,AFG,2016,,2025-07-01,4.54397,522.0822,,10.133,⋯,34700612,25.02,53.50634,560,South Asia,Kabul,69.1761,34.5228,Low income,IDA
8,Afghanistan,AF,AFG,2014,,2025-07-01,3.69522,625.0549,,7.915,⋯,32792523,24.587,54.23548,640,South Asia,Kabul,69.1761,34.5228,Low income,IDA
16,Afghanistan,AF,AFG,2009,,2025-07-01,4.81064,452.0537,,7.754,⋯,27466101,23.528,44.39717,460,South Asia,Kabul,69.1761,34.5228,Low income,IDA
17,Afghanistan,AF,AFG,2013,,2025-07-01,3.45446,637.0871,,7.93,⋯,31622704,24.373,54.75422,670,South Asia,Kabul,69.1761,34.5228,Low income,IDA


In [11]:
write.csv(data_clean, "worldbank_data.csv", row.names = FALSE)

In [None]:
colnames(data_clean)

## 1. Country Identifiers

| Column  | Meaning                                                |
|---------|--------------------------------------------------------|
| country | Full country name (e.g., India, Germany)               |
| iso2c   | ISO 3166-1 alpha-2 country code (2-letter, e.g., IN)   |
| iso3c   | ISO 3166-1 alpha-3 country code (3-letter, e.g., IND)  |

## 2. Time & Metadata

| Column      | Meaning                                                 |
|-------------|---------------------------------------------------------|
| year        | Observation year (e.g., 2022)                           |
| status      | Type of entry: “Country” or “Region” (aggregate)        |
| lastupdated | Last date updated in World Bank database (format: YYYY-MM-DD) |

## 3. Key Variables for Your Research

| Column            | Meaning                                                             |
|-------------------|---------------------------------------------------------------------|
| edu_exp_gdp       | Education expenditure as % of GDP (SE.XPD.TOTL.GD.ZS)               |
| gdp_per_capita    | GDP per capita in current US dollars (NY.GDP.PCAP.CD)               |
| gov_exp_gdp       | Government expenditure as % of GDP (NE.CON.GOVT.ZS)                 |
| unemployment_rate | Unemployment rate (% of labor force, SL.UEM.TOTL.ZS)              |
| inflation_rate    | Inflation rate (consumer prices, annual %, FP.CPI.TOTL.ZG)          |
| population        | Total population of the country (SP.POP.TOTL)                       |
| urban_pop_percent | % of population living in urban areas (SP.URB.TOTL.IN.ZS)           |
| literacy_rate     | Adult literacy rate (% age 15+, SE.ADT.LITR.ZS)                     |
| education_index   | Proxy for Education Index (from school enrollment data, SE.SEC.ENRR) |
| hdi_proxy         | Proxy for HDI (from GNI per capita, NY.GNP.PCAP.CD)                 |

## 4. Geographic & Economic Metadata

| Column  | Meaning                                                             |
|---------|---------------------------------------------------------------------|
| region  | World Bank region classification (e.g., East Asia & Pacific)        |
| capital | Capital city of the country (e.g., Beijing, Tokyo)                  |
| longitude | Longitude of the capital city                                       |
| latitude | Latitude of the capital city                                        |
| income  | World Bank income level: Low, Lower middle, Upper middle, or High   |
| lending | World Bank lending category: IBRD, IDA, Blend, or HIC               |