# Education Statistics exploration
## by Pedro Cruz

## Introduction

> Notebook prepared for investigation of the education statistics dataset made available by The World Bank Data Catalog.
> API: the World Bank Data Catalog provides an API for easy access and there is an open source initiative by Tim Herzog (https://github.com/tgherzog), called WBGAPI that streamlines the use of World Bank API. In this report all calls to the World Bank API will be made through the WBGAPI module.

In [1]:
import wbgapi as wb
import pandas as pd

pd.set_option('display.max_colwidth', None)

In [2]:
# series for future exploration
# {'EG.GDP.PUSE.KO.PP': 'GDP per unit of energy use (PPP $ per kg of oil equivalent)',
# 'SL.GDP.PCAP.EM.KD': 'GDP per person employed (constant 2017 PPP $)'
# }

In [3]:
# list available sources
wb.source.info()

id,name,code,concepts,lastupdated
1.0,Doing Business,DBS,3.0,2021-08-18
2.0,World Development Indicators,WDI,3.0,2022-12-22
3.0,Worldwide Governance Indicators,WGI,3.0,2022-09-23
5.0,Subnational Malnutrition Database,SNM,3.0,2016-03-21
6.0,International Debt Statistics,IDS,4.0,2022-12-06
11.0,Africa Development Indicators,ADI,3.0,2013-02-22
12.0,Education Statistics,EDS,3.0,2020-12-20
13.0,Enterprise Surveys,ESY,3.0,2022-03-25
14.0,Gender Statistics,GDS,3.0,2022-06-23
15.0,Global Economic Monitor,GEM,3.0,2020-07-27


In [4]:
# search for 'education' databases
wb.source.info(q='education')

id,name,code,concepts,lastupdated
12.0,Education Statistics,EDS,3.0,2020-12-20
34.0,Global Partnership for Education,GPE,3.0,2013-04-12
84.0,Education Policy,EDP,3.0,2022-07-19
,3 elements,,,


In [5]:
# change database to 'Education Statistics' (id 12)
wb.db = 12

In [6]:
wb.db

12

In [7]:
# get, in a DataFrame, all series in the education statistics database
ed_series = pd.DataFrame(wb.series.list())
ed_series

Unnamed: 0,id,value
0,BAR.NOED.1519.FE.ZS,Barro-Lee: Percentage of female population age 15-19 with no education
1,BAR.NOED.1519.ZS,Barro-Lee: Percentage of population age 15-19 with no education
2,BAR.NOED.15UP.FE.ZS,Barro-Lee: Percentage of female population age 15+ with no education
3,BAR.NOED.15UP.ZS,Barro-Lee: Percentage of population age 15+ with no education
4,BAR.NOED.2024.FE.ZS,Barro-Lee: Percentage of female population age 20-24 with no education
...,...,...
4396,UIS.YR.END.MON.6T8,End month of the academic school year (tertiary education)
4397,UIS.YR.ST.01T5,Start of the academic school year (pre-primary to post-secondary non tertiary education)
4398,UIS.YR.ST.6T8,Start of the academic school year (tertiary education)
4399,UIS.YR.ST.MON.01T5,Start month of the academic school year (pre-primary to post-secondary non-tertiary education)


In [8]:
# select/filter the series of interest
# series with goverment expenses
df_temp_a = ed_series[ed_series['value'].str.lower().str.contains('government expenditure')]
# further filter only the goverment expenses in  USD
df_temp_b = df_temp_a[df_temp_a['value'].str.contains('US\$')]

goverment_expenses = list(df_temp_b['id'])

In [9]:
# select the series with test scores
ed_series[ed_series['value'].str.lower().str.contains('pisa')]


Unnamed: 0,id,value
1202,LO.PISA.MAT,PISA: Mean performance on the mathematics scale
1203,LO.PISA.MAT.0,PISA: 15-year-olds by mathematics proficiency level (%). Below Level 1
1204,LO.PISA.MAT.0.FE,PISA: Female 15-year-olds by mathematics proficiency level (%). Below Level 1
1205,LO.PISA.MAT.0.MA,PISA: Male 15-year-olds by mathematics proficiency level (%). Below Level 1
1206,LO.PISA.MAT.1,PISA: 15-year-olds by mathematics proficiency level (%). Level 1
...,...,...
1299,LO.PISA.SCI.P25,PISA: Distribution of Science Scores: 25th Percentile Score
1300,LO.PISA.SCI.P50,PISA: Distribution of Science Scores: 50th Percentile Score
1301,LO.PISA.SCI.P75,PISA: Distribution of Science Scores: 75th Percentile Score
1302,LO.PISA.SCI.P90,PISA: Distribution of Science Scores: 90th Percentile Score


In [13]:

wb.series.info(goverment_expenses)

id,value
UIS.X.USCONST.UK.FSGOV,"Government expenditure on education not specified by level, constant US$ (millions)"
UIS.X.US.UK.FSGOV,"Government expenditure on education not specified by level, US$ (millions)"
UIS.X.USCONST.FSGOV,"Government expenditure on education, constant US$ (millions)"
UIS.X.US.FSGOV,"Government expenditure on education, US$ (millions)"
UIS.X.USCONST.2.FSGOV,"Government expenditure on lower secondary education, constant US$ (millions)"
UIS.X.US.2.FSGOV,"Government expenditure on lower secondary education, US$ (millions)"
UIS.X.USCONST.4.FSGOV,"Government expenditure on post-secondary non-tertiary education, constant US$ (millions)"
UIS.X.US.4.FSGOV,"Government expenditure on post-secondary non-tertiary education, US$ (millions)"
UIS.X.USCONST.02.FSGOV,"Government expenditure on pre-primary education, constant US$ (millions)"
UIS.X.US.02.FSGOV,"Government expenditure on pre-primary education, US$ (millions)"


In [14]:
# get the series data for Brazil
# select the goverment expenditure on education, US$
serie_id = 'UIS.X.US.FSGOV'
# transform to list
# get the first element (a dictionary)
# get the value of the key called 'value'
serie_desc = list(wb.series.list(serie_id))[0]['value']
print(f'{serie_id}: {serie_desc}')

df = wb.data.DataFrame(serie_id, 'BRA').reset_index()

UIS.X.US.FSGOV: Government expenditure on education, US$ (millions)


## Data Cleaning

In [15]:
df_clean = df.copy()

In [16]:

df_clean = df_clean.melt(id_vars=['economy'], value_name=serie_desc, var_name='year')

In [18]:
# remove extra characters in the column 'year'

df_clean['year'] = df_clean['year'].str.replace('YR', '')

In [22]:
# check for NaNs
df_clean.isna().sum()

economy                                                 0
year                                                    0
Government expenditure on education, US$ (millions)    47
dtype: int64

In [23]:
# dropping the NaNs
df_clean.dropna(inplace=True)

In [25]:
# check
df_clean.isna().sum()

economy                                                0
year                                                   0
Government expenditure on education, US$ (millions)    0
dtype: int64