# Initial Data Exploration: Barro-Lee Dataset

This notebook explores the Barro-Lee dataset on educational attainment for the population aged 25–64. The goal is to understand the structure, identify usable metrics, and prepare it for analysis.

## Step 1: Import Libraries & Load Data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
df = pd.read_csv("../data/BL_v3_MF2564.csv")

## Step 2: Initial Inspection & Clean-Up

### Shape & Columns

In [23]:
print("---Shape:\n", df.shape)
print("---Columns:\n", df.columns)

---Shape:
 (2044, 20)
---Columns:
 Index(['BLcode', 'country', 'year', 'sex', 'agefrom', 'ageto', 'lu', 'lp',
       'lpc', 'ls', 'lsc', 'lh', 'lhc', 'yr_sch', 'yr_sch_pri', 'yr_sch_sec',
       'yr_sch_ter', 'WBcode', 'region_code', 'pop'],
      dtype='object')


In [49]:
# Do the column names need changing/cleaning?


### Preview Data

In [36]:
print("---Head:\n", df.head())
print("---Tail:\n", df.tail())

---Head:
    BLcode  country  year sex  agefrom  ageto     lu     lp   lpc    ls   lsc  \
0       1  Algeria  1950  MF       25     64  79.68  18.33  3.50  1.63  0.54   
1       1  Algeria  1955  MF       25     64  80.31  17.57  3.70  1.71  0.50   
2       1  Algeria  1960  MF       25     64  84.61  13.38  2.91  1.68  0.65   
3       1  Algeria  1965  MF       25     64  87.12  10.53  2.59  1.93  0.87   
4       1  Algeria  1970  MF       25     64  83.25  13.82  3.63  2.47  1.36   

     lh   lhc  yr_sch  yr_sch_pri  yr_sch_sec  yr_sch_ter WBcode  \
0  0.35  0.23   0.892       0.774       0.106       0.012    DZA   
1  0.37  0.24   0.883       0.762       0.109       0.012    DZA   
2  0.34  0.22   0.731       0.610       0.110       0.011    DZA   
3  0.44  0.28   0.683       0.535       0.134       0.014    DZA   
4  0.36  0.22   0.869       0.693       0.165       0.012    DZA   

                    region_code     pop  
0  Middle East and North Africa  3226.0  
1  Middle East a

In [48]:
# Are names clean?

# Are years in columns or rows?

# Are tertiary columns named well?

# Are NaNs present?


### Data Types & Missing Data

In [37]:
print("---Info:\n", df.info())
print("---Describe:\n", df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2044 entries, 0 to 2043
Data columns (total 20 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   BLcode       2044 non-null   int64  
 1   country      2044 non-null   object 
 2   year         2044 non-null   int64  
 3   sex          2044 non-null   object 
 4   agefrom      2044 non-null   int64  
 5   ageto        2044 non-null   int64  
 6   lu           2044 non-null   float64
 7   lp           2044 non-null   float64
 8   lpc          2044 non-null   float64
 9   ls           2044 non-null   float64
 10  lsc          2044 non-null   float64
 11  lh           2044 non-null   float64
 12  lhc          2044 non-null   float64
 13  yr_sch       2044 non-null   float64
 14  yr_sch_pri   2044 non-null   float64
 15  yr_sch_sec   2044 non-null   float64
 16  yr_sch_ter   2044 non-null   float64
 17  WBcode       2044 non-null   object 
 18  region_code  2044 non-null   object 
 19  pop   

In [50]:
# Are there missing values?

# Are there columns with the wrong types?

# Are there placeholder names for certain types (e.g.: NaNs)?


### Unique Values & Duplicates

In [51]:
# TODO make this prettier like above ones

print(df["country"].nunique())
# etc
df.duplicated().sum()

146


np.int64(0)

In [52]:
# Are there the correct number of unique values?

# Are there any duplicates?


## Step 3: Reformat & Clean Up