## EDA on WB pop data

Use this notebook to answer the questions on the board. For each cell:
- Can you explain what the code is doing?
- Is there anything you can change/ adjust?
- Try writing the same or your own version in a new cell below.

### Import libraries and data

In [3]:
# import libraries
import pandas as pd
import os

Remember: relative paths are better for collaboration.


In [2]:
# importing with absolute path
# you won't be able to run this directly - you can get your equivalent by right-clicking on the csv in your explorer

df_direct = pd.read_csv('/Users/margheritaphilipp/Documents/margherita/GitHub/brushup_2025/data/WB_pop_clean.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/Users/margheritaphilipp/Documents/margherita/GitHub/brushup_2025/data/WB_pop_clean.csv'

In [5]:
# importing with relative path:
# if you have saved this file in a folder that has a data subfolder with the same csv inside, you can run this directly 

# get current working directory
cwd = os.getcwd()
print(cwd)
parent_path = os.path.dirname(cwd)

df_og = pd.read_csv(parent_path + '/brushup_2025/data/WB_pop_clean.csv')

d:\Users\Eric\Desktop\BSE\BrushUp\brushup_2025


In [6]:
# good practice to make a copy before manipulating - so you can quickly revert to the original without importing again
# NB in this notebook we are not yet making changes to the df, so we don't need the copy here
df = df_og.copy()

### Start inspection

Addresses the following questions from class:
- Display the head, check for missing values
- Find the min and max values - overall and just for 2023
- Which countries do they belong to?
- Inspect the values in the “Country Code” column

In [7]:
# show the dimensions (rows and columns) of the data set and display first few rows
print(df.shape)

# other options:
# df.tail(2)
# df.sample(4)

df_og.head() # default is 5

(218, 16)


Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1990,2000,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
0,"Population, total",SP.POP.TOTL,Afghanistan,AFG,12045660,20130327,33831764,34700612,35688935,36743039,37856121,39068979,40000412,40578842,41454761,42647492
1,"Population, total",SP.POP.TOTL,Albania,ALB,3286542,3089027,2880703,2876101,2873457,2866376,2854191,2837849,2811666,2777689,2745972,2714617
2,"Population, total",SP.POP.TOTL,Algeria,DZA,25375810,30903893,40019529,40850721,41689299,42505035,43294546,44042091,44761099,45477389,46164219,46814308
3,"Population, total",SP.POP.TOTL,American Samoa,ASM,46640,56855,52878,52245,51586,50908,50209,49761,49225,48342,47521,46765
4,"Population, total",SP.POP.TOTL,Andorra,AND,52597,65685,72174,72181,73763,75162,76474,77380,78364,79705,80856,81938


In [8]:
# sometimes not all columns are visible so it can be useful to get the full list
df_og.columns

Index(['Series Name', 'Series Code', 'Country Name', 'Country Code', '1990',
       '2000', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022',
       '2023', '2024'],
      dtype='object')

Note that while .head() is a ***method*** I apply to the data frame, .shape and .columns are ***attributes*** of the data frame object/ class that I can call


In [9]:
# the info method also tells us which columns are present and what data type they contain
# we know from the shape attribute that there are 218 rows and it seems that all rows contain data (are non-null), i.e. we don't have missing values

df_og.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Series Name   218 non-null    object
 1   Series Code   218 non-null    object
 2   Country Name  218 non-null    object
 3   Country Code  218 non-null    object
 4   1990          218 non-null    int64 
 5   2000          218 non-null    int64 
 6   2015          218 non-null    int64 
 7   2016          218 non-null    int64 
 8   2017          218 non-null    int64 
 9   2018          218 non-null    int64 
 10  2019          218 non-null    int64 
 11  2020          218 non-null    int64 
 12  2021          218 non-null    int64 
 13  2022          218 non-null    int64 
 14  2023          218 non-null    int64 
 15  2024          218 non-null    int64 
dtypes: int64(12), object(4)
memory usage: 27.4+ KB


In [10]:
# statistical summary of the numerical columns - we can already see a suspiciously high maximum value...

df_og.describe()

Unnamed: 0,1990,2000,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
count,218.0,218.0,218.0,218.0,218.0,218.0,218.0,218.0,218.0,218.0,218.0,218.0
mean,48523920.0,56429340.0,68166020.0,68966170.0,69752040.0,70511060.0,71252370.0,71956730.0,72560720.0,73200060.0,73883390.0,74590370.0
std,371013300.0,431155300.0,519422900.0,525415300.0,531309900.0,536982800.0,542504000.0,547733300.0,552197400.0,556890000.0,561917500.0,567111300.0
min,8798.0,9544.0,10954.0,10930.0,10869.0,10751.0,10581.0,10399.0,10194.0,9992.0,9816.0,9646.0
25%,484404.8,613108.0,745158.8,751786.5,759621.5,776581.5,791849.8,803492.5,816155.0,824774.8,832361.5,839972.2
50%,4351596.0,5120452.0,6199723.0,6286093.0,6327446.0,6435730.0,6552634.0,6584503.0,6664462.0,6697552.0,6723398.0,6751671.0
75%,12965700.0,16607500.0,23552850.0,24006720.0,24476590.0,24924040.0,25463280.0,26014550.0,26095750.0,26250230.0,26594130.0,27161710.0
max,5299247000.0,6161885000.0,7441827000.0,7529067000.0,7614749000.0,7697492000.0,7778304000.0,7855075000.0,7920862000.0,7990400000.0,8064977000.0,8142056000.0


In [11]:
# if I just want to find the min and max values for a speficic column:

print('mix and max vals for 2023: ', df['2023'].min(), df['2023'].max())

mix and max vals for 2023:  9816 8064976601


In [13]:
# one way to get the whole row for these values is to use loc

df_og.loc[df['2023'] == df['2023'].min()]

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1990,2000,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
201,"Population, total",SP.POP.TOTL,Tuvalu,TUV,8798,9544,10963,10930,10869,10751,10581,10399,10194,9992,9816,9646


In [14]:
# but this method is a bit more elegant and flexible

df_og.nlargest(2, '2023')

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1990,2000,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
217,"Population, total",SP.POP.TOTL,World,WLD,5299246757,6161884811,7441826877,7529066617,7614748582,7697492379,7778303912,7855075060,7920861888,7990399768,8064976601,8142056446
89,"Population, total",SP.POP.TOTL,India,IND,864972221,1057922733,1328024498,1343944296,1359657400,1374659064,1389030312,1402617695,1414203896,1425423212,1438069596,1450935791


In [15]:
df_og.nsmallest(5, ['2024', '2000'])

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1990,2000,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
201,"Population, total",SP.POP.TOTL,Tuvalu,TUV,8798,9544,10963,10930,10869,10751,10581,10399,10194,9992,9816,9646
137,"Population, total",SP.POP.TOTL,Nauru,NRU,9622,10168,10954,11150,11324,11477,11587,11643,11709,11801,11875,11947
150,"Population, total",SP.POP.TOTL,Palau,PLW,15259,19178,17770,17797,17812,17814,17798,17792,17783,17759,17727,17695
183,"Population, total",SP.POP.TOTL,St. Martin (French part),MAF,28224,29996,37369,37175,36837,36012,34267,31786,29961,28870,27515,26129
164,"Population, total",SP.POP.TOTL,San Marino,SMR,23475,26799,32897,33101,33825,34522,34663,34770,34252,33755,33860,33977


In [16]:
# inspecting the country column: note that the lentgh of the value counts is 218, same as the number of unique values, so each country only appears once

print(df_og['Country Name'].nunique()) # same as len(df['Country Name'].unique())

print(df_og['Country Name'].unique())

df_og['Country Name'].value_counts() 

218
['Afghanistan' 'Albania' 'Algeria' 'American Samoa' 'Andorra' 'Angola'
 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Aruba' 'Australia' 'Austria'
 'Azerbaijan' 'Bahamas, The' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
 'Belgium' 'Belize' 'Benin' 'Bermuda' 'Bhutan' 'Bolivia'
 'Bosnia and Herzegovina' 'Botswana' 'Brazil' 'British Virgin Islands'
 'Brunei Darussalam' 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cabo Verde'
 'Cambodia' 'Cameroon' 'Canada' 'Cayman Islands'
 'Central African Republic' 'Chad' 'Channel Islands' 'Chile' 'China'
 'Colombia' 'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Costa Rica'
 "Cote d'Ivoire" 'Croatia' 'Cuba' 'Curacao' 'Cyprus' 'Czechia' 'Denmark'
 'Djibouti' 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt, Arab Rep.'
 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Eswatini'
 'Ethiopia' 'Faroe Islands' 'Fiji' 'Finland' 'France' 'French Polynesia'
 'Gabon' 'Gambia, The' 'Georgia' 'Germany' 'Ghana' 'Gibraltar' 'Greece'
 'Greenland' 'Grenada' 'Guam' 'Guate

Country Name
Afghanistan           1
Albania               1
Algeria               1
American Samoa        1
Andorra               1
                     ..
West Bank and Gaza    1
Yemen, Rep.           1
Zambia                1
Zimbabwe              1
World                 1
Name: count, Length: 218, dtype: int64