## EDA on WB pop data

Use this notebook to answer the questions on the board. For each cell:
- Can you explain what the code is doing?
- Is there anything you can change/ adjust?
- Try writing the same or your own version in a new cell below.

### Import libraries and data

In [30]:
# import libraries
import pandas as pd
import os

Remember: relative paths are better for collaboration.


In [18]:
# importing with absolute path
# you won't be able to run this directly - you can get your equivalent by right-clicking on the csv in your explorer

df_direct = pd.read_csv('/Users/margheritaphilipp/Documents/margherita/GitHub/brushup_2025/data/WB_pop_clean.csv')

In [None]:
# importing with relative path:
# if you have saved this file in a folder that has a data subfolder with the same csv inside, you can run this directly 

# get current working directory
cwd = os.getcwd()
print(cwd)
parent_path = os.path.dirname(cwd)

df_og = pd.read_csv(parent_path + '/data/WB_pop_clean.csv')

In [20]:
# good practice to make a copy before manipulating - so you can quickly revert to the original without importing again
# NB in this notebook we are not yet making changes to the df, so we don't need the copy here
df = df_og.copy()

### Start inspection

Addresses the following questions from class:
- Display the head, check for missing values
- Find the min and max values - overall and just for 2023
- Which countries do they belong to?
- Inspect the values in the “Country Code” column

In [None]:
# show the dimensions (rows and columns) of the data set and display first few rows
print(df.shape)

# other options:
# df.tail(2)
# df.sample(4)

df_og.head() # default is 5

In [None]:
# sometimes not all columns are visible so it can be useful to get the full list
df_og.columns

Note that while .head() is a ***method*** I apply to the data frame, .shape and .columns are ***attributes*** of the data frame object/ class that I can call


In [None]:
# the info method also tells us which columns are present and what data type they contain
# we know from the shape attribute that there are 218 rows and it seems that all rows contain data (are non-null), i.e. we don't have missing values

df_og.info()

In [None]:
# statistical summary of the numerical columns - we can already see a suspiciously high maximum value...

df_og.describe()

In [None]:
# if I just want to find the min and max values for a speficic column:

print('mix and max vals for 2023: ', df['2023'].min(), df['2023'].max())

In [None]:
# one way to get the whole row for these values is to use loc

df_og.loc[df['2023'] == df['2023'].min()]

In [None]:
# but this method is a bit more elegant and flexible

df_og.nlargest(2, '2023')

In [None]:
df_og.nsmallest(5, ['2024', '2000'])

In [None]:
# inspecting the country column: note that the lentgh of the value counts is 218, same as the number of unique values, so each country only appears once

print(df_og['Country Name'].nunique()) # same as len(df['Country Name'].unique())

print(df_og['Country Name'].unique())

df_og['Country Name'].value_counts() 