# Data Preparation

## Introduction

**Context & Motivation**

Vaccination is a cornerstone of public health, contributing significantly to the prevention of disease and the reduction of child mortality worldwide. Despite its proven effectiveness, global access to and trust in vaccines remains uneven — influenced by political, economic, and cultural factors. This project explores these dynamics from a data-driven perspective.

**Objectives**
1. Read data for various sources (UNICEF, World Bank, World Values Survey, V Dem) using REST API and csv files
2. Handle missing data (or exclude dataset if needed)
4. Ensure consistency (country names, data format, etc.)
5. Merge data for final dataset 

The final dataset will include the following variables:
1. **iso3** - ISO alfa-3 standardized code - REST countries
2. **country** - common name - REST countries
3. **region** - region name - REST countries
4. **subregion** - sub-region name - REST countries
5. **year** - applicable year
6. **vac_index** - index of vaccination calculated as average of % ofvaccinated population with globally recommended vaccines (BCG, DTP3, POL3, IPV2, MCV2, RCV1, HEPB3, HEPBB, HIB3, PCV3, ROTAC) - UNICEF
7. **gpd** - gdp per capita (US$) - World Bank
8. **health_exp** - health expenditure (% of GDP) - World Bank
9. **child_mort** - under 5 years old children mortality (per 1000 live births) - World Bank
10. **internet_use** - % of population using internet - World Bank
11. **gov_trust** - % of population trusting government (yes/no) - UNICEF
12. **polarization** - score based on expert assessments of societal divisions - World Values (V-Dem) Survey

Separated DataSets will be used in relational PowerBI database, whereas merged dataframe in exploratory analysis and predictions.

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import wbgapi as wb
import requests

## Country names

In [2]:
# Fetch data from REST Countries API
url = "https://restcountries.com/v3.1/all?fields=cca3,name,region,subregion,area,population"
response = requests.get(url)
data = response.json()

In [3]:
# Extract relevant fields
country_list = [
    {'iso3': country['cca3'],
     'country': country['name']['common'],
     'region': country['region'],
     'subregion': country['subregion']}
    for country in data]

# Convert to pandas DataFrame
geo_df = pd.DataFrame(country_list)
geo_df.head(5)

Unnamed: 0,iso3,country,region,subregion
0,TGO,Togo,Africa,Western Africa
1,MYT,Mayotte,Africa,Eastern Africa
2,GEO,Georgia,Asia,Western Asia
3,VUT,Vanuatu,Oceania,Melanesia
4,KGZ,Kyrgyzstan,Asia,Central Asia


## Vaccinations - Target Variable

Vaccinations Excel includes UNICEF data on % of population being vaccinated in a given year. Each vaccine data is included in a separate tab. As some vaccines are only recommended for a few regions, there will be excluded from the analysis for a consistency.

### All Vaccines DataFrame

In [4]:
#reading the file to get sheet names
vaccines = pd.ExcelFile('01_Vaccinations.xlsx')

#listing sheet names
vaccines.sheet_names

['ReadMe',
 'BCG',
 'DTP1',
 'DTP3',
 'HEPB3',
 'HEPBB',
 'HIB3',
 'IPV1',
 'IPV2',
 'MCV1',
 'MCV2',
 'MENGA',
 'PCV3',
 'POL3',
 'RCV1',
 'ROTAC',
 'YFV',
 'regional_global']

In [5]:
# Read in each sheet for vaccines to be included in the analysis
sheet_names = ['BCG' , 'DTP3', 'POL3', 'IPV2', 'MCV2', 'RCV1', 'HEPB3', 'HEPBB', 'HIB3', 'PCV3', 'ROTAC']
sheets = [vaccines.parse(name) for name in sheet_names] 

# Combine them into one dataframe
vac_df = pd.concat(sheets, ignore_index=True)
vac_df.head()

Unnamed: 0,unicef_region,iso3,country,vaccine,2023,2022,2021,2020,2019,2018,...,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000
0,ROSA,AFG,Afghanistan,BCG,68.0,69.0,65.0,72.0,74.0,82.0,...,64.0,66.0,60.0,60.0,57.0,51.0,44.0,46.0,43.0,30.0
1,ECAR,ALB,Albania,BCG,99.0,99.0,99.0,98.0,99.0,99.0,...,97.0,99.0,98.0,97.0,98.0,97.0,95.0,94.0,93.0,93.0
2,MENA,DZA,Algeria,BCG,99.0,98.0,98.0,99.0,99.0,99.0,...,99.0,99.0,99.0,99.0,98.0,98.0,98.0,98.0,97.0,97.0
3,ESAR,AGO,Angola,BCG,73.0,60.0,56.0,58.0,69.0,72.0,...,70.0,73.0,75.0,54.0,51.0,63.0,54.0,76.0,70.0,53.0
4,LACR,ARG,Argentina,BCG,69.0,81.0,80.0,75.0,85.0,93.0,...,99.0,99.0,99.0,99.0,99.0,99.0,99.0,95.0,92.0,95.0


In [6]:
#choosing applicable columns and reordering
vac_df = vac_df.iloc[:, [1, 3, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5]]

In [7]:
#getting number of rows and columns
vac_df.shape

(1782, 13)

In [8]:
#listing datatypes
vac_df.dtypes

iso3        object
vaccine     object
2012       float64
2013       float64
2014       float64
2015       float64
2016       float64
2017       float64
2018       float64
2019       float64
2020       float64
2021       float64
2022       float64
dtype: object

###  Vaccination Index

As data on specific vaccines could be possibly used in the analysis, there will be new dataframe created for vaccination index - mean score of vaccinations for included vaccines. 

In [9]:
#removing vaccine type column
v_index_df = vac_df.drop(['vaccine'], axis=1)

#calculating vaccination index 
v_index_df = v_index_df.groupby(['iso3']).mean().round(2)
v_index_df.reset_index(inplace=True)
v_index_df.head()

Unnamed: 0,iso3,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,AFG,62.5,60.83,52.75,56.25,57.12,57.75,60.44,59.33,58.44,51.8,56.4
1,AGO,56.8,48.33,50.71,51.75,48.75,50.25,55.78,54.56,47.89,40.0,36.5
2,ALB,98.33,99.0,98.33,98.67,98.0,98.56,98.0,97.89,96.4,95.82,95.27
3,AND,96.29,94.86,94.88,94.38,95.12,98.0,97.57,97.57,97.29,97.71,96.71
4,ARE,95.78,97.33,93.0,97.2,98.2,96.3,97.7,96.0,88.1,95.91,94.73


In [10]:
#checking number of countries in index df
v_index_df.shape

(195, 12)

In [11]:
#checking missing values
v_index_df.isna().sum()

iso3    0
2012    0
2013    0
2014    0
2015    0
2016    0
2017    0
2018    0
2019    0
2020    0
2021    0
2022    0
dtype: int64

## Socio-Economic Variables

Data in this section comes from World Bank and have analogical structure.

In [12]:
# List of World Bank indicators
indicators = {
    'NY.GDP.PCAP.CD': 'gdp',
    'SI.POV.GINI': 'gini',
    'SH.XPD.CHEX.GD.ZS': 'health_exp',
    'SH.DYN.MORT': 'child_mort',
    'IT.NET.USER.ZS': 'internet_use'}

In [13]:
# Importing data using wbgapi package for World Bank API access
wb_df = wb.data.DataFrame(indicators.keys(), time=range(2012, 2023), labels=True)

wb_df = wb_df.reset_index()
wb_df.head(5)

Unnamed: 0,economy,series,Country,Series,YR2012,YR2013,YR2014,YR2015,YR2016,YR2017,YR2018,YR2019,YR2020,YR2021,YR2022
0,ZWE,NY.GDP.PCAP.CD,Zimbabwe,GDP per capita (current US$),1238.60109,1362.300668,1372.212781,1386.422847,1407.415364,3448.082537,2271.853335,1684.027904,1730.413489,1724.387731,2040.552459
1,ZMB,NY.GDP.PCAP.CD,Zambia,GDP per capita (current US$),1710.050613,1820.718548,1707.485731,1295.877887,1239.085279,1483.465773,1463.899979,1258.986198,951.644317,1127.160779,1447.123101
2,YEM,NY.GDP.PCAP.CD,"Yemen, Rep.",GDP per capita (current US$),1245.050683,1378.75003,1430.16421,1362.173794,975.359407,811.165964,633.887206,623.376165,559.564673,522.173513,615.702079
3,PSE,NY.GDP.PCAP.CD,West Bank and Gaza,GDP per capita (current US$),3067.438727,3315.297539,3352.112595,3272.154324,3527.613824,3620.360487,3562.330943,3656.858271,3233.568638,3678.635657,3799.95527
4,VIR,NY.GDP.PCAP.CD,Virgin Islands (U.S.),GDP per capita (current US$),37795.319259,34597.976694,33045.36438,34007.352941,35324.974887,35365.069304,36663.208755,38633.529892,39787.374165,42571.077737,44320.909186


In [14]:
#choosing applicable columns and set names
wb_df = wb_df.drop(['series', 'Country'], axis=1)
wb_df = wb_df.set_axis(['iso3', 'indicator', 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022], axis=1)

In [15]:
# Drop rows with at least 4 missing values
wb_df = wb_df.dropna(thresh=wb_df.shape[1] - 4)

# Interpolate missing values
wb_df.iloc[:, 2:] = wb_df.iloc[:, 2:].interpolate(method='linear', axis=1)

# Drop any remaining NaNs
wb_df = wb_df.dropna()

# Round to 2 decimal places
wb_df = wb_df.round(2)

wb_df.head(5)

Unnamed: 0,iso3,indicator,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,ZWE,GDP per capita (current US$),1238.6,1362.3,1372.21,1386.42,1407.42,3448.08,2271.85,1684.03,1730.41,1724.39,2040.55
1,ZMB,GDP per capita (current US$),1710.05,1820.72,1707.49,1295.88,1239.09,1483.47,1463.9,1258.99,951.64,1127.16,1447.12
2,YEM,GDP per capita (current US$),1245.05,1378.75,1430.16,1362.17,975.36,811.17,633.89,623.38,559.56,522.17,615.7
3,PSE,GDP per capita (current US$),3067.44,3315.3,3352.11,3272.15,3527.61,3620.36,3562.33,3656.86,3233.57,3678.64,3799.96
4,VIR,GDP per capita (current US$),37795.32,34597.98,33045.36,34007.35,35324.97,35365.07,36663.21,38633.53,39787.37,42571.08,44320.91


### Government Trust

Data in this section comes from OECD trust survey.

In [16]:
#reading data
trust = pd.read_csv('06_gov_trust_oecd.csv')
trust.head()

Unnamed: 0,STRUCTURE,STRUCTURE_ID,STRUCTURE_NAME,ACTION,REF_AREA,Reference area,MEASURE,Measure,UNIT_MEASURE,Unit of measure,...,OBS_VALUE,Observation value,OBS_STATUS,Observation status,UNIT_MULT,Unit multiplier,DECIMALS,Decimals,BASE_PER,Base period
0,DATAFLOW,OECD.WISE.WDP:DSD_HSL@DF_HSL_FWB(1.1),Future well-being,I,CZE,Czechia,14_3,Trust in government,PT_POP_Y_GE15,Percentage of population aged 15 years or over,...,44.579404,,A,Normal value,0,Units,2,Two,,
1,DATAFLOW,OECD.WISE.WDP:DSD_HSL@DF_HSL_FWB(1.1),Future well-being,I,CZE,Czechia,14_3,Trust in government,PT_POP_Y_GE15,Percentage of population aged 15 years or over,...,44.579404,,A,Normal value,0,Units,2,Two,,
2,DATAFLOW,OECD.WISE.WDP:DSD_HSL@DF_HSL_FWB(1.1),Future well-being,I,CZE,Czechia,14_3,Trust in government,PT_POP_Y_GE15,Percentage of population aged 15 years or over,...,44.579404,,A,Normal value,0,Units,2,Two,,
3,DATAFLOW,OECD.WISE.WDP:DSD_HSL@DF_HSL_FWB(1.1),Future well-being,I,CZE,Czechia,14_3,Trust in government,PT_POP_Y_GE15,Percentage of population aged 15 years or over,...,22.121649,,A,Normal value,0,Units,2,Two,,
4,DATAFLOW,OECD.WISE.WDP:DSD_HSL@DF_HSL_FWB(1.1),Future well-being,I,CZE,Czechia,14_3,Trust in government,PT_POP_Y_GE15,Percentage of population aged 15 years or over,...,22.121649,,A,Normal value,0,Units,2,Two,,


In [17]:
#listing columns in a dataframe
trust.columns

Index(['STRUCTURE', 'STRUCTURE_ID', 'STRUCTURE_NAME', 'ACTION', 'REF_AREA',
       'Reference area', 'MEASURE', 'Measure', 'UNIT_MEASURE',
       'Unit of measure', 'AGE', 'Age', 'SEX', 'Sex', 'EDUCATION_LEV',
       'Education level', 'DOMAIN', 'Domain', 'TIME_PERIOD', 'Time period',
       'OBS_VALUE', 'Observation value', 'OBS_STATUS', 'Observation status',
       'UNIT_MULT', 'Unit multiplier', 'DECIMALS', 'Decimals', 'BASE_PER',
       'Base period'],
      dtype='object')

In [18]:
#filtering education level for total
trust = trust[trust['Education level'] == 'Total']

In [19]:
#choosing applicable columns
trust = trust[['REF_AREA','TIME_PERIOD', 'OBS_VALUE','BASE_PER','Base period']]
trust.head()

Unnamed: 0,REF_AREA,TIME_PERIOD,OBS_VALUE,BASE_PER,Base period
0,CZE,2016,44.579404,,
1,CZE,2015,44.579404,,
2,CZE,2014,44.579404,,
3,CZE,2013,22.121649,,
4,CZE,2012,22.121649,,


In [20]:
#checking shape
trust.shape

(449, 5)

In [21]:
#removing rows with empty values and checking how many were removed
trust = trust.dropna(axis=1)
trust.shape

(449, 3)

There are no NaNs

In [22]:
#renaming columns
trust_df =trust.set_axis(['iso3', 'year', 'gov_trust'], axis=1)
trust_df.head()

Unnamed: 0,iso3,year,gov_trust
0,CZE,2016,44.579404
1,CZE,2015,44.579404
2,CZE,2014,44.579404
3,CZE,2013,22.121649
4,CZE,2012,22.121649


In [23]:
#checking datatypes
trust_df['gov_trust'] = trust_df['gov_trust'].round(2)
trust_df.dtypes

iso3          object
year           int64
gov_trust    float64
dtype: object

In [24]:
#check the number of countries
trust_df[['iso3']].nunique()

iso3    38
dtype: int64

Data only for OECD countries is available.

### Polarization

Data in this section comes from V_Dem (Varieties of Democracy) project and includes multiple variables measuring democracy. For the immunization project, data on polarization will be extracted.

In [25]:
pol = pd.read_csv('07_v_dem.csv')
pol.head(5)

Unnamed: 0,country_name,country_text_id,country_id,year,historical_date,project,historical,histname,codingstart,codingend,...,v2xex_elecleg,v2xps_party,v2xps_party_codelow,v2xps_party_codehigh,v2x_divparctrl,v2x_feduni,v2xca_academ,v2xca_academ_codelow,v2xca_academ_codehigh,v2xca_academ_sd
0,Mexico,MEX,3,1789,1789-12-31,1,1,Viceroyalty of New Spain,1789,2024,...,0.0,,,,,0.0,,,,
1,Mexico,MEX,3,1790,1790-12-31,1,1,Viceroyalty of New Spain,1789,2024,...,0.0,,,,,0.0,,,,
2,Mexico,MEX,3,1791,1791-12-31,1,1,Viceroyalty of New Spain,1789,2024,...,0.0,,,,,0.0,,,,
3,Mexico,MEX,3,1792,1792-12-31,1,1,Viceroyalty of New Spain,1789,2024,...,0.0,,,,,0.0,,,,
4,Mexico,MEX,3,1793,1793-12-31,1,1,Viceroyalty of New Spain,1789,2024,...,0.0,,,,,0.0,,,,


In [26]:
# Selecting relevant columns
pol_df = pol[['country_text_id', 'year', 'v2pepwrsoc']]

# Filter data for years 2012 to 2022
pol_df = pol_df[(pol_df['year'] >= 2012) & (pol_df['year'] <= 2022)]

#changing column names
pol_df = pol_df.set_axis(['iso3', 'year', 'polarization'], axis=1)

#rounding values to 2 decimals 
pol_df['polarization'] = pol_df['polarization'].round(2)

In [27]:
#checking number of countries
pol_df[['iso3']].nunique()

iso3    179
dtype: int64

In [28]:
#checking data types
pol_df.dtypes

iso3             object
year              int64
polarization    float64
dtype: object

In [29]:
#checking shape
pol_df.shape

(1969, 3)

In [30]:
#dropping empty rows and checking if any were dropped
pol_df = pol_df.dropna()
pol_df.shape

(1969, 3)

In [31]:
pol_df.head()

Unnamed: 0,iso3,year,polarization
223,MEX,2012,1.0
224,MEX,2013,1.0
225,MEX,2014,1.0
226,MEX,2015,1.0
227,MEX,2016,1.0


In [32]:
#checking values of polarization as first 5 were '1.0' only
pol_df['polarization'].value_counts()

polarization
 1.43    29
 1.58    24
 1.05    24
 2.03    23
 0.74    23
         ..
 1.57     1
 1.93     1
-1.54     1
-0.83     1
-1.15     1
Name: count, Length: 366, dtype: int64

### Trust - World Values Survey

World Values Survey is a global research project that explores people's values and beliefs, how they change over time, and what social and political impact they have. For this project, data regarding trust (in people, press, government, universities, big companies and who) will be extracted.

It includes individual responses on participants, so average country scores will be calculated during data clean-up. 

In [33]:
#reading the data
wvs = pd.read_csv('06_WVS7.csv', low_memory=False)
wvs.head()

Unnamed: 0,version,doi,A_WAVE,A_YEAR,A_STUDY,B_COUNTRY,B_COUNTRY_ALPHA,C_COW_NUM,C_COW_ALPHA,D_INTERVIEW,...,WVS_Polmistrust_PartyVoter,WVS_LR_MedianVoter,WVS_LibCon_MedianVoter,v2psbars,v2psorgs,v2psprbrch,v2psprlnks,v2psplats,v2xnp_client,v2xps_party
0,6-0-0 (2024-04-30),doi.org/10.14281/18241.24,7,2018,2,20,AND,232,AND,20070001,...,62.434211,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
1,6-0-0 (2024-04-30),doi.org/10.14281/18241.24,7,2018,2,20,AND,232,AND,20070002,...,62.434211,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
2,6-0-0 (2024-04-30),doi.org/10.14281/18241.24,7,2018,2,20,AND,232,AND,20070003,...,62.434211,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
3,6-0-0 (2024-04-30),doi.org/10.14281/18241.24,7,2018,2,20,AND,232,AND,20070004,...,,,,,,,,,,
4,6-0-0 (2024-04-30),doi.org/10.14281/18241.24,7,2018,2,20,AND,232,AND,20070005,...,66.964286,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0


In [34]:
#choosing applicable columns
wvs_df = wvs[['B_COUNTRY_ALPHA', 'A_YEAR', 'Q57', 'Q64', 'Q66', 'Q71', 'Q75', 'Q77', 'Q88']]

#renaming columns based on the codebook
wvs_df = wvs_df.set_axis(['iso3', 'year', 'poeple_trust', 'church_trust', 'press_trust', 'gov_trust', 'uni_trust', 'comp_trust', 'who_trust'], axis=1)

In [35]:
wvs_df.dtypes

iso3            object
year             int64
poeple_trust     int64
church_trust     int64
press_trust      int64
gov_trust        int64
uni_trust        int64
comp_trust       int64
who_trust        int64
dtype: object

In [36]:
#replacing wvs missing values codes (-) with NaN
wvs_df.replace([-1, -2, -4, -5], np.NaN, inplace=True)

#calculating missing values
wvs_df.isna().sum()

iso3                0
year                0
poeple_trust     1337
church_trust     2036
press_trust      2210
gov_trust        3137
uni_trust        3925
comp_trust       5367
who_trust       15849
dtype: int64

In [37]:
wvs_df.shape

(97220, 9)

In [38]:
#calculating mean values for the variables
wvs_df = wvs_df.groupby(['iso3', 'year']).mean().round(2).reset_index()
wvs_df.head()

Unnamed: 0,iso3,year,poeple_trust,church_trust,press_trust,gov_trust,uni_trust,comp_trust,who_trust
0,AND,2018,1.74,3.0,2.76,2.56,2.09,2.6,2.25
1,ARG,2017,1.79,2.43,2.91,2.94,2.07,2.91,2.41
2,ARM,2021,1.92,1.91,3.39,2.9,2.31,2.78,2.64
3,AUS,2018,1.46,2.82,3.02,2.82,2.14,2.74,2.14
4,BGD,2018,1.87,1.07,2.11,1.89,1.64,2.21,2.05


Data in WVS data for different countries were collected for different years. As COVID-19 could impact trust values, I'm calculating the number of countries where data was collected pre- and post-covid:

In [39]:
pre_cov_wvs = wvs_df[wvs_df['year'] < 2020]
len(pre_cov_wvs)

38

In [40]:
post_cov_wvs = wvs_df[wvs_df['year'] > 2019]
len(post_cov_wvs)

28

*Post-exploratory analysis note: there is no direct link between the trust and covid_19. It might be moderated by a different variable but anyways the data will the variables from the survey will be included in the trends analysis (no trends over time as data for different years is not available), as insights might be valuable.*

## Data Merge

In [41]:
# Changing wide tables to long tables
v_index_long = v_index_df.melt(id_vars=['iso3'], var_name='year', value_name='vac_index')
wb_long = wb_df.melt(id_vars=['iso3', 'indicator'], var_name='year', value_name='value')\
        .pivot(index=['iso3', 'year'],columns='indicator',values='value')\
        .reset_index()

#changing  year values to integer
v_index_long['year'] = v_index_long['year'].astype(int)

# Merging with country, iso3, year keys including values matching target v_index only
from functools import reduce
dfs_to_merge = [v_index_long, wb_long, trust_df, pol_df]
df = reduce(lambda left, right: pd.merge(left, right, on=['iso3', 'year'], how='left'), dfs_to_merge)

#merging with geo data on iso3
df = df.merge(geo_df, on='iso3', how='left')

#reordering columns
df = df[['iso3', 'country', 'region', 'subregion'] + [col for col in df.columns if col not in ['iso3', 'country', 'region', 'subregion']]]

df.columns

Index(['iso3', 'country', 'region', 'subregion', 'year', 'vac_index',
       'Current health expenditure (% of GDP)', 'GDP per capita (current US$)',
       'Gini index', 'Individuals using the Internet (% of population)',
       'Mortality rate, under-5 (per 1,000 live births)', 'gov_trust',
       'polarization'],
      dtype='object')

In [42]:
#changing names
df = df.set_axis(['iso3', 'country', 'region', 'subregion', 'year', 'vac_index',
       'health_exp', 'gdp', 'gini', 'internet_use', 'child_mort', 
       'gov_trust','polarization'], axis=1)
df.head()

Unnamed: 0,iso3,country,region,subregion,year,vac_index,health_exp,gdp,gini,internet_use,child_mort,gov_trust,polarization
0,AFG,Afghanistan,Asia,Southern Asia,2012,62.5,7.9,651.42,,5.45,81.2,,1.02
1,AGO,Angola,Africa,Middle Africa,2012,56.8,2.4,5086.03,,7.7,104.8,,-1.14
2,ALB,Albania,Europe,Southeast Europe,2012,98.33,6.16,4247.63,29.0,49.4,11.2,,1.4
3,AND,Andorra,Europe,Southern Europe,2012,96.29,6.71,41500.54,,82.7,4.1,,
4,ARE,United Arab Emirates,Asia,Western Asia,2012,95.78,3.34,52034.48,,85.0,8.4,,-0.61


Checking what countries are missing for each variable (except gov_trust as this is bonus only with oecd countries):

In [43]:
# List of columns to check
cols_to_check = ['gdp', 'gini', 'health_exp', 'child_mort', 'internet_use', 'polarization']

# Dictionary to store countries with missing values per column
missing_by_column = {}

for col in cols_to_check:
    missing_countries = df[df[col].isna()]['country'].unique()
    missing_by_column[col] = missing_countries

# Print the results
for col, countries in missing_by_column.items():
    print(f"\nMissing in '{col}': ({len(countries)} countries)")
    print(sorted(countries))


Missing in 'gdp': (6 countries)
['Cook Islands', 'Eritrea', 'Niue', 'North Korea', 'South Sudan', 'Venezuela']

Missing in 'gini': (129 countries)
['Afghanistan', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Australia', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belize', 'Benin', 'Bhutan', 'Bosnia and Herzegovina', 'Botswana', 'Brunei', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Cape Verde', 'Central African Republic', 'Chad', 'Chile', 'Comoros', 'Cook Islands', 'Cuba', 'DR Congo', 'Djibouti', 'Dominica', 'Egypt', 'Equatorial Guinea', 'Eritrea', 'Eswatini', 'Ethiopia', 'Fiji', 'Gabon', 'Gambia', 'Ghana', 'Grenada', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'India', 'Iraq', 'Ivory Coast', 'Jamaica', 'Jordan', 'Kenya', 'Kiribati', 'Kuwait', 'Laos', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico', 'Micronesia', 'Monaco', 'Mongo

Data for polarization is not crucial for the analysis - the variable will be rather used as a moderator. Thus, for the final analysis only countries missing in gdp, health_exp, child_mort and internet_use will be excluded. As for gini index, 129 countries are missing, the variable will be excluded from the analysis.

In [44]:
# Creating a list of countries with any missing values in selected columns
countries_with_missing = df[df[['gdp', 'health_exp', 'child_mort', 'internet_use']].isna().any(axis=1)]['country'].unique()

# Removing those countries
df_cleaned = df[~df['country'].isin(countries_with_missing)].copy()

#dropping gini index
df_cleaned = df_cleaned.drop(['gini'], axis=1)

#final numer of countries
print(f"Number of countries: {df_cleaned['country'].nunique()}")

Number of countries: 181


This is final DataFrame for analysis exploratory analysis and predictions.

In [45]:
df_cleaned.to_csv("00_Immunization_db.csv", index=False)