<a href="https://colab.research.google.com/github/asolovey83/HomeExpenses/blob/master/GDP_and_Covid_Vaccination_Correlation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Correlation Between Nations' GDP and Percent of COVID-19 Vaccination

#### The purpose of this project is to define whether there is a correlation between nation`s GDP (Gross Domestic Product) and the current level of vaccination against COVID-19

**Resources:**</br>
https://ourworldindata.org/grapher/gdp-per-capita-worldbank </br>
https://ourworldindata.org/covid-vaccinations


In [2]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from google.colab import files

In [3]:
#Uploading GDP dataset from the local drive
uploaded = files.upload()

Saving gdp-per-capita-worldbank.csv to gdp-per-capita-worldbank.csv


In [11]:
# Reading dataset into DataFrame object
gdp = pd.read_csv("gdp-per-capita-worldbank.csv")

In [12]:
# GDP DataFrame preview. First 25 rows
gdp.head(25)

Unnamed: 0,Entity,Code,Year,"GDP per capita, PPP (constant 2017 international $)"
0,Afghanistan,AFG,2002,1189.784668
1,Afghanistan,AFG,2003,1235.810063
2,Afghanistan,AFG,2004,1200.278013
3,Afghanistan,AFG,2005,1286.793659
4,Afghanistan,AFG,2006,1315.789117


In [13]:
# Removing all years except 2020, the latest one
gdp = gdp[gdp.Year == 2020]

In [14]:
gdp.head()

Unnamed: 0,Entity,Code,Year,"GDP per capita, PPP (constant 2017 international $)"
18,Afghanistan,AFG,2020,1978.961579
49,Africa Eastern and Southern,,2020,3387.59467
80,Africa Western and Central,,2020,4003.158913
111,Albania,ALB,2020,13295.410885
142,Algeria,DZA,2020,10681.679297


In [15]:
# Removing unnecessary columns
gdp.drop(columns = ['Code', 'Year'], inplace = True)

In [16]:
gdp.head()

Unnamed: 0,Entity,"GDP per capita, PPP (constant 2017 international $)"
18,Afghanistan,1978.961579
49,Africa Eastern and Southern,3387.59467
80,Africa Western and Central,4003.158913
111,Albania,13295.410885
142,Algeria,10681.679297


In [45]:
# Renaming columns for better readability and further merging
gdp.rename(columns = {"Entity": "Country", "GDP per capita, PPP (constant 2017 international $)": "GDP"}, inplace = True)

In [46]:
gdp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 224 entries, 18 to 7108
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Country  224 non-null    object 
 1   GDP      224 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.2+ KB


In [51]:
gdp.head()

Unnamed: 0,Country,GDP
18,Afghanistan,1978.961579
49,Africa Eastern and Southern,3387.59467
80,Africa Western and Central,4003.158913
111,Albania,13295.410885
142,Algeria,10681.679297


In [19]:
# Uploading COVID dataset
uploaded = files.upload()

Saving owid-covid-data.csv to owid-covid-data.csv


In [21]:
# Reading Covid dataset into DataFrame object
covid = pd.read_csv("owid-covid-data.csv")

In [23]:
# COVID dataset preview
covid.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,0.126,0.126,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,0.126,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,39835428.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,,,,


In [43]:
# Previewing data about fully vaccinated people, including empty cells
covid.people_vaccinated_per_hundred.value_counts(dropna=False)

NaN      161
63.09      1
54.38      1
57.95      1
54.82      1
68.36      1
7.66       1
68.19      1
64.48      1
64.77      1
67.57      1
67.54      1
13.67      1
25.06      1
70.76      1
35.17      1
66.16      1
63.36      1
59.46      1
22.52      1
77.58      1
18.23      1
75.41      1
64.99      1
43.41      1
40.28      1
75.36      1
21.30      1
78.95      1
60.36      1
42.14      1
68.70      1
76.51      1
95.23      1
38.60      1
53.72      1
66.95      1
57.14      1
54.36      1
70.56      1
49.81      1
47.51      1
37.18      1
Name: people_vaccinated_per_hundred, dtype: int64

As you can see information on vaccinated people is missing for a lot of countries. We will just remove these countries from the analysis later.

In [26]:
# Previewing info on COVID dataset columns
covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124199 entries, 0 to 124198
Data columns (total 65 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   iso_code                                 124199 non-null  object 
 1   continent                                118567 non-null  object 
 2   location                                 124199 non-null  object 
 3   date                                     124199 non-null  object 
 4   total_cases                              117538 non-null  float64
 5   new_cases                                117534 non-null  float64
 6   new_cases_smoothed                       116520 non-null  float64
 7   total_deaths                             106658 non-null  float64
 8   new_deaths                               106811 non-null  float64
 9   new_deaths_smoothed                      116520 non-null  float64
 10  total_cases_per_million         

In [27]:
# Converting Date column from string to datetime format
covid.date = pd.to_datetime(covid.date)

In [31]:
# Leaving only the most recent information (the latest date)
covid = covid[covid.date == covid.date.max()]

In [32]:
covid.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
600,AFG,Asia,Afghanistan,2021-10-16,155739.0,51.0,39.0,7238.0,0.0,2.429,3909.56,1.28,0.979,181.698,0.0,0.061,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,39835430.0,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,,,,
1212,OWID_AFR,,Africa,2021-10-16,8427524.0,4618.0,6061.0,215310.0,224.0,229.429,6135.862,3.362,4.413,156.762,0.163,0.167,,,,,,,,,,,,,,,,,,,171780334.0,105179629.0,69031700.0,,20468.0,845594.0,12.51,7.66,5.03,,616.0,,1373486000.0,,,,,,,,,,,,,,,,,,
1812,ALB,Europe,Albania,2021-10-16,177536.0,428.0,413.286,2810.0,3.0,7.286,61796.059,148.977,143.855,978.094,1.044,2.536,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2872934.0,104.871,38.0,13.188,8.643,11803.431,1.1,304.195,10.08,7.1,51.2,,2.89,78.57,0.795,,,,
2412,DZA,Africa,Algeria,2021-10-16,205199.0,93.0,101.286,5870.0,3.0,2.857,4599.16,2.084,2.27,131.565,0.067,0.064,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,44616630.0,17.348,29.1,6.211,3.857,13913.839,0.5,278.364,6.73,0.7,30.4,83.741,1.9,76.88,0.748,,,,
3006,AND,Europe,Andorra,2021-10-16,15338.0,0.0,6.714,130.0,0.0,0.0,198283.217,0.0,86.799,1680.585,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,77354.0,163.755,,,,,,109.135,7.97,29.0,37.8,,,83.73,0.868,,,,


In [34]:
# Counting on how many countries we have data about fully vaccinated people
covid.people_fully_vaccinated_per_hundred.count()

45

It turned out that out of around 250 countries we have data about COVID vaccination only for 45 countries, but it is what it is.

In [37]:
# Creating a separate DataFrame containing only information about vaccinated people.
vac = covid[['location', 'people_fully_vaccinated_per_hundred']]

In [38]:
# Previewing new DataFrame
vac.head()

Unnamed: 0,location,people_fully_vaccinated_per_hundred
600,Afghanistan,
1212,Africa,5.03
1812,Albania,
2412,Algeria,
3006,Andorra,


In [39]:
vac.tail()

Unnamed: 0,location,people_fully_vaccinated_per_hundred
121651,Vietnam,
122489,World,35.95
123044,Yemen,
123622,Zambia,2.6
124198,Zimbabwe,16.39


In [47]:
# Renaming columns for better readability and further merging
vac.rename(columns = {"location": "Country", "people_fully_vaccinated_per_hundred": "% Vaccinated"}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [50]:
vac.info

Unnamed: 0,Country,% Vaccinated
600,Afghanistan,
1212,Africa,5.03
1812,Albania,
2412,Algeria,
3006,Andorra,


In [48]:
# Comparing how many countries are in the GDP DataFrame and Vaccination DataFrame
gdp.Country.nunique()

224

In [52]:
vac.Country.nunique()

203

We can see that Vaccination DataFrame contains less countries. Let`s merge both DataFrames so that only the countries present in both of the DataFrames will remain in the output.

In [53]:
# Merging two DataFrame into one
result = pd.merge(gdp, vac, on="Country")

In [55]:
# Previewing resulting DataFrame
result.head(25)

Unnamed: 0,Country,GDP,% Vaccinated
0,Afghanistan,1978.961579,
1,Albania,13295.410885,
2,Algeria,10681.679297,
3,Angola,6198.083841,
4,Antigua and Barbuda,17956.315716,
5,Argentina,19686.523659,53.75
6,Armenia,12592.635368,
7,Australia,48697.837028,55.24
8,Austria,51935.603862,
9,Azerbaijan,13699.66559,


In [56]:
result.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 177
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country       178 non-null    object 
 1   GDP           178 non-null    float64
 2   % Vaccinated  39 non-null     float64
dtypes: float64(2), object(1)
memory usage: 5.6+ KB


In [58]:
# Let`s remove those countries that have at least one column, either GDP or % Vaccinated empty, because it doesn`t make sense to calculate correlation on these rows
result.dropna(inplace = True)

In [59]:
result.head()

Unnamed: 0,Country,GDP,% Vaccinated
5,Argentina,19686.523659,53.75
7,Australia,48697.837028,55.24
11,Bahrain,40933.352664,64.8
12,Bangladesh,4818.094737,11.23
24,Bulgaria,22383.805544,20.07


In [60]:
result.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39 entries, 5 to 177
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country       39 non-null     object 
 1   GDP           39 non-null     float64
 2   % Vaccinated  39 non-null     float64
dtypes: float64(2), object(1)
memory usage: 1.2+ KB


We see that we have only 39 countries remaining that have info both on GDP and %Vaccinated. However, the theory says that sample size of more than 25 items is enough to calculate correlation.

In [61]:
# Correlation between GDP and % of Fully Vaccinated people in countries
print("Correlation between nation`s GDP and % of Vaccinated people is ", result.GDP.corr(result['% Vaccinated']))

Correlation between nation`s GDP and % of Vaccinated people is  0.7207865452418307


As we can see there is a strong correlation between nation`s GDP and % of fully vaccinated people at the moment.

In [64]:
# Printing all the countries just for visibility
result.head(39)

Unnamed: 0,Country,GDP,% Vaccinated
5,Argentina,19686.523659,53.75
7,Australia,48697.837028,55.24
11,Bahrain,40933.352664,64.8
12,Bangladesh,4818.094737,11.23
24,Bulgaria,22383.805544,20.07
29,Canada,45856.625626,72.72
42,Czechia,38319.337663,56.2
55,European Union,41504.159149,64.12
57,Finland,47260.800458,66.75
64,Greece,27287.083401,60.76


## **Conclusion:** Though a strong correlation was found between nations' GDP and % of fully vaccinated people from COVID-19, it doesn`t mean that there is a direct dependency. My bet is that there is a variable or a number of vaiables, upon which both of these figures depend. I dare assume that the level of education is among those variables.