# Dataset Information
The main [dataset](https://www.kaggle.com/datasets/gpreda/covid-world-vaccination-progress) used in this project (scraped from [here](https://github.com/owid/covid-19-data)) provides detailed information regarding the COVID-19 vaccination status for different countries. The data is updated daily, and a second file containing the manufacturer information is also available.
Among the information available, we find:
* total_vaccination: This is the absolute number of total immunizations in the country
* people_vaccinated: A person, depending on the immunization scheme, will receive one or more (typically 2) vaccines; at a certain moment, the number of vaccination might be larger than the number of people
* people_fully_vaccinated: This is the number of people that received the entire set of immunization according to the immunization scheme (typically 2); at a certain moment in time, there might be a certain number of people that received one vaccine and another number (smaller) of people that received all vaccines in the scheme.
* daily_vaccinations_raw: For a certain data entry, the number of vaccination for that date/country.
* daily_vaccinations: For a certain data entry, the number of vaccination for that date/country.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Data Loading

In [2]:
countries_df = pd.read_csv('../input/covid-world-vaccination-progress/country_vaccinations.csv')
manufacturer_df = pd.read_csv('../input/covid-world-vaccination-progress/country_vaccinations_by_manufacturer.csv')

In [3]:
countries_df.head()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,0.0,0.0,,,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi...",World Health Organization,https://covid19.who.int/
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,,,,34.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi...",World Health Organization,https://covid19.who.int/
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,,,,34.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi...",World Health Organization,https://covid19.who.int/
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,,,,34.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi...",World Health Organization,https://covid19.who.int/
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,,,,34.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi...",World Health Organization,https://covid19.who.int/


In [4]:
manufacturer_df.head()

Unnamed: 0,location,date,vaccine,total_vaccinations
0,Argentina,2020-12-29,Moderna,2
1,Argentina,2020-12-29,Oxford/AstraZeneca,3
2,Argentina,2020-12-29,Sinopharm/Beijing,1
3,Argentina,2020-12-29,Sputnik V,20481
4,Argentina,2020-12-30,Moderna,2


# Data Cleaning
The data cleaning process consisted of checking for duplicates, meaningless NaN, and missing values, as well as casting appropriate data types to facilitate the analysis.

Starting with the manufacturer dataset, the `.info()` method provides a quick view of the dataset metadata

In [5]:
manufacturer_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35623 entries, 0 to 35622
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   location            35623 non-null  object
 1   date                35623 non-null  object
 2   vaccine             35623 non-null  object
 3   total_vaccinations  35623 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 1.1+ MB


Calling `.isna()sum()` and `.duplicated().sum()` returns the amount of NaN and duplicated values.

In [6]:
manufacturer_df.isna().sum()


location              0
date                  0
vaccine               0
total_vaccinations    0
dtype: int64

In [7]:
manufacturer_df.duplicated().sum()

0

With the manufacturer dataset looking seamlessly clean, we repeat the process for the vaccination dataset.

In [8]:
countries_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86512 entries, 0 to 86511
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   country                              86512 non-null  object 
 1   iso_code                             86512 non-null  object 
 2   date                                 86512 non-null  object 
 3   total_vaccinations                   43607 non-null  float64
 4   people_vaccinated                    41294 non-null  float64
 5   people_fully_vaccinated              38802 non-null  float64
 6   daily_vaccinations_raw               35362 non-null  float64
 7   daily_vaccinations                   86213 non-null  float64
 8   total_vaccinations_per_hundred       43607 non-null  float64
 9   people_vaccinated_per_hundred        41294 non-null  float64
 10  people_fully_vaccinated_per_hundred  38802 non-null  float64
 11  daily_vaccinations_per_milli

Checking the data types, we notice that the date column is cast as object. We will change that to date type, as well as lower some float points into int for less storage requirement. First, we select the columns that are pertinent to our analysis and store them in a different data frame to preserve the original dataset intact.

In [9]:
# Let's check the complete list of the columns.
countries_df.columns

Index(['country', 'iso_code', 'date', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated',
       'daily_vaccinations_raw', 'daily_vaccinations',
       'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
       'people_fully_vaccinated_per_hundred', 'daily_vaccinations_per_million',
       'vaccines', 'source_name', 'source_website'],
      dtype='object')

In [10]:
#Creating new df preserves the original data
# Selected columns
selected_cols = ['country', 'iso_code', 'date', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated',
       'daily_vaccinations_raw', 'daily_vaccinations','vaccines']

# Redefining data types
selected_dtypes = {
    'total_vaccinations': 'float32',
    'people_vaccinated': 'float32',
    'people_fully_vaccinated': 'float32',
    'daily_vaccinations_raw': 'float32',
    'daily_vaccinations': 'float32'  
}

# New dataset with corrected types
vac_df = pd.read_csv('../input/covid-world-vaccination-progress/country_vaccinations.csv', 
                            usecols=selected_cols, 
                            dtype=selected_dtypes, 
                            parse_dates=['date'])

In [11]:
vac_df.head()

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,vaccines
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi..."
1,Afghanistan,AFG,2021-02-23,,,,,1367.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi..."
2,Afghanistan,AFG,2021-02-24,,,,,1367.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi..."
3,Afghanistan,AFG,2021-02-25,,,,,1367.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi..."
4,Afghanistan,AFG,2021-02-26,,,,,1367.0,"Johnson&Johnson, Oxford/AstraZeneca, Pfizer/Bi..."


In [12]:
# New info
vac_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86512 entries, 0 to 86511
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   country                  86512 non-null  object        
 1   iso_code                 86512 non-null  object        
 2   date                     86512 non-null  datetime64[ns]
 3   total_vaccinations       43607 non-null  float32       
 4   people_vaccinated        41294 non-null  float32       
 5   people_fully_vaccinated  38802 non-null  float32       
 6   daily_vaccinations_raw   35362 non-null  float32       
 7   daily_vaccinations       86213 non-null  float32       
 8   vaccines                 86512 non-null  object        
dtypes: datetime64[ns](1), float32(5), object(3)
memory usage: 4.3+ MB


Moving on, counting for NaN values and duplicates.

In [13]:
vac_df.duplicated().sum()

0

In [14]:
vac_df.isna().sum()


country                        0
iso_code                       0
date                           0
total_vaccinations         42905
people_vaccinated          45218
people_fully_vaccinated    47710
daily_vaccinations_raw     51150
daily_vaccinations           299
vaccines                       0
dtype: int64

A closer look at the dataset literature reveals that the NaN values have meaning in this dataset (different vaccination starting dates, different shots) and will be kept as is.

# Data Visualization
Based on the main dataset (vac_df), we can create different data frames to support visualization. We can group by country and select the maximum value for each data point (as the counters increase daily).

In [15]:
countries_vac = vac_df.groupby('country')[['total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated']].max().sort_values(['total_vaccinations'], ascending=False).reset_index()
countries_vac

Unnamed: 0,country,total_vaccinations,people_vaccinated,people_fully_vaccinated
0,China,3.263129e+09,1.275541e+09,1.240777e+09
1,India,1.834501e+09,9.848381e+08,8.282294e+08
2,United States,5.601818e+08,2.553624e+08,2.174990e+08
3,Brazil,4.135596e+08,1.810781e+08,1.602729e+08
4,Indonesia,3.771089e+08,1.962409e+08,1.588305e+08
...,...,...,...,...
218,Falkland Islands,4.407000e+03,2.632000e+03,1.775000e+03
219,Montserrat,4.211000e+03,1.897000e+03,1.804000e+03
220,Niue,4.161000e+03,1.650000e+03,1.417000e+03
221,Tokelau,1.936000e+03,9.680000e+02,9.680000e+02


Supporting visualization for the top 20 nations with the highest vaccination records.

In [16]:
# Plot
fig = px.bar(countries_vac.head(20), x='country', y='total_vaccinations',
            hover_data=['total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated'], color='country')

# Set style and labels
fig.update_layout(template='presentation')
fig.update_layout(autosize=False, width=900, height=900)

fig.update_layout(title="Top 20 Countries in Total Vaccinations Numbers",
                 xaxis_title= "",
                 yaxis_title= "Population")

# Show plot
fig.show()


Different nations have different total populations. It could be interesting to compare how the number of fully vaccinated people compares to the number of vaccinated (one-shot for most vaccines) people to get an insight into how the country's vaccination program is evolving. Note, however, that this information is not available for China in the current dataset.

In [17]:
# Set figure
fig = go.Figure()

# Plot
fig.add_trace(go.Bar(x=countries_vac[1:20].country, y=countries_vac[1:20].people_fully_vaccinated,
                    name='Fully vaccinated'))
fig.add_trace(go.Bar(x=countries_vac[1:20].country, y=countries_vac[1:20].people_vaccinated,
                    name='Partially Vaccinated'))

# Set style and labels
fig.update_layout(template='simple_white')
fig.update_layout(title="Fully vaccinated vs Partially Vaccinated",
                 xaxis_title = "Country Name", yaxis_title= "Population")

# Show
fig.show()

We can check how the vaccination daily trends are evolving for both total vaccinations and fully vaccinated counts. We will limit this visualization to the top 10 nations with the highest total vaccination numbers.

In [18]:
# Select top 10
top_10 = list(countries_vac.country.head(10))
top_10_country = vac_df[vac_df['country'].isin(top_10)]

# Plot
fig = px.line(top_10_country, x='date', y='total_vaccinations', 
              color='country', line_group='country', hover_name='country')

# Set style and labels
fig.update_layout(title="Total Vaccination Over Time",
                  xaxis_title="Date",
                  yaxis_title="Total Vaccinations")

# Show
fig.show()


In [19]:
# Plot
fig = px.line(top_10_country, x='date', y='people_fully_vaccinated', 
              color='country', line_group='country', hover_name='country')

# Set style and labels
fig.update_layout(title="Fully Vaccination Increase Over Time",
                 xaxis_title="Date",
                 yaxis_title="Fully Vaccinated")

# Show
fig.show()

In order to compare the vaccination progress across the countries' entire populations, we add another dataset to our project. We rename the columns to use the pandas's merge method.

In [20]:
population_df = pd.read_csv('../input/population-by-country-2020/population_by_country_2020.csv', usecols=['Country (or dependency)', 'Population (2020)'])
population_df= population_df.rename(columns={'Country (or dependency)':'country',
                               'Population (2020)':'population'})

In [21]:
population_df.head()

Unnamed: 0,country,population
0,China,1440297825
1,India,1382345085
2,United States,331341050
3,Indonesia,274021604
4,Pakistan,221612785


In [22]:
merged_df = countries_vac.merge(population_df,left_on='country', right_on='country', how='left')
merged_df.head()

Unnamed: 0,country,total_vaccinations,people_vaccinated,people_fully_vaccinated,population
0,China,3263129000.0,1275541000.0,1240777000.0,1440298000.0
1,India,1834501000.0,984838100.0,828229400.0,1382345000.0
2,United States,560181800.0,255362400.0,217499000.0,331341000.0
3,Brazil,413559600.0,181078100.0,160272900.0,212822000.0
4,Indonesia,377108900.0,196240900.0,158830500.0,274021600.0


We can now compare the total population of a country with its total vaccinations (top 10).

In [23]:
# Plot
fig = go.Figure(data=[
    go.Bar(x=merged_df[1:11].country, y=merged_df[1:11].people_fully_vaccinated, name='fully vaccinated'),
    go.Bar(x=merged_df[1:11].country, y=merged_df[1:11].people_vaccinated, name='partially vaccinated'),
    go.Bar(x=merged_df[1:11].country, y=merged_df[1:11].population, name='population')
])

# Set style and labels
fig.update_layout(barmode='group')
fig.update_layout(template='simple_white')
fig.update_layout(title="A comparitive study on total numbers of fully vaccinated, partially vaccinated, and total population per country",
                  xaxis_title= "Country Name",
                  yaxis_title= "Population")

# Show
fig.show()

Based on the highest population count, we can visualize the daily vaccination progress.

In [24]:
# Set initial visualization to match highest population count (China)
country_name = 'Bangladesh' 
country_vac = vac_df[vac_df['country']== country_name]

In [25]:
# Select top 10 highest population countries as label
top_10 = list(merged_df.sort_values('population', ascending=False)['country'].head(10))
top_10 = sorted(top_10)
print(top_10)

['Bangladesh', 'Brazil', 'China', 'India', 'Indonesia', 'Mexico', 'Nigeria', 'Pakistan', 'Russia', 'United States']


In [26]:
# Set Figure
fig = go.Figure()

# Plot
fig.add_trace(go.Scatter(x=country_vac.date,
                        y=country_vac.daily_vaccinations, visible=True))

# Interactive (through different countries)
buttons = []
for x in top_10:
    buttons.append(dict(method='restyle',
                        label=x,
                        visible=True,
                        args=[{'x':[vac_df[vac_df['country']== x].date],
                               'y':[vac_df[vac_df['country']== x].daily_vaccinations],
                               'type':'scatter'}, [0]],
                       )
                  )
updatemenu = []
your_menu = dict()
updatemenu.append(your_menu)

updatemenu[0]['buttons'] = buttons
updatemenu[0]['direction'] = 'down'
updatemenu[0]['showactive'] = True

# add dropdown menus to the figure
fig.update_layout(showlegend=False, updatemenus=updatemenu)

# Set style and Labels
fig.update_layout(template="presentation")
fig.update_layout(title="Daily vaccination graph")
fig.show()

Since the manufacturer data is available, we can check the distribution for different vaccines.

In [27]:
title = "Vaccine distribution over countries"
# Plot
fig = go.Figure(data=[go.Pie(labels=manufacturer_df.vaccine, values=manufacturer_df.total_vaccinations, hole=.3)])

# Set style and labels
fig.update_layout(template="presentation")
fig.update_layout(title=title)

# Show
fig.show()

# Conclusions
With a rapid increase starting April 2021, China stands with the largest number of vaccinated people among nations. However, USA was the first country to flatten the immunization line, with half of its population being fully vaccinated by June 2021. Among the vaccines used, Pfizer has the lead in distribution by a good amount.

As of May 29, 2022,  Brazil has fully vaccinated over 75% of its population. After a first look through the dataset,  the metrics corresponding to specific countries, such as Brazil, can be explored in future projects.

**Special Credits**

**This notebook was closely inspired by [Ankit Pratap Singh](https://ankit-pratap-singh.medium.com/)'s work**