# COVID-19 Vaccination Dashboard Data Preparation

This notebook downloads, cleans, analyzes, and exports COVID-19 vaccination data for the Streamlit dashboard.

In [1]:
import pandas as pd
import plotly.express as px

## 1. Load the Data
Load the latest vaccination data from Our World in Data.

In [2]:
url = "https://raw.githubusercontent.com/owid/covid-19-data/refs/heads/master/public/data/vaccinations/vaccinations.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,location,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,daily_vaccinations_per_million,daily_people_vaccinated,daily_people_vaccinated_per_hundred
0,Afghanistan,AFG,2021-02-22,0.0,0.0,,,,,0.0,0.0,,,,,
1,Afghanistan,AFG,2021-02-23,,,,,,1367.0,,,,,33.0,1367.0,0.003
2,Afghanistan,AFG,2021-02-24,,,,,,1367.0,,,,,33.0,1367.0,0.003
3,Afghanistan,AFG,2021-02-25,,,,,,1367.0,,,,,33.0,1367.0,0.003
4,Afghanistan,AFG,2021-02-26,,,,,,1367.0,,,,,33.0,1367.0,0.003


## 2. Advanced Data Cleaning
We perform the following steps:
1. Sort by location and date.
2. Convert date to datetime.
3. **Forward Fill** cumulative columns: This is crucial as cumulative data is often reported intermittently.
4. Fill remaining missing values with 0.

In [7]:
# 1. Sort by location and date to ensure correct time-series filling
df = df.sort_values(['location', 'date'])

# 2. Correct data types: convert 'date' column to datetime
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# 3. Forward fill cumulative columns (propagates the last valid observation forward)
cumulative_cols = [
    'total_vaccinations', 
    'people_vaccinated', 
    'people_fully_vaccinated', 
    'total_boosters',
    'total_vaccinations_per_hundred',
    'people_vaccinated_per_hundred',
    'people_fully_vaccinated_per_hundred',
    'total_boosters_per_hundred'
]
df[cumulative_cols] = df.groupby('location')[cumulative_cols].ffill()

# 4. Fill remaining NaNs in cumulative columns with 0 (for the start of the series)
df[cumulative_cols] = df[cumulative_cols].fillna(0)

# 5. Fill daily stats with 0 (if missing, we assume no vaccinations reported that day)
daily_cols = [
    'daily_vaccinations_raw', 
    'daily_vaccinations', 
    'daily_vaccinations_per_million', 
    'daily_people_vaccinated', 
    'daily_people_vaccinated_per_hundred'
]
df[daily_cols] = df[daily_cols].fillna(0)

# 6. Fill any remaining object columns with 'Unknown'
df = df.fillna('Unknown')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196246 entries, 0 to 196245
Data columns (total 16 columns):
 #   Column                               Non-Null Count   Dtype         
---  ------                               --------------   -----         
 0   location                             196246 non-null  object        
 1   iso_code                             196246 non-null  object        
 2   date                                 196246 non-null  datetime64[ns]
 3   total_vaccinations                   196246 non-null  float64       
 4   people_vaccinated                    196246 non-null  float64       
 5   people_fully_vaccinated              196246 non-null  float64       
 6   total_boosters                       196246 non-null  float64       
 7   daily_vaccinations_raw               196246 non-null  float64       
 8   daily_vaccinations                   196246 non-null  float64       
 9   total_vaccinations_per_hundred       196246 non-null  float64       
 

## 3. Exploratory Data Analysis (EDA)

In [8]:
# Daily Vaccinations Trend (Worldwide)
world_df = df[df['location'] == 'World']
fig = px.line(world_df, x='date', y='daily_vaccinations', title='Global Daily Vaccinations Trend')
fig.show()

In [9]:
# Top 10 Countries by Total Vaccinations (Latest Data)
latest_df = df.sort_values('date').groupby('location').tail(1)
# Filter out aggregates
aggregates = ['World', 'Europe', 'Asia', 'Africa', 'North America', 'South America', 
              'European Union', 'High income', 'Low income', 'Lower middle income', 'Upper middle income']
countries_df = latest_df[~latest_df['location'].isin(aggregates)]

top_10 = countries_df.nlargest(10, 'total_vaccinations')
fig = px.bar(top_10, x='location', y='total_vaccinations', title='Top 10 Countries by Total Vaccinations')
fig.show()

In [13]:
## 4. Export Data
#Exporting to Parquet format for faster loading in the application.
df.to_parquet('cleaned_vaccinations.parquet', index=False)