## **As we explored the data related to covid deaths, let's explore the data of covid vaccinations.**

In [1]:
from google.colab import drive

drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [2]:
%cd /content/gdrive/MyDrive/SLIIT/Data_Science/Data Analyst Projects/Sample Projects

/content/gdrive/MyDrive/SLIIT/Data_Science/Data Analyst Projects/Sample Projects


In [3]:
import numpy as np
import pandas as pd

#import the csv while parsing the dates (if the data type of date is not in datetime, this will change it to that)
df = pd.read_csv('owid-covid-data.csv', parse_dates=True)

In [4]:
#Converting object type into datetime
df['date'] = pd.to_datetime(df['date'])
df['date'].info()

#add year & month columns
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

df.head()

<class 'pandas.core.series.Series'>
RangeIndex: 365398 entries, 0 to 365397
Series name: date
Non-Null Count   Dtype         
--------------   -----         
365398 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 2.8 MB


Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,year,month
0,AFG,Asia,Afghanistan,2020-01-03,,0.0,,,0.0,,...,0.5,64.83,0.511,41128772.0,,,,,2020,1
1,AFG,Asia,Afghanistan,2020-01-04,,0.0,,,0.0,,...,0.5,64.83,0.511,41128772.0,,,,,2020,1
2,AFG,Asia,Afghanistan,2020-01-05,,0.0,,,0.0,,...,0.5,64.83,0.511,41128772.0,,,,,2020,1
3,AFG,Asia,Afghanistan,2020-01-06,,0.0,,,0.0,,...,0.5,64.83,0.511,41128772.0,,,,,2020,1
4,AFG,Asia,Afghanistan,2020-01-07,,0.0,,,0.0,,...,0.5,64.83,0.511,41128772.0,,,,,2020,1


In [5]:
df.dropna(subset='continent', inplace=True) #removing all the records were null values exist in 'continent' field
df['continent'].isnull().sum() #checing whether there are any null values left in continent field

#dividing the dataset into df_deaths and df_vaccinations
df_deaths = df.iloc[:,:25]
df_vaccinations = df.iloc[:, 25:]

## **SInce I had already performed the data cleansing and pre-processing part in the data exploration of covid deaths, I just repeated the process without any hesitation. Now the dataframe is split into df_deaths and df_vaccinations. I will explore the df_vaccinations dataframe here.**

## **First of all we have to join country, date data into this df_vaccinations dataframe**

In [6]:
#Adding date, continent, location to df_vaccination dataframe
df_vaccinations['date'] = df['date']
df_vaccinations['continent'] = df['continent']
df_vaccinations['location'] = df['location']

df_vaccinations.head()

Unnamed: 0,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,...,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,year,month,date,continent,location
0,,,,,,,,,,,...,41128772.0,,,,,2020,1,2020-01-03,Asia,Afghanistan
1,,,,,,,,,,,...,41128772.0,,,,,2020,1,2020-01-04,Asia,Afghanistan
2,,,,,,,,,,,...,41128772.0,,,,,2020,1,2020-01-05,Asia,Afghanistan
3,,,,,,,,,,,...,41128772.0,,,,,2020,1,2020-01-06,Asia,Afghanistan
4,,,,,,,,,,,...,41128772.0,,,,,2020,1,2020-01-07,Asia,Afghanistan


## **So many columns there, let's try to narrow it down by selecting the columns we need for this explorations. Later, if I need or feel the requirement of another column, then I will add them here.**

In [7]:
df_vaccinations_narrowed = df_vaccinations[['date','year', 'month', 'continent', 'location', 'population','total_tests', 'positive_rate', 'total_vaccinations','people_vaccinated', 'people_fully_vaccinated', 'median_age', 'aged_65_older', 'aged_70_older','diabetes_prevalence', ]]
df_vaccinations_narrowed

Unnamed: 0,date,year,month,continent,location,population,total_tests,positive_rate,total_vaccinations,people_vaccinated,people_fully_vaccinated,median_age,aged_65_older,aged_70_older,diabetes_prevalence
0,2020-01-03,2020,1,Asia,Afghanistan,41128772.0,,,,,,18.6,2.581,1.337,9.59
1,2020-01-04,2020,1,Asia,Afghanistan,41128772.0,,,,,,18.6,2.581,1.337,9.59
2,2020-01-05,2020,1,Asia,Afghanistan,41128772.0,,,,,,18.6,2.581,1.337,9.59
3,2020-01-06,2020,1,Asia,Afghanistan,41128772.0,,,,,,18.6,2.581,1.337,9.59
4,2020-01-07,2020,1,Asia,Afghanistan,41128772.0,,,,,,18.6,2.581,1.337,9.59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
365393,2023-12-15,2023,12,Africa,Zimbabwe,16320539.0,,,,,,19.6,2.822,1.882,1.82
365394,2023-12-16,2023,12,Africa,Zimbabwe,16320539.0,,,,,,19.6,2.822,1.882,1.82
365395,2023-12-17,2023,12,Africa,Zimbabwe,16320539.0,,,,,,19.6,2.822,1.882,1.82
365396,2023-12-18,2023,12,Africa,Zimbabwe,16320539.0,,,,,,19.6,2.822,1.882,1.82


# **Let's start with evaluating the vaccine rollout progress globally.**

## **First let's try to understand the vaccination progress by continents. It's bit challenging because we cannot directly get the sum of people_vacinated & people_fully_vaccinated continent wise. I had to use a two level grouping operation to get the job done & it worked nicely!**

In [8]:
#first grouping by continet & location to get the sum by location. Then, use group it again by continet the get the sum by continent. (Here I have used 'max' function because in the dataset the total amount is updated on daily basis. So the last row of
#a location will be the total vaccinated/fully vaccinated people for that country)
df_vaccinations_narrowed[['continent', 'location','people_vaccinated', 'people_fully_vaccinated']].groupby(['continent', 'location']).agg({
    'people_vaccinated': 'max',
    'people_fully_vaccinated': 'max'
}).groupby('continent').agg({
    'people_vaccinated': 'sum',
    'people_fully_vaccinated': 'sum'
}).sort_values('people_fully_vaccinated', ascending= False).reset_index()


Unnamed: 0,continent,people_vaccinated,people_fully_vaccinated
0,Asia,3688363000.0,3461818000.0
1,Europe,577153300.0,544189000.0
2,Africa,556106700.0,463457300.0
3,North America,458546000.0,394459300.0
4,South America,375452300.0,336932800.0
5,Oceania,28959950.0,27965120.0


##**It appears that more people have been fully vaccinated in Asia than any other continent. But I'm pretty sure the population of Asia is higher than other continents. In that case this is obvious. Let's take this as a proportion of popultion & compare to get a better understanding about the vaccine rollout.**

In [9]:
#Let's get the output into a dataframe, so we can get the percentage of people fully vaccinated & vaccinated as a percentage of the population of the continent
df_vaccinated_by_continent = df_vaccinations_narrowed[['date','continent', 'location', 'population','people_vaccinated', 'people_fully_vaccinated']].groupby(['continent', 'location']).agg({
    'people_vaccinated': 'max',
    'people_fully_vaccinated': 'max',
    'population' : 'max'
}).groupby('continent').agg({
    'people_vaccinated': 'sum',
    'people_fully_vaccinated': 'sum',
    'population' : 'sum'
}).sort_values('people_fully_vaccinated', ascending= False).reset_index()

In [10]:
df_vaccinated_by_continent.sort_values('population', ascending=False)

Unnamed: 0,continent,people_vaccinated,people_fully_vaccinated,population
0,Asia,3688363000.0,3461818000.0,4721838000.0
2,Africa,556106700.0,463457300.0,1426737000.0
1,Europe,577153300.0,544189000.0,814493300.0
3,North America,458546000.0,394459300.0,600323700.0
4,South America,375452300.0,336932800.0,436816700.0
5,Oceania,28959950.0,27965120.0,45038910.0


In [11]:
#let's add two columns for fully vaccinated/vaccinated perentage
df_vaccinated_by_continent['people_vaccinated_percentage'] = (df_vaccinated_by_continent['people_vaccinated']/df_vaccinated_by_continent['population'])*100
df_vaccinated_by_continent['people_fully_vaccinated_percentage'] = (df_vaccinated_by_continent['people_fully_vaccinated']/df_vaccinated_by_continent['population'])*100

df_vaccinated_by_continent.sort_values('people_fully_vaccinated_percentage', ascending=False).to_csv('vaccination_rollout_by_continent.csv')

## **As I expected, there is a significant difference when the fully vaccinated percentages out of population compared with each other.**

## **But Asia hasn't performed bad either securing the 2nd place of the list. A significanltly low percentage for African continet is shocking!**

## **Just like this we can explore the total vaccinations which means the total number of vaccines used in a continent. But I dont think that would provide us any meaningful info. The important thing is how many people are vaccinated/fully vaccinated.**

## **Now let's expore the percentage of people vaccinated/fully vaccinated in countries.**

In [12]:
#lets get the number of people vaccinated/fully vaccinated in each country
df_vaccinated_by_location = df_vaccinations_narrowed[['location', 'people_vaccinated', 'people_fully_vaccinated', 'population']].groupby('location').agg({'people_vaccinated' : 'max', 'people_fully_vaccinated' : 'max', 'population': 'max'})

#lets get that as a percentage of the population
df_vaccinated_by_location['people_vaccinated_percentage'] = (df_vaccinated_by_location['people_vaccinated']/df_vaccinated_by_location['population'])*100
df_vaccinated_by_location['people_fully_vaccinated_percentage'] = (df_vaccinated_by_location['people_fully_vaccinated']/df_vaccinated_by_location['population'])*100

df_vaccinated_by_location.sort_values('people_fully_vaccinated_percentage', ascending=False).to_csv('vaccination_by_country.csv')

## **It is interesting that some countries like UAE, Qatar have vaccinated more people that its poulation. Migrants? I'm not sure.**