## Libraries

In [3]:
import pandas as pd

## Read in Data and display info about the data

In [9]:
data = pd.read_csv('country_vaccinations.csv')
print(list(data))
data.head()

['country', 'iso_code', 'date', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'daily_vaccinations_raw', 'daily_vaccinations', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred', 'daily_vaccinations_per_million', 'vaccines', 'source_name', 'source_website']


Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
0,Albania,ALB,2021-01-10,0.0,0.0,,,,0.0,0.0,,,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
1,Albania,ALB,2021-01-11,,,,,64.0,,,,22.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
2,Albania,ALB,2021-01-12,128.0,128.0,,,64.0,0.0,0.0,,22.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
3,Albania,ALB,2021-01-13,188.0,188.0,,60.0,63.0,0.01,0.01,,22.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...
4,Albania,ALB,2021-01-14,266.0,266.0,,78.0,66.0,0.01,0.01,,23.0,Pfizer/BioNTech,Ministry of Health,https://shendetesia.gov.al/covid19-ministria-e...


In [12]:
data.dtypes

country                                 object
iso_code                                object
date                                    object
total_vaccinations                     float64
people_vaccinated                      float64
people_fully_vaccinated                float64
daily_vaccinations_raw                 float64
daily_vaccinations                     float64
total_vaccinations_per_hundred         float64
people_vaccinated_per_hundred          float64
people_fully_vaccinated_per_hundred    float64
daily_vaccinations_per_million         float64
vaccines                                object
source_name                             object
source_website                          object
dtype: object

## Data Completeness

The code below shows that almost every column in the dataframe is missing data values. 

In [14]:
data.isnull().sum()

country                                   0
iso_code                                332
date                                      0
total_vaccinations                     1937
people_vaccinated                      2331
people_fully_vaccinated                3282
daily_vaccinations_raw                 2476
daily_vaccinations                      184
total_vaccinations_per_hundred         1937
people_vaccinated_per_hundred          2331
people_fully_vaccinated_per_hundred    3282
daily_vaccinations_per_million          184
vaccines                                  0
source_name                               6
source_website                            0
dtype: int64

### Qualitative Variables (ISO_Code & Source_Name) Missing Values

The ISO_code which is an abbreviation of the country name is missing, so let's see which countries are missing these values.

In [32]:
missing_iso_countries = list(data[data['iso_code'].isna()]['country'].unique())
print(missing_iso_countries)

['England', 'Northern Ireland', 'Scotland', 'Wales']


For the countries that have missing values, does the dataframe contain any rows that contain the corresponding ISO values?

In [35]:
for country in missing_iso_countries:
    print(country, data[data['country'] == country]['iso_code'].notna().sum())

England 0
Northern Ireland 0
Scotland 0
Wales 0


After doing some digging on the internet, I found out that England, Northern Ireland, Scotland, and Wales are collectively referred to as the United Kingdom, so we need to confirm if the current UK vaccination numbers are the proper summation of the vaccinations reported separately for England, Northern Ireland, Scotland, and wales. If yes, we can drop the England/Northern Ireland/Scotland/Wales rows as they are repetitive values. If no, we can either adjust to the United Kingdom numbers to match the true values, or drop the United Kingdom and use the separate regions' numbers. 

In [40]:
uk_data = data[data['country'] == 'United Kingdom']
eng_data = data[data['country'] == 'England']
ni_data = data[data['country'] == 'Northern Ireland']
scot_data = data[data['country'] == 'Scotland']
wales_data = data[data['country'] == 'Wales']

In [42]:
uk_data['daily_vaccinations'].sum() 

21898050.0

In [43]:
eng_data['daily_vaccinations'].sum() + ni_data['daily_vaccinations'].sum() + scot_data['daily_vaccinations'].sum() + wales_data['daily_vaccinations'].sum()   

21898043.0

The numbers are pretty close minus some missing values, so we can utilize the United Kingdom numbers. 

In [59]:
drop_countries_indices = data[(data['country'] == 'England') | (data['country'] == 'Northern Ireland') | (data['country'] == 'Wales') | (data['country'] == 'Scotland')].index
data.drop(drop_countries_indices, inplace = True) 

I have know resolved the missing values in the ISO_Code column. Now let's look at the other qualitative column: Source_name

In [60]:
data.isnull().sum()

country                                   0
iso_code                                  0
date                                      0
total_vaccinations                     1839
people_vaccinated                      2233
people_fully_vaccinated                3172
daily_vaccinations_raw                 2357
daily_vaccinations                      180
total_vaccinations_per_hundred         1839
people_vaccinated_per_hundred          2233
people_fully_vaccinated_per_hundred    3172
daily_vaccinations_per_million          180
vaccines                                  0
source_name                               6
source_website                            0
dtype: int64

In [63]:
data[data['source_name'].isna()]

Unnamed: 0,country,iso_code,date,total_vaccinations,people_vaccinated,people_fully_vaccinated,daily_vaccinations_raw,daily_vaccinations,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,daily_vaccinations_per_million,vaccines,source_name,source_website
571,Belize,BLZ,2021-02-28,0.0,0.0,,,,0.0,0.0,,,Oxford/AstraZeneca,,https://www.facebook.com/1410454029253008/post...
572,Belize,BLZ,2021-03-01,,,,,208.0,,,,523.0,Oxford/AstraZeneca,,https://www.facebook.com/1410454029253008/post...
573,Belize,BLZ,2021-03-02,417.0,417.0,,,208.0,0.1,0.1,,523.0,Oxford/AstraZeneca,,https://www.facebook.com/1410454029253008/post...
574,Belize,BLZ,2021-03-03,673.0,673.0,,256.0,224.0,0.17,0.17,,563.0,Oxford/AstraZeneca,,https://www.facebook.com/1410454029253008/post...
575,Belize,BLZ,2021-03-04,817.0,817.0,,144.0,204.0,0.21,0.21,,513.0,Oxford/AstraZeneca,,https://www.facebook.com/1410454029253008/post...
576,Belize,BLZ,2021-03-05,996.0,996.0,,179.0,199.0,0.25,0.25,,500.0,Oxford/AstraZeneca,,https://www.facebook.com/1410454029253008/post...


All the missing values for the source name are for the country Belize which reports its statistics on Facebook. However, the Facebook links are dead, so we don't know what Facebook page it is. Let's just assign the source to just Facebook then.

Note: This is a temporary fix since if we have missing sources in the future that are not from Belize Facebook pages, we will need to be more verbose in our NA fills.

In [75]:
data['source_name'] = data['source_name'].fillna('Facebook')

In [76]:
data.isnull().sum()

country                                   0
iso_code                                  0
date                                      0
total_vaccinations                     1839
people_vaccinated                      2233
people_fully_vaccinated                3172
daily_vaccinations_raw                 2357
daily_vaccinations                      180
total_vaccinations_per_hundred         1839
people_vaccinated_per_hundred          2233
people_fully_vaccinated_per_hundred    3172
daily_vaccinations_per_million          180
vaccines                                  0
source_name                               0
source_website                            0
dtype: int64

### Quantiative Variables Missing Values