<h1>COVID-19 and Vaccine Mortality - Visualized<h1>

<h3>Summary<h3>

   This project will seek to quantify the actual mortality rates of COVID-19 and the Pfizer/BioNtech, Moderna, and Johnson and Johnson vaccines for it. The internet and media are full of conflicting views on the safety of the COVID-19 vaccines, so the VAERS (Vaccine Adverse Event Reporting System) data provided by the CDC (Center for Disease Control) will be used to analyze the mortality of the various COVID-19 vaccines. As of the date of this report, the latest update of the VAERS was on 10/02/2021; therefore, this report will only use data from the beginning of the COVID-19 pandemic until that date. Since the VAERS system is limited to the U.S. only, the project will consider only U.S. COVID-19 data in general.
   
   The goal of the project is to discover and provide evidence of the COVID-19 vaccines' effectiveness in keeping people safe during the pandemic, and to answer the question: "Is one better off getting the vaccine or chancing a COVID-19 infection?" The U.S. public seems to be divided on vaccine/no-vaccine, with many heated arguments in workplaces, schools, and homes accross the nation. This is cause for concern of bias, as much for the viewer of the visualizations as for the author. Throughout this report, potential avaneues for bias and data shortcommings will be addressed.

<h3>The Data<h3>

Two data sources will be used. Instead of loading them via url, the urls will be provided and the displayed code will reference the downloaded .csv files. This is because the CDC only allows humans to interact with the VAERS data to ensure everyone using the data sees the disclaimer. It is encouraged that the reader view the disclaimer to understand the shortcommings of the VAERS data:
    
   VAERS accepts reports of adverse events and reactions that occur following vaccination. Healthcare providers, vaccine manufacturers, and the public can submit reports to VAERS. While very important in monitoring vaccine safety, VAERS reports alone cannot be used to determine if a vaccine caused or contributed to an adverse event or illness. The reports may contain information that is incomplete, inaccurate, coincidental, or unverifiable. Most reports to VAERS are voluntary, which means they are subject to biases. This creates specific limitations on how the data can be used scientifically. Data from VAERS reports should always be interpreted with these limitations in mind.
    
   The strengths of VAERS are that it is national in scope and can quickly provide an early warning of a safety problem with a vaccine. As part of CDC and FDA's multi-system approach to post-licensure vaccine safety monitoring, VAERS is designed to rapidly detect unusual or unexpected patterns of adverse events, also known as "safety signals." If a safety signal is found in VAERS, further studies can be done in safety systems such as the CDC's Vaccine Safety Datalink (VSD) or the Clinical Immunization Safety Assessment (CISA) project. These systems do not have the same limitations as VAERS, and can better assess health risks and possible connections between adverse events and a vaccine.

Key considerations and limitations of VAERS data:

1. Vaccine providers are encouraged to report any clinically significant health problem following vaccination to VAERS, whether or not they believe the vaccine was the cause.
    
2. Reports may include incomplete, inaccurate, coincidental and unverified information.
    
3. The number of reports alone cannot be interpreted or used to reach conclusions about the existence, severity, frequency, or rates of problems associated with vaccines.
    
4. VAERS data are limited to vaccine adverse event reports received between 1990 and the most recent date for which data are available.
    
5. VAERS data do not represent all known safety information for a vaccine and should be interpreted in the context of other scientific information.
    
6. VAERS data available to the public include only the initial report data to VAERS. Updated data which contains data from medical records and corrections reported during follow up are used by the government for analysis. 
    
However, for numerous reasons including data consistency, these amended data are not available to the public.
    
Source for VAERS Data: https://wonder.cdc.gov/vaers.html

The other COVID-19 data that is unrelated to vaccines comes from the aggregated COVID-19 data on github.
    
Source for the COVID-19 Data: https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports_us/10-02-2021.csv


<h3>Bringing in the Data<h3>

   To begin, the data will be loaded into python pandas dataframe objects. The data will be queried with pandasql and visualized with altair.

In [1]:
import pandas as pd
import numpy as np
from pandasql import sqldf
import altair as alt

In [2]:
# Open the files
df_data = pd.read_csv('2021VAERSDATA.csv', encoding='iso-8859-1')
df_vax = pd.read_csv('2021VAERSVAX.csv', encoding='iso-8859-1')
df_covid = pd.read_csv('10-02-2021.csv')

# Join vaccine and vaers data on common 'Vaers_ID' column
df = pd.merge(df_data,df_vax, on = ['VAERS_ID'])

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<h3>Low-Fidelity Query and Visualization<h3>

In [3]:
# Query for Deaths by COVID Vaccine Manufacturer
pysqldf = lambda q: sqldf(q, globals())
vax_deaths = pysqldf("SELECT VAX_TYPE,VAX_MANU from df WHERE DIED = 'Y';")
deaths_covid_vax = pysqldf("SELECT VAX_MANU FROM vax_deaths WHERE VAX_TYPE = 'COVID19';")
deaths_covid_vax['Deaths'] = deaths_covid_vax.VAX_MANU.map(deaths_covid_vax.VAX_MANU.value_counts())
deaths_covid_vax = deaths_covid_vax.drop_duplicates().reset_index()
print(deaths_covid_vax.head())

   index              VAX_MANU  Deaths
0      0               MODERNA    3592
1      4       PFIZER\BIONTECH    4036
2    202  UNKNOWN MANUFACTURER      33
3   1674               JANSSEN     795


In [4]:
# Create visualization
alt.Chart(deaths_covid_vax).mark_bar().encode(x='VAX_MANU',y='Deaths')

In [5]:
# Add in COVID Data
covid_deaths = pysqldf("SELECT SUM(Deaths) from df_covid;")
total_deaths = deaths_covid_vax.append({'VAX_MANU':'COVID-19','Deaths':covid_deaths['SUM(Deaths)'][0]},ignore_index = True)
print(total_deaths.head())

    index              VAX_MANU  Deaths
0     0.0               MODERNA    3592
1     4.0       PFIZER\BIONTECH    4036
2   202.0  UNKNOWN MANUFACTURER      33
3  1674.0               JANSSEN     795
4     NaN              COVID-19  701451


In [6]:
# Create visualization
alt.Chart(total_deaths).mark_bar().encode(x='VAX_MANU',y='Deaths')

<h3>Initial Feedback - Expiramental Evaluation<h3>

Though the visualization does demonstrate the massive difference in the raw number of U.S. COVID-19 deaths and the amount of deaths due to the three major vaccines available in the U.S., the low fidelity concept visualization leaves much wanting. Three people were recruited for feedback, and the main points of the feedback were:

1. What about percentage (mortality rate)?
2. How does age play a factor?
3. What are the common factors in the deaths?
4. The vaccine has only been out for less than a year. How about COVID-19 deaths and vaccine deaths since the vaccine?
5. Bar chart makes it hard to see the data.

When viewing the results of the data, the tested were confused initially by the metric. The bars were so large for the vaccine-related deaths in the first chart, but then barely even visible in the second. This will need to be corrected in the final visualization. 

While some of the points were deemed to be beyond the scope of this project, it was decided to pursue the mortality rate according to age for all vaccines combined and the COVID-19 virus.

<h3>Second Round Design - More features<h3>

In [7]:
# Query for age of vaccine deaths
vax_deaths_ages = pysqldf("SELECT AGE_YRS FROM df WHERE DIED = 'Y';")
vax_deaths_ages['Deaths'] = vax_deaths_ages.AGE_YRS.map(vax_deaths_ages.AGE_YRS.value_counts())
vax_deaths_ages = vax_deaths_ages.drop_duplicates().dropna().reset_index()
print(vax_deaths_ages)

     index  AGE_YRS  Deaths
0        0    78.00   206.0
1        1    82.00   241.0
2        2    90.00   196.0
3        4    64.00   141.0
4        5    65.00   168.0
..     ...      ...     ...
103   5209    14.00     1.0
104   5434    13.00     5.0
105   6297     0.50     1.0
106   6426     1.17     4.0
107   8068    11.00     1.0

[108 rows x 3 columns]


Now to calculate the mortality rate, the total administered vaccine data needed to be retreived, and it was discovered that the administering data was grouped into age groups: 12+,18+, and 65+. Now, the age data will be sorted into those three groups.

New Data Source: https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-Jurisdi/unsk-b7fc

In [8]:
vax_age_groups = pd.DataFrame({'Groups':['12-17','18-64','65+'],'Count':[0,0,0]})
i=0
while i < vax_deaths_ages.shape[0]:
    if int(vax_deaths_ages.iloc[i]['AGE_YRS']) <= 17:
        vax_age_groups.loc[0,'Count'] = vax_age_groups.loc[0,'Count']+vax_deaths_ages.loc[i,'Deaths']
    elif int(vax_deaths_ages.iloc[i]['AGE_YRS']) <= 64:
        vax_age_groups.loc[1,'Count'] = vax_age_groups.loc[1,'Count']+vax_deaths_ages.loc[i,'Deaths']
    else:
        vax_age_groups.loc[2,'Count'] = vax_age_groups.loc[2,'Count']+vax_deaths_ages.loc[i,'Deaths']
    i+=1

print(vax_age_groups.head())

  Groups   Count
0  12-17    95.0
1  18-64  1927.0
2    65+  6004.0


In [9]:
# Bring in Vaccination Data
df_vax_admin = pd.read_csv('COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv')
group2 = pysqldf("SELECT SUM(Series_Complete_65plus) from df_vax_admin WHERE Date = '10/02/2021';")
group1 = pysqldf("SELECT SUM(Series_Complete_18plus) from df_vax_admin WHERE Date = '10/02/2021';")
group0 = pysqldf("SELECT SUM(Series_Complete_12plus) from df_vax_admin WHERE Date = '10/02/2021';")

group2_admin = group2['SUM(Series_Complete_65plus)'][0]
group1_admin = group1['SUM(Series_Complete_18plus)'][0] - group2_admin
group0_admin = group0['SUM(Series_Complete_12plus)'][0] - group2_admin - group1_admin

vax_age_groups['Administered'] = [group0_admin, group1_admin, group2_admin]
print(vax_age_groups.head())

  Groups   Count  Administered
0  12-17    95.0      22913560
1  18-64  1927.0     258136419
2    65+  6004.0      92883566


In [10]:
# Calculate Mortality Rate by Age Group
i = 0
mortality = []
while i < vax_age_groups.shape[0]:
    mortality.append(vax_age_groups.loc[i,'Count']/vax_age_groups.loc[i,'Administered']*100)
    i += 1
vax_age_groups['Mortality Rate'] = mortality
print(vax_age_groups.head())

  Groups   Count  Administered  Mortality Rate
0  12-17    95.0      22913560        0.000415
1  18-64  1927.0     258136419        0.000747
2    65+  6004.0      92883566        0.006464


In [11]:
# Create visualization
alt.Chart(vax_age_groups).mark_bar().encode(x='Groups',y='Mortality Rate')

Now that the mortality rate of the vaccines have been categorized by age, the mortality rate of the virus will be calculated by the same age groups. Another data source from the CDC will be retrieved.

New Data Source: https://data.cdc.gov/NCHS/Provisional-COVID-19-Deaths-by-Sex-and-Age/9bhg-hcku/data

In [12]:
df_covid = pd.read_csv('Provisional_COVID-19_Deaths_by_Sex_and_Age.csv')
df_covid = df_covid.rename(columns={'COVID-19 Deaths':'Covid_Deaths','Age Group':'Age_Group','Start Date':'Start_Date','End Date':'End_Date'})
df_covid = pysqldf("SELECT Age_Group,Covid_Deaths from df_covid WHERE Start_Date = '01/01/2020' AND End_Date = '10/02/2021' AND Sex = 'All Sexes';")

group01 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = 'Under 1 year';")
group02 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = '0-17 years';")
group03 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = '5-14 years';")
group04 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = '1-4 years';")
group0 = int(group01['SUM(Covid_Deaths)'][0])+int(group02['SUM(Covid_Deaths)'][0])+int(group03['SUM(Covid_Deaths)'][0])+int(group04['SUM(Covid_Deaths)'][0])

group11 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = '18-29 years';")
group12 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = '30-39 years';")
group13 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = '40-49 years';")
group14 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = '50-64 years';")
group1 = int(group11['SUM(Covid_Deaths)'][0])+int(group12['SUM(Covid_Deaths)'][0])+int(group13['SUM(Covid_Deaths)'][0])+int(group14['SUM(Covid_Deaths)'][0])

group21 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = '65-74 years';")
group22 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = '75-84 years';")
group23 = pysqldf("SELECT SUM(Covid_Deaths) from df_covid WHERE Age_Group = '85 years and over';")
group2 = int(group21['SUM(Covid_Deaths)'][0])+int(group22['SUM(Covid_Deaths)'][0])+int(group23['SUM(Covid_Deaths)'][0])

covid_age_groups = pd.DataFrame({'Groups':['0-24','18-64','65+'],'Count':[group0//2,group1//2,group2//2]})
print(covid_age_groups.head())

  Groups   Count
0   0-24     678
1  18-64  164735
2    65+  537191


For case data by age, the public surveillance data was pulled from the CDC website.

New Data Source: https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data-Profile/xigx-wn5e

In [13]:
cases_20s = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data_20s.csv').shape[0]
cases_30s = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data30s.csv').shape[0]
cases_40s = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data40s.csv').shape[0]
cases_50s = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data50s.csv').shape[0]
cases_10s = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data10s.csv').shape[0]
cases_60s = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data60s.csv').shape[0]
cases_0s = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data0s.csv').shape[0]
cases_70s = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data70s.csv').shape[0]
cases_80s = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data80s.csv').shape[0]

covid_age_groups['Cases'] = [cases_0s+cases_10s,cases_20s+cases_30s+cases_40s+cases_50s+cases_60s,cases_70s+cases_80s]
print(covid_age_groups.head())

  Groups   Count     Cases
0   0-24     678   5709977
1  18-64  164735  23905567
2    65+  537191   2853228


In [16]:
# Calculate Mortality Rate by Age Group
i = 0
mortality = []
while i < covid_age_groups.shape[0]:
    mortality.append(covid_age_groups.loc[i,'Count']/covid_age_groups.loc[i,'Cases']*100)
    i += 1
covid_age_groups['Mortality Rate'] = mortality
print(covid_age_groups.head())

  Groups   Count     Cases  Mortality Rate
0   0-24     678   5709977        0.011874
1  18-64  164735  23905567        0.689107
2    65+  537191   2853228       18.827482


In [17]:
# Create visualization
alt.Chart(covid_age_groups).mark_bar().encode(x='Groups',y='Mortality Rate')

<h3>New and Improved Visualization<h3>

In [64]:
# New Chart
Age_Group = list(vax_age_groups['Groups'])+list(covid_age_groups['Groups'])
Mortality_Rate = list(vax_age_groups['Mortality Rate'])+list(covid_age_groups['Mortality Rate'])
Cause = ['Vaccine']*3+['Covid']*3
df = pd.DataFrame({'Age Group':Age_Group,'Mortality Rate':Mortality_Rate,'Cause':Cause})
print(df)

alt.Chart(df).mark_circle(size=300).encode(
    alt.X('Age Group',
          axis = alt.Axis(tickCount=6)),
    alt.Y('Mortality Rate',
          scale = alt.Scale(type='log'),
          sort = 'ascending',
          axis = alt.Axis(tickCount=6)),
    alt.Color('Cause',
              scale = alt.Scale(scheme='goldred')),
    tooltip = ['Cause', 'Age Group', 'Mortality Rate']
).interactive().properties(width=250)

  Age Group  Mortality Rate    Cause
0     12-17        0.000415  Vaccine
1     18-64        0.000747  Vaccine
2       65+        0.006464  Vaccine
3      0-24        0.011874    Covid
4     18-64        0.689107    Covid
5       65+       18.827482    Covid


<h3>Final Evaluation<h3>

For the final evaluation, the same three people from the initial feedback were approached and asked to view the visualization. This was met with overall better feedback. The logorithmic scale, though not entirely intuitive, did help put the two mortality rates in perspective. The following questions were asked, and they are accompanied by the resulting answers:

1. What do you see in this chart? Did you learn anything?
    Person A: "I see that the COVID vaccine is safer than catching COVID!"
    Person B: "I learned that the vaccine does have risk."
    Person C: "I would rather take the vaccine than risk COVID."

2. Did you learn something new from this visualization?
    Person A: "I will reconsider getting the vaccine."
    Person B: "I learned that the COVID vaccine isn't 100% safe."
    Person C: "That COVID is really dangerous for old people."
    
3. What improvements would you like to see to this visualization?
    Person A: "Could you connect the dots to show a line?"
    Person B: "It's difficult for me to see the difference in the colors."
    Person C: "Add lines between the points."
    
From this evaluation, future iterations of designs should focus on more salient datapoints, especially when highlighting the difference between two datasets. People like to see goups, so adding a line connecting the data points could have aided in the inuitiveness of the design. Also, logorithmic scales are not intuitive for most people. The covid mortality rate in ages 65+ is over 18%! That is a massive shock to most people, and it could have been highlighted better. 

<h3>Final Thoughts<h3>

Exploring this data was tedious, but really fun. I set out with my own bias of the vaccine being increadible and everyone should take it, but I did learn that people have died from it. An accurate conclusion would be that people should consult their family physicians before taking one of the available COVID-19 vaccines. The vaccines were approved for emergency use, and, therefore, they should be thoroughly vetted by an expert that knows the individual's medical history. 

Getting other people's feedback early on in the process really helped flesh out a final design. And if I were to pursue it further and add in the second round suggestions, I believe the visualization could be very impactful. Perhaps I will pursue that route after the conclusion of this course. 

In the future, I will spend more time learning the data and sketching possible tasks/visualizations. I just jumped on the coding, and that really lead to a lot of frustration and straying from the task. 