# Final Project: Vaccine Hesitancy


### Abby Wolfe

## Problem Statement

### Research question: What factors determine vaccine hesitancy?

+ Lack of information about the vaccine?    
+ Concern about potential side effects?   
+ Belief that COVID-19 is not a large threat?  
+ Distrust in medicine or the government? 
+ Echo chamber effect? Can the people we're surrounded by influence our beliefs?

## Background

### Data information
   
+ COVID-19 Vaccination Data from the CDC  
    - Contains data on vaccination rates by state and date
        
+ Household Pulse Survey from the Census Bureau 
    - Contains survey data from July 21st, 2021
    - Respondents answered questions about demographic information, location, vaccination status, and reasons for not wanting to be vaccinated
    - Contains outcome variable which is a dichotomous dummy that confirms if the respondents are vaccinated

## Methods Explored and Considered

### Alternative data source
+ Google trends data in place of the Census Bureau survey data
    - Allows people to look at search trends and history over a period of time and by location
    - Data is downloadable as a .csv file
    - Search queries' phrasing are not as specific as they could be
    - U.S. results often limited to certain states based on where there' significant data - lots of missing data!

### Data Wrangling
+ Data exploration requires 2 different datasets that needed to be merged
+ After merging the datasets, variables need to be transformed and scaled

## More Methods Considered and Explored

### Statistical Learning
+ Different types of models considered for a modeling pipeline:
    - Ordinary Least Squares regression
    - Logistic regression
    - Gaussian Naive Bayes classification
    - Random Forest model
    - K Nearest Neighbors model
    - Decision Tree model
+ Parameters measured to assess models during tuning:
    - Mean Squared Error
    - Maximum Likelihood Estimation
    - ROC

## Methods and Tools Used

### Data Sources
+ Chose to not use data from Google Trends and used the Census Bureau dataset instead
    - Original concern with Census Bureau data was that the date-level information wasn't specific enough to merge with CDC data
    - During initial research, only one week of the latest phase of the Census Bureau data had been published 
    - Now all of the latest phase of the Census Bureau data has been published with the weeks recorded which allows for date-level analysis
    - Google Trends data not specific or complete enough for analysis
    - Google Trends data also does not allow for desired unit of analysis (individual vaccination status)

### Data Wrangling
+ The datasets still need to be merged by state and date-level data
+ Some data needs to be transformed because survey data reports missing values as other numbers
+ Most of the variables are on a 0-1 scale, but others that are continuous or categorical but not dichotomous need to be scaled 

## More Methods and Tools Used
### Statistical Learning
+ Statistical learning component not yet complete
+ While there are no preliminary results yet, the only model that I've removed from my list is the K Nearest Neighbors model and potentially the OLS regression
    - Most variables of interest (including the outcome variable) are dichotomous and would perform better under Logit, Bayesian, and Decision Trees/forests
+ Parameter of interest will most likely be Maximum Likelihood Estimation as the outcome variable is dichotomous 

## Results

While there are no statistical learning results yet, we can show results of the merge and initial data exploration. Here are our initial 2 datasets and the merged dataset.

In [3]:
import pandas as pd
import numpy as np
import missingno as miss
from plotnine import *
import matplotlib.pyplot as plt

hps = pd.read_csv("hps_data.csv")
cdc = pd.read_csv("COVID-19_Vaccinations_in_the_United_States_Jurisdiction.csv")
cdc_new = cdc[cdc["Date"] == '7/21/2021']

hps = hps.assign(state = hps.EST_ST)
cdc_new = cdc_new.assign(state = cdc_new.Location)

states_key = {1:'AL', 2:'AK', 4:'AZ', 5:'AR', 6:'CA', 8:'CO', 9:'CT', 10:'DE', 11:'DC', 12:'FL', 13:'GA', 15:'HI',16:'ID', 
              17:'IL', 18:'IN', 19:'IA', 20:'KS', 21:'KY', 22:'LA', 23:'ME', 24:'MD', 25:'MA', 26:'MI', 27:'MN', 28:'MS',
              29:'MO', 30:'MT', 31:'NE', 32:'NV', 33:'NH', 34:'NJ', 35:'NM', 36:'NY', 37:'NC', 38:'ND', 39:'OH', 40:'OK',
              41:'OR', 42:'PA', 44:'RI', 45:'SC', 46:'SD', 47:'TN', 48:'TX', 49:'UT', 50:'VT', 51:'VA', 53:'WA', 54:'WV',
              55:'WI', 56:'WY'}

hps.state.replace(states_key, inplace=True)

cdc_vars = cdc_new[['state','Distributed','Dist_Per_100K', 'Administered', 'Admin_Per_100K', 'Series_Complete_Pop_Pct', 
                    'Administered_Dose1_Pop_Pct']]
combined_df = hps.merge(cdc_vars, how="left", on="state")

In [4]:
combined_df = combined_df.replace(-88, np.NaN) # Missing response to question
combined_df = combined_df.replace(-99, np.NaN) # Question seen but not answered

combined_df.RECVDVACC = combined_df.RECVDVACC.replace(2, 0)

combined_df.WHYNORV1=combined_df.WHYNORV1.fillna(0) # side effects concern
combined_df.WHYNORV2=combined_df.WHYNORV2.fillna(0) # unsure of vaccine protection
combined_df.WHYNORV3=combined_df.WHYNORV3.fillna(0) # doesn't believe vaccine is necessary
combined_df.WHYNORV6=combined_df.WHYNORV6.fillna(0) # cost concern
combined_df.WHYNORV7=combined_df.WHYNORV7.fillna(0) # distrust of the vaccine
combined_df.WHYNORV8=combined_df.WHYNORV8.fillna(0) # distrust of the government
combined_df.WHYNORV9=combined_df.WHYNORV9.fillna(0) # doesn't see COVID-19 as a threat
combined_df.WHYNORV11=combined_df.WHYNORV11.fillna(0) # believes that one dose is enough protection

combined_df = combined_df[['Series_Complete_Pop_Pct','Administered_Dose1_Pop_Pct','RECVDVACC','GETVACRV','WHYNORV1','WHYNORV2','WHYNORV3','WHYNORV6','WHYNORV7','WHYNORV8','WHYNORV9','WHYNORV11']]

In [6]:
# This is a snapshot of the CDC data
cdc.head()

Unnamed: 0,Date,MMWR_week,Location,Distributed,Distributed_Janssen,Distributed_Moderna,Distributed_Pfizer,Distributed_Unk_Manuf,Dist_Per_100K,Distributed_Per_100k_12Plus,...,Additional_Doses_18Plus,Additional_Doses_18Plus_Vax_Pct,Additional_Doses_50Plus,Additional_Doses_50Plus_Vax_Pct,Additional_Doses_65Plus,Additional_Doses_65Plus_Vax_Pct,Additional_Doses_Moderna,Additional_Doses_Pfizer,Additional_Doses_Janssen,Additional_Doses_Unk_Manuf
0,11/16/2021,46,HI,2738800,111700,1025160,1601940,0,193436,226140,...,102827.0,12.8,80786.0,18.4,58275.0,24.6,24145.0,78252.0,570.0,2.0
1,11/16/2021,46,IL,20805235,1033300,7663400,12108535,0,164185,191969,...,1384001.0,19.2,1066406.0,28.7,745263.0,41.9,528502.0,844422.0,12021.0,457.0
2,11/16/2021,46,OH,17728785,833000,6991380,9904405,0,151669,177275,...,1191580.0,20.6,977371.0,29.1,722517.0,41.5,465407.0,710412.0,16840.0,222.0
3,11/16/2021,46,TX,47765235,2325500,17998260,27441475,0,164731,198187,...,2236905.0,15.5,1695704.0,24.9,1101393.0,35.8,951389.0,1264475.0,25356.0,68.0
4,11/16/2021,46,BP2,285240,15600,122220,147420,0,0,0,...,8686.0,7.0,3062.0,10.9,548.0,15.4,4184.0,4386.0,96.0,20.0


In [7]:
# This is a snapshot of the Census Bureau data
hps.head()

Unnamed: 0,SCRAM,WEEK,EST_ST,EST_MSA,REGION,HWEIGHT,PWEIGHT,TBIRTH_YEAR,ABIRTH_YEAR,RHISPANIC,...,PSWHYCHG2,PSWHYCHG3,PSWHYCHG4,PSWHYCHG5,PSWHYCHG6,PSWHYCHG7,PSWHYCHG8,PSWHYCHG9,INCOME,state
0,V340000001S34010804300113,34,1,,2,1548.941305,2889.966484,1986,2,1,...,-88,-88,-88,-88,-88,-88,-88,-88,7,AL
1,V340000001S37010632600113,34,1,,2,1080.178856,3023.046143,1967,2,2,...,-88,-88,-88,-88,-88,-88,-88,-88,4,AL
2,V340000001S52011057710113,34,1,,2,1542.97903,4318.263387,1941,2,1,...,-88,-88,-88,-88,-88,-88,-88,-88,4,AL
3,V340000001S79010365210123,34,1,,2,1111.825305,2074.409055,1962,2,1,...,-88,-88,-88,-88,-88,-88,-88,-88,8,AL
4,V340000002S01021059400123,34,2,,4,87.544682,173.361259,1975,2,1,...,-88,-88,-88,-88,-88,-88,-88,-88,5,AK


In [8]:
# Here is a snapshot of the combined dataframe after transforming the data as necessary and dropping irrelevant variables
combined_df.head()

Unnamed: 0,Series_Complete_Pop_Pct,Administered_Dose1_Pop_Pct,RECVDVACC,GETVACRV,WHYNORV1,WHYNORV2,WHYNORV3,WHYNORV6,WHYNORV7,WHYNORV8,WHYNORV9,WHYNORV11
0,33.9,41.5,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,33.9,41.5,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,33.9,41.5,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,33.9,41.5,0.0,5.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
4,45.1,50.7,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Results
The variables in the combined dataframe represent all of our variables of interest:
+ RECVDVACC (our outcome variable): Respondent's vaccination status
+ Series_Complete_Pop_Pct: Full vaccination rate of respondent's state on the day they responded to the survey
+ Administered_Dose1_Pop_Pct: Partial and full vaccination rate of respondent's state on the day they responded to the survey
+ GETVACRV: Respondent's propensity to get the COVID-19 vaccine if they have not already
+ WHYNORV1: Respondent doesn't want the COVID-19 vaccine due to a side effects concern
+ WHYNORV2: Respondent doesn't want the COVID-19 vaccine due to being unsure of vaccine protection
+ WHYNORV3: Respondent doesn't want the COVID-19 vaccine due to a belief that it's unnecessary
+ WHYNORV6: Respondent doesn't want the COVID-19 vaccine due to a cost concern
+ WHYNORV7: Respondent doesn't want the COVID-19 vaccine due to distrust of the vaccine
+ WHYNORV8: Respondent doesn't want the COVID-19 vaccine due to distrust of the government
+ WHYNORV9: Respondent doesn't want the COVID-19 vaccine due to a belief that COVID-19 does not pose a large threat
+ WHYNORV11: Respondent doesn't want the COVID-19 vaccine due to a belief that one dose of the vaccine is enough

Here is an image of how all these variables interact in a correlation matrix:

In [9]:
rs = np.random.RandomState(0)
corr = combined_df.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Series_Complete_Pop_Pct,Administered_Dose1_Pop_Pct,RECVDVACC,GETVACRV,WHYNORV1,WHYNORV2,WHYNORV3,WHYNORV6,WHYNORV7,WHYNORV8,WHYNORV9,WHYNORV11
Series_Complete_Pop_Pct,1.0,0.948558,0.120929,-0.013485,-0.085339,-0.05885,-0.056763,-0.019153,-0.0694,-0.061017,-0.043485,0.004612
Administered_Dose1_Pop_Pct,0.948558,1.0,0.125459,-0.029115,-0.09038,-0.059586,-0.060933,-0.020803,-0.07567,-0.065251,-0.047542,0.004903
RECVDVACC,0.120929,0.125459,1.0,,-0.689274,-0.419458,-0.458443,-0.131383,-0.585403,-0.513511,-0.3894,0.006727
GETVACRV,-0.013485,-0.029115,,1.0,0.146483,0.091243,0.31898,-0.005315,0.38067,0.320808,0.275616,
WHYNORV1,-0.085339,-0.09038,-0.689274,0.146483,1.0,0.504006,0.443247,0.142914,0.609411,0.5204,0.418081,0.034331
WHYNORV2,-0.05885,-0.059586,-0.419458,0.091243,0.504006,1.0,0.351801,0.167576,0.454165,0.39844,0.325541,0.019545
WHYNORV3,-0.056763,-0.060933,-0.458443,0.31898,0.443247,0.351801,1.0,0.10465,0.503519,0.477476,0.6173,0.027811
WHYNORV6,-0.019153,-0.020803,-0.131383,-0.005315,0.142914,0.167576,0.10465,1.0,0.113084,0.114596,0.09853,0.016264
WHYNORV7,-0.0694,-0.07567,-0.585403,0.38067,0.609411,0.454165,0.503519,0.113084,1.0,0.623026,0.473357,0.02473
WHYNORV8,-0.061017,-0.065251,-0.513511,0.320808,0.5204,0.39844,0.477476,0.114596,0.623026,1.0,0.480911,0.028695


Based on this matrix, it appears that many of the beliefs driving vaccine hesitancy are positively correlated with one another while negatively correlated with vaccination rates.

## Lessons Learned and Plans to Mitigate Challenges
Lessons:
+ Survey data makes generating research questions easy but presents challenges with transforming data for analysis
+ Merging datasets may complicate analysis if the outcome variable doesn't relate to the merged data

Plans:
+ Trial and error?
+ Be flexible while doing analyses and visualizations