# Importing Packages

In [2]:
# packages for data collection and cleaning
import requests
import json
from io import StringIO
from bs4 import BeautifulSoup as bs
from datetime import date, timedelta
import covidcast
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)

# packages for mapping
import geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl


# Data Collection


## NYT Mask Wearing Survey Data

- This data comes from a large number of interviews conducted online by the firm Dynata, which asked the question to obtain 250,000 survey responses between July 2 and July 14, enough data to provide estimates more detailed than the state level. 

- This survey was conducted for a single run and there are no plans for a follow-up study. Specifically, each participant was asked the following question: 

<center><strong>How often do you wear a mask in public when you expect to be within six feet of another person?</strong></center>

- Here are the definitions for the column headings:

    - **COUNTYFP**: The county FIPS code.
    - **NEVER**: The estimated share of people in this county who would say never in response to the question: “How often do you wear a mask in public when you expect to be within six feet of another person?”
    - **RARELY**: The estimated share of people in this county who would say rarely
    - **SOMETIMES**: The estimated share of people in this county who would say sometimes
    - **FREQUENTLY**: The estimated share of people in this county who would say frequently
    - **ALWAYS**: The estimated share of people in this county who would say always
    

- In their analysis, they assumed the following relationships to be true:  
    - ALWAYS : 100%
    - FREQUENTLY : 80%
    - SOMETIMES : 50%
    - RARELY : 20%
    - NEVER : 0%




In [4]:
# There was no API available here, just the file made available through their Github repo
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/mask-use/mask-use-by-county.csv'
s = requests.get(url).text
nymask = pd.read_csv(StringIO(s))
nymask.COUNTYFP = nymask.COUNTYFP.astype(str)
nymask.COUNTYFP = np.where(nymask['COUNTYFP'].str.len() == 4, '0' + nymask.COUNTYFP, nymask.COUNTYFP)
nymask.columns = ['FIPS', 'NEVER', 'RARELY', 'SOMETIMES', 'FREQUENTLY', 'ALWAYS']

In [5]:
nymask.describe()

Unnamed: 0,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS
count,3142.0,3142.0,3142.0,3142.0,3142.0
mean,0.07994,0.082919,0.121318,0.207725,0.508094
std,0.058538,0.055464,0.058011,0.063571,0.152191
min,0.0,0.0,0.001,0.029,0.115
25%,0.034,0.04,0.079,0.164,0.39325
50%,0.068,0.073,0.115,0.204,0.497
75%,0.113,0.115,0.156,0.247,0.61375
max,0.432,0.384,0.422,0.549,0.889


**Observations:**

- There are 3142 different counties in this dataset, and just over 50% of respondents said they always wore a mask.
- We do not have an idea how many respondents are from each county.  The count above represents the county and not the respondents.


In [1]:
%%html
<div class='tableauPlaceholder' id='viz1625813706281' style='position: relative'><noscript><a href='https:&#47;&#47;www.nytimes.com&#47;interactive&#47;2020&#47;07&#47;17&#47;upshot&#47;coronavirus-face-mask-map.html'><img alt='&quot;How often do you wear a mask in public when you expect to be within 6 feet of another person?&quot;A study conducted by Dynata and New York Times with 250,000 respondents in July 2020 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;NY&#47;NYTMask-WearingStudyVisualization&#47;FacebookMaskStudy&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='NYTMask-WearingStudyVisualization&#47;FacebookMaskStudy' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;NY&#47;NYTMask-WearingStudyVisualization&#47;FacebookMaskStudy&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1625813706281');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1366px';vizElement.style.height='795px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1366px';vizElement.style.height='795px';} else { vizElement.style.width='100%';vizElement.style.height='1627px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

## Carnegie Mellon Mask Wearing Facebook Survey Data

- In conjunction with Facebook and survey firm, the Carnegie Mellon study collected 1,220,000 valid responses from respondents across the US.

- The survey involves questions about symptoms, mask wearing, testing, and the other important topics in relation to COVID-19 as described above, along with demographic details about the respondent. 
    - These demographics include age, gender, race, occupation, and education, allowing us to understand how different groups have been affected and which groups are currently most vulnerable to COVID-19. 
    
- The two relevant questions in relation to face masks are the following:
    1. In the past 5 days, did you wear a mask most or all of the time in public?
    2. In the past 7 days, when you were in public places where social distancing is not possible, did most or all other people wear masks?

- The four signals available for data collection through Covidcast are:
    1. `**smoothed_wearing_mask**`
    2. `**smoothed_others_masked**`
    3. `**smoothed_wwearing_mask**`
    4. `**smoothed_wothers_masked**`

- The smoothed versions of all the `fb-survey` signals (with `smoothed_` prefix) are calculated using seven day pooling.
- The weighting versions (with the `smoothed_w` prefix) adjust the data to be representative of the US population, adjusting both for:

    1. the differences between the US population and US Facebook users (according to a state-by-age-gender stratification of the US population from the 2018 Census March Supplement)
    2. the propensity of a Facebook user to take our survey in the first place
    
**Dashboard website:** https://delphi.cmu.edu/covidcast/?date=20210102&region=42003


**First question link:** https://delphi.cmu.edu/covidcast/survey-results/?date=20210102®ion=42003#self-reported-mask-use


**Second question link:** https://delphi.cmu.edu/covidcast/survey-results/?date=20210102&region=42003#other-people-wearing-masks


- Data for the first question starts on the 8th of September, while data for the second question starts on the 25th of November.

In [7]:
# create variable to hold date object for two days ago
two_days_ago = date.today() - timedelta(days=2)

# get the most recent survey data
mask_ind = covidcast.signal("fb-survey", "smoothed_wearing_mask", date(2020, 9, 8), two_days_ago, "county")
mask_oth = covidcast.signal('fb-survey', 'smoothed_others_masked', date(2020, 11, 24),two_days_ago, "county")

In [8]:
mask_ind.head()

Unnamed: 0,geo_value,signal,time_value,issue,lag,value,stderr,sample_size,geo_type,data_source
0,1000,smoothed_wearing_mask,2020-09-08,2020-12-09,92,87.874513,1.706219,366.0096,county,fb-survey
1,2000,smoothed_wearing_mask,2020-09-08,2020-12-09,92,78.712871,4.093374,100.0,county,fb-survey
2,4000,smoothed_wearing_mask,2020-09-08,2020-12-09,92,75.423725,3.754089,131.527,county,fb-survey
3,4013,smoothed_wearing_mask,2020-09-08,2020-12-09,92,90.151765,1.594534,349.1927,county,fb-survey
4,4019,smoothed_wearing_mask,2020-09-08,2020-12-09,92,88.463022,2.866823,124.1801,county,fb-survey


In [9]:
mask_ind.describe()

Unnamed: 0,lag,value,stderr,sample_size
count,67697.0,67697.0,67697.0,67697.0
mean,34.894397,89.673346,1.863504,406.89027
std,29.200339,6.247194,0.883028,465.783993
min,0.0,51.179474,0.251097,100.0
25%,5.0,86.347615,1.145676,148.0359
50%,27.0,91.14055,1.726723,240.76
75%,62.0,94.381144,2.464686,467.3302
max,92.0,99.677919,4.965824,4677.9669


In [10]:
mask_oth.head()

Unnamed: 0,geo_value,signal,time_value,issue,lag,value,stderr,sample_size,geo_type,data_source
0,1000,smoothed_others_masked,2020-11-24,2020-12-09,15,63.732745,2.362805,414.0201,county,fb-survey
1,4000,smoothed_others_masked,2020-11-24,2020-12-09,15,69.457396,3.936373,136.909,county,fb-survey
2,4013,smoothed_others_masked,2020-11-24,2020-12-09,15,82.923867,2.05172,336.382,county,fb-survey
3,4019,smoothed_others_masked,2020-11-24,2020-12-09,15,84.678313,2.96486,147.5946,county,fb-survey
4,5000,smoothed_others_masked,2020-11-24,2020-12-09,15,64.576451,2.575665,344.816,county,fb-survey


In [11]:
mask_oth.describe()

Unnamed: 0,lag,value,stderr,sample_size
count,25061.0,25061.0,25061.0,25061.0
mean,5.478752,81.796308,2.287603,410.989417
std,2.16942,12.298306,1.046096,475.412758
min,0.0,24.554959,0.418443,100.0
25%,5.0,75.53885,1.423795,149.0
50%,5.0,85.466091,2.10683,238.2069
75%,5.0,91.157208,3.08561,475.8048
max,15.0,98.75,4.992825,4377.1022


In [13]:
# remove data that isn't at the county level since they lumped data into **000 where responses were too few
## FIPS code is state code + county code, 2 and 3 digit codes
mask_ind = mask_ind.loc[~mask_ind.geo_value.str.endswith('000')]
mask_oth = mask_oth.loc[~mask_oth.geo_value.str.endswith('000')]

# check number of counties represented 
print(mask_ind.geo_value.value_counts().shape)
print(mask_oth.geo_value.value_counts().shape)

# change four digit FIPS code to all five digit FIPS code
mask_ind.rename(columns = {'geo_value': 'FIPS'}, inplace=True)
mask_oth.rename(columns = {'geo_value': 'FIPS'}, inplace=True)
mask_ind.FIPS = mask_ind.FIPS.apply(lambda x: x.zfill(5))
mask_oth.FIPS = mask_oth.FIPS.apply(lambda x: x.zfill(5))

(687,)
(659,)


In [20]:
# find means of values for each county to concatenate with NYT dataset
mask_ind_means = pd.DataFrame(mask_ind.groupby(['FIPS'])['value'].mean()).reset_index()
mask_ind_means.columns = ['FIPS', 'ind_mask']
mask_oth_means = pd.DataFrame(mask_oth.groupby(['FIPS'])['value'].mean()).reset_index()
mask_oth_means.columns = ['FIPS', 'oth_mask']

# merge both datasets together
mask_means = mask_ind_means.merge(mask_oth_means, on = 'FIPS', how = 'outer')

<img src='images/delphi_dec.png'>

<center>Data from Delphi COVIDcast. Obtained via the Delphi Epidata API. <a href>https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html</a></center>

**Observations:**

- The graphic provided by the website gives us a sense of the representation of the data points in this dataset.  There are from mostly more metropolitan areas.
- There are over 4000 in the sample size, but in order to match this dataset up with the one from NYT, we took the mean of the value for each county, and as you see, once we performed the grouping, we were left with 680 and 659 counties out of the 3142 counties in the NYT.
- There is an issue of class imbalance which we must address later.
- Perhaps we should consider imputing techniques to fill out the dataset.


## 2020 Election Data

In [14]:
# upload election data
df2020 = pd.read_csv('data/2020_Results.csv')
df2020.rename(columns={'county_fips': 'FIPS'}, inplace=True)
df2020.FIPS = df2020.FIPS.astype(str)
df2020.FIPS = df2020.FIPS.apply(lambda x: x.zfill(5))

# upload shapefile for USA counties
shapefile = 'data/USA_Counties.shp'
gdf = gpd.read_file(shapefile)
gdf = gdf[~gdf.STATE_NAME.isin(['Puerto Rico', 'Alaska', 'Hawaii'])]
gdf.FIPS = gdf.FIPS.astype(str)
merged = gdf.merge(df2020, on='FIPS', how='left')

# Create visualization for election data using Geopandas
# fig, ax = plt.subplots(1, figsize=(20,20))
# merged.plot(column='per_gop', cmap='RdBu_r', linewidth=0.8, ax=ax, edgecolor='black')
# ax.axis('off')
# ax.set_title('Percentage By County for 2020 Election', fontsize=18)
# plt.savefig('images/election2020.png')

<img src='images/election2020.png'>

## Johns Hopkins COVID Data Geopandas

- This geopandas feature layer contains the most up-to-date COVID-19 cases for the US. 
- Data is pulled from the Coronavirus COVID-19 Global Cases by the CSSE at Johns Hopkins University, the Red Cross, the Census American Community Survey, and the Bureau of Labor and Statistics, and aggregated at the US county level. 

**Dashboard Website:**  https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

**API Link:** https://services9.arcgis.com/6Hv9AANartyT7fJW/arcgis/rest/services/USCounties_cases_V1/FeatureServer

In [15]:
# Call API for geopandas file
geo = gpd.read_file('https://opendata.arcgis.com/datasets/4cb598ae041348fb92270f102a6783cb_0.geojson')
print(geo.shape)
geo.head()

(3331, 88)


Unnamed: 0,OBJECTID,Countyname,ST_Name,ST_Abbr,ST_ID,FIPS,FatalityRa,Confirmedb,DeathsbyPo,PCTPOVALL_,Unemployme,Med_HH_Inc,State_Fata,DateChecke,EM_type,EM_date,EM_notes,url,Thumbnail,Confirmed,Deaths,Age_85,Age_80_84,Age_75_79,Age_70_74,Age_65_69,Beds_Licen,Beds_Staff,Beds_ICU,Ventilator,POP_ESTIMA,POVALL_201,Unemployed,Median_Hou,Recovered,Active,State_Conf,State_Deat,State_Reco,State_Test,AgedPop,NewCases,NewDeaths,TotalPop,NonHispWhP,BlackPop,AmIndop,AsianPop,PacIslPop,OtherPop,TwoMorPop,HispPop,Wh_Alone,Bk_Alone,AI_Alone,As_Alone,NH_Alone,SO_Alone,Two_More,Not_Hisp,Age_Less15,Age_15_24,Age_25_34,Age_Over75,Agetotal,NonHisp,Age_35_64,Age_65_74,Day_1,Day_2,Day_3,Day_4,Day_5,Day_6,Day_7,Day_8,Day_9,Day_10,Day_11,Day_12,Day_13,Day_14,NewCasebyP,Inpat_Occ,ICU_Occ,Shape__Area,Shape__Length,geometry
0,1,Autauga,Alabama,AL,1,1001,1.062699,8462.08,89.92644,13.8,3.6,118.9591,1.305141,01/08/2021 11:30:53,Govt Ordered Community Quarantine,4/3/2020 10:40:51 PM,AL Governor issued a Shelter In Place order.,https://bao.arcgis.com/covid-19/jhu/county/010...,https://coronavirus.jhu.edu/static/media/dashb...,4705,50,815,1026,1498,2440,2271,85,55,6,2,55601,7587,942,59338,0,0,389230,5080,0,2009678,8050,60,0,55200,41412,10475,159,568,5,41,1012,1528,42437,10565,159,568,32,409,1030,53672,10842,7192,7064,3339,55200,1528,22052,4711,60.0,99.0,210.0,31.0,37.0,29.0,49.0,26.0,59.0,40.0,36.0,30.0,9.0,48.0,107.911728,96.320346,100.0,2209382000.0,246839.865479,"POLYGON ((-86.41312 32.70739, -86.41219 32.526..."
1,2,Baldwin,Alabama,AL,1,1003,1.151903,6808.95,78.432452,9.8,3.6,115.4508,1.305141,01/08/2021 11:30:53,Govt Ordered Community Quarantine,4/3/2020 10:41:19 PM,AL Governor issued a Shelter In Place order.,https://bao.arcgis.com/covid-19/jhu/county/010...,https://coronavirus.jhu.edu/static/media/dashb...,14845,171,3949,4792,7373,11410,13141,386,362,51,8,218022,21069,3393,57588,0,0,389230,5080,0,2009678,40665,189,0,208107,172768,19529,1398,1668,9,410,2972,9353,179526,19764,1522,1680,9,2034,3572,198754,37621,23497,23326,16114,208107,9353,82998,24551,189.0,216.0,253.0,123.0,109.0,132.0,222.0,209.0,220.0,210.0,137.0,117.0,42.0,145.0,86.688499,74.029401,120.089286,5770469000.0,728445.072448,"MULTIPOLYGON (((-87.78878 31.29877, -87.78849 ..."
2,3,Barbour,Alabama,AL,1,1005,2.168525,6486.88,140.669587,30.9,5.2,68.928,1.305141,01/08/2021 11:30:53,Govt Ordered Community Quarantine,4/3/2020 10:43:21 PM,AL Governor issued a Shelter In Place order.,https://bao.arcgis.com/covid-19/jhu/county/010...,https://coronavirus.jhu.edu/static/media/dashb...,1614,35,422,551,841,1305,1515,74,30,5,2,24881,6788,433,34382,0,0,389230,5080,0,2009678,4634,17,0,25782,11898,12199,63,85,1,86,344,1106,12216,12266,72,96,1,778,353,24676,4517,3092,3675,1814,25782,1106,9864,2820,17.0,22.0,42.0,3.0,2.0,11.0,3.0,22.0,30.0,45.0,11.0,8.0,2.0,6.0,68.325228,56.711409,88.571429,3258643000.0,307285.15451,"POLYGON ((-85.25609 32.13767, -85.25569 32.135..."
3,4,Bibb,Alabama,AL,1,1007,2.423019,8843.75,214.285714,21.8,4.0,92.3478,1.305141,01/08/2021 11:30:53,Govt Ordered Community Quarantine,4/3/2020 10:40:51 PM,AL Governor issued a Shelter In Place order.,https://bao.arcgis.com/covid-19/jhu/county/010...,https://coronavirus.jhu.edu/static/media/dashb...,1981,48,427,488,624,842,1280,35,25,4,1,22400,4400,344,46064,0,0,389230,5080,0,2009678,3661,37,0,22527,16801,4974,8,37,0,0,160,547,17268,5018,8,37,0,9,187,21980,3742,3005,3075,1539,22527,547,9044,2122,37.0,21.0,38.0,3.0,19.0,9.0,20.0,17.0,25.0,30.0,16.0,7.0,14.0,14.0,165.178571,19.047619,,2310715000.0,227886.96384,"POLYGON ((-87.02685 33.24646, -87.02572 33.209..."
4,5,Blount,Alabama,AL,1,1009,1.452491,8570.19,124.481328,13.2,3.5,101.0645,1.305141,01/08/2021 11:30:53,Govt Ordered Community Quarantine,4/3/2020 10:40:51 PM,AL Governor issued a Shelter In Place order.,https://bao.arcgis.com/covid-19/jhu/county/010...,https://coronavirus.jhu.edu/static/media/dashb...,4957,72,866,1459,1776,2802,3330,25,25,6,2,57840,7527,878,50412,0,0,389230,5080,0,2009678,10233,59,5,57645,50232,820,124,198,18,174,818,5261,55054,862,141,198,18,437,935,52384,11112,6906,6786,4101,57645,5261,22608,6132,59.0,49.0,78.0,25.0,17.0,36.0,52.0,57.0,49.0,52.0,18.0,19.0,5.0,36.0,102.005533,76.190476,92.857143,2456058000.0,286306.840721,"POLYGON ((-86.44507 34.24954, -86.40902 34.205..."


We are highlighting some of the features that we wanted to keep for our analysis and also for engineering:

- FatalityRa = County Fatality Rate
- Confirmedb = County Confirmed divided by County Population * 100,000
- DeathsbyPo = County Deaths divided by County Population * 100,000
- Unemployme = Unemployment Rate
- EM_type, EM_date, EM_notes - could be turned into a column about Emergency Declaration
- Confirmed = County Level Confirmed Cases
- Deaths = County Level Deaths
- Beds_Licen = Number of Licensed Beds 
- Beds_Staff = Number of Staffed Beds 
- Beds_ICU = Number of ICU Beds 
- Ventilator = Average Ventilation Used Per Hospital
- POP_ESTIMA = Total Population
- POVALL_201 = Poverty Rate
- Median_Hou - Median Household Income
- Recovered = Number of Recovered Cases
- Active = Number of Active Cases
- AgedPop = Total Population Aged 65 Plus
- NewCases = New cases since yesterday
- NewDeaths = Newdeaths since yesterday
- Wh_Alone, Bk_Alone, AI_Alone, As_Alone, NH_Alone, SO_Alone, Two_More, Not_Hisp = Population numbers of each group
- Age_? = Population numbers of age groups
- NewCasebyP = I think this is new case by percentage
- ICU_Occ, Inpat_Occ = Percent Occupied of ICUs and Inpatient Beds
- Shape_Area, Shape_Length = for mapping



In [16]:
# remove all the rows with Unassigned and Out of [State] designations for county name, as well as Puerto Rico
geo = geo[~geo.Countyname.str.contains("Out of")]
geo = geo[~geo.Countyname.str.contains("Unassigned")]
geo = geo[~geo.ST_Name.str.contains("Puerto Rico")]
geo.drop(geo.tail(7).index, inplace=True)

In [21]:
# remove Day_1 to Day_14 columns, unclear what the days represent
geo.drop(columns=['Day_1', 'Day_2', 'Day_3', 'Day_4', 'Day_5', 'Day_6', 'Day_7', 'Day_8', 'Day_9', 'Day_10', 'Day_11', 'Day_12', 'Day_13', 'Day_14'], inplace=True)

# remove state level data columns, as well as ID columns and miscellaneous data
geo.drop(columns=['OBJECTID', 'ST_ID', 'Med_HH_Inc', 'State_Fata', 'DateChecke', 'url', 'Thumbnail', 'State_Conf', 'State_Deat', 'State_Reco', 'State_Test'], inplace=True)

# remove confirmed and deaths by county population and removed notes column for Emergency Management
geo.drop(columns=['Confirmed', 'Deaths', 'EM_notes'], inplace=True)

# remove columns because most values were null
geo.drop(columns=['Recovered', 'Active'], inplace=True)

# replace longer versions of values into shorter abbreviated versions and extracted the date out of the string
geo.EM_type = geo.EM_type.replace({'Govt Ordered Community Quarantine': 'CQ', 'Govt Directed Social Distancing': 'SD'})
geo.EM_date = geo.EM_date.str.extract(r'((\d+)\/(\d{2})\/(\d{4}))')

# create percentages for races and removed original race columns as well as the other set of racial columns
geo['Wht_Per'] = geo.NonHispWhP / geo.TotalPop
geo['His_Per'] = geo.HispPop / geo.TotalPop
geo['Blk_Per'] = geo.BlackPop / geo.TotalPop
geo['Asn_Per'] = (geo.AsianPop + geo.PacIslPop) / geo.TotalPop
geo['Nat_Per'] = geo.AmIndop / geo.TotalPop

# group ages into larger categories
geo['Yth_Per'] = (geo.Age_Less15 + geo.Age_15_24) / geo.Agetotal
geo['Adt_Per'] = (geo.Age_25_34 + geo.Age_35_64) / geo.Agetotal
geo['Sen_Per'] = (geo.Age_65_74 + geo.Age_Over75) / geo.Agetotal

# created some other features
geo['Aged_Per'] = geo.AgedPop / geo.POP_ESTIMA
geo['Staf_Per'] = geo['Beds_Staff'] / geo['Beds_Licen']
geo['Pov_Per'] = geo.POVALL_201 / geo.POP_ESTIMA
geo['Beds_Per'] = geo.Beds_ICU / geo.POP_ESTIMA
geo['Pop_Dens'] = geo['TotalPop'] / geo['Shape__Area']

# drop columns with original data
geo.drop(columns=['Age_85', 'Age_80_84', 'Age_75_79', 'Age_70_74', 'Age_65_69', 'Agetotal', 'Age_Less15', 'Age_15_24', 'Age_25_34', 'Age_35_64', 'Age_65_74', 'Age_Over75'], inplace=True)
geo.drop(columns=['Wh_Alone', 'Bk_Alone', 'AI_Alone', 'As_Alone', 'NH_Alone', 'SO_Alone', 'Two_More', 'Not_Hisp', 'NonHispWhP', 'BlackPop', 'AmIndop', 'PacIslPop', 'OtherPop', 'TwoMorPop', 'HispPop', 'NonHisp', 'AsianPop'], inplace=True)
geo.drop(columns=['NewCases', 'NewDeaths', 'AgedPop', 'POVALL_201', 'Unemployed'], inplace=True)
geo.drop(columns=['Beds_Licen', 'Beds_Staff', 'Beds_ICU', 'Ventilator'], inplace=True)

### Confirmed Cases by County in the United States Per County Population

<img src='images/county_conf_by_pop.png'>

Visualization: Centers for Civic Impact. Automation Support: Esri Living Atlas team, JHU APL, and JHU Sheridan Libraries. Data sources: Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University; the Red Cross; the Census American Community Survey; the Bureau of Labor and Statistics.

### Total Confirmed Cases by County in the United States since Outbreak 

<img src='images/county_confirmed.png'>

Visualization: Centers for Civic Impact. Automation Support: Esri Living Atlas team, JHU APL, and JHU Sheridan Libraries. Data sources: Coronavirus COVID-19 Global Cases by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University; the Red Cross; the Census American Community Survey; the Bureau of Labor and Statistics.

# Feature Engineering

Normalizing the data ensures that mere comparisons of raw numbers will skew the numbers towards higher population areas. The two images above both show confirmed cases, one in raw numbers and the other normalized.

In our feature engineering and selection process, to ensure that communities with lower populations are not left out or misrepresented, we dropped or engineered any such columns.

For confirmed cases and deaths, the dataset already had the following columns, so we naturally dropped the latter two columns:
- Confirmedb = County Confirmed divided by County Population * 100,000
- DeathsbyPo = County Deaths divided by County Population * 100,000
- Confirmed = County Level Confirmed Cases
- Deaths = County Level Deaths

Turning the numbers into percentages allows us to look at the racial distributions within each county:
- `geo[Wht_Per] = geo.NonHispWhP / geo.TotalPop`
- `geo[His_Per] = geo.HispPop / geo.TotalPop`
- `geo[Blk_Per] = geo.BlackPop / geo.TotalPop`
- `geo[Asn_Per] = (geo.AsianPop + geo.PacIslPop) / geo.TotalPop`
- `geo[Nat_Per] = geo.AmIndop / geo.TotalPop`

In terms of voting behavior, we can lump the age groups into bigger categories, and we can remove the 5-year groupings:
- `geo[Yth_Per] = (geo.Age_Less15 + geo.Age_15_24) / geo.Agetotal`
- `geo[Adt_Per] = (geo.Age_25_34 + geo.Age_35_64) / geo.Agetotal`
- `geo[Sen_Per] = (geo.Age_65_74 + geo.Age_Over75) / geo.Agetotal`
- `geo[Aged_Per] = geo.AgedPop / geo.POP_ESTIMA`

Creating a column to represent population density which would be an indication of urban/rural level would be useful, ultimately discovering a metric called RUCC and incorporating that into the dataset:
- `geo['Pop_Dens'] = geo['TotalPop'] / geo['Shape__Area']`

Lastly, we wanted to capture hospitalization and bed data in some ghopefully meaningful way and decided on the following:
- `geo['Staf_Per'] = geo['Beds_Staff'] / geo['Beds_Licen']`
- `geo['Beds_Per'] = geo.Beds_ICU / geo.POP_ESTIMA`


In [22]:
rural_urban = pd.read_excel('data/ru_code.xls')
fip_rur = rural_urban[['FIPS', 'RUCC_2013']]
fip_rur.FIPS = fip_rur.FIPS.astype(str)
geof = geo.merge(fip_rur, how = 'left', on = 'FIPS')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [23]:
geof.describe()

Unnamed: 0,FatalityRa,Confirmedb,DeathsbyPo,PCTPOVALL_,Unemployme,POP_ESTIMA,Median_Hou,TotalPop,NewCasebyP,Inpat_Occ,ICU_Occ,Shape__Area,Shape__Length,Wht_Per,His_Per,Blk_Per,Asn_Per,Nat_Per,Yth_Per,Adt_Per,Sen_Per,Aged_Per,Staf_Per,Pov_Per,Beds_Per,Pop_Dens,RUCC_2013
count,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,2423.0,1552.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3142.0,3141.0,2477.0,3141.0,3141.0,3142.0,2825.0
mean,1.752863,7025.453342,124.294096,15.157829,4.132018,104127.1,52777.610439,102769.9,76.861698,53.976173,65.56021,6808623000.0,371488.6,0.764951,0.092623,0.089207,0.014375,0.018288,0.311678,0.504682,0.18364,0.141358,0.874776,0.145583,0.000283,6.210886e-05,5.00531
std,1.108569,2846.660721,90.595127,6.130698,1.503802,333486.3,13909.834889,329907.7,81.358994,34.142614,29.709178,55569680000.0,739740.8,0.201837,0.137899,0.144647,0.030766,0.075532,0.047807,0.032847,0.045859,0.082917,0.459133,0.056476,0.00035,0.0004004684,2.706416
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75.0,-1418.439716,0.0,0.0,8763938.0,15636.78,0.007278,0.0,0.0,0.0,0.0,0.08,0.231646,0.037986,0.0,0.0,0.026183,0.0,2.429692e-09,1.0
25%,1.012532,5227.3825,60.766296,10.8,3.1,10926.5,43679.5,10948.0,35.825191,34.330776,50.33986,1860524000.0,205800.0,0.647585,0.021121,0.00633,0.003158,0.001257,0.284786,0.486504,0.154482,0.110546,0.754026,0.105443,8.3e-05,3.850014e-06,2.0
50%,1.544236,6897.985,106.99877,14.1,3.9,25758.5,50566.5,25736.0,68.932516,54.614412,72.340426,2680092000.0,249354.3,0.839246,0.04101,0.021594,0.006433,0.002771,0.308982,0.505499,0.18038,0.164017,0.938776,0.136557,0.000205,1.044782e-05,6.0
75%,2.288105,8603.9975,164.100587,18.3,4.8,67820.5,58845.5,67209.0,103.590743,72.14168,88.000071,3944540000.0,336767.0,0.926928,0.095586,0.098571,0.013624,0.00675,0.333725,0.523774,0.208046,0.195383,1.0,0.175465,0.000356,2.739642e-05,7.0
max,13.043478,28456.81,842.266462,54.0,19.9,10105520.0,140382.0,10098050.0,1579.586877,1218.285714,200.0,2228678000000.0,17964230.0,1.0,0.990688,0.874123,0.533333,0.909612,0.611674,0.76,0.555963,0.522477,20.88,0.537284,0.004112,0.01586326,9.0


**Observations:**

- There are too many columns to analyze, but it is better visually analyze any relationships through Seaborn.

# Visualizing Combined Data

In [24]:
# combine all the dataframes together
merged = geof.merge(df2020, on='FIPS', how='left')
merged_2 = merged.merge(nymask, on='FIPS', how='left')
merged_3 = merged_2.merge(mask_means, on='FIPS', how='left')
merged_3['per_dem_cat'] = merged_3['per_dem']
choices = [0,1,2,3]
condition = [(merged_3['per_dem_cat']) <= .25, 
             (merged_3['per_dem_cat'] > .25) & (merged_3['per_dem_cat'] <= .50),
             (merged_3['per_dem_cat'] > .50) & (merged_3['per_dem_cat'] <= .75),
             (merged_3['per_dem_cat']) > .75]
merged_3['per_dem_cat'] = np.select(condition, choices, merged_3['per_dem_cat'])
merged_3.to_file("data/combined.geojson", driver='GeoJSON')
# turn geopandas dataframe into pandas dataframe
df = pd.DataFrame(merged_3)

In [25]:
# view column names to select for features
df.columns

Index(['Countyname', 'ST_Name', 'ST_Abbr', 'FIPS', 'FatalityRa', 'Confirmedb',
       'DeathsbyPo', 'PCTPOVALL_', 'Unemployme', 'EM_type', 'EM_date',
       'POP_ESTIMA', 'Median_Hou', 'TotalPop', 'NewCasebyP', 'Inpat_Occ',
       'ICU_Occ', 'Shape__Area', 'Shape__Length', 'geometry', 'Wht_Per',
       'His_Per', 'Blk_Per', 'Asn_Per', 'Nat_Per', 'Yth_Per', 'Adt_Per',
       'Sen_Per', 'Aged_Per', 'Staf_Per', 'Pov_Per', 'Beds_Per', 'Pop_Dens',
       'RUCC_2013', 'state_name', 'county_name', 'votes_gop', 'votes_dem',
       'total_votes', 'diff', 'per_gop', 'per_dem', 'per_point_diff', 'NEVER',
       'RARELY', 'SOMETIMES', 'FREQUENTLY', 'ALWAYS', 'ind_mask', 'oth_mask',
       'per_dem_cat'],
      dtype='object')

In [26]:
# Create scatterplots of features
df.FIPS = df.FIPS.astype(int)
features = df[['FIPS', 'FatalityRa', 'Confirmedb',
       'DeathsbyPo', 'PCTPOVALL_', 'Unemployme', 'POP_ESTIMA',
       'Median_Hou', 'TotalPop', 'NewCasebyP', 'Inpat_Occ', 'ICU_Occ',
       'Wht_Per', 'His_Per', 'Blk_Per', 'Asn_Per', 'Nat_Per', 'Yth_Per', 'Adt_Per',
       'Sen_Per', 'Staf_Per', 'Aged_Per', 'Pov_Per', 'Beds_Per', 'Pop_Dens',
       'RUCC_2013', 'ind_mask', 'oth_mask', 'NEVER', 'RARELY',
       'SOMETIMES', 'FREQUENTLY', 'ALWAYS']]
target = df['per_dem_cat']
feat_col = features.columns

In [27]:
# con = pd.melt(df, id_vars='per_dem_cat', value_vars=feat_col)
# g = sns.FacetGrid(con, col='variable', col_wrap=4, sharex=False, sharey=False, height=4)
# g = g.map(sns.regplot, 'value', 'per_dem_cat', color='dodgerblue')
# g.set_xticklabels(rotation=45)
# plt.savefig('images/regplot.png')

<img src='images/regplot.png'>

In [28]:
# con_2 = pd.melt(df, value_vars = feat_col)
# g = sns.FacetGrid(con_2, col='variable', col_wrap=4, sharex=False, sharey=False, height=4)
# g = g.map(sns.distplot, 'value', color='mediumslateblue')
# g.set_xticklabels(rotation=45)
# plt.savefig('images/distplot.png')

<img src='images/distplot.png'>

**Observations:**

- All the masking data seems evenly distributed for each response:  `ALWAYS, FREQUENTLY`, but there is skew with the other responses: `SOMETIMES, NEVER, RARELY, ind_mask, oth_mask`
- Interesting that the distribution is very left-skewed for white population, but very right-skewed for the minority populations.

## Tableau EDA Visualizations

<img src='images/tableau1.png'>
<img src='images/tableau2.png'>
<img src='images/tableau3.png'>
<img src='images/tableau4.png'>