# Covid-19 Data Aggregation

This notebook is in conjunction with the ongoing research on Covid-19 and human mobility at the CyberGIS Center *Social Media and Viz Team* led by Dr. Su Han and serves the main purpose of preprocessing Covid-19 data by taking the input data of cumulative covid-19 data at county level from New York Times to Metropolitan Statistical Area (MSA) level for further study. This notebook also offers functionality to output global daily covid data and output in js format.

Created at The University of Illinois CyberGIS Center. Created: 12/6/2020. Last updated: 7/31/2021.

## Notebook Outline
- [Data preparation](#Data_preparation)
- [Cumulative to daily](#Cumulative_to_daily)
- [Transpose](#Transpose)
- [Aggregate](#Aggregate)

In [12]:
import sys
import argparse
import os
import pandas as pd
import datetime as dt
from datetime import timedelta
from datetime import datetime
import numpy as np
import json
import geopandas as gpd
from urllib.request import urlopen

<a id = 'Data_preparation'></a>
### Data preparation

In [13]:
# change the parameters below to specify a beginning and an ending date 
startDate = dt.datetime(2020, 2, 14)
endDate = dt.datetime(2020, 4, 14)
inputCovid = './data/Covid-19 cumulative.csv'
inputMetro = './data/metro_county.csv'

# change the parameters below to specify an interval (for example interval = 1
# means daily cases and interval = 7 means weekly cases)
interval = 1

In [14]:
# input cumulative data
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
data = pd.read_csv(url)

# filter out data rows whose dates are not within the range
data["date"] = pd.to_datetime(data["date"])
data = data[data.date <= endDate]
data = data[data.date >= startDate]

data = data.sort_values(by = ['county', 'state'])
copy = data
data.to_csv("./data/daily count.csv", index = False)
data = pd.read_csv("./data/daily count.csv")
copy.to_csv("./data/daily count_copy.csv", index = False)
copy = pd.read_csv("./data/daily count_copy.csv")

In [15]:
# input MSA to county reference table
metro = pd.read_csv(inputMetro)
metro_initial = metro

In [16]:
metro_initial.head()

Unnamed: 0,states_msa_code,states_msa,states_msa_full,geoid_msa,name_msa,statefp10_county,countyfp10,countyns10,geoid10_county,name10_county,countyid
0,1,AL,AL,33860,Montgomery,1,1,161526,1001,Autauga,1001
1,1,AL,AL,19300,Daphne-Fairhope-Foley,1,3,161527,1003,Baldwin,1003
2,1,AL,AL,13820,Birmingham-Hoover,1,7,161529,1007,Bibb,1007
3,1,AL,AL,13820,Birmingham-Hoover,1,9,161530,1009,Blount,1009
4,1,AL,AL,11500,Anniston-Oxford-Jacksonville,1,15,161533,1015,Calhoun,1015


In [17]:
len(metro_initial)

2340

In [18]:
# input US state abbreviations reference table
state_abbr = pd.read_csv('./data/us states abbreviations.csv')

In [19]:
state_abbr.head()

Unnamed: 0,State,Abbreviation
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA


In [20]:
len(state_abbr)

52

<a id = 'Cumulative_to_daily'></a>
### Cumulative to daily
This section is for converting cumulative cases to daily cases. The basic idea is that the daily case for a specific day is the cumulative cases on the day minus the cumulative cases on the day before it.

In [21]:
i = 1
data['fips'] = data['fips'].fillna(-1)
while i < len(data.index):
    if data.at[i, 'county'] == data.at[i - 1, 'county'] and data.at[i, 'state'] == data.at[i - 1, 'state']:
        if (data.at[i, 'cases'] < copy.at[i - 1, 'cases']):
            data.at[i, 'cases'] = copy.at[i - 1, 'cases']
        if (data.at[i, 'deaths'] < copy.at[i - 1, 'deaths']):
            data.at[i, 'deaths'] = copy.at[i - 1, 'deaths']
        data.at[i, 'cases'] = data.at[i, 'cases'] - copy.at[i - 1, 'cases']
        data.at[i, 'deaths'] = data.at[i, 'deaths'] - copy.at[i - 1, 'deaths']
    i += 1
data = data.astype({'fips': int})
data = data.sort_values(by = ['date'])

data.to_csv("./data/daily count.csv", index = False)

data.head()

Unnamed: 0,date,county,state,fips,cases,deaths
13404,2020-02-14,Dane,Wisconsin,55025,1,0.0
45762,2020-02-14,San Diego,California,6073,1,0.0
45823,2020-02-14,San Francisco,California,6075,2,0.0
11901,2020-02-14,Cook,Illinois,17031,2,0.0
32742,2020-02-14,Maricopa,Arizona,4013,1,0.0


In [22]:
len(data)

59008

<a id = 'Aggregate'></a>
### Aggregate
This section is for aggregating

In [23]:
# merge the metro reference table with state reference table
metro_initial = metro_initial.merge(state_abbr, how = 'left', 
                     left_on = 'states_msa', 
                     right_on = 'Abbreviation', indicator = True)
metro_initial['states_msa'] = metro_initial['State']


In [24]:
metro_initial.head()

Unnamed: 0,states_msa_code,states_msa,states_msa_full,geoid_msa,name_msa,statefp10_county,countyfp10,countyns10,geoid10_county,name10_county,countyid,State,Abbreviation,_merge
0,1,Alabama,AL,33860,Montgomery,1,1,161526,1001,Autauga,1001,Alabama,AL,both
1,1,Alabama,AL,19300,Daphne-Fairhope-Foley,1,3,161527,1003,Baldwin,1003,Alabama,AL,both
2,1,Alabama,AL,13820,Birmingham-Hoover,1,7,161529,1007,Bibb,1007,Alabama,AL,both
3,1,Alabama,AL,13820,Birmingham-Hoover,1,9,161530,1009,Blount,1009,Alabama,AL,both
4,1,Alabama,AL,11500,Anniston-Oxford-Jacksonville,1,15,161533,1015,Calhoun,1015,Alabama,AL,both


In [25]:
len(metro_initial)

2340

In [26]:
metro_initial[metro_initial["_merge"] != "both"]

Unnamed: 0,states_msa_code,states_msa,states_msa_full,geoid_msa,name_msa,statefp10_county,countyfp10,countyns10,geoid10_county,name10_county,countyid,State,Abbreviation,_merge
2264,72,,PR,10260,Adjuntas,72,1,1804480,72001,Adjuntas,72001,,,left_only
2265,72,,PR,10380,Aguadilla-Isabela,72,3,1804481,72003,Aguada,72003,,,left_only
2266,72,,PR,10380,Aguadilla-Isabela,72,5,1804482,72005,Aguadilla,72005,,,left_only
2267,72,,PR,41980,San Juan-Carolina-Caguas,72,7,1804483,72007,Aguas Buenas,72007,,,left_only
2268,72,,PR,41980,San Juan-Carolina-Caguas,72,9,1804484,72009,Aibonito,72009,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2335,72,,PR,41980,San Juan-Carolina-Caguas,72,145,1804553,72145,Vega Baja,72145,,,left_only
2336,72,,PR,41980,San Juan-Carolina-Caguas,72,147,1804554,72147,Vieques,72147,,,left_only
2337,72,,PR,38660,Ponce,72,149,1804555,72149,Villalba,72149,,,left_only
2338,72,,PR,41980,San Juan-Carolina-Caguas,72,151,1804556,72151,Yabucoa,72151,,,left_only


In [27]:
metro_initial = metro_initial.drop(columns = ['_merge'])

In [28]:
# merge again with the covid data
merged = metro_initial.merge(data, how = 'right', 
                     left_on = ["states_msa", "name10_county"], 
                     right_on = ["state","county"], indicator = True)
merged = merged[['date', 'county', 'cases', 
                 'deaths', 'name_msa', 'states_msa_code', 'states_msa', 'states_msa_full',
                 'geoid_msa', '_merge']]  


In [29]:
merged.head()

Unnamed: 0,date,county,cases,deaths,name_msa,states_msa_code,states_msa,states_msa_full,geoid_msa,_merge
0,2020-03-24,Autauga,1,0.0,Montgomery,1.0,Alabama,AL,33860.0,both
1,2020-03-25,Autauga,3,0.0,Montgomery,1.0,Alabama,AL,33860.0,both
2,2020-03-26,Autauga,2,0.0,Montgomery,1.0,Alabama,AL,33860.0,both
3,2020-03-27,Autauga,0,0.0,Montgomery,1.0,Alabama,AL,33860.0,both
4,2020-03-28,Autauga,0,0.0,Montgomery,1.0,Alabama,AL,33860.0,both


In [30]:
merged[merged["_merge"] == "right_only"]

Unnamed: 0,date,county,cases,deaths,name_msa,states_msa_code,states_msa,states_msa_full,geoid_msa,_merge
43099,2020-03-01,Unknown,2,0.0,,,,,,right_only
43100,2020-03-02,Unknown,0,0.0,,,,,,right_only
43101,2020-03-03,Unknown,0,0.0,,,,,,right_only
43102,2020-03-04,Unknown,0,0.0,,,,,,right_only
43103,2020-03-05,Unknown,0,0.0,,,,,,right_only
...,...,...,...,...,...,...,...,...,...,...
60295,2020-04-14,Garrard,1,0.0,,,,,,right_only
60296,2020-04-14,Washburn,1,0.0,,,,,,right_only
60297,2020-04-14,Clay,1,0.0,,,,,,right_only
60298,2020-04-14,Johnston,2,0.0,,,,,,right_only


In [31]:
len(merged)

60300

In [32]:
# aggregate for each interval of dates
merged["date"] = pd.to_datetime(merged["date"])
iterate_start = startDate

interval_data = merged
output = pd.DataFrame()

while iterate_start <= endDate:
    iterate_end = iterate_start + timedelta(days = interval)

    eachInterval = interval_data[interval_data.date >= iterate_start]
    eachInterval = eachInterval[eachInterval.date < iterate_end]


    eachInterval = eachInterval.groupby(['name_msa', 'states_msa'])['cases', 'deaths'].sum()
    eachInterval = eachInterval.merge(metro_initial, left_on='name_msa', right_on='name_msa')[['states_msa_code', 'states_msa', 
                                'states_msa_full', "geoid_msa",
                               'name_msa', 'cases', 'deaths']].sort_values(by = 'states_msa_code')
    eachInterval['interval_start'] = iterate_start
    eachInterval = eachInterval.drop_duplicates(subset = ['name_msa', 'states_msa'])

    output = output.append(eachInterval)
    iterate_start = iterate_start + timedelta(days=interval)

output.to_csv("./data/output.csv", index = False)
output.head()


  eachInterval = eachInterval.groupby(['name_msa', 'states_msa'])['cases', 'deaths'].sum()


Unnamed: 0,states_msa_code,states_msa,states_msa_full,geoid_msa,name_msa,cases,deaths,interval_start
63,4,Arizona,AZ,38060,Phoenix-Mesa-Scottsdale,1,0.0,2020-02-14
80,6,California,CA,41940,San Jose-Sunnyvale-Santa Clara,2,0.0,2020-02-14
78,6,California,CA,41860,San Francisco-Oakland-Hayward,2,0.0,2020-02-14
73,6,California,CA,41740,San Diego-Carlsbad,1,0.0,2020-02-14
57,6,California,CA,31080,Los Angeles-Long Beach-Anaheim,2,0.0,2020-02-14


<a id = 'Transpose'></a>
### Transpose
Here we transpose the aggregated output from the previous cell so that each date is its own column and we will have a separate output cases and output deaths data

#### transpose MSA

In [33]:
# initilize some dictionaries to be used for the transposition
geoid_msa = {}
MSA_all_cases = {}
MSA_all_deaths = {}
all_dates = {}

for index, row in output.iterrows():
    MSA_all_cases[row['name_msa'] + '/' + row['states_msa_full']] = []
    MSA_all_deaths[row['name_msa'] + '/' + row['states_msa_full']] = []
    all_dates[row['interval_start']] = 0
    geoid_msa[row['geoid_msa']] = 0

for index, row in output.iterrows():
    MSA_all_cases[row['name_msa'] + '/' + row['states_msa_full']].append(row['cases'])
    MSA_all_deaths[row['name_msa'] + '/' + row['states_msa_full']].append(row['deaths'])

In [34]:
# filling up the dictionaries so that the keys are MSA names and the values
# will be the cases/deaths on each date
MSA_all_cases_list = []
for value in MSA_all_cases.values():
    MSA_all_cases_list.append(value)

MSA_all_deaths_list = []
for value in MSA_all_deaths.values():
    for index, v in enumerate(value):
        value[index] = int(v)
    MSA_all_deaths_list.append(value)

for i in MSA_all_cases_list:
    if (len(i) < len(all_dates)):
        diff = len(all_dates) - len(i)
        for j in range(diff):
            i.insert(j, 0)

for i in MSA_all_deaths_list:
    if (len(i) < len(all_dates)):
        diff = len(all_dates) - len(i)
        for j in range(diff):
            i.insert(j, 0)

In [35]:
# format the dates into strings
dates = list(all_dates.keys())
for i in range(len(dates)):
    dates[i] = str(dates[i])[:-9]

In [36]:
# convert the dictionaries back to lists for dataframe constructions
output_cases = pd.DataFrame(MSA_all_cases_list, columns = dates)
print(output_cases.shape)
print(len(list(geoid_msa.keys())))
#output_cases.insert(0, 'geoid', list(geoid_msa.keys()))
output_cases.insert(0, 'name', list(MSA_all_cases.keys()))

output_deaths = pd.DataFrame(MSA_all_deaths_list, columns = dates)
#output_deaths.insert(0, 'geoid', list(geoid_msa.keys()))
output_deaths.insert(0, 'name', list(MSA_all_deaths.keys()))

(964, 61)
893


In [37]:
# add the state column and remove the state name from the MSA name column
states = []
i = 0
while i < len(output_cases.index):
    state = output_cases.at[i, 'name'].split('/')[1]
    states.append(state)
    i+=1
    
    
i = 0
while i < len(output_cases.index):
    output_cases.at[i, 'name'] = output_cases.at[i, 'name'].split('/')[0]
    i+=1

i = 0
while i < len(output_deaths.index):
    output_deaths.at[i, 'name'] = output_deaths.at[i, 'name'].split('/')[0]
    i+=1

# print(len(states))
output_cases.insert(1, 'state', states)
output_deaths.insert(1, 'state', states)

In [38]:
output_deaths.head()

Unnamed: 0,name,state,2020-02-14,2020-02-15,2020-02-16,2020-02-17,2020-02-18,2020-02-19,2020-02-20,2020-02-21,...,2020-04-05,2020-04-06,2020-04-07,2020-04-08,2020-04-09,2020-04-10,2020-04-11,2020-04-12,2020-04-13,2020-04-14
0,Phoenix-Mesa-Scottsdale,AZ,0,0,0,0,0,0,0,0,...,3,0,4,2,5,2,6,2,4,3
1,San Jose-Sunnyvale-Santa Clara,CA,0,0,0,0,0,0,0,0,...,0,0,4,3,1,2,2,2,7,0
2,San Francisco-Oakland-Hayward,CA,0,0,0,0,0,0,0,0,...,1,5,11,2,2,7,3,3,1,1
3,San Diego-Carlsbad,CA,0,0,0,0,0,0,0,0,...,1,0,12,5,4,4,1,0,2,6
4,Los Angeles-Long Beach-Anaheim,CA,0,0,0,0,0,0,0,0,...,15,15,23,31,25,18,25,32,24,40


In [39]:
output_cases.head()

Unnamed: 0,name,state,2020-02-14,2020-02-15,2020-02-16,2020-02-17,2020-02-18,2020-02-19,2020-02-20,2020-02-21,...,2020-04-05,2020-04-06,2020-04-07,2020-04-08,2020-04-09,2020-04-10,2020-04-11,2020-04-12,2020-04-13,2020-04-14
0,Phoenix-Mesa-Scottsdale,AZ,1,0,0,0,0,0,0,0,...,169,111,75,68,143,53,158,86,65,43
1,San Jose-Sunnyvale-Santa Clara,CA,2,0,0,0,0,0,0,0,...,59,24,62,97,63,43,82,55,48,1
2,San Francisco-Oakland-Hayward,CA,2,0,0,0,0,0,0,0,...,219,94,122,146,130,137,154,72,196,126
3,San Diego-Carlsbad,CA,1,0,0,0,0,0,0,0,...,117,78,50,76,98,68,65,43,43,83
4,Los Angeles-Long Beach-Anaheim,CA,2,0,0,0,0,0,0,0,...,711,468,599,705,488,534,526,375,234,643


In [40]:
output_cases.to_csv("./data/output_cases.csv", index = False)
output_deaths.to_csv("./data/output_deaths.csv", index = False)