# Covid-19 Data Aggregation

This notebook is in conjunction with the ongoing research on Covid-19 and human mobility at the CyberGIS Center *Social Media and Viz Team* led by Dr. Su Han and serves the main purpose of preprocessing Covid-19 data by taking the input data of cumulative covid-19 data at county level from New York Times to Metropolitan Statistical Area (MSA) level for further study. This notebook also offers functionality to output global daily covid data and output in js format.

Created at The University of Illinois CyberGIS Center. Created: 12/6/2020. Last updated: 7/31/2021.

## Notebook Outline
- [Data preparation](#Data_preparation)
- [Cumulative to daily](#Cumulative_to_daily)
- [Transpose](#Transpose)
- [Aggregate](#Aggregate)

In [12]:
import sys
import argparse
import os
import pandas as pd
import datetime as dt
from datetime import timedelta
from datetime import datetime
import numpy as np
import json
import geopandas as gpd
from urllib.request import urlopen

<a id = 'Data_preparation'></a>
### Data preparation

In [17]:
# change the parameters below to specify a beginning and an ending date 
startdate = dt.datetime(2020, 2, 14)
endDate = dt.datetime(2020, 3, 14)
inputCovid = './data/Covid-19 cumulative.csv'
inputMetro = './data/metro_county.csv'

# change the parameters below to specify an interval (for example interval = 1
# means daily cases and interval = 7 means weekly cases)
interval = 1

In [18]:
# input cumulative data
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
data = pd.read_csv(url)

# filter out data rows whose dates are not within the range
data["date"] = pd.to_datetime(data["date"])
data = data[data.date <= endDate]
data = data[data.date >= startdate]

data = data.sort_values(by = ['county', 'state'])
copy = data
data.to_csv("./data/daily count.csv", index = False)
data = pd.read_csv("./data/daily count.csv")
copy.to_csv("./data/daily count_copy.csv", index = False)
copy = pd.read_csv("./data/daily count_copy.csv")

In [19]:
# input MSA to county reference table
metro = pd.read_csv(inputMetro)
metro_initial = metro

In [20]:
# input US state abbreviations reference table
state_abbr = pd.read_csv('./data/us states abbreviations.csv')

<a id = 'Cumulative_to_daily'></a>
### Cumulative to daily
This section is for converting cumulative cases to daily cases. The basic idea is that the daily case for a specific day is the cumulative cases on the day minus the cumulative cases on the day before it.

In [21]:
i = 1
data['fips'] = data['fips'].fillna(-1)
while i < len(data.index):
    if data.at[i, 'county'] == data.at[i - 1, 'county'] and data.at[i, 'state'] == data.at[i - 1, 'state']:
        if (data.at[i, 'cases'] < copy.at[i - 1, 'cases']):
            data.at[i, 'cases'] = copy.at[i - 1, 'cases']
        if (data.at[i, 'deaths'] < copy.at[i - 1, 'deaths']):
            data.at[i, 'deaths'] = copy.at[i - 1, 'deaths']
        data.at[i, 'cases'] = data.at[i, 'cases'] - copy.at[i - 1, 'cases']
        data.at[i, 'deaths'] = data.at[i, 'deaths'] - copy.at[i - 1, 'deaths']
    i += 1
data = data.astype({'fips': int})
data = data.sort_values(by = ['date'])

data.to_csv("./data/daily count.csv", index = False)

data.head()

Unnamed: 0,date,county,state,fips,cases,deaths
1727,2020-02-14,Snohomish,Washington,53061,1,0.0
990,2020-02-14,Los Angeles,California,6037,1,0.0
339,2020-02-14,Cook,Illinois,17031,2,0.0
1310,2020-02-14,Orange,California,6059,1,0.0
1863,2020-02-14,Suffolk,Massachusetts,25025,1,0.0


<a id = 'Aggregate'></a>
### Aggregate
This section is for aggregating

In [137]:
# merge the metro reference table with state reference table
metro = metro.merge(state_abbr, how ='inner', 
                     left_on = 'states_msa', 
                     right_on = 'Abbreviation')
metro['states_msa'] = metro['State']


In [138]:
# merge again with the covid data
merged = metro.merge(data, how='inner', 
                     left_on=["states_msa", "name10_county"], 
                     right_on=["state","county"])
merged = merged[['date', 'county', 'cases', 
                 'deaths', 'name_msa', 'states_msa_code', 'states_msa', 'states_msa_full',
                 'geoid_msa']]  


In [139]:
# aggregate for each interval of dates
merged["date"] = pd.to_datetime(merged["date"])
iterate_start = startdate

interval_data = merged
output = pd.DataFrame()

while iterate_start <= enddate:
    iterate_end = iterate_start + timedelta(days = interval)

    eachInterval = interval_data[interval_data.date >= iterate_start]
    eachInterval = eachInterval[eachInterval.date < iterate_end]


    eachInterval = eachInterval.groupby(['name_msa'])['cases', 'deaths'].sum()
    eachInterval = eachInterval.merge(metro_initial, left_on='name_msa', right_on='name_msa')[['states_msa_code', 'states_msa', 
                                'states_msa_full', "geoid_msa",
                               'name_msa', 'cases', 'deaths']].sort_values(by = 'states_msa_code')
    eachInterval['interval_start'] = iterate_start
    eachInterval = eachInterval.drop_duplicates(subset=['name_msa'])

    output = output.append(eachInterval)
    iterate_start = iterate_start + timedelta(days=interval)

output.to_csv("./data/output.csv", index = False)
output.head()


  from ipykernel import kernelapp as app


Unnamed: 0,states_msa_code,states_msa,states_msa_full,geoid_msa,name_msa,cases,deaths,interval_start
63,4,AZ,AZ,38060,Phoenix-Mesa-Scottsdale,1,0.0,2020-02-14
80,6,CA,CA,41940,San Jose-Sunnyvale-Santa Clara,2,0.0,2020-02-14
78,6,CA,CA,41860,San Francisco-Oakland-Hayward,2,0.0,2020-02-14
73,6,CA,CA,41740,San Diego-Carlsbad,1,0.0,2020-02-14
57,6,CA,CA,31080,Los Angeles-Long Beach-Anaheim,2,0.0,2020-02-14


<a id = 'Transpose'></a>
### Transpose
Here we transpose the aggregated output from the previous cell so that each date is its own column and we will have a separate output cases and output deaths data

#### transpose MSA

In [177]:
# initilize some dictionaries to be used for the transposition
geoid_msa = {}
MSA_all_cases = {}
MSA_all_deaths = {}
all_dates = {}

for index, row in output.iterrows():
    MSA_all_cases[row['name_msa']] = []
    MSA_all_deaths[row['name_msa']] = []
    all_dates[row['interval_start']] = 0
    geoid_msa[row['geoid_msa']] = 0

for index, row in output.iterrows():
    MSA_all_cases[row['name_msa']].append(row['cases'])
    MSA_all_deaths[row['name_msa']].append(row['deaths'])

In [178]:
# filling up the dictionaries so that the keys are MSA names and the values
# will be the cases/deaths on each date
MSA_all_cases_list = []
for value in MSA_all_cases.values():
    MSA_all_cases_list.append(value)

MSA_all_deaths_list = []
for value in MSA_all_deaths.values():
    for index, v in enumerate(value):
        value[index] = int(v)
    MSA_all_deaths_list.append(value)

for i in MSA_all_cases_list:
    if (len(i) < len(all_dates)):
        diff = len(all_dates) - len(i)
        for j in range(diff):
            i.insert(j, 0)

for i in MSA_all_deaths_list:
    if (len(i) < len(all_dates)):
        diff = len(all_dates) - len(i)
        for j in range(diff):
            i.insert(j, 0)

In [None]:
# format the dates into strings
dates = list(all_dates.keys())
for i in range(len(dates)):
    dates[i] = str(dates[i])[:-9]

In [179]:
# conver the dictionaries back to lists for dataframe constructions
output_cases = pd.DataFrame(MSA_all_cases_list, columns = dates)
output_cases.insert(0, 'geoid', list(geoid_msa.keys()))
output_cases.insert(1, 'name', list(MSA_all_cases.keys()))

output_deaths = pd.DataFrame(MSA_all_deaths_list, columns = dates)
output_deaths.insert(0, 'geoid', list(geoid_msa.keys()))
output_deaths.insert(1, 'name', list(MSA_all_deaths.keys()))

In [180]:
output_cases.to_csv("./data/output_cases.csv", index = False)
output_deaths.to_csv("./data/output_deaths.csv", index = False)

#### transpose world

In [205]:
# input global covid data
input_world = './data/covid_world.csv'
df = pd.read_csv(input_world)
df["date"] = pd.to_datetime(df["date"])
df = df[df.date <= endDate]
df = df[df.date >= startdate]
df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
0,AFG,Asia,Afghanistan,2020-02-24,1.0,1.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
1,AFG,Asia,Afghanistan,2020-02-25,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
2,AFG,Asia,Afghanistan,2020-02-26,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
3,AFG,Asia,Afghanistan,2020-02-27,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
4,AFG,Asia,Afghanistan,2020-02-28,1.0,0.0,,,,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498


In [206]:
# geoid data for reference
geoid = pd.read_csv('./data/geoid.csv')
geoid.head()

Unnamed: 0,Geographical location identifier (Hex),Geographical location identifier (decimal),Location (Short Name),Location (Long Name)
0,0x2,2,Antigua and Barbuda,Antigua and Barbuda
1,0x3,3,Afghanistan,Islamic Republic of Afghanistan
2,0x4,4,Algeria,Democratic and Popular Republic of Algeria
3,0x5,5,Azerbaijan,Republic of Azerbaijan
4,0x6,6,Albania,Republic of Albania


In [207]:
# merge the geoid and world data 
df = df.merge(geoid, how = 'inner', 
                     left_on = ['location'], 
                     right_on = ['Location (Short Name)'] )

df = df[['Geographical location identifier (decimal)', 'iso_code', 'date', 'continent', 'location', 
         'new_cases', 'new_deaths']]

In [208]:
# initilize some dictionaries to be used for the transposition
iso_country = {}
country_all_cases = {}
country_all_deaths = {}
all_dates = {}
continent = {}
locations = {}
geoid = {}

# for i in range(df.shape[0]):
#     transform = df.at[i, 'date']
#     print(transform)
#     transform = transform.split('/')
#     transform = transform[0] + '-' + transform[1] + '-' + transform[2]


for i in range(df.shape[0]):
    if (df.at[i, 'date'] < startDate or df.at[i, 'date'] > endDate):
        df = df.drop([i])

for index, row in df.iterrows():
    country_all_cases[row['iso_code']] = []
    country_all_deaths[row['iso_code']] = []
    all_dates[row['date']] = 0
    iso_country[row['iso_code']] = 0

    continent[row['continent']] = 0
    locations[row['location']] = 0
    geoid[row['Geographical location identifier (decimal)']] = 0

    

In [209]:
# filling up the dictionaries so that the keys are iso_code for each country
# and the values are cases/deaths on each date
for index, row in df.iterrows():
    country_all_cases[row['iso_code']].append(row['new_cases'])
    country_all_deaths[row['iso_code']].append(row['new_deaths'])

country_all_cases_list = []
for value in country_all_cases.values():
    country_all_cases_list.append(value)

country_all_deaths_list = []
for value in country_all_deaths.values():
    country_all_deaths_list.append(value)


dates = list(all_dates.keys())
for i in range(len(dates)):
    dates[i] = str(dates[i])[:-9]


output_cases_world = pd.DataFrame(country_all_cases_list, columns = dates)
output_cases_world.insert(0, 'geoid', list(geoid.keys()))
output_cases_world.insert(1, 'name', list(locations.keys()))


output_deaths_world = pd.DataFrame(country_all_deaths_list, columns = dates)
output_deaths_world.insert(0, 'geoid', list(geoid.keys()))
output_deaths_world.insert(1, 'name', list(locations.keys()))

output_cases_world.head()

Unnamed: 0,geoid,name,2020-02-24,2020-02-25,2020-02-26,2020-02-27,2020-02-28,2020-02-29,2020-03-01,2020-03-02,...,2020-02-14,2020-02-15,2020-02-16,2020-02-17,2020-02-18,2020-02-19,2020-02-20,2020-02-21,2020-02-22,2020-02-23
0,3,Afghanistan,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
1,6,Albania,,,,,,,,,...,,,,,,,,,,
2,4,Algeria,1.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,...,,,,,,,,,,
3,8,Andorra,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
4,2,Antigua and Barbuda,1.0,0.0,,,,,,,...,,,,,,,,,,


In [212]:
# merge the world output and MSA output
for i in range(output_cases_world.shape[0]):
    output_cases.loc[output_cases.shape[0] + i] = list(output_cases_world.loc[i])

for i in range(output_deaths_world.shape[0]):
    output_deaths.loc[output_deaths.shape[0] + i] = list(output_deaths_world.loc[i])

output_cases.to_csv("./data/output_cases.csv", index = False)
output_deaths.to_csv("./data/output_deaths.csv", index = False)

output_cases.head()

Unnamed: 0,geoid,name,2020-02-14,2020-02-15,2020-02-16,2020-02-17,2020-02-18,2020-02-19,2020-02-20,2020-02-21,...,2020-03-05,2020-03-06,2020-03-07,2020-03-08,2020-03-09,2020-03-10,2020-03-11,2020-03-12,2020-03-13,2020-03-14
0,38060,Phoenix-Mesa-Scottsdale,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,2.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0
1,41940,San Jose-Sunnyvale-Santa Clara,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.0,4.0,8.0,5.0,6.0,2.0,3.0,18.0,14.0,12.0
2,41860,San Francisco-Oakland-Hayward,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,4.0,6.0,5.0,35.0,2.0,8.0,13.0,21.0,17.0
3,41740,San Diego-Carlsbad,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,4.0,1.0
4,31080,Los Angeles-Long Beach-Anaheim,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,2.0,1.0,0.0,7.0,1.0,9.0,4.0,15.0,14.0
