covidestim-sources

This repository provides a way to clean various input data used for the covidestim model, producing the following easy-to-use outcomes:

Cases
Deaths
Vaccination-related risk ratio
Vaccinations and boosters adminstered
Hospitalizations

These data are offered at the following geographies:

Outcome	County-level	State-level
Cases	✓	✓
*Deaths$$ **	✓	✓
Risk-ratio	✓	✓
Vax-boost	✓	✓
Hospitalizations	✓	✓

$*$ Note that due to the discontinuation of JHU data, deaths data are marked NA or $0$ beyond February 14, 2023.

Hospitalizations data were reported Friday-Thursday until June 19, 2023. Then they switched to a Sunday-Saturday reporting scheme. The weekly CDC case data is reported at a weekly Thursday-Wednesday routine. We match the hospitalization weeks to the clostest case report week. That is, the hospitalization reports from Sunday February 5- Saturday February 11 are matched to the case reportes from Thursday February 2 - Wedenesday February 8 (2023 reference dates).

Usage and dependencies

This repository is essentially a series of GNU Make targets (see makefile), which depend on the results of HTTP requests, as well as data sources in data-sources/. None of the "cleaned" data are committed to the repository; you need to make it yourself to produce it.

First, install Git LFS.

Then, clone the repository and initialize the Git submodules, which track some of our external data sources:

git clone https://github.com/covidestim/covidestim-sources && cd covidestim-sources
git submodule init
git submodule update --remote # This will take 5-30 minutes

Then, make sure you have the neccessary R packages installed. These are:

tidyverse
cli
docopt
sf
vaccineAdjust: Not on CRAN. To install: devtools::install_github("covidestim/vaccineAdjust")
raster
spdep

You can install them in the R console: install.packages(c('tidyverse', 'cli', 'docopt', 'sf', 'raster', 'spdep')).

Finally, attempt to Make the most important targets. Note, you will need GNU Make >=4.3 installed, which does not ship with OS X.

# Make all primary outcomes
make -Bj data-products/{case-death-rr-boost-hosp.csv,case-death-rr-boost-hosp-state.csv}

Repository structure

makefile: The project makefile. If you're confused about how a piece of data gets cleaned, go here first. If you've never read a Makefile before, it's advisable to read an introduction to Make, like this one.
data-products/: All cleaned data is written to this directory. Some recipes will also products metadata, which will always have a .json extension.
data-sources/: All git submodules are stored here, as well as static files used in recipes, like population sizes, polygons, and records of periods of nonreporting.
example-output/: Some example cleaned data, for reference
R/: All data cleaning scripts live here

Keeping your data sources up-to-date

Data sources will not automatically update, and thus, make will not normally do anything if you attempt to remake a target! This is undesirable if you believe there may be newer versions of data sources available. To pull new data from sources backed by submodules, run:

git submodule update --remote

And, use make -B to force targets to be remade, like so:

make -B data-products/case-death-rr.csv

Data sources

All data sources for the cleaned data are either:

Committed to the repository in data-sources/
Committed to the repository in data-sources/, but backed by Git LFS
Accessed through HTTP requests in the makefile or from within R scripts
Committed to the repository as Git submodules in data-sources/

Data	Used for	Accessed through	Frequency of update
Johns Hopkins CSSE	Cases, Deaths	Submodule `data-sources/jhu-data`	>Daily, until February 14, 2023
Covid Tracking Project	Cases, Deaths (2021-02 - 2021-06)	HTTP, `api.covidtracking.com`	No longer updated
NYTimes covid data	Nothing, but a future merge will use it to supplement counties missing from the JHU dataset	Submodule, `data-sources/nyt-data`	>Daily
USCB county, state population estimates	Everything	`data-sources/{fips,state}pop.csv`, reformatted from .xls	1/yr?
DHHS facility-level hospitalizations data	Hospitalizations	HTTP, `healthdata.gov/api`	1/wk
Dartmouth Atlas Zip-HSA-HRR crosswalk	Hospitalizations (agg/disagg)	`data-sources/ZipHsaHrr18.csv`, downloaded from dartmouthatlas.org	1/yr?
Dartmouth Atlas HSA polygons	Hospitalizations (agg/disagg)	`data-sources/hsa-polygons/` (Git LFS), downloaded from dartmouthatlas.org	1/yr?
Census Block Group polygons	Hospitalizations (agg/disagg)	`data-sources/cbg-polygons/` (Git LFS), downloaded from TIGER	1/yr?
Census Block Group popsize	Hospitalizations (agg/disagg)	`data-sources/population_by_cbg.csv/`, extracted from TIGER	1/yr?
Vaccination adjustments	Vaccination adjustments	`data-sources/vaccines-counties.csv`	Static w/ data through Dec 2022
Data.CDC.gov	Booster and vaccination data	`data-sources/cdc-vax-boost-state.csv/`, extracted from cdc-states and `data-sources/cdc-vax-boost-county.csv` extracted from cdc-counties	Daily
Data.CDC.gov	Case data, after February 14, 2023	cdc-state-cases and cdc-counties-cases	Weekly

Some data sources are tracked as git submodules, and other sources, being accessed through APIs, are fetched over HTTP. Large static files are stored through Git LFS.

The included makefile provides a common interface for directing the fetching and cleaning of all data sources.

Data sources

data-sources/vaccines-counties A cached file containing daily RR adjustments to the transition probabilities for each state and county, up until December 31, 2022. This file is rendered by running vaccineAdjust::run(), with data up until December 1, 2021, and freezing the final RR adjustment. After December 1, 2021, vaccination adjustments occur within the covidestim model.

Targets

make data-products/case-death-rr-boost-hosp.csv
Reads cleaned archived JHU county-level case/death data. Joins by location and date with static vaccine IFR adjustments from data-sources/vaccines-counties and the cleaned CDC vax-boost-county.csv. Weekly aggregates are joined with the cleaned hhs-hospitalizations-by-county.csv. Any metadata for counties included in the cleaned data will be stored in case-death-rr-boost-hosp-metadata.json. Joins the weekly CDC case data from data-sources/cdc-cases-raw.csv
make data-products/case-death-rr-boost-hosp-state.csv
Reads cleaned archived JHU state-level data, splicing in archived Covid Tracking Project data. For details on this, see the makefile. Joins by location and date with static vaccine IFR adjustments from data-sources/vaccines-counties and the cleaned CDC vax-boost-state.csv. Weekly aggregates are joined with the cleaned hhs-hospitalizations-by-state.csv. Any metadata for counties included in the cleaned data will be stored in case-death-rr-boost-hosp-state-metadata.json. Joins the weekly CDC case data from data-sources/cdc-cases-state-raw.csv
make data-products/nyt-counties.csv
Clean NYT county-level case/death data. Writes nyt-counties-rejects.csv.
make data-products/hhs-hospitalizations-by-state.csv
State-level aggregation of county level hospitalization data.
make data-products/hhs-hospitalizations-by-county.csv
County-level aggregation of facility level hospitalization data. See the next section for details on how this is done.
make data-products/hhs-hospitalizations-by-facility.csv
Cleans facility-level data from DHHS's API and annotates each faility with an HSA id. Also, computes .min and .max columns to compensate for censoring done when there are 1-3 hospitalizations in a given week.

Hospitalizations data pipeline

Key document: COVID-19 Guidance for Hospital Reporting and FAQs For Hospitals, Hospital Laboratory, and Acute Care Facility Data Reporting, January 6, 2022 revision

We source hospitalizations data from the official HHS facility-level dataset. In order to be useful to our model, we transform these data into a county-level dataset.

Outcome format

Outcomes are presented across 3-4 variables. For an outcome name, the following variables may be present:

{name}: The outcome itself, including censored data. This means that if a facility reports -999999 for that week, the name outcome will be equal to -999999.
{name}_min: The smallest the outcome could be - all censored values, which each represent a possible range of 1-3, will be resolved to 1.
{name}_max: The largest the outcome could be if all censored values are resolved to 3.
{name}_max2: The largest the outcome could be if all censored valeus are resolved to 3 and any missing days are imputed using the average of the present days.

Note: This quantity is not meaningful for the following averaged prevalence outcomes, because they are already averaged across the number of days reported by the facility that week (which is not necessarily 7). For these outcomes, {name}_max == {name}_max2.
- averageAdultICUPatientsConfirmed
- averageAdultICUPatientsConfirmedSuspected
- averageAdultInpatientsConfirmed
- averageAdultInpatientsConfirmedSuspected

Outcomes table

Variable	Meaning	`min`/`max`	`max2`
`fips`	FIPS code of the county
`weekstart`	YYYY-MM-DD of the firs date in the week
`admissionsAdultsConfirmed`	# admissions of adults with confirmed¹ Covid	✓	✓
`admissionsAdultsSuspected`	# admissions of adults with suspected² Covid	✓	✓
`admissionsPedsConfirmed`	# admissions of peds with confirmed¹ Covid	✓	✓
`admissionsPedsSuspected`	# admissions of peds with suspected² Covid	✓	✓
`averageAdultICUPatientsConfirmed`	Average number of ICU beds occupied by adults with confirmed¹ covid that week	✓	equal to `max`
`averageAdultICUPatientsConfirmedSuspected`	Average number of ICU beds occupied by adults with confirmed or suspected¹² covid that week	✓	equal to `max`
`averageAdultInpatientsConfirmed`	Average number of inpatient beds occupied by adults with confirmed¹ Covid that week	✓	equal to `max`
`averageAdultInpatientsConfirmedSuspected`	Average number of inpatient beds occupied by adults with confirmed¹ or suspected² Covid that week	✓	equal to `max`
`covidRelatedEDVisits`	Total number of ED visits that week related to Covid³	✓

Geographic aggregation/disaggregation

County boundaries, HSA boundaries, CBG boundaries

Simply identifying which county each facility lies within and then summing across all facilities in a county carries the following drawbacks:

It ignores the fact that the hospital may be treating patients from other counties.
It ignores the fact that patients from one county may be treated at a hospital in an adjacent (or even non-adjacent) county.

These two issues will cause particularly large problems (biases) when:

There are major medical centers in the area, which are more likely to take the lion's share of severe patients during times of peak Covid prevalence.
There are small or sparsely-populated counties, where residents may have to travel outside of their county to seek hospital care.

To help solve this problem, we rely on a dataset maintained by the Dartmouth Atlas which defines geographic units called Hospital Service Areas (HSA). These service areas are meant to represent a notion of a catchment area for each hospital:

Hospital service areas (HSAs) are local health care markets for hospital care. An HSA is a collection of ZIP codes whose residents receive most of their hospitalizations from the hospitals in that area. HSAs were defined by assigning ZIP codes to the hospital area where the greatest proportion of their Medicare residents were hospitalized. Minor adjustments were made to ensure geographic contiguity. Most hospital service areas contain only one hospital. The process resulted in 3,436 HSAs.

Importantly, HSA's are only an approximation of a catchment area, and since a patient may very well travel outside of the "catchment area" for care with some nonzero probability, the concept of a catchment area as "patient always goes to a hospital in this polygon" has inherent limitations as far as fully capturing patient facility choice. See [this 2015 paper][paper] by Kilaru and Carr for a discussion of these problems.

Methodology

Nonetheless, we use HSA's as our aggregate geographical unit because we believe that it is a better representation of a catchment area than what we would get by simply drawing the county-border enclosing each facility. To leverage these HSAs to create county-level hospitalizations data, We:

Aggregate facility-level data to the HSA level
Fracture the HSAs using county boundaries
Use CBG population data to divide the outcomes from fractured HSA's into the intersecting counties in a population-proportional manner. Note that the CBG population is not present for D.C. in our dataset, so we manually add the D.C. population for the corresponding HSA. [paper]: https://pubmed.ncbi.nlm.nih.gov/25961661/ da: https://www.dartmouthatlas.org/faq/

Definition of "Laboratory-confirmed Covid":
Source: Page 44, HHS hospital reporting guidance ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Definition of "suspected Covid":
"“Suspected” is defined as a person who is being managed as though he/she has COVID-19 because of signs and symptoms suggestive of COVID-19 but does not have a laboratory-positive COVID-19 test result."
Source: Page 14, HHS hospital reporting guidance ↩ ↩² ↩³ ↩⁴
Definition of "related to Covid":
"Enter the total number of ED visits who were seen on the previous calendar day who had a visit related to suspected or laboratory-confirmed COVID-19. Do not count patients who receive a COVID-19 test solely for screening purposes in the absence of COVID-19 symptoms."
Source: Page 14, HHS hospital reporting guidance ↩

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
R		R
data-sources		data-sources
example-output		example-output
img		img
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
example_output.png		example_output.png
makefile		makefile
splicedates.csv		splicedates.csv

covidestim/covidestim-sources

Folders and files

Latest commit

History

Repository files navigation

covidestim-sources

Usage and dependencies

Repository structure

Keeping your data sources up-to-date

Data sources

Data sources

Targets

Hospitalizations data pipeline

Outcome format

Outcomes table

Geographic aggregation/disaggregation

Methodology

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Languages