Skip to content
Switch branches/tags
Go to file
Cannot retrieve contributors at this time

Data overview below, detailed documentation in the list of columns

The Yu group at UC Berkeley Statistics and EECS has compiled, cleaned and documented a large corpus of hospital- and county-level data from a variety of public sources to aid data science efforts to combat COVID-19. At the hospital level, our data include the location of the hospital, the number of ICU beds, the total number of employees, and the hospital type. At the county level, our data include COVID-19 cases/deaths from USA Facts and NYT, automatically updated every day, along with demographic information, health resource availability, COVID-19 health risk factors, and social mobility information. An overview of each data set in this corpus is provided in this file.

Options to download the data

  • clone the repo and load the data as documented below in the quickstart (recommended)
  • download the county_data_abridged.csv file
    • Note: this is an abrdiged data set, not the full county-level data set
  • download from the AWS Data Exchange.

Data overview

  • Hospital Level Data
  • Nursing Homes Level Data
    • nyt_nursinghomes: number of COVID-19-related cases and deaths from nursing homes, as reported by NYT
    • hifld_nursinghomes: database of nursing homes/assisted living facilities, populated via open source authoritative sources
  • County Level Data
    • COVID-19 Cases/Deaths Data
      • nytimes_infections: COVID-19-related death/case counts per day per county from NYT
      • usafacts_infections: COVID-19-related death/case counts per day per county from USA Facts
      • ccd_daily: COVID-19-related deaths, cases, hospitalizations, and testing statistics
    • Demographics and Health Resource Availability
      • ahrf_health: contains county-level information on health facilities, health professions, measures of resource scarcity, health status, economic activity, health training programs, and socioeconomic and environmental characteristics from Area Health Resources Files
      • cdc_svi: Social Vulnerability Index for counties from CDC
      • hpsa_shortage: information on areas with shortages of primary care, as designated by the Health Resources & Services Administration (HRSA)
      • khn_icu: information on number of ICU beds and hospitals per county from Kaiser Health News
      • usda_poverty: county-level poverty estimates from the United States Department of Agriculture, Economic Research Service
    • Health Risk Factors
      • chrr_health: contains estimates of various health outcomes and health behaviors (e.g., percentage of adult smokers) for each county from County Health Rankings & Roadmaps
      • dhdsp_heart: cardiovascular disease mortality rates from CDC DHDSP
      • dhdsp_stroke: stroke mortality rates from CDC DHDSP
      • ihme_respiratory: chronic respiratory disease mortality rates from IHME
      • medicare_chronic: Medicare claims data for 21 chronic conditions
      • nchs_mortality: overall mortality rates for each county from National Center for Health Statistics
      • usdss_diabetes: diagnosed diabetes in each county from CDC USDSS
      • kinsa_ili: measures of anomalous influenza-like illness incidence (ILI) outbreaks in real-time using Kinsa鈥檚 county-level illness signals, developed from real-time geospatial thermometer data (private data)
      • cmu_covidcast: epidemiological data from the CMU Delphi COVIDcast, which includes data on COVID-like symptoms from Facebook surveys, estimated COVID-related doctor visits and hospital admissions, and other indicators
    • Social Distancing and Mobility/Miscellaneous
      • nytimes_masks: mask-wearing survey data from NYT and Dynata
      • google_mobility: community mobility reports from Google
      • apple_mobility: mobility trends from Apple maps direction requests
      • unacast_mobility: county-level estimates of the change in mobility from pre-COVID-19 baseline from Unacast (private data)
      • streetlight_vmt: estimates of total vehicle miles travelled (VMT) by residents of each county, each day; provided by Streetlight Data (private data)
      • safegraph_socialdistancing: aggregated daily views of USA foot-traffic summarizing movement between counties from SafeGraph (private data)
      • safegraph_weeklypatterns: place foot-traffic and demographic aggregations that answer: how often people visit, where they came from, where else they go, and more; from SafeGraph (private data)
      • jhu_interventions: contains the dates that counties (or states governing them) took measures to mitigate the spread by restricting gatherings (e.g., travel bans, stay at home orders)
      • mit_voting: county-level returns for presidential elections from 2000 to 2016 according to official state election data records
  • Miscellaneous Data
    • bts_airtravel: survey data including origin, destination, and itinerary details from a 10% sample of airline tickets from the Bureau of Transportation Statistics
    • fb_socialconnectedness: an anonymized snapshot of all active Facebook users and their friendship networks as a measure of social connectedness between two different places


To load the county-level data (daily COVID-19 cases/deaths data + other county-level features listed above) from the project root directory:

import data
# unabridged
df_unabridged = data.load_county_data(data_dir = "data", cached = False, abridged = False)
# abridged
df_abrdiged = data.load_county_data(data_dir = "data", cached = False, abridged = True)

To load the nursing homes data from the project root directory:

import data
nhomes = data.load_nursinghome_data(data_dir = "data", cached = False)

To load the hospital-level data from the project root directory:

import data
hosp = data.load_hospital_data(data_dir="data", with_private_data=False, load_cached_file=False)

To load the public social mobility data from the project root directory:

import data
# country-level data in long-format
mobility_country_long = data.load_socialmobility_data(data_dir = "data", level = "country", df_shape = "long")
# county-level data in wide-format
mobility_country_long = data.load_socialmobility_data(data_dir = "data", level = "county", df_shape = "wide")
# level must be one of {"country", "state", "county", "city"}

Folder Structure

The structure of the folder is as the following:

  • raw (contains raw data)
    • [datasource]_[shortname]/
      • (a script that loads the data)
      • (a script that downloads the data)
      • raw data
      • (metadata for the raw data)
  • processed (contains the processed data)
    • [datasource]_[shortname]/
      • (a script that cleans the data)
      • cleaned data
      • (metadata for the cleaned data)

We prepared this data to support emergency medical supply distribution efforts through short-term (days) prediction of COVID-19 deaths (and cases) at the county level. We are using the predictions and hospital data to arrive at a covid Pandemic Severity Index (c-PSI) for each hospital. This project is in partnership with We will be adding more relevant data sets as they are found.