Skip to content

several utilities to help wrangle COVID-19 data into a time-series format

License

Notifications You must be signed in to change notification settings

dathere/covid19-time-series-utilities

Repository files navigation

COVID-19 - time-series utilities

This repo contains several utilities for wrangling COVID-19 data from the John Hopkins University COVID-19 repository.

NOTE: The utilities currently do not work because of the new file formats. They will be updated shortly to work with the revised formats.

Requirements

Cloning

A note on cloning this repo, since the COVID19 directory is a git submodule:

  • after cloning, you must initiate the submodule. In the top level directory for the project, run git submodule init and git submodule update to clone the JHU Repo as a submodule

Content

The files in this directory and how they're used:

  • covid-19_ingest.sh: script that converts the JHU COVID-19 daily-report data to a time-series database using TimescaleDB.
  • covid-refine: OpenRefine automation script that converts JHU COVID-19 time-series data into a normalized, enriched format and uploads it to TimescaleDB. (RECOMMENDED)
  • schema.sql: Data definition (DDL) to create the necessary tables & hypertables.
  • environment: Default environment values used in Docker containers.

Using the Timescale covid19-ingest script

  1. Create a TimescaleDB instance - download or signup
  2. Create a database named covid_19, and an application user covid19_user
  psql
  create database covid_19;
  create user covid19_user WITH PASSWORD 'your-password-here';
  alter database covid_19 OWNER TO covid19_user;
  \quit
  1. Run schema.sql as the covid19_user. VACUUM/ANALYZE require owner privs

    psql -U covid19_user -h <the.server.hostname> -f schema.sql covid_19

  2. Install csvkit

    • Ubuntu: sudo apt-get install csvkit
    • MacOS: Using homebrew run brew install csvkit
  3. Using a text editor, replace the environment variables for PGHOST, PGUSER and PGPASSWORD in covid-19_ingest.sh

  4. Run the script

    bash covid-19_ingest.sh

  5. (OPTIONAL) add shell script to crontab to run daily

  6. Be able to slice-and-dice the data using the full power of PostgreSQL along with Timescale's time-series capabilities!

Using COVIDrefine

NOTE: Due to the changing file format of JHU's daily report data, covid-refine is recommended over covid-19_ingest.sh. COVIDrefine has the added benefit of producing fully normalized, non-sparse, geo-enriched data.

See the detailed README.

If you just want to download the COVIDrefine data, the latest version can be found here.

Using docker-compose

  1. Remember initiate the submodule, run git submodule init
  2. Run docker-compose build
  3. Run docker-compose up
  4. That's all. You can go to Swagger or PostgREST

NOTES

  • the JHU COVID-19 repository is a git submodule. This was done to automate getting the latest data from their repo.
  • the script will only work in *nix environment (Linux, Unix, MacOS)
  • both scripts maintain a hidden directory called ~/.covid-19 in your home directory. -covid-19_ingest.sh checkslastcsvprocessed. Delete that file to process all daily-report files from the beginning, or change the date in the file to start processing files AFTER the entered date.

TODO

  • use postgREST to add a REST API in front of TimescaleDB database
  • create a Grafana dashboard
  • create a Carto visualization
  • create a Superset visualization

ACKNOWLEDGEMENTS

  • thanks to Avtar Sewrathan (@avthars), Prashant Sridharan (@CoolAssPuppy) and Mike Freedman (@michaelfreedman) at Timescale for their help & support to implement this project from idea to implementation in 5 days!
  • thanks to Julian Simioni (@orangejulius) at Geocode.earth for allowing us to use the Geocode.earth API!

Shield: CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0