COVID-19 - time-series utilities
This repo contains several utilities for wrangling COVID-19 data from the John Hopkins University COVID-19 repository.
- A working instance of TimescaleDB in PostgreSQL v10+
- Unix/Linux operating system with bash
- for OpenRefine time-series automation
- Docker with docker-compose in case you want to run it in containers (optional)
A note on cloning this repo, since the COVID19 directory is a git submodule:
- after cloning, you must initiate the submodule. In the top level directory for the project, run
git submodule initand
git submodule updateto clone the JHU Repo as a submodule
The files in this directory and how they're used:
covid-19_ingest.sh: script that converts the JHU COVID-19 daily-report data to a time-series database using TimescaleDB.
covid-refine: OpenRefine automation script that converts JHU COVID-19 time-series data into a normalized, enriched format and uploads it to TimescaleDB. (RECOMMENDED)
schema.sql: Data definition (DDL) to create the necessary tables & hypertables.
environment: Default environment values used in Docker containers.
Using the Timescale covid19-ingest script
- Create a TimescaleDB instance - download or signup
- Create a database named
covid_19, and an application user
psql create database covid_19; create user covid19_user WITH PASSWORD 'your-password-here'; alter database covid_19 OWNER TO covid19_user; \quit
covid19_user. VACUUM/ANALYZE require owner privs
psql -U covid19_user -h <the.server.hostname> -f schema.sql covid_19
sudo apt-get install csvkit
- MacOS: Using homebrew run
brew install csvkit
Using a text editor, replace the environment variables for
Run the script
(OPTIONAL) add shell script to crontab to run daily
Be able to slice-and-dice the data using the full power of PostgreSQL along with Timescale's time-series capabilities!
NOTE: Due to the changing file format of JHU's daily report data, covid-refine is recommended over
COVIDrefine has the added benefit of producing fully normalized, non-sparse, geo-enriched data.
See the detailed README.
If you just want to download the COVIDrefine data, the latest version can be found here.
- Remember initiate the submodule, run
git submodule init
- That's all. You can go to Swagger or PostgREST
- the JHU COVID-19 repository is a git submodule. This was done to automate getting the latest data from their repo.
- the script will only work in *nix environment (Linux, Unix, MacOS)
- both scripts maintain a hidden directory called
~/.covid-19in your home directory. -
lastcsvprocessed. Delete that file to process all daily-report files from the beginning, or change the date in the file to start processing files AFTER the entered date.
- use postgREST to add a REST API in front of TimescaleDB database
- create a Grafana dashboard
- create a Carto visualization
- create a Superset visualization
- thanks to Avtar Sewrathan (@avthars), Prashant Sridharan (@CoolAssPuppy) and Mike Freedman (@michaelfreedman) at Timescale for their help & support to implement this project from idea to implementation in 5 days!
- thanks to Julian Simioni (@orangejulius) at Geocode.earth for allowing us to use the Geocode.earth API!
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.