Skip to content

Commit

Permalink
cleaning up for beta2 release
Browse files Browse the repository at this point in the history
  • Loading branch information
jqnatividad committed Mar 6, 2020
1 parent f2dffc9 commit 40c012a
Show file tree
Hide file tree
Showing 5 changed files with 67 additions and 34 deletions.
33 changes: 14 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,14 @@
# COVID-19 - time-series utilities

This repo contains several utilities for wrangling COVID-19 data from John Hopkins University.

* A shell script that converts the JHU COVID-19 daily-report data to a time-series database using TimescaleDB.
* Several OpenRefine projects that "unpivots" (normalizes) JHU's COVID-19 time-series csv file
This repo contains several utilities for wrangling COVID-19 data from the [John Hopkins University COVID-19 repository](https://github.com/CSSEGISandData/COVID-19).

## Requirements
* For time-series ingestion script
- A working instance of [TimescaleDB](https://docs.timescale.com) in PostgreSQL v10+
- [csvkit](https://csvkit.readthedocs.io/en/latest/)
- [git](https://git-scm.com/)
- Unix/Linux operating system with bash
* OpenRefine time-series automation
- [OpenRefine](http://openrefine.org) - installed automatically
* A working instance of [TimescaleDB](https://docs.timescale.com) in PostgreSQL v10+
* [csvkit](https://csvkit.readthedocs.io/en/latest/)
* [git](https://git-scm.com/)
* Unix/Linux operating system with bash
* for OpenRefine time-series automation
- [OpenRefine](http://openrefine.org) - installed automatically
- [openrefine-batch](https://github.com/opencultureconsulting/openrefine-batch) - included
- a [Geocode.earth](https://geocode.earth) API key - [install](https://github.com/pelias/pelias) or [free trial](https://geocode.earth/invite/request?referrer=datHere). Used to enrich geographic data.

Expand All @@ -24,9 +20,9 @@ A note on cloning this repo, since the COVID19 directory is a git submodule:
## Content
The files in this directory and how they're used:

* `covid-19_ingest.sh`: Bash script to read daily-report data from [JHU COVID-19 Github](https://github.com/CSSEGISandData/COVID-19), and upsert the data into TimescaleDB.
* `covid-19_ingest.sh`: script that converts the JHU COVID-19 [daily-report data](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports) to a time-series database using TimescaleDB.
* `covid-refine`: OpenRefine automation script that converts JHU COVID-19 [time-series data](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series) into a normalized, enriched format and uploads it to TimescaleDB.
* `schema.sql`: Data definition (DDL) to create the necessary tables & hypertables.
* `covid-refine`: OpenRefine automation to create fully normalized COVID-19 time-series data, with expanded geographic features

## Using the Timescale covid19-ingest script
1. Create a TimescaleDB instance - [download](https://docs.timescale.com/latest/getting-started/installation) or [signup](https://www.timescale.com/cloud-signup)
Expand Down Expand Up @@ -60,17 +56,16 @@ The files in this directory and how they're used:

8. Be able to slice-and-dice the data using the full power of PostgreSQL along with Timescale's time-series capabilities!

## Using the COVIDrefine
See the detailed [README](covid19-refine/README.md).
## Using COVIDrefine
See the detailed [README](covid19-refine/).

## NOTES
- the JHU COVID-19 repository is a git submodule. This was required to automate getting the latest data from their repo.
- the JHU COVID-19 repository is a git submodule. This was done to automate getting the latest data from their repo.
- the script will only work in \*nix environment (Linux, Unix, MacOS)
- the script creates a hidden directory called `~/.covid-19` in your home directory where it stores the date of the last csv it processed in a file named `lastcsvprocessed`. Delete that file to process all the files from the beginning, or change the date in the file to start processing files AFTER the entered date.
- if you need the normalized COVID-19 time-series data on another operating system, use the OpenRefine projects. OpenRefine is cross-platform and has pre-built binaries for Windows, Mac and Linux.
- both scripts maintain a hidden directory called `~/.covid-19` in your home directory.
-`covid-19_ingest.sh` checks`lastcsvprocessed`. Delete that file to process all daily-report files from the beginning, or change the date in the file to start processing files AFTER the entered date.

## TODO
- add postgis to schema for spatial analysis
- use [postgREST](http://postgrest.org) to add a REST API in front of TimescaleDB database
- create a Grafana dashboard
- create a Carto visualization
Expand Down
4 changes: 4 additions & 0 deletions covid19-ingest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,12 @@ psql \
-c "\COPY covid19_locations(province_state,country_region,latitude,longitude) \
FROM '/tmp/locations.csv' DELIMITER ',' CSV HEADER FORCE NOT NULL province_state;"

psql -c "UPDATE covid19_locations
SET location_geom = ST_Transform(ST_SetSRID(ST_MakePoint(longitude, latitude), 4326), 2163);"

echo -e -n "\nVacuuming/Analyzing database..."
psql -q -c "VACUUM FULL ANALYZE covid19_ts, import_covid19_ts, covid19_locations;"
psql -q -c "REFRESH MATERIALIZED VIEW daily_change;"
psql -t -c "SELECT count(*) || ' rows' from covid19_ts;"

end_time="$(date -u +%s)"
Expand Down
15 changes: 8 additions & 7 deletions covid19-refine/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# covid19-refine.sh
This script automates the creation/population of a fully normalized, non-sparse, geo-enriched version of [JHU's COVID-19 time-series data](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series).

0. Be sure to run `schema.sql` as noted in the main README.
0. Be sure to follow steps 1 through 4 of the main [README](/README.md#using-the-timescale-covid19-ingest-script).

1. Run `openrefine-batch.sh` to initialize the environment. It will automatically download OpenRefine and openrefine-client. You need to do this only once, and you'll only need to run `covid19-refine.sh` for future invocations.
1. Run `openrefine-batch.sh` to initialize the environment. It will automatically download OpenRefine and openrefine-client. You need to do this only once.

` ./openrefine-batch.sh`

2. Modify the `covid19-refine.sh` file to add your Postgres connection parameters, and the Geocode.earth API key.
2. Modify the `covid19-refine.sh` file to add your Postgres connection parameters, and your Geocode.earth API key.

3. Run `covid19-refine.sh`. It will automate several OpenRefine projects to normalize and enrich the latest [JHU's time-series data](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series) and insert it into Timescale. It will be a relatively long-running automation (6 minutes) as it will run OpenRefine in headless mode, and geographically enrich the location data to support filtering on additional facets (continent, for the US - locality, county, state).

Expand All @@ -17,7 +17,7 @@ This script automates the creation/population of a fully normalized, non-sparse,
4. Slice and dice the data using Postgres/Timescale! Note these tables/hypertables/continuous aggregates:

- `covid19_loclookup`
we assign a `loc_id` to help with joins. We also enrich the data with continent, and for the US - locality, county and state for additional aggregrations ([sample](workdir/location-lookup/location-lookup.csv)).
we assign a `loc_id` to help with joins. We also geocode the data, adding `continent`; and for the US - `locality`, `county` and `state` for additional aggregrations ([sample](workdir/location-lookup/location-lookup.csv)).

```SQL
CREATE TABLE IF NOT EXISTS covid19_loclookup (
Expand All @@ -30,11 +30,12 @@ This script automates the creation/population of a fully normalized, non-sparse,
us_state TEXT,
us_county TEXT,
continent TEXT,
location_geom geometry(POINT, 2163),
geocode_earth_json JSONB);
```

- `covid19_normalized_ts`
apart from the running totals compiled by JHU, we also compute the daily incidents for any specific date/location (e.g. how many confirmed, deaths, recoveries for each day/location). This will allow you to do aggregations for arbitrary date ranges, compute rates of confirmed/deaths/recoveries, and benchmarking across locations.
from the running totals compiled by JHU, we also derive the daily incidents for any specific date/location (e.g. how many confirmed, deaths, recoveries for each day/location). This will allow you to do aggregations for arbitrary date ranges, compute rates of confirmed/deaths/recoveries, and do benchmarking across locations.

```SQL
CREATE TABLE IF NOT EXISTS covid19_normalized_ts (
Expand All @@ -53,7 +54,7 @@ This script automates the creation/population of a fully normalized, non-sparse,

```SQL
CREATE VIEW confirmed_3days
WITH (timescaledb.continuous)
WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
AS
SELECT
loc_id,
Expand All @@ -68,7 +69,7 @@ This script automates the creation/population of a fully normalized, non-sparse,
GROUP BY loc_id, bucket;

CREATE VIEW confirmed_weekly
WITH (timescaledb.continuous)
WITH (timescaledb.continuous, timescaledb.refresh_lag = '-14 days')
AS
SELECT
loc_id,
Expand Down
11 changes: 10 additions & 1 deletion covid19-refine/covid19-refine.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ if [ -f ~/.covid-19/covidrefinelastrun ]; then
if [[ $last_modified < $covidrefinelastrun ]]; then
echo "ABORTED. JHU's COVID-19 data has not been modified since the last run."
echo -e "JHU data last modified: $last_modified\nLast Run: $covidrefinelastrun"
echo -e "HINT: Delete or modify ~/.covid-19/covidrefinelastrun"
exit
fi
fi
Expand Down Expand Up @@ -52,6 +53,8 @@ csvcut -x \
psql -c \
"\COPY covid19_loclookup(loc_id,province_state,country_region,latitude,longitude,us_locality,us_state,us_county,continent,geocode_earth_json) \
FROM 'workdir/location-lookup/location-lookup.csv' DELIMITER ',' CSV HEADER ;"
psql -c "UPDATE covid19_loclookup
SET location_geom = ST_Transform(ST_SetSRID(ST_MakePoint(longitude, latitude), 4326), 2163);"

echo -e -n " Populating deaths table..."
psql -q -c "TRUNCATE TABLE import_covid19_deaths;"
Expand All @@ -72,7 +75,7 @@ psql -c \
FROM 'workdir/normalize/output/time_series_19-covid-Recovered.csv' DELIMITER ',' CSV HEADER;"

echo -e " Collating normalized data... "
psql -q -c "TRUNCATE TABLE covid19_normalized_ts;"
psql -q -t -c "SELECT drop_chunks(NOW() + interval '1 year', 'covid19_normalized_ts', cascade_to_materializations=>true);"
psql -q -c "INSERT INTO covid19_normalized_ts \
SELECT a.loc_id, a.observation_date, a.observation_count as confirmed_total, \
b.observation_count as deaths_total, c.observation_count as recovered_total, 0, 0, 0 \
Expand All @@ -86,6 +89,12 @@ psql -q -t -o /dev/null -c "select derive_daily_counts();"

echo -e -n "\nVacuuming/Analyzing database...\n"
psql -q -c "VACUUM FULL ANALYZE covid19_normalized_ts, covid19_loclookup;"
psql -q -c "REFRESH MATERIALIZED VIEW confirmed_3days;"
psql -q -c "REFRESH MATERIALIZED VIEW deaths_3days;"
psql -q -c "REFRESH MATERIALIZED VIEW recovered_3days;"
psql -q -c "REFRESH MATERIALIZED VIEW confirmed_weekly;"
psql -q -c "REFRESH MATERIALIZED VIEW deaths_weekly;"
psql -q -c "REFRESH MATERIALIZED VIEW recovered_weekly;"
psql -t -c "SELECT count(*) || ' rows' FROM covid19_normalized_ts;"
psql -t -c "SELECT count(*) || ' locations' FROM covid19_loclookup;"

Expand Down
38 changes: 31 additions & 7 deletions schema.sql
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
\c covid_19

CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
CREATE EXTENSION IF NOT EXISTS postgis;

CREATE TABLE IF NOT EXISTS covid19_ts (
province_state TEXT,
Expand All @@ -24,6 +25,7 @@ CREATE TABLE IF NOT EXISTS covid19_locations (
country_region TEXT NOT NULL,
latitude NUMERIC NOT NULL,
longitude NUMERIC NOT NULL,
location_geom geometry(POINT, 2163),
PRIMARY KEY (province_state, country_region));

-- OpenRefine tables
Expand All @@ -38,6 +40,7 @@ CREATE TABLE IF NOT EXISTS covid19_loclookup (
us_state TEXT,
us_county TEXT,
continent TEXT,
location_geom geometry(POINT, 2163),
geocode_earth_json JSONB);
ALTER TABLE covid19_loclookup OWNER TO covid19_user;
CREATE INDEX IF NOT EXISTS geocode_earth_json_idx ON covid19_loclookup USING GIN (geocode_earth_json);
Expand Down Expand Up @@ -76,9 +79,30 @@ select create_hypertable('covid19_normalized_ts', 'observation_date');
-- we need to DROP VIEW CASCADE as there are underlying Timescale structures.
-- CREATE OR REPLACE VIEW doesn't work with Timescale's continuous aggregates

DROP VIEW IF EXISTS daily_change CASCADE;
CREATE VIEW daily_change
WITH (timescaledb.continuous, timescaledb.refresh_lag = '-2 days')
AS
SELECT
country_region,
province_state,
time_bucket('2 days', observation_date) as bucket,
first(confirmed, observation_date) as confirmed_yesterday,
last(confirmed, observation_date) as confirmed_today,
last(confirmed, observation_date) - first(confirmed, observation_date) as confirmed_change,
first(deaths, observation_date) as deaths_yesterday,
last(deaths, observation_date) as deaths_today,
last(deaths, observation_date) - first(deaths, observation_date) as deaths_change,
first(recovered, observation_date) as recovered_yesterday,
last(recovered, observation_date) as recovered_today,
last(recovered, observation_date) - first(recovered, observation_date) as recovered_change
FROM
covid19_ts
GROUP BY country_region, province_state, bucket;

DROP VIEW IF EXISTS confirmed_3days CASCADE;
CREATE VIEW confirmed_3days
WITH (timescaledb.continuous)
WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
AS
SELECT
loc_id,
Expand All @@ -94,7 +118,7 @@ GROUP BY loc_id, bucket;

DROP VIEW IF EXISTS deaths_3days CASCADE;
CREATE VIEW deaths_3days
WITH (timescaledb.continuous)
WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
AS
SELECT
loc_id,
Expand All @@ -110,7 +134,7 @@ GROUP BY loc_id, bucket;

DROP VIEW IF EXISTS recovered_3days CASCADE;
CREATE VIEW recovered_3days
WITH (timescaledb.continuous)
WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
AS
SELECT
loc_id,
Expand All @@ -126,7 +150,7 @@ GROUP BY loc_id, bucket;

DROP VIEW IF EXISTS confirmed_3days CASCADE;
CREATE VIEW confirmed_3days
WITH (timescaledb.continuous)
WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
AS
SELECT
loc_id,
Expand All @@ -142,7 +166,7 @@ GROUP BY loc_id, bucket;

DROP VIEW IF EXISTS confirmed_weekly CASCADE;
CREATE VIEW confirmed_weekly
WITH (timescaledb.continuous)
WITH (timescaledb.continuous, timescaledb.refresh_lag = '-14 days')
AS
SELECT
loc_id,
Expand All @@ -158,7 +182,7 @@ GROUP BY loc_id, bucket;

DROP VIEW IF EXISTS deaths_weekly CASCADE;
CREATE VIEW deaths_weekly
WITH (timescaledb.continuous)
WITH (timescaledb.continuous, timescaledb.refresh_lag = '-14 days')
AS
SELECT
loc_id,
Expand All @@ -174,7 +198,7 @@ GROUP BY loc_id, bucket;

DROP VIEW IF EXISTS recovered_weekly CASCADE;
CREATE VIEW recovered_weekly
WITH (timescaledb.continuous)
WITH (timescaledb.continuous, timescaledb.refresh_lag = '-14 days')
AS
SELECT
loc_id,
Expand Down

0 comments on commit 40c012a

Please sign in to comment.