cleaning up for beta2 release

dathere · Mar 6, 2020 · 40c012a · 40c012a
1 parent f2dffc9
commit 40c012a
Show file tree

Hide file tree

Showing 5 changed files with 67 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,14 @@
 # COVID-19 - time-series utilities
 
-This repo contains several utilities for wrangling COVID-19 data from John Hopkins University.
-
-* A shell script that converts the JHU COVID-19 daily-report data to a time-series database using TimescaleDB.
-* Several OpenRefine projects that "unpivots" (normalizes) JHU's COVID-19 time-series csv file
+This repo contains several utilities for wrangling COVID-19 data from the [John Hopkins University COVID-19 repository](https://github.com/CSSEGISandData/COVID-19). 
 
 ## Requirements
-* For time-series ingestion script
-  - A working instance of [TimescaleDB](https://docs.timescale.com) in PostgreSQL v10+
-  - [csvkit](https://csvkit.readthedocs.io/en/latest/)
-  - [git](https://git-scm.com/)
-  - Unix/Linux operating system with bash 
-* OpenRefine time-series automation
-	- [OpenRefine](http://openrefine.org) - installed automatically
+* A working instance of [TimescaleDB](https://docs.timescale.com) in PostgreSQL v10+
+* [csvkit](https://csvkit.readthedocs.io/en/latest/)
+* [git](https://git-scm.com/)
+* Unix/Linux operating system with bash 
+* for OpenRefine time-series automation
+  - [OpenRefine](http://openrefine.org) - installed automatically
   - [openrefine-batch](https://github.com/opencultureconsulting/openrefine-batch) - included
   - a [Geocode.earth](https://geocode.earth) API key - [install](https://github.com/pelias/pelias) or [free trial](https://geocode.earth/invite/request?referrer=datHere). Used to enrich geographic data.
 
@@ -24,9 +20,9 @@ A note on cloning this repo, since the COVID19 directory is a git submodule:
 ## Content
 The files in this directory and how they're used:
 
-* `covid-19_ingest.sh`: Bash script to read daily-report data from [JHU COVID-19 Github](https://github.com/CSSEGISandData/COVID-19), and upsert the data into TimescaleDB.
+* `covid-19_ingest.sh`: script that converts the JHU COVID-19 [daily-report data](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports) to a time-series database using TimescaleDB.
+* `covid-refine`:   OpenRefine automation script that converts JHU COVID-19 [time-series data](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series) into a normalized, enriched format and uploads it to TimescaleDB. 
 * `schema.sql`: Data definition (DDL) to create the necessary tables & hypertables.
-* `covid-refine`: OpenRefine automation to create fully normalized COVID-19 time-series data, with expanded geographic features 
 
 ## Using the Timescale covid19-ingest script
 1. Create a TimescaleDB instance - [download](https://docs.timescale.com/latest/getting-started/installation) or [signup](https://www.timescale.com/cloud-signup)
@@ -60,17 +56,16 @@ The files in this directory and how they're used:
 
 8. Be able to slice-and-dice the data using the full power of PostgreSQL along with Timescale's time-series capabilities!
 
-## Using the COVIDrefine 
-See the detailed [README](covid19-refine/README.md).
+## Using COVIDrefine 
+See the detailed [README](covid19-refine/).
 
 ## NOTES
- - the JHU COVID-19 repository is a git submodule. This was required to automate getting the latest data from their repo.
+ - the JHU COVID-19 repository is a git submodule. This was done to automate getting the latest data from their repo.
  - the script will only work in \*nix environment (Linux, Unix, MacOS)
- - the script creates a hidden directory called `~/.covid-19` in your home directory where it stores the date of the last csv it processed in a file named `lastcsvprocessed`.  Delete that file to process all the files from the beginning, or change the date in the file to start processing files AFTER the entered date.  
- - if you need the normalized COVID-19 time-series data on another operating system, use the OpenRefine projects.  OpenRefine is cross-platform and has pre-built binaries for Windows, Mac and Linux.
+ - both scripts maintain a hidden directory called `~/.covid-19` in your home directory. 
+   -`covid-19_ingest.sh` checks`lastcsvprocessed`.  Delete that file to process all daily-report files from the beginning, or change the date in the file to start processing files AFTER the entered date.  
 
 ## TODO
- - add postgis to schema for spatial analysis
  - use [postgREST](http://postgrest.org) to add a REST API in front of TimescaleDB database
  - create a Grafana dashboard
  - create a Carto visualization

diff --git a/covid19-ingest.sh b/covid19-ingest.sh
@@ -47,8 +47,12 @@ psql  \
     -c "\COPY covid19_locations(province_state,country_region,latitude,longitude) \
     FROM '/tmp/locations.csv' DELIMITER ',' CSV HEADER FORCE NOT NULL province_state;"
 
+psql -c "UPDATE covid19_locations
+    SET location_geom = ST_Transform(ST_SetSRID(ST_MakePoint(longitude, latitude), 4326), 2163);"
+
 echo -e -n "\nVacuuming/Analyzing database..."
 psql -q -c "VACUUM FULL ANALYZE covid19_ts, import_covid19_ts, covid19_locations;"
+psql -q -c "REFRESH MATERIALIZED VIEW daily_change;"
 psql -t -c "SELECT count(*) || ' rows' from covid19_ts;"
 
 end_time="$(date -u +%s)"

diff --git a/covid19-refine/README.md b/covid19-refine/README.md
@@ -1,13 +1,13 @@
 # covid19-refine.sh
 This script automates the creation/population of a fully normalized, non-sparse, geo-enriched version of [JHU's COVID-19 time-series data](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series).
 
-0. Be sure to run `schema.sql` as noted in the main README.
+0. Be sure to follow steps 1 through 4 of the main [README](/README.md#using-the-timescale-covid19-ingest-script).
 
-1. Run `openrefine-batch.sh` to initialize the environment.  It will automatically download OpenRefine and openrefine-client. You need to do this only once, and you'll only need to run `covid19-refine.sh` for future invocations.
+1. Run `openrefine-batch.sh` to initialize the environment.  It will automatically download OpenRefine and openrefine-client. You need to do this only once.
 
 ` ./openrefine-batch.sh`
 
-2. Modify the `covid19-refine.sh` file to add your Postgres connection parameters, and the Geocode.earth API key.
+2. Modify the `covid19-refine.sh` file to add your Postgres connection parameters, and your Geocode.earth API key.
 
 3. Run `covid19-refine.sh`.  It will automate several OpenRefine projects to normalize and enrich the latest [JHU's time-series data](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series) and insert it into Timescale.  It will be a relatively long-running automation (6 minutes) as it will run OpenRefine in headless mode, and geographically enrich the location data to support filtering on additional facets (continent, for the US - locality, county, state).
 
@@ -17,7 +17,7 @@ This script automates the creation/population of a fully normalized, non-sparse,
 4. Slice and dice the data using Postgres/Timescale! Note these tables/hypertables/continuous aggregates:
 
   - `covid19_loclookup`
-     we assign a `loc_id` to help with joins.  We also enrich the data with continent, and for the US - locality, county and state for additional aggregrations ([sample](workdir/location-lookup/location-lookup.csv)).
+     we assign a `loc_id` to help with joins.  We also geocode the data, adding `continent`; and for the US - `locality`, `county` and `state` for additional aggregrations ([sample](workdir/location-lookup/location-lookup.csv)).
 
 ```SQL
 		CREATE TABLE IF NOT EXISTS covid19_loclookup (
@@ -30,11 +30,12 @@ This script automates the creation/population of a fully normalized, non-sparse,
 		  us_state TEXT,
 		  us_county TEXT,
 		  continent TEXT,
+		  location_geom geometry(POINT, 2163),
 		  geocode_earth_json JSONB);
 ```
 
   - `covid19_normalized_ts`
-  	apart from the running totals compiled by JHU, we also compute the daily incidents for any specific date/location (e.g. how many confirmed, deaths, recoveries for each day/location).  This will allow you to do aggregations for arbitrary date ranges, compute rates of confirmed/deaths/recoveries, and benchmarking across locations.
+  	from the running totals compiled by JHU, we also derive the daily incidents for any specific date/location (e.g. how many confirmed, deaths, recoveries for each day/location).  This will allow you to do aggregations for arbitrary date ranges, compute rates of confirmed/deaths/recoveries, and do benchmarking across locations.
 
 ```SQL
 		CREATE TABLE IF NOT EXISTS covid19_normalized_ts (
@@ -53,7 +54,7 @@ This script automates the creation/population of a fully normalized, non-sparse,
 
 ```SQL
 		CREATE VIEW confirmed_3days
-		WITH (timescaledb.continuous)
+		WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
 		AS
 		SELECT
 		  loc_id,
@@ -68,7 +69,7 @@ This script automates the creation/population of a fully normalized, non-sparse,
 		GROUP BY loc_id, bucket;
 
 		CREATE VIEW confirmed_weekly
-		WITH (timescaledb.continuous)
+		WITH (timescaledb.continuous, timescaledb.refresh_lag = '-14 days')
 		AS
 		SELECT
 		  loc_id,

diff --git a/covid19-refine/covid19-refine.sh b/covid19-refine/covid19-refine.sh
@@ -20,6 +20,7 @@ if [ -f ~/.covid-19/covidrefinelastrun ]; then
   if [[ $last_modified < $covidrefinelastrun ]]; then
     echo "ABORTED. JHU's COVID-19 data has not been modified since the last run."
     echo -e "JHU data last modified: $last_modified\nLast Run: $covidrefinelastrun"
+    echo -e "HINT: Delete or modify ~/.covid-19/covidrefinelastrun"
     exit
   fi
 fi
@@ -52,6 +53,8 @@ csvcut -x \
 psql -c \
     "\COPY covid19_loclookup(loc_id,province_state,country_region,latitude,longitude,us_locality,us_state,us_county,continent,geocode_earth_json) \
       FROM 'workdir/location-lookup/location-lookup.csv' DELIMITER ',' CSV HEADER ;"
+psql -c "UPDATE covid19_loclookup
+    SET location_geom = ST_Transform(ST_SetSRID(ST_MakePoint(longitude, latitude), 4326), 2163);"
 
 echo -e -n "  Populating deaths table..."
 psql -q -c "TRUNCATE TABLE import_covid19_deaths;"
@@ -72,7 +75,7 @@ psql  -c \
       FROM 'workdir/normalize/output/time_series_19-covid-Recovered.csv' DELIMITER ',' CSV HEADER;"
 
 echo -e "  Collating normalized data... "
-psql -q -c "TRUNCATE TABLE covid19_normalized_ts;"
+psql -q -t -c "SELECT drop_chunks(NOW() + interval '1 year', 'covid19_normalized_ts', cascade_to_materializations=>true);"
 psql -q -c "INSERT INTO covid19_normalized_ts \
       SELECT a.loc_id, a.observation_date, a.observation_count as confirmed_total, \
         b.observation_count as deaths_total, c.observation_count as recovered_total, 0, 0, 0 \
@@ -86,6 +89,12 @@ psql -q -t -o /dev/null -c "select derive_daily_counts();"
 
 echo -e -n "\nVacuuming/Analyzing database...\n"
 psql -q -c "VACUUM FULL ANALYZE covid19_normalized_ts, covid19_loclookup;"
+psql -q -c "REFRESH MATERIALIZED VIEW confirmed_3days;"
+psql -q -c "REFRESH MATERIALIZED VIEW deaths_3days;"
+psql -q -c "REFRESH MATERIALIZED VIEW recovered_3days;"
+psql -q -c "REFRESH MATERIALIZED VIEW confirmed_weekly;"
+psql -q -c "REFRESH MATERIALIZED VIEW deaths_weekly;"
+psql -q -c "REFRESH MATERIALIZED VIEW recovered_weekly;"
 psql -t -c "SELECT count(*) || ' rows' FROM covid19_normalized_ts;"
 psql -t -c "SELECT count(*) || ' locations' FROM covid19_loclookup;"
 

diff --git a/schema.sql b/schema.sql
@@ -3,6 +3,7 @@
 \c covid_19
 
 CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;
+CREATE EXTENSION IF NOT EXISTS postgis;
 
 CREATE TABLE IF NOT EXISTS covid19_ts (
   province_state TEXT,
@@ -24,6 +25,7 @@ CREATE TABLE IF NOT EXISTS covid19_locations (
   country_region TEXT NOT NULL,
   latitude NUMERIC NOT NULL,
   longitude NUMERIC NOT NULL,
+  location_geom geometry(POINT, 2163),
   PRIMARY KEY (province_state, country_region));
 
 -- OpenRefine tables
@@ -38,6 +40,7 @@ CREATE TABLE IF NOT EXISTS covid19_loclookup (
   us_state TEXT,
   us_county TEXT,
   continent TEXT,
+  location_geom geometry(POINT, 2163),
   geocode_earth_json JSONB);
 ALTER TABLE covid19_loclookup OWNER TO covid19_user;
 CREATE INDEX IF NOT EXISTS geocode_earth_json_idx ON covid19_loclookup USING GIN (geocode_earth_json);
@@ -76,9 +79,30 @@ select create_hypertable('covid19_normalized_ts', 'observation_date');
 -- we need to DROP VIEW CASCADE as there are underlying Timescale structures.
 -- CREATE OR REPLACE VIEW doesn't work with Timescale's continuous aggregates
 
+DROP VIEW IF EXISTS daily_change CASCADE;
+CREATE VIEW daily_change
+WITH (timescaledb.continuous, timescaledb.refresh_lag = '-2 days')
+AS
+SELECT
+  country_region,
+  province_state,
+  time_bucket('2 days', observation_date) as bucket,
+  first(confirmed, observation_date) as confirmed_yesterday,
+  last(confirmed, observation_date) as confirmed_today,
+  last(confirmed, observation_date) - first(confirmed, observation_date) as confirmed_change,
+  first(deaths, observation_date) as deaths_yesterday,
+  last(deaths, observation_date) as deaths_today,
+  last(deaths, observation_date) - first(deaths, observation_date) as deaths_change,
+  first(recovered, observation_date) as recovered_yesterday,
+  last(recovered, observation_date) as recovered_today,
+  last(recovered, observation_date) - first(recovered, observation_date) as recovered_change
+FROM
+  covid19_ts
+GROUP BY country_region, province_state, bucket;
+
 DROP VIEW IF EXISTS confirmed_3days CASCADE;
 CREATE VIEW confirmed_3days
-WITH (timescaledb.continuous)
+WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
 AS
 SELECT
   loc_id,
@@ -94,7 +118,7 @@ GROUP BY loc_id, bucket;
 
 DROP VIEW IF EXISTS deaths_3days CASCADE;
 CREATE VIEW deaths_3days
-WITH (timescaledb.continuous)
+WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
 AS
 SELECT
   loc_id,
@@ -110,7 +134,7 @@ GROUP BY loc_id, bucket;
 
 DROP VIEW IF EXISTS recovered_3days CASCADE;
 CREATE VIEW recovered_3days
-WITH (timescaledb.continuous)
+WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
 AS
 SELECT
   loc_id,
@@ -126,7 +150,7 @@ GROUP BY loc_id, bucket;
 
 DROP VIEW IF EXISTS confirmed_3days CASCADE;
 CREATE VIEW confirmed_3days
-WITH (timescaledb.continuous)
+WITH (timescaledb.continuous, timescaledb.refresh_lag = '-6 days')
 AS
 SELECT
   loc_id,
@@ -142,7 +166,7 @@ GROUP BY loc_id, bucket;
 
 DROP VIEW IF EXISTS confirmed_weekly CASCADE;
 CREATE VIEW confirmed_weekly
-WITH (timescaledb.continuous)
+WITH (timescaledb.continuous, timescaledb.refresh_lag = '-14 days')
 AS
 SELECT
   loc_id,
@@ -158,7 +182,7 @@ GROUP BY loc_id, bucket;
 
 DROP VIEW IF EXISTS deaths_weekly CASCADE;
 CREATE VIEW deaths_weekly
-WITH (timescaledb.continuous)
+WITH (timescaledb.continuous, timescaledb.refresh_lag = '-14 days')
 AS
 SELECT
   loc_id,
@@ -174,7 +198,7 @@ GROUP BY loc_id, bucket;
 
 DROP VIEW  IF EXISTS recovered_weekly CASCADE;
 CREATE VIEW recovered_weekly
-WITH (timescaledb.continuous)
+WITH (timescaledb.continuous, timescaledb.refresh_lag = '-14 days')
 AS
 SELECT
   loc_id,