ETL is handled by drake
, which is expected to be called from the top (or root) directory of this repository. All paths assume that drake was invoked from the root of the repository.
There are four conceptual steps: extract, clean, deduplicate, and process.
The scripts here expect to find a zipped data dump in a data
directory in the repository root. It will extract that, and restore into the database. The database configuration is specified within a config
directory in the repository root.
The raw tables, as restored from the database dump, get placed into the public schema of the database. Before being used by the pipeline, all tables go through a cleaning and normalization process. All scripts contained within the etl/data_cleaning
directory are executed at this point.
Deduplication is handled by superdeduper. A SQL script creates a master "entries" table by combining all relevant columns from all the data sources, and then superdeduper is called with the saved configuration in etl/dedupe/config.yaml
. After a dedupe_id
is appended to the entries table, the apply_results.py
script goes back to the specified tables in the clean schema to append a dedupe_id column.
Finally, a few SQL scripts are used to create computed tables for convenience. These include things like a timeline of events. All scripts contained within the etl/data_processing
directory are executed at this point.