pgdedupe

A work-in-progress to provide a standard interface for deduplication of large databases with custom pre-processing and post-processing steps.

Free software: MIT license
Documentation: https://pgdedupe.readthedocs.io.

Interface

This provides a simple command-line program, pgdedupe. Two configuration files specify the deduplication parameters and database connection settings. To run deduplication on a generated dataset, create a database.yml file that specifies the following parameters:

user:
password:
database:
host:
port:

You can now create a sample CSV file with:

$ python generate_fake_dataset.py --csv people.csv
creating people: 100%|█████████████████████| 9500/9500 [00:21<00:00, 445.38it/s]
adding twins: 100%|█████████████████████████| 500/500 [00:00<00:00, 1854.72it/s]
writing csv:  47%|███████████▋             | 4666/10000 [00:42<00:55, 96.28it/s]

Once complete, store this example dataset in a database with:

$ python test/initialize_db.py --db database.yml --csv people.csv
CREATE SCHEMA
DROP TABLE
CREATE TABLE
COPY 197617
ALTER TABLE
ALTER TABLE
UPDATE 197617

Now you can deduplicate this dataset. This will run dedupe as well as the custom pre-processing and post-processing steps as defined in config.yml:

$ pgdedupe --config config.yml --db database.yml

Custom pre- and post-processing

In addition to running a database-level deduplication with dedupe, this script adds custom pre- and post-processing steps to improve the run-time and results, making this a hybrid between fuzzy matching and record linkage.

Pre-processing: Before running dedupe, this script does an exact-match deduplication. Some systems create many identical rows; this can make it challenging for dedupe to create an effective blocking strategy and generally makes the fuzzy matching much harder and time intensive.
Post-processing: After running dedupe, this script does an optional exact-match merge across subsets of columns. For example, in some instances an exact match of just the last name and social security number are sufficient evidence that two clusters are indeed the same identity.

Further steps

This script was based upon and extended from the example in dedupe-examples. It would be nice to use this common interface across all database types, and potentially even allow reading from flat CSV files.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
docs		docs
pgdedupe		pgdedupe
tests		tests
.gitignore		.gitignore
.pyup.yml		.pyup.yml
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
config.yaml		config.yaml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pgdedupe

Interface

Custom pre- and post-processing

Further steps

About

Releases 9

Packages

Contributors 5

Languages

License

dssg/pgdedupe

Folders and files

Latest commit

History

Repository files navigation

pgdedupe

Interface

Custom pre- and post-processing

Further steps

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 5

Languages

Packages