Skip to content

Smarter automatic deduplication #31

@jqnatividad

Description

@jqnatividad

Automatic deduplication works well (#25), however, when duplicates are found and removed, the datastore table and the resource file are no longer in sync.

Smarter dedup can be handled three ways. When dupes are found:

  1. Stop the DP+ job and show the dupe error in the Datastore tab.
  2. Replace the resource file with the dedupped CSV.
  3. Take advantage of qsv dedup's --dupes-output option and create two new resources - RESOURCENAME_dupes.csv and RESOURCENAME_dedupped.csv which are pushed to the Datastore. The original resource with dupes is NOT pushed. The Data Publisher can then just use the CKAN interface to manage which resource to keep (e.g. delete the original and the _dupes resources; rename the _dedupped resource, removing the _dedupped suffix.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions