Identify duplicate datasets #23

adamreichold · 2022-07-26T18:16:49Z

It is quite likely that we will harvest datasets from multiple sources, e.g. "Zoo Leipzig Jahreszahlen" can be harvested from govdata.de and opendata.leipzig.de under different ID.

The DCAT-AP.de implementation guide describes how to identify duplicates based on dct:identifier field which in this case forwards the ID from opendata.leipzig.de into the catalogue at govdata.de via a CKAN "extra" field called identifier. (Additionally, its full URL is available via the guid field.)

Since this will only work for catalogues participating in DCAT-AP.de pipelines, it might be simpler to resolve duplicates based on the URL of the data itself, e.g. https://statistik.leipzig.de/opendata/api/values?kategorie_nr=11&rubrik_nr=4&periode=y&format=csv in this case which should identify the dataset independently of any intermediaries publishing and identifying it.

The text was updated successfully, but these errors were encountered:

adamreichold mentioned this issue Jul 30, 2022

Explicitly handle duplicate dataset identifiers per source #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify duplicate datasets #23

Identify duplicate datasets #23

adamreichold commented Jul 26, 2022

Identify duplicate datasets #23

Identify duplicate datasets #23

Comments

adamreichold commented Jul 26, 2022