Skip to content
This repository has been archived by the owner on Oct 6, 2022. It is now read-only.

Identify duplicate datasets #23

Open
adamreichold opened this issue Jul 26, 2022 · 0 comments
Open

Identify duplicate datasets #23

adamreichold opened this issue Jul 26, 2022 · 0 comments

Comments

@adamreichold
Copy link
Owner

It is quite likely that we will harvest datasets from multiple sources, e.g. "Zoo Leipzig Jahreszahlen" can be harvested from govdata.de and opendata.leipzig.de under different ID.

The DCAT-AP.de implementation guide describes how to identify duplicates based on dct:identifier field which in this case forwards the ID from opendata.leipzig.de into the catalogue at govdata.de via a CKAN "extra" field called identifier. (Additionally, its full URL is available via the guid field.)

Since this will only work for catalogues participating in DCAT-AP.de pipelines, it might be simpler to resolve duplicates based on the URL of the data itself, e.g. https://statistik.leipzig.de/opendata/api/values?kategorie_nr=11&rubrik_nr=4&periode=y&format=csv in this case which should identify the dataset independently of any intermediaries publishing and identifying it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant