Skip to content

Data validation using external databases

tamslo edited this page Jan 8, 2015 · 4 revisions

Motivation

Due to the given identifiers of many external databases, we thought of cross-checking our data with theirs and possibly adding the databases as references (depending on their reliability).

Prototype

We have developed a small command line tool that requests data from Wikidata and MusicBrainz via their APIs and compares it.
The actual system should work with data dumps because API-queries would cause too much traffic. See external validation/crosscheck.py

Different possibilities for the implementation

The tool could be available as a live tool (mockups below) or run as a cronjob, where found mismatches could be treated as constraint violations (see Using constraints more effectively).

Mockups

Cross-checking button

A user can hit the cross-checking button to start the cross-checking for the current item.

Green bar

... appears if the information from Wikidata and the external databases match.

Yellow bar

... appears if there are references missing.

Red bar

... appears if there are mismatches.

Blue bar

... appears if there are no suitable identifiers for external databases or no properties that could be validated with them.

Problems:

We need data dumps in suitable formats (RDF, JSON, ...) that are unfortunately not commonly provided.