Quality Issues

cmader edited this page Dec 15, 2014 · 32 revisions

The vocabulary quality issues defined in the following sections should be applicable to any SKOS vocabulary. Some of the issues are taken from already existing research (see section "Related Work"). Others reflect our thoughts when investigating real-world thesauri (see the data repository). If not stated otherwise, we treat the SKOS vocabularies as fully entailed RDFS graphs. We also enrich the vocabularies by entailment of owl:inverseOf properties as well as instances of owl:TransitiveProperty and owl:SymmetricProperty.

Table of Contents

Labeling and Documentation Issues

Omitted or Invalid Language Tags

Description Some controlled vocabularies contain literals in natural language, but without information what language has actually been used. Language tags might also not conform to language standards, such as RFC 3066.
Example The network ontology defines a prefLabel of the resource http://river.styx.org/network#AddressFamilies that has no language information. Furthermore, the press contacts information dataset (http://data.southampton.ac.uk/dataset/pressinfo.html) defines labels for resources without stating the language.
Implementation Iteration over all triples in the vocabulary that have a predicate which is a (subclass of) rdfs:label or skos:note.

Incomplete Language Coverage

Description Some concepts in a thesaurus are labeled in only one language, some in multiple languages. It may be desirable to have each concept labeled in each of the languages that also are used on the other concepts. This is not always possible, but incompleteness of language coverage for some concepts can indicate shortcomings of the vocabulary.
Example The MARC List for Countries (http://id.loc.gov/vocabulary/countries.html) defines french labels for some but not all labels.
Implementation Iteration over all concepts in the vocabulary and creation of a global set of language tags appearing in the vocabulary. In a second iteration, each concept having a set of language tags that is not equal to the global language tag set is returned.

No Common Language

Description Checks if all concepts have at least one common language, i.e. they have assigned at least one literal in the same language.
Example
Preliminary ideas on computation

Undocumented Concepts

Description The SKOS "standard" defines a number of properties useful for documenting the meaning of the concepts in a thesaurus (section 7) also in a human-readable form. Intense use of these properties leads to a well-documented thesaurus which should also improve its quality.
Example Library of Congress Thesaurus for Graphic Materials offers a high coverage of documentation properties
Implementation Iteration over all concepts in the vocabulary and find those not using one of skos:note, skos:changeNote, skos:definition, skos:editorialNote, skos:example, skos:historyNote, or skos:scopeNote

Overlapping Labels

Description This is a generalization of a recommendation in the SKOS primer, that “no two concepts have the same preferred lexical label in a given language when they belong to the same concept scheme”. This could indicate missing disambiguation information and thus lead to problems in autocompletion application.
Example In the concepts http://purl.org/collections/nl/am/t-8523 a skos:prefLabel is defined as "streekvervoer" ("local transport"). Another concept, http://purl.org/collections/nl/am/t-6081 defines skos:altLabel "streekvervoer" and skos:prefLabel "openbaar vervoer" ("public transport").
Implementation Iteration over all authoritative concepts, collecting their respective labels. In a second pass, similarity of all possible label pairs is checkt by a similarity function. Concept labels with a similarity value below a given threshold, are considered conflicting and are returned. In the current implementation, the similarity function is string equality with a threshold equal to 1.

Missing Labels

Description To make the vocabulary more convenient for humans to use, instances of SKOS classes (Concept, ConceptScheme, Collection) should be labeled using e.g., prefLabel, altLabel, rdfs:label, dc:title.
Example
Preliminary ideas on computation

Unprintable Characters in Labels

Description pref/alt/hiddenlabels contain characters that are not alphanumeric characters or blanks.
Example Newline characters that have been left over from automated vocabulary conversion or invalid user input.
Preliminary ideas on computation A SPARQL query would be sufficient to find labels having characters that belong to the unicode general category "Zl", "Zp" and "C"

Empty Labels

Description Labels also need to contain textual information to be useful, thus we find all SKOS labels with length 0 (after removing whitespaces).
Example
Preliminary ideas on computation

Ambiguous Notation References

Description Concepts within the same concept scheme should not have identlical skos:notation literals.
Example
Preliminary ideas on computation

Structural Issues

SKOS is based on RDF, which is a graph-based data model. Therefore we can concentrate on the vocabulary's graph-based structure for assessing the quality of SKOS vocabularies and apply graph- and network-analysis techniques.

Orphan Concepts

Description An orphan concept is a concept without any associative or hierarchical relations. It might have attached literals like e.g., labels, but is not connected to any other resource, lacking valuable context information. A controlled vocabulary that contains many orphan concepts is less usable for search and retrieval use cases, because, e.g., no hierarchical query expansion can be performed on search terms to find documents with more general content.
Example In the press contacts information dataset from the University of Southampton (http://data.southampton.ac.uk/dataset/pressinfo.html), SKOS concepts are defined but aren't linked to other resources using SKOS properties (e.g., http://id.southampton.ac.uk/pressinfo/subject/ActiveSourceSeismology). Similarly, the http://river.styx.org/network ontology defines the concept http://river.styx.org/network#AddressFamily, but doesn't link it using SKOS properties.
Implementation Iteration over all concepts in the vocabulary and returning that don't have associated resources using (subproperties of) skos:semanticRelation.

Disconnected Concept Clusters

Description Checking the connectivity of the graph, it is possible to identify all weakly connected components. These datasets form "islands" in the vocabulary and might be caused by incomplete data acquisition, "forgotten" test data, outdated terms and the like.
Example The dmGeo vocabulary consists of 5 weakly connected components. It was available at http://www.dismarc.org but now seems to be offline. Weakly connected components can also be found in the LVAk thesaurus.
Implementation Creation of an undirected graph that includes all non-orphan concepts as nodes and all semantic relations as edges. Tarjan's algorithm then finds and returns all weakly connected components.

Cyclic Hierarchical Relations

Description Although perfectly consistent with the SKOS data model, cyclic relations may reveal a logical problem in the thesaurus. Consider the following example: "decision" -> "problem resolution" -> "problem" (-> "decision": here the cycle is closed). The concepts are connected using skos:broader relationships (indicated with "->"). Due to the fact that a thesaurus is in many cases a product of consensus between the contributors (or just the decision of one dedicated thesaurus manager), it will be almost impossible to automatically resolve the cycle (i.e. deleting an edge).
Example dbpedia categories: http://dbpedia.org/resource/Category:Wikipedians_in_West_Virginia is related broader to itself. http://dbpedia.org/resource/Category:Republic_of_Macedonia broader http://dbpedia.org/resource/Category:Macedonia broader http://dbpedia.org/resource/Category:Geography_of_the_Republic_of_Macedonia furthermore: http://dbpedia.org/resource/Category:Graphics_software broader http://dbpedia.org/resource/Category:Application_software and vice versa
Implementation Construction of a graph having all concepts as nodes and the set of edges being skos:broader relations.

Valueless Associative Relations

Description Two concepts are sibling, but also connected by an associative relation. In that context, the associative relation is not necessary. See ISO_DIS_25964-1, 11.3.2.2
Example The concepts http://eurovoc.europa.eu/2291 and http://eurovoc.europa.eu/4483 in the EuroVoc thesaurus are related associatively (skos:related) and hierarchically.
Implementation Identification of all pairs of concepts that have the same broader or narrower concepts, i.e. they are "sibling terms". All siblings that are related by a skos:related property are returned.

Solely Transitively Related Concepts

Description skos:broaderTransitive and skos:narrowerTransitive are, according to the SKOS reference document, "not used to make assertions", so they should not be the only relations hierarchically relating two concepts.
Example The NAICS thesaurus contains 2189 concepts that are related directly by skos:broaderTransitive.
Implementation Identification of all concept pairs that are related by skos:broaderTransitive or skos:narrowerTransitive properties but not by their skos:broader and skos:narrower subproperties.

Unidirectionally Related Concepts

Description Reciprocal relations (e.g., broader/narrower, related, hasTopConcept/topConceptOf) should be included in the controlled vocabularies to, e.g., to achieve better search results using SPARQL in systems without reasoner support.
Example
Implementation This issue is checked WITHOUT inference of owl:inverseOf properties. We iterate over all triples and check for each property if an inverse property is defined in the SKOS ontology and if the respective statement using this property is included in the vocabulary. If not, the resources associated with this property are returned.

Omitted Top Concepts

Description A vocabulary should provide "entry points" to the data to provide “efficient access” (SKOS primer) and guidance for human users.
Example In EuroVoc, the ConceptScheme http://eurovoc.europa.eu/100141 doesn't have a top concept defined.
Implementation For every ConceptScheme in the controlled vocabulary, a SPARQL query is issued finding resources that are associated with this ConceptScheme by one of the properties skos:hasTopConcept or skos:topConceptOf. TODO: extend notion of top concepts also by concepts having no broader concept (as suggested in [Abdul]).

Top Concepts Having Broader Concepts

Description Concepts "internal to the tree" should not be indicated as top concepts, as pointed out in [Allemang2011].
Example In the PXV vocabulary, http://www.peroxisomekb.nl/v1.6/pxv/C000850 is a top concepts having related to a broader concept.
Implementation A SPARQL query finds all top concepts (being defined by one of the properties skos:hasTopConcept or skos:topConceptOf) having associated a broader concept.

Hierarchical Redundancy

Description As stated in the SKOS reference document, skos:broader and skos:narrower are not transitive properties. However, they are subproperties of skos:broaderTransitive and skos:narrowerTransitive which enables inference of a "transitive closure". This, in fact, leaves it up to the user to interpret wheter a vocabulary's hierarchical structure is seen as transitive or not. In the former case, this check can be useful. It finds pairs of concepts (A,B) that are directly hierarchically related but there also exits an hierarchical path through a concept C that connects A and B.
Example Concept A has a broader concept B. If a concept C exists, such that A broader B and B broader C, the hierarchical relation A broader B is considered redundant.
Implementation These structures can be found by a single SPARQL query.

Reflexive Relations

Description Concepts related to themsevels.
Example
Implementation These structures can be found by a single SPARQL query.

Linked Data Specific Issues

When publishing Linked Data, it is important to respect the following "rules" (also see http://www.w3.org/DesignIssues/LinkedData.html):

  • data is provided using standard formats (e.g., RDF which is obviously the case for SKOS vocabularies)
  • linked resources are dereferencable and provide further information
  • data linked to and from other resources

The issue introduced in this section can be used to create computable metrics for measuring data linkage.

Missing In-Links

Description The usage of its concepts can be an indicator for a vocabulary's quality. Usage can be determined by the number of external resources, referencing these concepts.
Example Consider the concept "http://dbpedia.org/resource/Michael_Jackson". It is, for example, referenced by the following resources:
Implementation For each authoritative concept in the vocabulary, a SPARQL query (against, e.g. the Sindice endpoint) is issued that returns all triples in which the concept shows up as the object. An estimation of the number of other vocabularies referencing the concept can be obtained by examining if the host part of the returned triple subject URIs does't match the publishing host of the vocabulary. Concepts for which no such matches can be found are returned.

Missing Out-Links

Description SKOS concepts can define links to other concepts within one and the same vocabulary, to concepts in other vocabularies, or to external resources on the Web. These external links are essential to, for example,
  • connect the vocabulary with other Web resources and benefit from other people's knowledge about the contained terms (by, e.g., using the link as starting point for a web crawling application)
  • act as some kind of bridge, connecting previously unconnected (unrelated) domains
  • provide information on the context of a term, serving a documentation purpose
  • prevent duplication of information
Example The New York Times People Vocabulary is aligned to dbpedia using owl:sameAs properties (e.g., http://data.nytimes.com/64870337666324078863).
Implementation For each authoritative concept in the vocabulary, a SPARQL query is issued that returns all IRIs that occur as subject or object in triples where this concept is involved. All IRIs that are HTTP URIs and refer to a non-authoritative resource for the concept are counted. Concepts with a count that equals zero are returned.

Broken Links

Description If concepts link to other resources (link targets) on the Web, it is important that these resources are dereferencable and return a response code other than 200 after possible redirections.
Example New York Times People Vocabulary: Response 404 when dereferencing http://data.nytimes.com/elements/manual
Implementation A SPARQL query is issued that finds all HTTP URIs being part (as subject, predicate, or object) of a triple in the vocabulary. The found URIs are then dereferenced and returned if the HTTP response code (after possible redirections) is other than 200.

Undefined SKOS Resources

Description The vocabulary should not invent any new terms within the SKOS namespace or use “deprecated” SKOS elements like those defined in Appendix D of the SKOS reference.
Example NSDL Registry Agents Vocabulary uses skos:status, which is not defined in http://www.w3.org/2009/08/skos-reference/skos.rdf. http://vocabularyserver.com/emergencias/?tema=575 uses skos:subjectIndicator
Implementation A SPARQL query finds all IRIs that appear in one of the vocabulary's triples in combination with a "deprecated" predicate. "Invented" new terms are found by a SPARQL query, selecting all IRIs in the vocabulary's RDF triples belonging to the SKOS namespace but are not defined in the SKOS ontology. All terms found by the two mentioned queries are returned.

HTTP URI Scheme Violation

Description URIs should be dereferencable. C. Bizer, How to Publish Linked Data on the Web: "In the context of Linked Data, we restrict ourselves to using HTTP URIs only and avoid other URI schemes such as URNs and DOIs."
Example In CFR Thesaurus (Thesaurus in the Legal domain by the Cornell University) a concept has been identified by a file:// URI
Implementation A SPARQL query is used to find all IRIs that occur as subject in the vocabulary's RDF triples. If their protocol identifier is other than http or https, the resource is returned.

SKOS Semi-Formal Consistency Issues

This category defines issues that relate to specific design decisions of the SKOS ontology. Some of them are also semi-formally expressed in the SKOS reference documentation.

Relation Clashes

Description Covers condition S27 from the SKOS reference document, that has not been defined formally.
Example In the AGROVOC thesurus, the concepts http://aims.fao.org/aos/agrovoc/c_118, http://aims.fao.org/aos/agrovoc/c_2969 are affected by this issue.
Implementation In a first step, all pairs of concepts are found that are associatively connected, using a SPARQL query. In the second step, a graph is created, containing only hierarchically related concepts and the respective relations. For each concept pair from the first step, we check for a path in the graph from step two. If such a path is found, a clash has been identified and the causing concepts are returned.

Mapping Clashes

Description Covers condition S46 from the SKOS reference document, that has not been defined formally.
Example
Implementation Can be solved by issuing a SPARQL query.

Inconsistent Preferred Labels

Description According to the SKOS reference document, "A resource has no more than one value of skos:prefLabel per language tag".
Example For the concept http://dbpedia.org/resource/Income_tax, the STW thesaurus mappings define two german prefLabels: "Einkommensteuer" and "Einkommensteuer (Deutschland)".
Implementation A SPARQL query is used to find concepts with at least two prefLabels. In a second step, the language tags of these prefLabels are analyzed and an ambiguity is detected if they are equal.

Disjoint Labels Violation

Description Covers condition S13 from the SKOS reference document (section 5.4) stating that "skos:prefLabel, skos:altLabel and skos:hiddenLabel are pairwise disjoint properties".
Example The concept http://aims.fao.org/aos/agrovoc/c_35337 in AGROVOC has the string literal "tüske" defined as both prefLabel and altLabel. http://www.afp-ifm-thesaurus.net/t-pro/Merkzeichen prefLabel and altLabel identical (marques@fr) http://lod.geospecies.org/kingdoms/Af?format=rdf prefLabel and altLabel identical (Bacteria) http://www.afp-ifm-thesaurus.net/t-pro/Motto prefLabel and hiddenLabel identical (motto@de)
Implementation A SPARQL query collects all labels of all concepts, building an in-memory structure. This structure is then checked for disjoint entries.

Mapping Relations Misuse

Description According to the SKOS reference documentation, mapping relations (e.g., skos:broadMatch or skos:relatedMatch) should be asserted to concepts being members of different concept schemes. This check finds concepts that are related by a mapping property and are either members of the same concept scheme or members of no concept scheme at all.
Example The concept labeled "jaguar" is member of concept scheme labeled "animals". Furthermore, the concept "cat" is member of the same concept scheme and "jaguar" is related to "cat" by skos:broadMatch. Thus, this relation can be considered a misuse of a mapping relation.
Implementation