Vision

David Shorthouse edited this page Apr 7, 2014 · 21 revisions

The dwca-validator is split into different modules that can (or will eventually) interact together. Overview of the different modules and a summary of their respective scopes:

Vision diagram

Modules definition

library

The library module will be used by the web component but may also be used directly by other projects including some data aggregators. The library itself is responsible for validating that supplied data respect the DarwinCore standard. This includes validation against schema for core and extensions and against terms in controlled vocabularies suggested by the standard. Another basic responsibility of the library is to validate the structural and functional integrity of the data (e.g. uniqueness, invalid characters). The library contains all base classes for the validation chain and the results gathering.

web

The web module is the public-facing interface for the 'dwca-validator'. A user should be able to submit an archive or an URL pointing to an archive. An API should also be available to allow interoperability with other programming languages. The web component should permit 'narwhal' and 'extensions' modules to the validation chain.

narwhal

The narwhal module will make use of the narwhal processor in the validation chain. An InterpretedResultAccumulator will be added so a ResultAccumulator could store the validation result and a possible interpreted value. The narwhal processor is responsible for interpreting data such as different date formats, coordinate formats, country names and eventually scientific names that may use external services. This module will also require a composite ChainableRecordEvaluator to allow different parts of the validation chain to execute in logical sequences (e.g. validating a province name with a previously validated/interpreted country name).

extension

Extensions would be possible as a library extension or as a narwhal extension. Extensions may contain domain- specific knowledge, outside the (limited) scope of the dwca-validator. An extension could, for example, add a fitness for use calculation within the validation chain. The extensions are added to the validation chain.

Modules usage

Possible usage of the dwca-validator and its modules:

  • Data aggregator: library + narwhal + extensions
  • Data publisher: web (with possibility for narwhal and extensions)
  • IPT: library (with possibility for extensions)

Validation Flow

Validation flow

Source

To ensure that the 'dwca-validator' can be used in other projects, the source of the validation could be a user-defined database or a messaging system. In the case of a user-defined source, the scope of the data needs to be managed by the caller.

Structure validation

Structural validation is executed only against the DwC-A source. User-defined source will be considered structurally valid.

Content validation chain

The content validation chain is executed against each record regardless of the source. Raw and interpreted results are accumulated with a ResultAccumulator. Some validation chain elements will provide immediate results. Other elements like 'uniqueness' need to be aware of all the records prior to declaring validity. Parallelization of the content validation chain is possible but depends on the chosen ResultAccumulator.