Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
The following features are planned to support the improvement of data quality and structure:
(note: these features may require a powerful data hub or a processing unit that is applicable for such tasks)
Deduplication comprises two steps:
- Finding duplicates (only in the easiest case this can be done via a common identifier)
- Applying an appropriate strategy for merging the duplicates
FRBR-ization is a process that has its origin in bibliographic domain. It allows to create a graph of connected bibliographic resource at its various levels of abstraction (see also FRBR@Wikipedia). For example, tt can relate concrete manifestations to its abstract works.
Deduplication and FRBR-ization can happen in the data hub. Then we can refer to data from a specific version or with a specific provenance. Cleaned data can be stored in the data hub as well.
Filtering Statements by Qualified Attributes (Context)
The ability to filter statements by qualified attributes (context), such as, provenance, version or trustworthiness, can be utilised when implementing deduplication or FRBR-ization algorithms. For example, a mapping used in a data quality procedure needs to select data based on the source it originates from.
While most entities in d:swarm are already modelled to support reuse and sharing, we are planning to make sharing a prominent feature that is easily accessible from various views in the d:swarm Back Office. Sharing and discussing projects, transformations and mappings with other users, which facing the same data management tasks, should be possible.