Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

RDSS-Archivematica Test Data Corpus


This Index provides an overview of the collection contents with links to dataset source files and their SIP metadata.


This repository contains a collection of research data files that are used as a test data corpus for analyzing and testing the integration of Archivematica into JISC's Research Data Shared Service (RDSS), beginning with an initial Minimum Viable Product release.

New dataset additions to the collection are guided by the corpus appraisal criteria.

Only files with open access rights that are posted on public websites are collected. However, errors do occur. If you find a file in this collection that is in violation of a copyright, please file an issue, include a link to the rightsholder information, and it will be deleted.

|-- DOI
|--- crawlerInfo
|---- wget-log.txt
|--- SIPmetadata
|---- [last_4_DOI_characters]-SIP1-request.json
|---- [last_4_DOI_characters]-SIP2-request.json
|--- sourceFiles

The test datasets are organized under the /collection/ directory using their Digital Object Identifier. DOIs have broad adoption within the academic, public, and private research domains to identify and locate canonical versions of research articles and their related datasets.

An index is provided as a collection finding aid.

Under each /DOI/ test data directory, the /crawerlInfo/ sub-directory contains information about the web crawler (WGET) settings, URLs, http requests, and http responses that were used to collect the digital resources stored in the /sourceFiles/ sub-directory.

The /SIPmetadata/ sub-directory contains re-formatted and newly generated metadata that is used to test a variety of preservation system submission scenarios using the research data source files and the request.json format that is specified by the RDSS Messaging API. Note: SIP = Submission Information Package in the ISO-14271 OAIS context.


A human-readable YAML template and yaml-2-json Python scripts are provided as helpers to organize and convert harvested metadata to the request.json schema which is used to send message payloads (i.e. metadata) between the JICS RDSS components, including Archivematica.


A metadata crosswalk is provided to map the movement of metadata values from DataCite properties to the JISC-RDSS Canonical Data Model (CDM) to the Dublin Core properties stored as PREMIS Intellectual Entities in Archivematica's Archival Information Packages (AIP).


This collective work and additional files created in the course of acquiring and curating this collection are freely re-usable under a CC-BY-SA-4.0 license.

datacite.identifier: 10.5072/FK2JM29J5Z  (*)
datacite.identifierType: DOI
datacite.title: RDSS-Archivematica Test Data corpus
datacite.creator: Peter Van Garderen
datacite.publisher: Artefactual Systems
datacite.publicationYear: 2017
datacite.resourceTypeGeneral: collection

( * ) Please don't cite this particular DOI. This is a currently valid and functioning DOI (see However it is a temporary EZID DOI which expires 02-07-2017. It is given here as an attribution illustration. It will be replaced with a permanent DOI. This note will be erased when it does.


A collection of research dataset files used for testing Archivematica integration and functionality in the JISC Research Data Shared Service (RDSS).





No releases published


No packages published
You can’t perform that action at this time.