The CESSDA Data Catalogue (CDC) can harvest any XML content provided by an OAI-PMH endpoint. It uses different sets of XPath mappings to adapt the different flavours of the XML payloads to a standard format, namely the CESSDA Metadata Model.
The CESSDA Metadata Validator (CMV) is part of the pipeline, and is used to perform bulk checks on the harvested files. Additional checks are also run (XML Schema validation on DDI 2.5 metadata files) and the validated files are saved to a Google Cloud storage bucket. Note that files are validated in the following sequence: XML Schema; CMV.
The results of the validation checks are sent to an ElasticSearch index that feeds a Kibana dashboard. The dashboard shows both summary and detailed information regarding violations (non-conformance with the CDC DDI profiles) and XML Schema validations. The Aggregator component loads the validated files into its storage component and makes them available for aggregators such as OpenAIRE, B2Find and GoTriple to harvest.
The CDC product is made up of several components, which can be grouped as Data Gathering, User Facing, Public API and Management. There are also some repositories which are concerned with Documentation & Issue Tracking and QA & Deployment respectively.
The following Open Source code repositories are used to gather and index metadata:
- cessda.metadata.harvester (periodically harvests the configured endpoints).
- cessda.cdc.osmh-indexer.cmm (runs after the harvester has finished to update the Elasticsearch indicies).
The following Open Source code repository is used to provide the user facing components:
- cessda.cdc.searchkit (user interface).
The following components are part of the Aggregator (an OAI-PMH endpoint for the CDC):
- cessda.cdc.aggregator.client (Command line client for synchronizing records to CESSDA CDC Aggregator DocStore).
- cessda.cdc.aggregator.doc-store (HTTP server providing an API in front of a MongoDB cluster).
- cessda.cdc.aggregator.oai-pmh-repo-handler (HTTP server providing an OAI-PMH aggregator endpoint serving DocStore records).
- cessda.cdc.aggregator.shared-library (Python library containing shared code for the CDC Aggregator).
The following private source code repositories are used to build and deploy the management components:
- cessda.cdc.aggregator.deploy (deploys the CDC Aggregator components).
- cessda.cdc.reverse (reverse proxy used as part of the Certbot automated security certificate renewal process. Also provides authentication for components, as needed).
- cessda.cdc.sitemapgenerator (generates a sitemap for use by Google Data Search crawler).
The following public source code repository applies validation to the harvested metadata records:
- cessda.cmv.console (command line application of the CESSDA Metadata Validator).
The following private source code repositories are used to build the documentation components:
- cessda.cdc.userguide (source files in Markdown which are compiled to static html using Jekyll with the Just the docs theme).
- cessda.cdc.versions (contains an issue tracker used internally to record the backlog).
The following private source code repositories are used to test and deploy the product's components:
- cessda.cdc.deploy (contains all the scripts and infrastructure definitions needed to deploy the product).
- cessda.cdc.test (contains test scripts used to QA the product during the deployment process).
See CDC Developer documentation for details.
See CDC Operations documentation for details.
See CDC User guide for details.
The Jenkinsfile in each of the Data Gathering and User Facing component repositories defines the build pipeline for that component. See also the 'README.md' file in each of those repositories.
See the 'QA and Deployment' section, above.
See the 'QA and Deployment' section, above.
The Jenkinsfile in each of the Data Gathering and User Facing component repositories defines the build pipeline for that component. See also the 'README.md' file in each of those repositories.
Please read the CESSDA Software Development Guidelines for details on our code of conduct, and the process for submitting pull requests to us.
See Semantic Versioning for guidance.
You can find the list of contributors in the CONTRIBUTORS.md
file for each component repository.
See the LICENSE file for each component repository.
See the FAQ file.
None at present.