Darwin Core Archives – How-to Guide
Table of Contents
- What is Darwin Core Archive (DwC-A)?
- DwC-A Components
- DWC-A Data Publishing Solutions
- Publishing DwC-A using the IPT
- Registering your Dataset using IPT
- Publishing DwC-A using GBIF Spreadsheet Templates
- Publishing DwC-A Manually
- Publishing DwC-A using the IPT
- Validation of DwC-As
- Registration of DwC-As with GBIF
- Annex: Preparing Your Data
- Required and recommended terms
- Character Encoding
- Data From a Database
- DwC-A Examples
|Version||Description||Date of release||Author(s)|
|1.0||Content review and additions||April 2011||David Remsen, Markus Döring|
|2.0||Transferred to wiki, major edits||9 May 2017||Kyle Braak|
GBIF (2017). Darwin Core Archives – How-to Guide, version 2.0, released on 9 May 2011, (contributed by Remsen D, Braak, K, Döring M, Robertson, T), Copenhagen: Global Biodiversity Information Facility, accessible online at: https://github.com/gbif/ipt/wiki/DwCAHowToGuide
Cover art credit: Kim Wismann, Cicindelinae
What is Darwin Core Archive (DwC-A)?
Darwin Core Archive (DwC-A) is a biodiversity informatics data standard that makes use of the Darwin Core terms to produce a single, self contained dataset for sharing species-level (taxonomic), species-occurrence data, and sampling-event data. An archive is a set of text files, in standard comma- or tab-delimited format, with a simple descriptor file (called meta.xml) to inform others how your files are organised. The format is defined in the Darwin Core Text Guidelines. It is the preferred format for publishing data in the GBIF network.
The central idea of an archive is that its data files are logically arranged in a star-like manner, with one core data file surrounded by any number of ‘extension’ data files. Core and extension files contain data records, one per line. Each extension record (or ‘extension file row’) points to a record in the core file; in this way, many extension records can exist for each single core record. This is sometimes referred to as a “star schema”.
Sharing entire datasets as DwC-As instead of using page-able web services like DiGIR and TAPIR allows much simpler and more efficient data transfer. For example, retrieving 260,000 records via TAPIR takes about nine hours, and involves issuing 1,300 http requests to transfer 500 MB of XML-formatted data. The exact same dataset, when encoded as DwC-A and zipped becomes a 3 MB file. Therefore, GBIF highly recommends compressing an archive using ZIP or GZIP when generating a DwC-A. In addition, producing DwC-As does not require any dedicated software to be installed by a data publisher, making it a much simpler option.
The production of a DwC-A requires the use of stable identifiers for core records, but not for extensions. For any kind of shared data it is therefore necessary to have some sort of local record identifiers. It is good practice to maintain – with the original data – identifiers that are stable over time and are not being reused after the record is deleted. If possible, please provide globally unique identifiers (GUID) instead of local ones. Refer to A Beginner’s Guide to Persistent Identifiers for more information about GUIDs. This identifier is referred to as the “core ID” in DwC-As and the specific Darwin Core term that it corresponds to is dependent on the data type being published.
A DwC-A may consist of a single data file or multiple files, depending on the scope of the published data. The specific types of files that may be included in an archive are the following:
A required core data file consisting of a standard set of Darwin Core terms. The data file is formatted as fielded text, where data records are expressed as rows of text, and data elements (columns) are separated with a standard delimiter such as a tab or comma (commonly referred to as CSV or ‘comma-separated value’ files). The first row of the data file may optionally contain data or represent a ‘header row’. In general, if a header row is included, it contains the names of the Darwin Core terms represented in the succeeding rows of data.
GBIF currently supports the following three biodiversity data types as the basis for a core data file:
- Occurrence data - The category of information pertaining to evidence of an occurrence in nature, in a collection, or in a dataset (specimen, observation, etc.). Core files of this type are used to share information about a specific instance of a taxon such as a specimen or observation. The required core ID is represented by dwc:occurrenceID. The definitive list of Occurrence terms can be found in the Occurrence (Core) Extension.
- Checklist data - The category of information pertaining to taxa or taxon concepts, such as species. Core files of this type are used to share annotated species checklists, taxonomic catalogues, and other information about taxa. The required core ID is represented by dwc:taxonID. The definitive list of core Taxon terms can be found in the Taxon (Core) Extension.
- Sampling-event data - The category of information pertaining to a sampling event. Core files of this type are used to share information about ecological investigations that can be one off studies or monitoring programmes that are usually quantitative, calibrated and follow certain protocols so that changes and trends of populations can be detected. The required core ID is represented by dwc:eventID. The definitive list of core Event terms can be found in the Event (Core) Extension.
- Optional “extension” files support the exchange of additional, described classes of data that relate to the core data type (Occurrence or Taxon). An extension record points to a record in the core data file. Extensions may only apply to Taxa or Occurrences or may apply to both. For example, the Vernacular Names extension (illustrated below) is an extension to the Taxon class, whereas an Images extension may be used in both. Extensions can be created and added to the GBIF Extension Repository following a consultation and development process with GBIF. The definitive list of supported Extensions can be found on the GBIF Extension Repository.
- A descriptor metafile describes how the files in your archive are organised. It describes the files in the archive and maps each data column to a corresponding standard Darwin Core or Extension term. The metafile is a relatively simple XML file format. GBIF provides an online tool for making this file but the format is simple enough that many data administrators will be able to generate it manually. These options are described in the Publishing Options section of this document.
A metafile is required when an archive includes any extension files or if a single core data file uses non-standard column names in the first (header) row of data. A complete reference guide to this metafile is available. (TODO - merge content here)
- Datasets require documentation. This is achieved in a DwC-A by including a resource metadata document that provides information about the dataset itself such as a description (abstract) of the dataset, the agents responsible for authorship, publication and documentation, bibliographic and citation information, collection methods and much more. GBIF currently supports a metadata profile based on the Ecological Metadata Language but other metadata standards exist and may be supported. The GBIF Metadata Profile's XML Schema description can be found on the GBIF Schema Repository
This single, compressed file is the DwC-A file!
DwC-A Data Publishing Solutions
There are a number of different options for generating a DwC-A.
To help select the most appropriate solution for creating your own archive, answering the following questions can help your decision:
- Have your data been digitised? (If yes, it is assumed that you can easily convert the data into CSV or Tab format).
- Are your data stored in a relational database?
- How many separate datasets (DwC-Archives) do you plan to publish?
Publishing DwC-As using the IPT is most suitable when:
- Your data have been digitised already.
- Your data either are or are not already in a relational database
- You need to create/manage multiple archives.
- You would like to document datasets using the GBIF Metadata Profile.
Publishing DwC-As using GBIF Spreadsheet Templates is most suitable when:
- Your data have not been digitised already.
- You already maintain data using spreadsheets.
- You need a simple solution to create/manage a limited number datasets
- You need extra guidance capturing and formatting the data
Publishing DwC-As manually is most suitable when:
- Your data have been digitised already.
- Your data may be in a relational database.
- You only need to create/manage a small number of archives, and/or you have the technical skills to automate / script the archive generation process.
A more detailed discussion of these three options follows.
Publishing DwC-A using the IPT
Assumption: Your data are already stored as a CSV/Tab text file, or in one of the supported relational database management systems (MySQL, PostgreSQL, Microsoft SQL Server, Oracle, Sybase). Preferably, you are already using Darwin Core terms as column names, although this is not compulsory.
The Integrated Publishing Toolkit (IPT) is GBIF’s flagship tool for publishing DwC-As.
The simplest way to begin using the IPT is to request a free account on a trusted data hosting centre allowing you to manage your own datasets and publish them through GBIF.org without the hassle of setting up and maintaining the IPT on your own server.
Otherwise if want to setup your own instance of the IPT the Getting Started Guide is your entry point.
The IPT can be used to publish resource metadata, occurrence data, checklist data, and sampling-event data. The guide How to publish biodiversity data through GBIF.org provides a simple set of instructions how to do so.
The IPT outputs a DwC-A during publishing and supports automatic registration in the GBIF network. See the IPT User Manual for further details.
Publishing DwC-A using GBIF Spreadsheet Templates
Assumption: The occurrence data, simple taxonomic data, or sampling-event data to be published are not yet captured in digital format OR a simple solution for creating a metadata document to describe a dataset is desired.
GBIF provides a set of pre-configured Microsoft Excel spreadsheet files that serve as templates for capturing occurrence data, checklist data, and sampling-event data:
- Checklist data template: suitable for basic species checklists
- Occurrence data template: suitable for occurrence data (specimen, observation)
- Sampling-event data template: suitable for sampling-event data
- Resource metadata template: suitable for composing a metadata document - pending but imminent
Each template provides inline help and instructions in the worksheets.
To publish the data as a DwC-A, upload the templates to the IPT. Use the IPT's built-in metadata editor to enter dataset metadata. The guide How to publish biodiversity data through GBIF.org provides a simple set of instructions how to do so. If you require an account on an IPT, it is highly recommended that you request an account on a trusted data hosting centre located in your country.
Publishing DwC-A Manually
Assumption: Data is already in, or can easily generate, a CSV/Tab text file, or in one of the supported relational database management systems (MySQL, PostgreSQL, Microsoft SQL Server, Oracle, Sybase). The publisher does not wish to host an IPT instance but does have access to a web server.
DwC-As can be created without installing any dedicated software. These instructions target data managers who are familiar with the dataset to be published and are comfortable working with their data management system.
Below is a set of instructions on how to manually create a DwC-Archive:
- Unless the data are already stored in a CSV/Tab text file, the publisher needs to prepare a text file(s) from the source. If the data are stored in a database, generate an output of delimited text from the source database into an outfile. Most database management systems support this process; an example is given in the Annex to this guide, below, in the section “Outputting Data From a MySQL Database Into a Textfile”. As the metafile maps the columns of the text file to Darwin Core terms, it is not necessary to use Darwin Core terms as column header in the resultant text file, though it may help to reduce errors. A general recommendation is to produce a single core data file and a single file for each extension if the intention is to output data tied to an extension.
- Create a Metafile: There are three different ways to generate the file:
- Create it manually by using an XML editor and using a sample metafile as a guiding example. A complete description of the metafile format can be found in the Darwin Core Text Guide.
- Create it using the online application Darwin Core Archive Assistant Simply select the fields of data to be published, provide some details about the files and save the resultant XML. This only needs to be done once unless the set of published fields changes at some later time. Warning: this tool is no longer supported by GBIF. Support for the Event core is missing. Publishers also need to manually add term dwc:taxonID to Taxon core and dwc:occurrenceID to Occurrence core, to ensure they are explicitly included.
- Create a metadata file (eml.xml) that describes the data resource. Complete instructions on doing this are available in the GBIF Extended Metadata Profile: How-To Guide. It is best practice to include a metadata file and the simplest way to produce one is using the IPT's built-in metadata editor.
- Ensure the data files, the metafile (meta.xml) and metadata file (eml.xml) are in the same directory or folder. Compress the folder using one of the support compression formats. The result is a DwC-A.
Note: Metadata authored using IPT can be output as an RTF document, which can then be submitted as ‘Data Paper’ manuscript to Zookeys, PhytoKeys and BioRisks. See instructions to authors for ‘Data Paper’ submission to these journals.
Validation of DwC-As
GBIF provides an online DwC-Archive Validator that performs the following checks:
- The metafile (meta.xml) is valid XML and complies with the Darwin Core Text Guidelines.
- The content complies with the known extensions and terms registered within the GBIF network. Note GBIF runs a production and a development registry that keeps track of extensions, both of which are used by this validator.
- The metadata file (eml.xml) is valid XML and complies with the GBIF Metadata Profile schema and the official EML schema.
- Referential integrity - that mapped ID terms in extension files reference existing core records.
- All core IDs are unique
- That no verbatim null values are found in the data. For example NULL or \N
To use the validator:
- Upload the DwC-A using the form provided in the Validator web page.
- Review the response that and address any validation errors
- Repeat the process until the file is successfully validated.
- Contact the GBIF Helpdesk if you get stuck (email@example.com).
Registration of DwC-As with GBIF
An entry for the resource must be made in the GBIF Registry that enables the resource to be discoverable and accessible. Each new registration needs to be associated with a publishing organization that has been formally endorsed by a GBIF Participant Node manager. This is a simple quality control step required by the GBIF Participant Node Managers Committee.
Fortunately, the IPT and GBIF API support automatic registration for datasets. Otherwise if you are publishing DwC-As manually, initiate registration by sending an email to mailto:firstname.lastname@example.org with the following information:
- Dataset title
- Dataset description (copied from metadata file)
- Publishing organisation name (must be registered in GBIF, otherwise register it by filling in this online questionnaire).
- Your relation to this organisation
- Dataset URL (publicly accessible address of zipped DwC-A)
You will receive a confirmation email, and a URL representing the resource entry in the Registry.
Annex: Preparing Your Data
Required and recommended terms
The guide How to publish biodiversity data through GBIF.org provides a set of required and recommended terms for each type of data:
- Checklist data: required terms / recommended terms
- Occurrence data: required terms / recommended terms
- Sampling-event data: required terms / recommended terms
- Resource metadata: required terms / recommended terms
Recommended best practice is to encode text (data) files using UTF-8.
The following tools for Unix and Windows can be used to convert character encodings of files:
Ex.: Convert character encodings from Windows-1252 to UTF-8 using iconv:
#iconv -f CP1252 -t utf-8 example.txt > exampleUTF8.txt
Data From a Database
It is easy to produce delimited text files from a database using the SQL commands. For MySQL, use the
SELECT INTO outfile command. The encoding of the resulting file will depend on the server variables and collations used, and might need to be modified before the operation is done. Note that MySQL will export NULL values as \N by default. Use the IFNULL() function as shown in the following example to avoid this:
SELECT IFNULL(id, ''), IFNULL(scientific\_name, ''), IFNULL(count,'') INTO outfile '/tmp/dwc.txt' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\\n' FROM` dwc;
Here are some other recommendations for generating data using SQL queries/functions:
- Concatenate or split strings as required, e.g. to construct the full scientific name string (watch out for autonyms)
- Format dates to conform to ISO datetime format
- Create year/month/day by parsing native SQL date types
- Use a UNION to merge 2 or more tables, e.g. accepted taxa and synonyms, or specimen and observations
The guide How to publish biodiversity data through GBIF.org provides a set of example DwC-As for each type of data: