GMPHowToGuide

Kyle Braak edited this page Sep 5, 2017 · 46 revisions

GBIF Metadata Profile – How-to Guide

Version 2.0

Table of Contents

  • Introduction
  • Metadata Publishing Solutions
    • Publishing metadata using the IPT
    • Publishing metadata using the GBIF Metadata Template
    • Publishing metadata manually
  • Validation of metadata
  • What changed in version 1.1 of the GMP since 1.0.2?
  • Example files
  • Annex
    • Background to the GBIF Metadata Profile
    • Metadata Elements
      • Dataset (Resource)
      • Project
      • People and Organisations
      • KeywordSet (General Keywords)
      • Coverage
      • Taxonomic Coverage
      • Geographic Coverage
      • Temporal Coverage
      • Methods
      • Intellectual Property Rights
      • Additional Metadata + NCD (Natural Collections Description Data) Related

Suggested citation

GBIF (2011). GBIF Metadata Profile – How-to Guide, (contributed by Ó Tuama, Eamonn, Braak, K. Remsen, D.), Copenhagen: Global Biodiversity Information Facility ISBN: 87-92020-24-0, accessible online at: https://github.com/gbif/ipt/wiki/GMPHowToGuide

Cover Art Credit: John Giez, Maidenhair fern sporophyte, Adiatum sp.

Document Control

Version Description Date of release Author(s)
1.0 Checked consistencies across relevant documents, updated links to production sites, updated text and pictures to reflect current functionalities. 1 Mar 2011 KB, BK, MR
2.0 Transferred to wiki, major edits 19 June 2017 Kyle Braak

Introduction

Documenting the provenance and scope of datasets is required in order to publish data through the GBIF network. Dataset documentation is referred to as ‘resource metadata’ that enable users to evaluate the fitness-for-use of a dataset.

There are various ways to write a metadata document conforming to the GBIF Metadata Profile (GMP). This How-To Guide will go through the most common ways, such as using the GBIF Integrated Publishing Toolkit (IPT) metadata editor, the GBIF Resource Metadata template (pending), or generating a metadata document manually. The guide also serves as a reference guide to the GBIF Metadata Profile itself.

If metadata describing a dataset are also being published using Darwin Core Archives (DwC-A), the metadata file will be included in the DwC-A file that bundles it together with the data (based on the Darwin Core terms) that it describes. For help with making the complete DwC-A, refer to the Darwin Core Archive: How-To Guide.

Once the metadata document has been written and validated, it is ready to be published.

Ultimately, the goal in publishing the metadata is that the data resource described therein can be fully documented and registered in the GBIF Registry. In so doing, the data resource becomes globally discoverable.

Metadata Publishing Solutions

If sampling-event data, occurrence data or checklist data are being published using the IPT, there is a built-in metadata authoring functionality that can be used to write an accompanying metadata document conforming to the GMP. It may be convenient to use this tool for authoring metadata even if no data is being published, especially if there is a need to author and manage several metadata documents. On the other hand, if only a few metadata documents are needed, it might be easiest to generate them manually, for example by modifying a sample document. Below is a description of each of these methodologies.

Publishing metadata using the IPT

The IPT contains a built-in metadata editor that allows you to easily fill in resource metadata, validate it, and produce an EML file that is always valid XML. Users are recommended to reuse an IPT data hosting centre instead of installing and maintaining their own installation.

In total, the IPT has 12 different metadata forms that logically organise metadata entry:

  1. Basic Metadata
  2. Geographic Coverage
  3. Taxonomic Coverage
  4. Temporal Coverage
  5. Other Keywords
  6. Associated Parties
  7. Project Data
  8. Sampling Methods
  9. Citations
  10. Collection Data
  11. Physical Data
  12. Additional Metadata

The IPT User Manual goes through each form and its respective fields in some depth. The form provides help dialogs to aid the user in understanding what an element means (Figure 1).

Figure 1. Screenshot of a help dialog for the term “Personnel Identifier”

To ensure suitable data are entered, the fields are validated and informative messages displayed back to the user to assist them in filling out the forms (Figure 2).

Figure 2. Screenshot of the field validation message displayed when an email field is submitted with an irregular email address.

For further reference, a description of each element in the GBIF Metadata Profile can be found below with an accompanying example.

The IPT publishes the metadata document and ensures that it is validated against the GBIF Metadata Profile so the user does not have to worry about validation.

If at any time the metadata are modified, the user only has to update the document and click the “Publish” button on the Manage Resource page to publish a new version of the document (resource) (Figure 3).

Figure 3. Screenshot of the Published Versions section of the Manage Resource page of the IPT.

At any point, the resource manager can choose to make the resource publicly available on the Internet and subsequently even register it with GBIF making it globally discoverable.

Publishing metadata using the GBIF Metadata Template

The GBIF Metadata Template is similar to a manuscript template that makes it easy to author resource metadata. Once data have been entered into the template, a metadata author will have to enter it into the IPT via the metadata editor. The required fields will all be clearly indicated. The IPT metadata editor ensures that all mandatory fields have been filled in and that any fields using controlled vocabularies get entered correctly, e.g. the country field. The IPT also ensures the generated metadata document is valid XML and validates against the GBIF Metadata Profile. Ultimately this two-step process (1. metadata template -> 2. IPT metadata editor) can be used to generate a valid resource metadata document.

Where there is doubt about what a field means, refer to this guide to look up the description of its corresponding element with an accompanying example.

Publishing metadata manually

Below is a simple set of instructions for non-IPT users wishing to generate their own custom EML XML file complying with the latest version of the GBIF Metadata Profile: 1.1. Refer to the following list to ensure it is completed properly:

Instructions

  1. Use the schema location for version 1.1 of the GBIF Metadata Profile in the <eml:eml> root element: <eml:eml ... xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.1/eml.xsd" ...>.
  2. Set the packageId attribute inside the <eml:eml> root element. Remember, the packageId should be any globally unique ID fixed for that document. Whenever the document changes, it must be assigned a new packageId. For example: packageId='619a4b95-1a82-4006-be6a-7dbe3c9b33c5/eml-1.xml' for the 1st version of the document, packageId='619a4b95-1a82-4006-be6a-7dbe3c9b33c5/eml-2.xml' for the 2nd version, and so on.
  3. Fill in all mandatory metadata elements specified by the schema, plus any additional metadata elements desired. When updating an existing EML file using an earlier version of the GBIF Metadata Profile, refer to the section below for a list of what's new in this version.
  4. Ensure the EML file is valid XML. For assistance, refer to this section.

Validation of metadata

It is essential the XML metadata document is valid, both as an XML document and as validating against the GML schema. There are several options for how to do this. The Oxygen XML Editor is an excellent tool with built-in validator you can use to do this. Java programmers could also do this for example by using the EmlValidator.java from the GBIF registry-metadata project.

What changed in version 1.1 of the GMP since 1.0.2?

  1. Support for a machine readable license. Note instructions on how to provide a machine readable license can be found here. I assume it would make sense to make this change when implementing issue #78.
  2. Support for multiple contacts, creators, metadataProvider and project personnel
  3. Support for userIds for any agent (e.g. ORCID)
  4. Support for providing information about the frequency with which changes are made to the dataset
  5. Support for providing a project identifier (e.g. to associate datasets under a common project)
  6. The description can be broken into separate paragraphs versus all lumped into one
  7. Support for providing information about multiple collections

Example files

An example EML complying with v1.1 of the GBIF Metadata Profile can be found here. Note this file has been generated by the ALA IPT.

Annex

Background to the GBIF Metadata Profile

Metadata, literally “data about data” are an essential component of a data management system, describing such aspects as the “who, what, where, when and how” pertaining to a resource. In the GBIF context, resources are datasets, loosely defined as collections of related data, the granularity of which is determined by the data custodian. Metadata can occur in several levels of completeness. In general, metadata should allow a prospective end user of data to:

  1. Identify/discover its existence,
  2. Learn how to access or acquire the data,
  3. Understand its fitness-for-use, and
  4. Learn how to transfer (obtain a copy of) the data.

The GBIF Metadata Profile (GMP) was developed in order to standardise how resources get described at the dataset level in the GBIF Data Portal. This profile can be transformed to other common metadata formats such as the ISO 19139 metadata profile.

In the GMP there is a minimum set of mandatory elements required for identification, but it is recommended that as many elements be used as possible to ensure the metadata are as descriptive and complete as possible.

Metadata Elements

The GBIF Metadata Profile is primarily based on the Ecological Metadata Language (EML). The GBIF profile utilises a subset of EML and extends it to include additional requirements that are not accommodated in the EML specification. The following tables provide short descriptions of the profile elements, and where relevant, links to more complete EML descriptions. The elements are categorised as follows:

  • Dataset (Resource)
  • Project
  • People and Organisations
  • Keyword Set (General Keywords)
  • Coverage
    • Taxonomic Coverage
    • Geographic Coverage
    • Temporal Coverage
  • Methods
  • Intellectual Property Rights
  • Additional Metadata + NCD (Natural Collections Descriptions Data) Related

Dataset (Resource)

The dataset field has elements relating to a single dataset (resource).

Term name Description
alternateIdentifier It is a Universally Unique Identifier (UUID) for the EML document and not for the dataset. This term is optional. A list of different identifiers can be supplied. E.g., 619a4b95-1a82-4006-be6a-7dbe3c9b33c5.
title A description of the resource that is being documented that is long enough to differentiate it from other similar resources. Multiple titles may be provided, particularly when trying to express the title in more than one language (use the "xml:lang" attribute to indicate the language if not English/en). E.g. Vernal pool amphibian density data, Isla Vista, 1990-1996.
creator The resource creator is the person or organization responsible for creating the resource itself.
metadataProvider The metadataProvider is the person or organization responsible for providing documentation for the resource. See section “People and Organisations” for more details.
associatedParty An associatedParty is another person or organisation that is associated with the resource. These parties might play various roles in the creation or maintenance of the resource, and these roles should be indicated in the "role" element. See section “People and Organisations” for more details.
contact The contact field contains contact information for this dataset. This is the person or institution to contact with questions about the use, interpretation of a data set.
pubDate The date that the resource was published. The format should be represented as: CCYY, which represents a 4 digit year, or as CCYY-MM-DD, which denotes the full year, month, and day. Note that month and day are optional components. Formats must conform to ISO 8601. E.g. 2010-09-20.
language The language in which the resource (not the metadata document) is written. This can be a well-known language name, or one of the ISO language codes to be more precise. GBIF recommendation is to use the ISO language code (http://vocabularies.gbif.org/vocabularies/lang). E.g., English.
additionalInfo Information regarding omissions, instructions or other annotations that resource managers may wish to include with a dataset. Basically, any information that is not characterized by the other resource metadata fields.
url The URL of the resource that is available online.
abstract A brief overview of the resource that is being documented.

Project

The project field contains information on the project in which this dataset was collected. It includes information such as project personnel, funding, study area, project design and related projects.

Term Definition
title A descriptive title for the research project. E.g., Species diversity in Tennessee riparian habitats
personnel The personnel field is used to document people involved in a research project by providing contact information and their role in the project.
funding The funding field is used to provide information about funding sources for the project such as: grant and contract numbers; names and addresses of funding sources.
studyAreaDescription The studyAreaDescription field documents the physical area associated with the research project. It can include descriptions of the geographic, temporal, and taxonomic coverage of the research location and descriptions of domains (themes) of interest such as climate, geology, soils or disturbances.
designDescription The field designDescription contains general textual descriptions of research design. It can include detailed accounts of goals, motivations, theory, hypotheses, strategy, statistical design, and actual work. Literature citations may also be used to describe the research design.

People and Organisations

There are several fields that could represent either a person or an organisation. Below is a list of the various fields used to describe a person or organisation.

Term Definition
givenName Subfield of individualName field. The given name field can be used for the first name of the individual associated with the resource, or for any other names that are not intended to be alphabetized (as appropriate). E.g., Jonny
surName Subfield of individualName field. The surname field is used for the last name of the individual associated with the resource. This is typically the family name of an individual, for example, the name by which s/he is referred to in citations. E.g. Carson
organizationName The full name of the organization that is associated with the resource. This field is intended to describe which institution or overall organization is associated with the resource being described. E.g., National Center for Ecological Analysis and Synthesis
positionName This field is intended to be used instead of a particular person or full organization name. If the associated person that holds the role changes frequently, then Position Name would be used for consistency. Note that this field, used in conjunction with 'organizationName' and 'individualName' make up a single logical originator. Because of this, an originator with only the individualName of 'Joe Smith' is NOT the same as an originator with the name of 'Joe Smith' and the organizationName of 'NSF'. Also, the positionName should not be used in conjunction with individualName unless only that individual at that position would be considered an originator for the data package. If a positionName is used in conjunction with an organizationName, then that implies that any person who currently occupies said positionName at organizationName is the originator of the data package. E.g., HAST herbarium data manager
electronicMailAddress The electronic mail address is the email address for the party. It is intended to be an Internet SMTP email address, which should consist of a username followed by the @ symbol, followed by the email server domain name address. E.g. jcuadra@gbif.org
deliveryPoint Subfield of the address field that describes the physical or electronic address of the responsible party for a resource. The delivery point field is used for the physical address for postal communication. E.g., GBIF Secretariat, Universitetsparken 15
role Use this field to describe the role the party played with respect to the resource. E.g. technician, reviewer, principal investigator, etc.
phone The phone field describes information about the responsible party's telephone, be it a voice phone, fax. E.g. +4530102040
postalCode Subfield of the address field that describes the physical or electronic address of the responsible party for a resource. The postal code is equivalent to a U.S. zip code, or the number used for routing to an international address. E.g., 52000.
city Subfield of the address field that describes the physical or electronic address of the responsible party for a resource. The city field is used for the city name of the contact associated with a particular resource. E.g. San Diego.
country Subfield of the address field that describes the physical or electronic address of the responsible party for a resource. The country field is used for the name of the contact's country. The country name is most often derived from the ISO 3166 country code list. E.g., Japan.
onlineUrl A link to associated online information, usually a web site. When the party represents an organization, this is the URL to a website or other online information about the organization. If the party is an individual, it might be their personal web site or other related online information about the party. E.g., http://www.yourdomain.edu/~doe.

KeywordSet (General Keywords)

The keywordSet field is a wrapper for the keyword and keywordThesaurus elements, both of which are required together.

Term Definition
keyword A keyword or key phrase that concisely describes the resource or is related to the resource. Each keyword field should contain one and only one keyword (i.e., keywords should not be separated by commas or other delimiters). E.g., biodiversity.
keywordThesaurus The name of the official keyword thesaurus from which keyword was derived. If an official thesaurus name does not exist, please keep a placeholder value such as “N/A” instead of removing this element as it is required together with the keyword element to constitute a keywordSet. E.g., IRIS keyword thesaurus.

Coverage

Describes the extent of the coverage of the resource in terms of its spatial extent, temporal extent, and taxonomic extent.

Taxonomic Coverage

A container for taxonomic information about a resource. It includes a list of species names (or higher level ranks) from one or more classification systems. Please note the taxonomic classifications should not be nested, just listed one after the other.

Term Definition
generalTaxonomicCoverage Taxonomic Coverage is a container for taxonomic information about a resource. It includes a list of species names (or higher level ranks) from one or more classification systems. A description of the range of taxa addressed in the data set or collection. Use a simple comma separated list of taxa. E.g., "All vascular plants were identified to family or species, mosses and lichens were identified as moss or lichen."
taxonomicClassification Information about the range of taxa addressed in the dataset or collection.
taxonRankName The name of the taxonomic rank for which the Taxon rank value is provided. E.g., phylum, class, genus, species.
taxonRankValue The name representing the taxonomic rank of the taxon being described. E.g. Acer would be an example of a genus rank value, and rubrum would be an example of a species rank value, together indicating the common name of red maple. It is recommended to start with Kingdom and include ranks down to the most detailed level possible.
commonName Applicable common names; these common names may be general descriptions of a group of organisms if appropriate. E.g., invertebrates, waterfowl.

Geographic Coverage

A container for spatial information about a resource; allows a bounding box for the overall coverage (in lat long), and also allows description of arbitrary polygons with exclusions.

Term Definition
geographicDescription A short text description of a dataset's geographic areal domain. A text description is especially important to provide a geographic setting when the extent of the dataset cannot be well described by the "boundingCoordinates". E.g., "Manistee River watershed", "extent of 7 1/2 minute quads containing any property belonging to Yellowstone National Park"
westBoundingCoordinate Subfield of boundingCoordinates field covering the W margin of a bounding box. The longitude in decimal degrees of the western-most point of the bounding box that is being described. E.g., -18.25, +25, 45.24755.
eastBoundingCoordinate Subfield of boundingCoordinates field covering the E margin of a bounding box. The longitude in decimal degrees of the eastern-most point of the bounding box that is being described. E.g., -18.25, +25, 45.24755.
northBoundingCoordinate Subfield of boundingCoordinates field covering the N margin of a bounding box. The longitude in decimal degrees of the northern-most point of the bounding box that is being described. E.g., -18.25, +25, 65.24755.
southBoundingCoordinate Subfield of boundingCoordinates field covering the S margin of a bounding box. The longitude in decimal degrees of the southern-most point of the bounding box that is being described. E.g., -118.25, +25, 84.24755.

Temporal Coverage

This container allows coverage to be a single point in time, multiple points in time, or a range of dates.

Term Definition
beginDate Subfield of rangeOfDates field: It may be used multiple times with a endDate field to document multiple date ranges. A single time stamp signifying the beginning of some time period. The calendar date field is used to express a date, giving the year, month, and day. The format should be one that complies with the International Standards Organization's standard 8601. The recommended format for EML is YYYY-MM-DD, where Y is the four digit year, M is the two digit month code (01 - 12, where January = 01), and D is the two digit day of the month (01 - 31). This field can also be used to enter just the year portion of a date. E.g. 2010-09-20
endDate Subfield of rangeOfDates field: It may be used multiple times with a beginDate field to document multiple date ranges. A single time stamp signifying the end of some time period. The calendar date field is used to express a date, giving the year, month, and day. The format should be one that complies with the International Standards Organization's standard 8601. The recommended format for EML is YYYY-MM-DD, where Y is the four digit year, M is the two digit month code (01 - 12, where January = 01), and D is the two digit day of the month (01 - 31). This field can also be used to enter just the year portion of a date. E.g. 2010-09-20.
singleDateTime The SingleDateTime field is intended to describe a single date and time for an event.

Methods

This field documents scientific methods used in the collection of the resource. It includes information on items such as tools, instrument calibration and software.

Term Definition
methodStep The methodStep field allows for repeated sets of elements that document a series of procedures followed to produce a data object. These include text descriptions of the procedures, relevant literature, software, instrumentation, source data and any quality control measures taken.
qualityControl The qualityControl field provides a location for the description of actions taken to either control or assess the quality of data resulting from the associated method step.
sampling Description of sampling procedures including the geographic, temporal and taxonomic coverage of the study.
studyExtent Subfield of the sampling field. The coverage field allows for a textual description of the specific sampling area, the sampling frequency (temporal boundaries, frequency of occurrence), and groups of living organisms sampled (taxonomic coverage). The field studyExtent represents both a specific sampling area and the sampling frequency (temporal boundaries, frequency of occurrence). The geographic studyExtent is usually a surrogate (representative area of) for the larger area documented in the "studyAreaDescription".
samplingDescription Subfield of the sampling field. The samplingDescription field allows for a text-based/human readable description of the sampling procedures used in the research project. The content of this element would be similar to a description of sampling procedures found in the methods section of a journal article.

Intellectual Property Rights

Contain a rights management statement for the resource, or a reference to a service providing such information.

Term Definition
purpose A description of the purpose of this dataset.
intellectualRights A rights management statement for the resource, or reference a service providing such information. Rights information encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights. In the case of a data set, rights might include requirements for use, requirements for attribution, or other requirements the owner would like to impose. E.g., Copyright 2001 Regents of the University of California Santa Barbara. Free for use by all individuals provided that the owners are acknowledged in any use or publication.

Additional Metadata + Natural Collections Description Data (NCD) Related

The additionalMetadata field is a container for any other relevant metadata that pertains to the resource being described. This field allows EML to be extensible in that any XML-based metadata can be included in this element. The elements provided here in the GMP include those required for conformance with ISO 19139 and a subset of NCD (Natural Collections Descriptions) elements.

Term Definition
dateStamp The dateTime the metadata document was created or modified. E.g., 2002-10-23T18:13:51.235+01:00
metadataLanguage The language in which the metadata document (as opposed to the resource being described by the metadata) is written. Composed of an ISO639-2/T three-letter language code and an ISO3166-1 three-letter country code. E.g., en_UK
citation The citation for the work itself. See eml
bibliography A list of citations (see below) that form a bibliography on literature related / used in the dataset
resourceLogoUrl URL of the logo associated with a resource. E.g., http://www.gbif.org/logo.jpg
parentCollectionIdentifier Subfield of collection field. Is an optional field. Identifier for the parent collection for this sub-collection. Enables a hierarchy of collections and sub collections to be built.
collectionIdentifier Subfield of collection field. Is an optional field. The URI (LSID or URL) of the collection. In RDF, used as URI of the collection resource.
formationPeriod Text description of the time period during which the collection was assembled. E.g., "Victorian", or "1922 - 1932", or "c. 1750".
livingTimePeriod Time period during which biological material was alive (for palaeontological collections).
specimenPreservationMethod Picklist keyword indicating the process or technique used to prevent physical deterioration of non-living collections. Expected to contain an instance from the Specimen Preservation Method Type Term vocabulary. E.g., formaldehyde.
jgtiCuratorialUnit A quantitative descriptor (number of specimens, samples or batches). The actual quantification could be covered by 1) an exact number of “JGI-units” in the collection plus a measure of uncertainty (+/- x); 2) a range of numbers (x to x), with the lower value representing an exact number, when the higher value is omitted. The discussion concluded that the quantification should encompass all specimens, not only those that have not yet been digitised. This is to avoid having to update the numbers too often. The number of non-public data (not digitised or not accessible) can be calculated from the GBIF numbers as opposed to the JGTI-data.
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.