Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
GBIF Metadata Profile – How-to Guide
Table of Contents
- Metadata Publishing Solutions
- Publishing metadata using the IPT
- Publishing metadata using the GBIF Metadata Template
- Publishing metadata manually
- Validation of metadata
- What changed in version 1.1 of the GMP since 1.0.2?
- Example files
- Background to the GBIF Metadata Profile
- Metadata Elements
- Dataset (Resource)
- People and Organisations
- KeywordSet (General Keywords)
- Taxonomic Coverage
- Geographic Coverage
- Temporal Coverage
- Intellectual Property Rights
- Additional Metadata + NCD (Natural Collections Description Data) Related
GBIF (2011). GBIF Metadata Profile – How-to Guide, (contributed by Ó Tuama, Eamonn, Braak, K. Remsen, D.), Copenhagen: Global Biodiversity Information Facility ISBN: 87-92020-24-0, accessible online at: https://github.com/gbif/ipt/wiki/GMPHowToGuide
Cover Art Credit: John Giez, Maidenhair fern sporophyte, Adiatum sp.
|Version||Description||Date of release||Author(s)|
|1.0||Checked consistencies across relevant documents, updated links to production sites, updated text and pictures to reflect current functionalities.||1 Mar 2011||KB, BK, MR|
|2.0||Transferred to wiki, major edits||19 June 2017||Kyle Braak|
Documenting the provenance and scope of datasets is required in order to publish data through the GBIF network. Dataset documentation is referred to as ‘resource metadata’ that enable users to evaluate the fitness-for-use of a dataset.
There are various ways to write a metadata document conforming to the GBIF Metadata Profile (GMP). This How-To Guide will go through the most common ways, such as using the GBIF Integrated Publishing Toolkit (IPT) metadata editor, the GBIF Resource Metadata template (pending), or generating a metadata document manually. The guide also serves as a reference guide to the GBIF Metadata Profile itself.
If metadata describing a dataset are also being published using Darwin Core Archives (DwC-A), the metadata file will be included in the DwC-A file that bundles it together with the data (based on the Darwin Core terms) that it describes. For help with making the complete DwC-A, refer to the Darwin Core Archive: How-To Guide.
Once the metadata document has been written and validated, it is ready to be published.
Ultimately, the goal in publishing the metadata is that the data resource described therein can be fully documented and registered in the GBIF Registry. In so doing, the data resource becomes globally discoverable.
Metadata Publishing Solutions
If sampling-event data, occurrence data or checklist data are being published using the IPT, there is a built-in metadata authoring functionality that can be used to write an accompanying metadata document conforming to the GMP. It may be convenient to use this tool for authoring metadata even if no data is being published, especially if there is a need to author and manage several metadata documents. On the other hand, if only a few metadata documents are needed, it might be easiest to generate them manually, for example by modifying a sample document. Below is a description of each of these methodologies.
Publishing metadata using the IPT
The IPT contains a built-in metadata editor that allows you to easily fill in resource metadata, validate it, and produce an EML file that is always valid XML. Users are recommended to reuse an IPT data hosting centre instead of installing and maintaining their own installation.
In total, the IPT has 12 different metadata forms that logically organise metadata entry:
- Basic Metadata
- Geographic Coverage
- Taxonomic Coverage
- Temporal Coverage
- Other Keywords
- Associated Parties
- Project Data
- Sampling Methods
- Collection Data
- Physical Data
- Additional Metadata
The IPT User Manual goes through each form and its respective fields in some depth. The form provides help dialogs to aid the user in understanding what an element means (Figure 1).
Figure 1. Screenshot of a help dialog for the term “Personnel Identifier”
To ensure suitable data are entered, the fields are validated and informative messages displayed back to the user to assist them in filling out the forms (Figure 2).
Figure 2. Screenshot of the field validation message displayed when an email field is submitted with an irregular email address.
For further reference, a description of each element in the GBIF Metadata Profile can be found below with an accompanying example.
The IPT publishes the metadata document and ensures that it is validated against the GBIF Metadata Profile so the user does not have to worry about validation.
If at any time the metadata are modified, the user only has to update the document and click the “Publish” button on the Manage Resource page to publish a new version of the document (resource) (Figure 3).
Figure 3. Screenshot of the Published Versions section of the Manage Resource page of the IPT.
At any point, the resource manager can choose to make the resource publicly available on the Internet and subsequently even register it with GBIF making it globally discoverable.
Publishing metadata using the GBIF Metadata Template
The GBIF Metadata Template is similar to a manuscript template that makes it easy to author resource metadata. Once data have been entered into the template, a metadata author will have to enter it into the IPT via the metadata editor. The required fields will all be clearly indicated. The IPT metadata editor ensures that all mandatory fields have been filled in and that any fields using controlled vocabularies get entered correctly, e.g. the country field. The IPT also ensures the generated metadata document is valid XML and validates against the GBIF Metadata Profile. Ultimately this two-step process (1. metadata template -> 2. IPT metadata editor) can be used to generate a valid resource metadata document.
Where there is doubt about what a field means, refer to this guide to look up the description of its corresponding element with an accompanying example.
Publishing metadata manually
Below is a simple set of instructions for non-IPT users wishing to generate their own custom EML XML file complying with the latest version of the GBIF Metadata Profile: 1.1. Refer to the following list to ensure it is completed properly:
- Use the schema location for version 1.1 of the GBIF Metadata Profile in the
<eml:eml ... xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.1/eml.xsd" ...>.
- Set the
packageIdattribute inside the
<eml:eml>root element. Remember, the
packageIdshould be any globally unique ID fixed for that document. Whenever the document changes, it must be assigned a new packageId. For example:
packageId='619a4b95-1a82-4006-be6a-7dbe3c9b33c5/eml-1.xml'for the 1st version of the document,
packageId='619a4b95-1a82-4006-be6a-7dbe3c9b33c5/eml-2.xml'for the 2nd version, and so on.
- Fill in all mandatory metadata elements specified by the schema, plus any additional metadata elements desired. When updating an existing EML file using an earlier version of the GBIF Metadata Profile, refer to the section below for a list of what's new in this version.
- Ensure the EML file is valid XML. For assistance, refer to this section.
Validation of metadata
It is essential the XML metadata document is valid, both as an XML document and as validating against the GML schema. There are several options for how to do this. The Oxygen XML Editor is an excellent tool with built-in validator you can use to do this. Java programmers could also do this for example by using the EmlValidator.java from the GBIF registry-metadata project.
What changed in version 1.1 of the GMP since 1.0.2?
- Support for a machine readable license. Note instructions on how to provide a machine readable license can be found here. I assume it would make sense to make this change when implementing issue #78.
- Support for multiple contacts, creators, metadataProvider and project personnel
- Support for userIds for any agent (e.g. ORCID)
- Support for providing information about the frequency with which changes are made to the dataset
- Support for providing a project identifier (e.g. to associate datasets under a common project)
- The description can be broken into separate paragraphs versus all lumped into one
- Support for providing information about multiple collections
Background to the GBIF Metadata Profile
Metadata, literally “data about data” are an essential component of a data management system, describing such aspects as the “who, what, where, when and how” pertaining to a resource. In the GBIF context, resources are datasets, loosely defined as collections of related data, the granularity of which is determined by the data custodian. Metadata can occur in several levels of completeness. In general, metadata should allow a prospective end user of data to:
- Identify/discover its existence,
- Learn how to access or acquire the data,
- Understand its fitness-for-use, and
- Learn how to transfer (obtain a copy of) the data.
The GBIF Metadata Profile (GMP) was developed in order to standardise how resources get described at the dataset level in the GBIF Data Portal. This profile can be transformed to other common metadata formats such as the ISO 19139 metadata profile.
In the GMP there is a minimum set of mandatory elements required for identification, but it is recommended that as many elements be used as possible to ensure the metadata are as descriptive and complete as possible.
The GBIF Metadata Profile is primarily based on the Ecological Metadata Language (EML). The GBIF profile utilises a subset of EML and extends it to include additional requirements that are not accommodated in the EML specification. The following tables provide short descriptions of the profile elements, and where relevant, links to more complete EML descriptions. The elements are categorised as follows:
- Dataset (Resource)
- People and Organisations
- Keyword Set (General Keywords)
- Taxonomic Coverage
- Geographic Coverage
- Temporal Coverage
- Intellectual Property Rights
- Additional Metadata + NCD (Natural Collections Descriptions Data) Related
The dataset field has elements relating to a single dataset (resource).
|alternateIdentifier||It is a Universally Unique Identifier (UUID) for the EML document and not for the dataset. This term is optional. A list of different identifiers can be supplied. E.g., 619a4b95-1a82-4006-be6a-7dbe3c9b33c5.|
|title||A description of the resource that is being documented that is long enough to differentiate it from other similar resources. Multiple titles may be provided, particularly when trying to express the title in more than one language (use the "xml:lang" attribute to indicate the language if not English/en). E.g. Vernal pool amphibian density data, Isla Vista, 1990-1996.|
|creator||The resource creator is the person or organization responsible for creating the resource itself.|
|metadataProvider||The metadataProvider is the person or organization responsible for providing documentation for the resource. See section “People and Organisations” for more details.|
|associatedParty||An associatedParty is another person or organisation that is associated with the resource. These parties might play various roles in the creation or maintenance of the resource, and these roles should be indicated in the "role" element. See section “People and Organisations” for more details.|
|contact||The contact field contains contact information for this dataset. This is the person or institution to contact with questions about the use, interpretation of a data set.|
|pubDate||The date that the resource was published. The format should be represented as: CCYY, which represents a 4 digit year, or as CCYY-MM-DD, which denotes the full year, month, and day. Note that month and day are optional components. Formats must conform to ISO 8601. E.g. 2010-09-20.|
|language||The language in which the resource (not the metadata document) is written. This can be a well-known language name, or one of the ISO language codes to be more precise. GBIF recommendation is to use the ISO language code (http://vocabularies.gbif.org/vocabularies/lang). E.g., English.|
|additionalInfo||Information regarding omissions, instructions or other annotations that resource managers may wish to include with a dataset. Basically, any information that is not characterized by the other resource metadata fields.|
|url||The URL of the resource that is available online.|
|abstract||A brief overview of the resource that is being documented.|
The project field contains information on the project in which this dataset was collected. It includes information such as project personnel, funding, study area, project design and related projects.
|title||A descriptive title for the research project. E.g., Species diversity in Tennessee riparian habitats|
|personnel||The personnel field is used to document people involved in a research project by providing contact information and their role in the project.|
|funding||The funding field is used to provide information about funding sources for the project such as: grant and contract numbers; names and addresses of funding sources.|
|studyAreaDescription||The studyAreaDescription field documents the physical area associated with the research project. It can include descriptions of the geographic, temporal, and taxonomic coverage of the research location and descriptions of domains (themes) of interest such as climate, geology, soils or disturbances.|
|designDescription||The field designDescription contains general textual descriptions of research design. It can include detailed accounts of goals, motivations, theory, hypotheses, strategy, statistical design, and actual work. Literature citations may also be used to describe the research design.|
People and Organisations
There are several fields that could represent either a person or an organisation. Below is a list of the various fields used to describe a person or organisation.
|givenName||Subfield of individualName field. The given name field can be used for the first name of the individual associated with the resource, or for any other names that are not intended to be alphabetized (as appropriate). E.g., Jonny|
|surName||Subfield of individualName field. The surname field is used for the last name of the individual associated with the resource. This is typically the family name of an individual, for example, the name by which s/he is referred to in citations. E.g. Carson|
|organizationName||The full name of the organization that is associated with the resource. This field is intended to describe which institution or overall organization is associated with the resource being described. E.g., National Center for Ecological Analysis and Synthesis|
|positionName||This field is intended to be used instead of a particular person or full organization name. If the associated person that holds the role changes frequently, then Position Name would be used for consistency. Note that this field, used in conjunction with 'organizationName' and 'individualName' make up a single logical originator. Because of this, an originator with only the individualName of 'Joe Smith' is NOT the same as an originator with the name of 'Joe Smith' and the organizationName of 'NSF'. Also, the positionName should not be used in conjunction with individualName unless only that individual at that position would be considered an originator for the data package. If a positionName is used in conjunction with an organizationName, then that implies that any person who currently occupies said positionName at organizationName is the originator of the data package. E.g., HAST herbarium data manager|
|electronicMailAddress||The electronic mail address is the email address for the party. It is intended to be an Internet SMTP email address, which should consist of a username followed by the @ symbol, followed by the email server domain name address. E.g. email@example.com|
|deliveryPoint||Subfield of the address field that describes the physical or electronic address of the responsible party for a resource. The delivery point field is used for the physical address for postal communication. E.g., GBIF Secretariat, Universitetsparken 15|
|role||Use this field to describe the role the party played with respect to the resource. E.g. technician, reviewer, principal investigator, etc.|
|phone||The phone field describes information about the responsible party's telephone, be it a voice phone, fax. E.g. +4530102040|
|postalCode||Subfield of the address field that describes the physical or electronic address of the responsible party for a resource. The postal code is equivalent to a U.S. zip code, or the number used for routing to an international address. E.g., 52000.|
|city||Subfield of the address field that describes the physical or electronic address of the responsible party for a resource. The city field is used for the city name of the contact associated with a particular resource. E.g. San Diego.|
|country||Subfield of the address field that describes the physical or electronic address of the responsible party for a resource. The country field is used for the name of the contact's country. The country name is most often derived from the ISO 3166 country code list. E.g., Japan.|
|onlineUrl||A link to associated online information, usually a web site. When the party represents an organization, this is the URL to a website or other online information about the organization. If the party is an individual, it might be their personal web site or other related online information about the party. E.g., http://www.yourdomain.edu/~doe.|
KeywordSet (General Keywords)
The keywordSet field is a wrapper for the keyword and keywordThesaurus elements, both of which are required together.
|keyword||A keyword or key phrase that concisely describes the resource or is related to the resource. Each keyword field should contain one and only one keyword (i.e., keywords should not be separated by commas or other delimiters). E.g., biodiversity.|
|keywordThesaurus||The name of the official keyword thesaurus from which keyword was derived. If an official thesaurus name does not exist, please keep a placeholder value such as “N/A” instead of removing this element as it is required together with the keyword element to constitute a keywordSet. E.g., IRIS keyword thesaurus.|
Describes the extent of the coverage of the resource in terms of its spatial extent, temporal extent, and taxonomic extent.
A container for taxonomic information about a resource. It includes a list of species names (or higher level ranks) from one or more classification systems. Please note the taxonomic classifications should not be nested, just listed one after the other.
|generalTaxonomicCoverage||Taxonomic Coverage is a container for taxonomic information about a resource. It includes a list of species names (or higher level ranks) from one or more classification systems. A description of the range of taxa addressed in the data set or collection. Use a simple comma separated list of taxa. E.g., "All vascular plants were identified to family or species, mosses and lichens were identified as moss or lichen."|
|taxonomicClassification||Information about the range of taxa addressed in the dataset or collection.|
|taxonRankName||The name of the taxonomic rank for which the Taxon rank value is provided. E.g., phylum, class, genus, species.|
|taxonRankValue||The name representing the taxonomic rank of the taxon being described. E.g. Acer would be an example of a genus rank value, and rubrum would be an example of a species rank value, together indicating the common name of red maple. It is recommended to start with Kingdom and include ranks down to the most detailed level possible.|
|commonName||Applicable common names; these common names may be general descriptions of a group of organisms if appropriate. E.g., invertebrates, waterfowl.|
A container for spatial information about a resource; allows a bounding box for the overall coverage (in lat long), and also allows description of arbitrary polygons with exclusions.
|geographicDescription||A short text description of a dataset's geographic areal domain. A text description is especially important to provide a geographic setting when the extent of the dataset cannot be well described by the "boundingCoordinates". E.g., "Manistee River watershed", "extent of 7 1/2 minute quads containing any property belonging to Yellowstone National Park"|
|westBoundingCoordinate||Subfield of boundingCoordinates field covering the W margin of a bounding box. The longitude in decimal degrees of the western-most point of the bounding box that is being described. E.g., -18.25, +25, 45.24755.|
|eastBoundingCoordinate||Subfield of boundingCoordinates field covering the E margin of a bounding box. The longitude in decimal degrees of the eastern-most point of the bounding box that is being described. E.g., -18.25, +25, 45.24755.|
|northBoundingCoordinate||Subfield of boundingCoordinates field covering the N margin of a bounding box. The longitude in decimal degrees of the northern-most point of the bounding box that is being described. E.g., -18.25, +25, 65.24755.|
|southBoundingCoordinate||Subfield of boundingCoordinates field covering the S margin of a bounding box. The longitude in decimal degrees of the southern-most point of the bounding box that is being described. E.g., -118.25, +25, 84.24755.|
This container allows coverage to be a single point in time, multiple points in time, or a range of dates.
|beginDate||Subfield of rangeOfDates field: It may be used multiple times with a endDate field to document multiple date ranges. A single time stamp signifying the beginning of some time period. The calendar date field is used to express a date, giving the year, month, and day. The format should be one that complies with the International Standards Organization's standard 8601. The recommended format for EML is YYYY-MM-DD, where Y is the four digit year, M is the two digit month code (01 - 12, where January = 01), and D is the two digit day of the month (01 - 31). This field can also be used to enter just the year portion of a date. E.g. 2010-09-20|
|endDate||Subfield of rangeOfDates field: It may be used multiple times with a beginDate field to document multiple date ranges. A single time stamp signifying the end of some time period. The calendar date field is used to express a date, giving the year, month, and day. The format should be one that complies with the International Standards Organization's standard 8601. The recommended format for EML is YYYY-MM-DD, where Y is the four digit year, M is the two digit month code (01 - 12, where January = 01), and D is the two digit day of the month (01 - 31). This field can also be used to enter just the year portion of a date. E.g. 2010-09-20.|
|singleDateTime||The SingleDateTime field is intended to describe a single date and time for an event.|
This field documents scientific methods used in the collection of the resource. It includes information on items such as tools, instrument calibration and software.
|methodStep||The methodStep field allows for repeated sets of elements that document a series of procedures followed to produce a data object. These include text descriptions of the procedures, relevant literature, software, instrumentation, source data and any quality control measures taken.|
|qualityControl||The qualityControl field provides a location for the description of actions taken to either control or assess the quality of data resulting from the associated method step.|
|sampling||Description of sampling procedures including the geographic, temporal and taxonomic coverage of the study.|
|studyExtent||Subfield of the sampling field. The coverage field allows for a textual description of the specific sampling area, the sampling frequency (temporal boundaries, frequency of occurrence), and groups of living organisms sampled (taxonomic coverage). The field studyExtent represents both a specific sampling area and the sampling frequency (temporal boundaries, frequency of occurrence). The geographic studyExtent is usually a surrogate (representative area of) for the larger area documented in the "studyAreaDescription".|
|samplingDescription||Subfield of the sampling field. The samplingDescription field allows for a text-based/human readable description of the sampling procedures used in the research project. The content of this element would be similar to a description of sampling procedures found in the methods section of a journal article.|
Intellectual Property Rights
Contain a rights management statement for the resource, or a reference to a service providing such information.
|purpose||A description of the purpose of this dataset.|
|intellectualRights||A rights management statement for the resource, or reference a service providing such information. Rights information encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights. In the case of a data set, rights might include requirements for use, requirements for attribution, or other requirements the owner would like to impose. E.g., Copyright 2001 Regents of the University of California Santa Barbara. Free for use by all individuals provided that the owners are acknowledged in any use or publication.|
The additionalMetadata field is a container for any other relevant metadata that pertains to the resource being described. This field allows EML to be extensible in that any XML-based metadata can be included in this element. The elements provided here in the GMP include those required for conformance with ISO 19139 and a subset of NCD (Natural Collections Descriptions) elements.
|dateStamp||The dateTime the metadata document was created or modified. E.g., 2002-10-23T18:13:51.235+01:00|
|metadataLanguage||The language in which the metadata document (as opposed to the resource being described by the metadata) is written. Composed of an ISO639-2/T three-letter language code and an ISO3166-1 three-letter country code. E.g., en_UK|
|citation||The citation for the work itself. See eml|
|bibliography||A list of citations (see below) that form a bibliography on literature related / used in the dataset|
|resourceLogoUrl||URL of the logo associated with a resource. E.g., http://www.gbif.org/logo.jpg|
|parentCollectionIdentifier||Subfield of collection field. Is an optional field. Identifier for the parent collection for this sub-collection. Enables a hierarchy of collections and sub collections to be built.|
|collectionIdentifier||Subfield of collection field. Is an optional field. The URI (LSID or URL) of the collection. In RDF, used as URI of the collection resource.|
|formationPeriod||Text description of the time period during which the collection was assembled. E.g., "Victorian", or "1922 - 1932", or "c. 1750".|
|livingTimePeriod||Time period during which biological material was alive (for palaeontological collections).|
|specimenPreservationMethod||Picklist keyword indicating the process or technique used to prevent physical deterioration of non-living collections. Expected to contain an instance from the Specimen Preservation Method Type Term vocabulary. E.g., formaldehyde.|
|jgtiCuratorialUnit||A quantitative descriptor (number of specimens, samples or batches). The actual quantification could be covered by 1) an exact number of “JGI-units” in the collection plus a measure of uncertainty (+/- x); 2) a range of numbers (x to x), with the lower value representing an exact number, when the higher value is omitted. The discussion concluded that the quantification should encompass all specimens, not only those that have not yet been digitised. This is to avoid having to update the numbers too often. The number of non-public data (not digitised or not accessible) can be calculated from the GBIF numbers as opposed to the JGTI-data.|