BestPracticesSamplingEventData

Andrea Hahn edited this page Aug 16, 2018 · 37 revisions

Best Practices in Publishing Sampling-event data

Version 3.0

Document control

Version Description Date of release Author(s)
1.0 Release Version 2015 Éamonn Ó Tuama
2.0 Transferred to wiki, major changes May 2017 Kyle Braak
3.0 Minor changes, clean up Aug 2018 Marie Grosjean

Table of content

  1. Suggested citation
  2. Introduction
  3. What is sampling-event data?
    1. How to express sampling-event data in DwC-A?
    2. Sampling-event Metadata
    3. Sampling-event Data
    4. Data file formatting recommendations
    5. Sample size
    6. Quantity and abundance
    7. How to uniquely identify sampling events
    8. How to capture hierarchy of events
    9. How to capture absence data
    10. How to include supplementary multimedia
  4. Publishing sampling-event data
    1. Using GUIDs for identifiers
    2. Protect location of sensitive species
    3. Preserving verbatim data
    4. Publishing project data as a single dataset
    5. Republishing occurrence data as sampling-event data
    6. Modeling continuous monitoring of live individuals
  5. Continuous data quality improvement
    1. Managing issues related to the dataset
    2. Sharing scripts and programs used to produce or clean the dataset
  6. Describing sampling-event data in dataset metadata
    1. Linking related datasets
    2. Listing related research
  7. Examples

Suggested citation

GBIF (2017). Best Practices in Publishing Sampling-event data. Accessible online at https://github.com/gbif/ipt/wiki/BestPracticesSamplingEventData

Introduction

This guide provides details on how to utilize the Darwin Core Archive (DwC-A) format as a means to share sampling-event information in a standard way. It focuses on specific components of the Darwin Core Archive format, and some of the supporting extensions to the core event data class, and provides recommendations on how to best utilize these components to maximize the value of the shared data. This guide does not provide a detailed overview of the Darwin Core Archive format, instead please refer to the Darwin Core Archives How-to Guide.

The DwC-A format and the specific profile described here represent an internationally recognized and ratified data exchange format for sharing sampling-event data. All data exchange standards must strike a balance between the technical scope and capacity on one hand, and social acceptance and uptake on the other. Simple solutions sacrifice coverage and complexity in favor of ease-of-use. Highly complex formats provide more complete solutions for representing any type of data but at the expense of simplicity and require supporting software and expertise. The Darwin Core Archive format represents an intermediate position between the two ends of this spectrum. It focuses on the key elements of sampling-event data and enables an enriched set of data types to be linked to this core structure. The data contained in an archive can be readily understood and used by many ecologists and data managers familiar with basic structured text files. With this international standard, GBIF hopes to facilitate sampling-event data sharing and promote common approaches to cite and recognize the work of the community creating and handling sampling-event data. A standard format also increases relevance and utility.

What is sampling-event data?

Sampling-event data is a type of data available from thousands of environmental, ecological, and natural resource investigations. These can be one-off studies or monitoring programs. Such data are usually quantitative, calibrated, and follow certain protocols so that changes and trends of populations can be detected. This is in contrast to opportunistic observation and collection data, which today form a significant proportion of openly accessible biodiversity data.

How to express sampling-event data in DwC-A?

Darwin Core Archive (DwC-A) is an informatics data standard that makes use of the Darwin Core terms to produce a single, self-contained dataset for checklist data. The collection of files in an archive form a self-contained dataset, which can be provided as a single compressed (Zip or GZIP) file. A dataset is composed of a descriptive metadata document and a set of one or more data files. For more information about DwC-A refer to the Darwin Core Archives How-to Guide.

Sampling-event Metadata

Documenting the provenance and scope of datasets is required in order to publish sampling-event data through the GBIF network. Dataset documentation is referred to as ‘resource metadata’ and enables users to evaluate the fitness-for-use of a dataset. It may describe the sampling methodologies used for its compilation, and the individuals and organizations involved in its creation and management. Metadata is shared in a Darwin Core Archive as an XML document. GBIF provides a metadata profile for sampling-event datasets based on the Ecological Metadata Language. A How-to guide describes all the options for describing a sampling-event dataset using this format. See http://links.gbif.org/gbif_metadata_profile_how-to_en_v1

Sampling-event Data

The Darwin Core Archive format provides the structural framework for publishing sampling-event data. Darwin Core Archives consist of a series of one or more text files, in standard comma- or tab-delimited format. The files are logically arranged in a star-like manner with one core file, listing the sampling events (sampling protocol, sample size, location, etc.) surrounded by a number of ‘extensions’, that describe related data types (such as species occurrences, measurements or facts related to the sampling event, etc). Links between core and extension records are made using an event identifier (eventID) data element. In this way, many extension records can exist for each single core event record. This “star-schema” provides a simple relational data model that supports many types of annotations that are common to sampling-event datasets.

Figure 1. Darwin Core Archive data files in 'star schema'

An alternative way to encode sampling-event data is listing the species occurrence in the core file, surrounded by a number of extensions, that describe the related data types (such as measurements relating to the species occurrences, etc). Note listing sampling events in the core file is preferable if a plot or site is the main focus of the study.

Be aware that the current DwC-A star schema such as does have limitations. For example, it does not allow to link measurements and facts to both events and occurrences in the same dataset. Discussion on this prototype extension are taking place on GitHub but there is no solution available yet.

Data file formatting recommendations

For ease in understanding, we may use the terms field in this guide to refer to the Darwin Core set of terms in the sampling-event publishing profile to which a users data will be mapped. For example, we will refer to the use of the dwc:scientificName field when referring to the Darwin Core term, scientificName.

  • It is recommended to use TAB or Comma-Separated-Values instead of custom field delimiters and quotes.
  • Be careful and consistent with quotation.
  • Encode text files as UTF-8
  • Make sure you replace all line breaks in a data field, i.e. \r \n or \r\n with either simple spaces or use 2 characters like “$$” to replace "\r" to escape the line break if the intention is to preserve them. Another option is to replace line breaks with the html <br> tag.
  • Encode NULLs as empty strings, i.e. no characters between 2 delimiters, or \N or \NULL, but no other text sequence!

Sample size

The following Darwin Core fields store the sample size of a sampling event:

  • sampleSizeValue: a numeric value for a measurement of the size (time duration, length, area, or volume) of a sample in a sampling event.
  • sampleSizeUnit: the unit of measurement of the size (time duration, length, area, or volume) of a sample in a sampling event.

The value of sampleSizeValue is a number and must have a corresponding sampleSizeUnit. The value of sampleSizeUnit should be restricted to use only SI units/derived units or other non-SI units accepted for use within the SI (e.g. minute, hour, day, litre) as per the Unit of Measurement Vocabulary. Examples are given in Table 1 below.

You may represent a sampling area by using an appropriate WKT shape or latitude/longitude point location. Done correctly, the direction sampling was carried out can also be derived. For example, an ocean trawl line represented using a WKT shape LINESTRING allows the direction of the trawl to be determined based on the standard notation for writing the start and end points.

Table 1. sampleSizeValue and sampleSizeUnit must be used together, e.g., 3 square meters, or 1 liter.

sampleSizeValue sampleSizeUnit
2 hour
3 m2
17 km
1 litre

Quantity and abundance

The following Darwin Core fields are also required to be used as a pair:

  • organismQuantity: a numeric or enumeration value for the quantity of organisms.
  • organismQuantityType: the type of quantification system used for the quantity of organisms.

Table 2 lists some example values. The value of organismQuantity is a number or enumeration, e.g., “27” for an organismQuantityType “individuals”, “12.5” for an organismQuantityType “%biomass”, or “r” for an organismQuantityType “BraunBlanquetScale”. The value of organismQuantityType (i.e., the entity being measured) is expected to be drawn from a small controlled vocabulary with terms such as “Individuals”, “%Biomass”, “%Biovolume”, “%Species”, “%Coverage”, “BraunBlanquetScale”, “DominScale”. Examples when combined with organismQuantity values: + on DominScale; 5 on BraunBlanquetScale; 45 for %Biomass.

Table 2. organismQuantity and organismQuantityType must be used together, e.g., a count of 14 individuals, or a code value “r” on the Braun Blanquet scale.

organismQuantity organismQuantityType
14 individuals
r BraunBlanquetScale
0.4 %Species
31 %Biomass

How to uniquely identify sampling events

Each event is uniquely identified using dwc:eventID and occasionally dwc:parentEventID. Although the type and format of identifier is arbitrary we recommend that publishers choose persistent globally unique identifier. In the absence of a GUID, publishers may reuse the original fieldNumber.

Make sure to reuse existing stable identifiers and do not create a new identifier for an event when one is already declared.

How to capture hierarchy of events

Sampling events can be related to each other (e.g., nested samples) via a common parent identifier. For example, several sub-sampling events within a Whittaker Plot[7] each with their own eventID (e.g., “A1:1”, “A1:2”) would share a common parentEventID (e.g., “A1”) thus enabling them to be linked together easily (see Table 4 and Figure 3).

Further information on the nature of the relationship (e.g. part of a monitoring series) can be described in the project section of the accompanying metadata.

You may also refer to the following FAQ.

How to capture absence data

Refer to the following FAQ.

How to include supplementary multimedia

You may choose to include supplementary media in order to make the data easier to interpret. For example for vegetation data, it is helpful to include a link to the original scanned releve sheet when interpreting the data.

The files associated have to be hosted on an external server and linked to the occurrence through the dwc:associatedMedia and dwc:associatedReferences. These files may be images, texts or a combination of both as long as the format type is specified. JPG, PNG, etc. images will be visible as thumbnails while PDFs will appear as clickable links.

Publishing sampling-event data

Using GUIDs for identifiers

A number of fields require the usage of unique identifiers: dwc:occurrenceID, dwc:eventID, dwc:organismID and dwc:locationID.

As mentioned previously, although no particular format is enforced, we advise publishers to use Global Unique IDtenifiers (GUIDs). There are a few online services which can provide such identifiers. For example, it is possible to use http://www.geonames.org/ to find (or even generate new) identifiers for dwc:locationID, e.g. http://sws.geonames.org/10793757/ is a GUID for a lake in Greenland.

Protect location of sensitive species

If your dataset contains sensitive species, there are several ways to handle it:

  • Simply removing these species from the dataset,
  • Publishing the species identifications at Genus level only,
  • Publishing the sensitive/protected species in a separate dataset,
  • Publish obfuscated sensitive data points in the main dataset and publish non-obfuscated details in an access-limited separate dataset, both datasets including all data records.

Preserving verbatim data

Although verbatim data and descriptions are not visible on the GBIF.org web interface, they are made available to the community through downloads. When entering verbatim description, make sure to link them to the original event or occurrence. For example, the ID or code given to the original event should be entered into dwc:fieldNumber; the ID or code given to the original occurrence observation should be entered into dwc:recordNumber.

Publishing project data as a single dataset

Data produced from a large sampling project should be published as a single dataset if possible. If you must publish multiple datasets, we encourage linking them using a common project identifier in the metadata.

Republishing occurrence data as sampling-event data

Sampling events provide better documentation and benefit both the scientific community and policy makers (read more). We encourage strongly the republishing of occurrence data as sampling-event data when possible.

In order to do so, you should create a new sampling-event dataset and send an email to GBIF's helpdesk (helpdesk@gbif.org). In this email, you should provide the UUIDs of both the occurrence dataset and the new dataset. We will then be able to link the first dataset to the newest one before de-indexing it thereby avoiding occurrence duplications and preserving citations.

Modeling continuous monitoring of live individuals

If your dataset contains continuous monitoring of live individuals, such as bird tracking data, you can use dwc:organismID to store the ID of the individual being tracked. In addition to that, you should represent each individual being tracked as a single event.

Continuous data quality improvement

Managing issues related to the dataset

If possible, we recommend tracking all the issues related to a given dataset using an issue management system such as the one provided by Github.

Sharing scripts and programs used to produce or clean the dataset

Ideally, custom scripts and programs used to transform data should be made publicly available on GitHub. Other publishers will benefit from accessing these scripts along with detailed instructions on how to use them.

Describing sampling-event data in dataset metadata

Publishers should document their dataset as much as possible with a particular emphasis on sampling methodologies.

Besides the mandatory requirements, the metadata should include information about the extent of study, the sampling methods, the quality control and limitations of the study. Although information about fieldwork can part of the data content, you may describe the sampling location and conditions in the metadata as well.

Linking related datasets

Some datasets may come from the same research project or may be related in certain context. The current recommendation is to link them using a project identifier.

Listing related research

You can make your dataset easier to interpret by including links to related published works such as journal articles, project notes, thesis, etc,. in the bibliographic citation or in the external link part of the metadata.

Examples

Following are some examples of typical sampling-event data sets. In each case, the key fields in the Event core and Occurrence extension are provided. For some examples, additional extensions such as Relevé and measurement-or-fact are also included.

Freshwater invertebrate survey

Core (Event) table

EventID samplingProtocol sampleSizeValue sampleSizeUnit eventDate location decimalLatitude decimalLongitude
C_1428 AQEM 1.25 m^2 21/06/2006 Kinzig O3 Rothenbergen 50.18689 9.100369
B_1538 AQEM 1.25 m^2 11/06/2008 Kinzig W3 Bulau 50.1316 8.9657

Extension (Occurrence) table

EventID scientificName organismQuantity organismQuantityType ...
C_1428 Baetis rhodani 14 individuals
C_1428 Ephemera danica 15 individuals
C_1428 Gyraulus albus 2 individuals
B_1538 Serratella ignita 318 individuals

Explanation

Ephemera danica : A total of 14 individuals from 1.25 square meters were obtained in this sampling event. The derived individuals per sq meter count is 11.2 (14/1.25).



Brackish water invertebrates survey

Core (Event) table

EventID samplingProtocol sampleSizeValue sampleSizeUnit startDayOfYear endDayOfYear year location decimalLatitude decimalLongitude ...
IA1 hand operated van Veen grab 0.04 m^2 147 154 1995 Gialova lagoon 36.9564 21.6661
IA3 hand operated van Veen grab 0.04 m^2 147 154 1995 Gialova lagoon 36.9564 21.6661

Extension (Occurrence) table

EventID scientificName organismQuantity organismQuantityType ...
IA1 Abra ovata 57 individuals
IA3 Bittium reticulatum 113 individuals

Extension (Measuremenr-or-Fact) table

EventID measurementType measurementValue measurementUnit measurementRemarks ...
IA1 Tmp (sed) 21.5 Degree C temperature at the bottom surface --
IA1 Rdx (sed)0 170 mv Eh value at the bottom surface (0cm) --

Explanation

Abra ovata : A total of 57 individuals from 0.04 square meters were obtained in sampling event IA1.

Each event can also have measurements or facts associated with it, e.g., environmental measurements like sediment temperature and redox potential (Eh).

Macrophyte survey

Note that this example is based on Dutch Vegetation Database (LVD) previous version republished as sampling-event dataset. The Releve extension underwent significant changes following the publication of the primer. For more information about LVD and the data model for vegetation sampling-event data see: http://gbif.blogspot.dk/2016/07/probably-turbovegs-best-kept-secret.html

EventID samplingProtocol sampleSizeValue sampleSizeUnit eventDate location decimalLatitude decimalLongitude ...
1001 Braun Blanquet 100 m^2 09/08/2012 Kinzig O3 Rothenbergen 50.18689 9.100369

Extension (Occurrence) table

EventID scientificName organismQuantity organismQuantityType ...
1001 Acer psuedoplatanus r BraunBlanquetScale

Extension (Relevé) table

EventID syntaxonCode inclination coverTotal treesCover coverShrubs highTreeLayerHeight highHerbLayerMeanHeight mossesIdentified ...
1001 843200 40 100 95 50 25 40 Y --

Explanation

** Acer psuedoplatanus** : In the 100 sq meters surveyed, the abundance of the species was reported as “r” on the Braun Blanquet scale.

Additional vegetation plot measurements such as vegetation community type (syntaxon) % coverage values that are typical of TurboVeg type databases are captured in a Relevé (vegetation-plot) extension.

Lepidoptera survey I

Core (Event) table

EventID samplingProtocol sampleSizeValue sampleSizeUnit startDayOfYear endDayOfYear year location decimalLatitude decimalLongitude ...
2320 Jalas-model light trap with 160W ML matt lamp 16 day 164 180 1999 Kungsmarken 55.72 13.28 ...

Extension (Occurrence) table

EventID scientificName organismQuantity organismQuantityType ...
2320 Opisthograptis luteolata 11 individuals

Explanation

Opisthograptis luteolata : 11 individuals were observed over the sampling period of 16 days. The derived number of individuals per day is 0.68 (11/16).

Lepidoptera survey II

Core (Event) table

EventID samplingProtocol sampleSizeValue sampleSizeUnit eventDate location decimalLatitude decimalLongitude ...
1014-tr023m Pollard walks 250 m^2 2012-10-11 Ramat Hanadiv botanik garden 32.553191 34.947492
1012-tr006-s5 Pollard walks 250 m^2 2012-05-02 Carmel Hurshan haarbaim 32.75789805 35.02697333

Extension (Occurrence) table

EventID scientificName organismQuantity organismQuantityType ...
1014-tr023m Pieris rapae 1 individuals
1014-tr023-s5 Maniola telmessia 2 individuals

Extension (Measuremenr-or-Fact) table

EventID measurementType measurementValue measurementUnit measurementRemarks ...
1014-tr023m Temp 20 Degree C
1014-tr023m Wind speed light
1014-tr023m Cloudiness 0 Level 1 of 8
1014-tr023m AvgAltitude 10 m Average altitude

Explanation

Pieras rapae : A total of 1 individual from 250 sq metres was obtained in this sampling event.
Several environmental measurements (e.g., temperature, wind speed, cloudiness) are included in a measurement-or-facts extension.

Reef fish survey

Core (Event) table

EventID samplingProtocol sampleSizeValue sampleSizeUnit eventDate location decimalLatitude decimalLongitude ...
506003329 Reef Life Survey methods 500 m^2 2006-09-02 Cocos Islands 5.56187 -87.04693
57003326 Reef Life Survey methods 500 m^2 2006-12-11 Panama Bight 4.008553 -81.605377

Extension (Occurrence) table

EventID scientificName organismQuantity organismQuantityType ...
506003329 Acanthurus nigricans 42 individuals
506003329 Acanthurus xanthopterus 1 individuals
506003329 Aulostomus chinensis 4 individuals
506003329 Axoclinus cocoensis 1 individuals

Explanation

Aulostomus chinensis : A total of 4 individuals from 500 sq metres were obtained in this sampling event.

Nested samples

Table 4. Several sub-plots may be related to the parentEventID as in this example of a Whittaker plot consisting of 13 sub-plots (see Figure 3 for plot layout).

EventID parentEventID samplingProtocol sampleSizeValue sampleSizeUnit eventDate location decimalLatitude decimalLongitude ...
A1 Modified Whittaker Plot 1000 m^2 18/03/84 Monarch 55.72 13.28
A1.1 A1 100 m^2
A1.2 A1 10 m^2
A1.3 A1 10 m^2
A1.4 A1 1 m^2
A1.5 A1 1 m^2
A1.6 A1 1 m^2
A1.7 A1 1 m^2
A1.8 A1 1 m^2
A1.9 A1 1 m^2
A1.10 A1 1 m^2
A1.11 A1 1 m^2
A1.12 A1 1 m^2
A1.13 A1 1 m^2


Figure 3.A Schematic of a Whittaker plot consisting of 13 sub-plots of varying area.

Additional information that could also be included or was previously included

The Event core elements are mainly drawn from the DwC classes Event, Location and Geological Context (Table 3). The Occurrence extension elements are drawn from the Occurrence, Taxon and Identification classes. For reasons of consistency, the Occurrence extension includes all terms found in the Occurrence core. Thus Event, Location and Geological Context terms are also listed for the Occurrence extension but are actually redundant. Note the IPT hides redundant terms by default when mapping for the user’s convenience.

Table 3. Placement of the sample related terms in the Event core and Occurrence extension.

Event Core eventID, parentEventID, samplingProtocol, sampleSizeValue, sampleSizeUnit, samplingTaxaRange*, siteTreatment*, siteID*, layer*
Occurrence Extension eventID, organismQuantity, organismQuantityType, siteID+, layer+

Planned additions and notes for revision:

https://github.com/gbif/ipt/wiki/Best-Practices-in-Publishing-Sampling-event-data---planned-additions

The + symbol indicates proposed new terms not yet ratified.
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.