Skip to content

gbif/embl-adapter

Repository files navigation

EMBL adapter

Contains adapters for connecting EMBL content into GBIF.

This repository contains an EMBL API crawler that produces data that is later used for producing DwC-A files suitable for ingestion into GBIF.

The results are the four EMBL-EBI datasets.

Expected use of the EMBL API by the crawler is described in this working document

The adapter is configured to run once a week at a specific time. See the properties startTime and frequencyInDays in the gbif-configuration project.

Basic steps of the adapter:

  1. Request data from ENA portal API, two requests for each dataset + one taxonomy request (optional)
  2. Store raw data into database
  3. Process and store processed data into database (Perform backend deduplication)
  4. Clean temporal files

Requests

We get data from https://www.ebi.ac.uk/ena/portal/api. See the API documentation provided by EBI.

Requests requestUrl1 (sequence) and requestUrl2 (wgs_set) can be seen in the gbif-configuration project.

Sequence requests

Request with result=sequence

  1. a dataset for eDNA: environmental_sample=True, host="" (no host)
query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND environmental_sample=true AND host!="*"
  • include records with environmental_sample=true
  • include records with coordinates and/or specimen_voucher
  • exclude records dataclass="CON" see here
  • exclude records with host
  1. a dataset for organism sequenced: environmental_sample=False, host="" (no host)
query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND environmental_sample=false AND host!="*"
  • include records with environmental_sample=false
  • include records with coordinates and/or specimen_voucher
  • exclude records dataclass="CON" see here
  • exclude records with host
  1. a dataset with hosts
query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND host="*" AND host!="human" AND host!="Homo sapiens" AND host!="Homo_sapiens"
  • include records with coordinates and/or specimen_voucher
  • include records with host
  • exclude records dataclass="CON" see here
  • exclude records with human host

WGS_SET request

Request with result=wgs_set. These requests are pretty much the same with some differences:

  • sequence_md5 field not supported, use specimen_voucher twice to match number of fields
  • do not use dataclass filter

Taxonomy

Adapter requests taxonomy separately: download a zipped archive, unzip it and store it into database. Configuration is here.

Database

The data is stored in the PostgreSQL database after execution. Each dataset has own table with raw and processed data.

The database is created only once in the target environment and tables are cleaned up before every run.

Database creation scripts for data and taxonomy.

See gbif-configuration here and here for connection properties.

Backend deduplication

We perform several deduplication steps.

First step

Perform SQL (local copy here), get rid of some duplicates and join data with taxonomy; based on issue here

Second step

Get rid of records with both missing specimen_voucher and collection_date

Third step

Keep only one record with same sample_accession and scientific_name and get rid of the rest

DWC archives

Adapter stores all processed data back into database (tables with postfix _processed) which then used by IPT as SQL sources.

Test datasets (UAT):

and production ones (prod):

Data mapping

Configuration

Remember that all configuration files are in the private gbif-configuration project!

Configuration files in the directory src/main/resources do not affect the adapter and can be used, for example, for testing (local run).

Local run

Use scripts start.sh and start-taxonomy.sh for local testing. Remember to provide valid logback and config files for the scripts (you may need to create databases before run).

About

Contains adapters for connecting EMBL content into GBIF

Resources

License

Stars

Watchers

Forks

Packages

No packages published