Skip to content

🐚 Check lists comparisons between both OBIS and BOLD databases

License

Notifications You must be signed in to change notification settings

Ulises-Rosas/OBc

Repository files navigation

OBIS/BOLD comparisons (OBc)

These shells generate check list with currently accepted names of species from OBIS database and these names are both compared and matched with check list with currently accepted names of species from BOLD database. The WoRMS database is used for validating species names.

Software requierements:

  • anaconda 3
  • git

Installing OBc

git clone https://github.com/Ulises-Rosas/OBc.git && cd OBc
conda env create -f environment.yml
source activate OBc

Commands available

checklists*

There are two mock files available for testing:

head list_*
==> list_geo <==
38,Colombia
260,Chile

==> list_invert <==
Acanthocephala
Reptilia

Therefore, the checklists shell can run it with:

checklists -t list_invert -g list_geo
ls *.txt
Chile_260_Acanthocephala_bold_validated.txt    Chile_260_Reptilia_bold_validated.txt          Colombia_38_Acanthocephala_bold_validated.txt  Colombia_38_Reptilia_bold_validated.txt
Chile_260_Acanthocephala_obis_validated.txt    Chile_260_Reptilia_obis_validated.txt          Colombia_38_Acanthocephala_obis_validated.txt  Colombia_38_Reptilia_obis_validated.txt

* Intermediate files generated up while running this command are the same at each run. Therefore, if this command is running in parallel, specific directory per run must be used in order to avoid intermediate file crashing. Since the following example is a single run, repo directory is used as the working directory.

joinfiles.py

As its name suggests, this command merge results from checklists command by adding metadata that is already stated on filenames:

joinfiles --matching _obis_
valid_name,region,subgroup,group
Dermochelys coriacea,Chile,Reptilia,Reptilia
Lepidochelys olivacea,Chile,Reptilia,Reptilia
Caretta caretta,Colombia,Reptilia,Reptilia
Dermochelys coriacea,Colombia,Reptilia,Reptilia
Eretmochelys imbricata,Colombia,Reptilia,Reptilia
joinfiles --matching _bold_
valid_name,synonyms,availability,region,subgroup,group
Dermochelys coriacea,Dermochelys coriacea,public_outside,Chile,Reptilia,Reptilia
Lepidochelys olivacea,Lepidochelys olivacea,public_outside,Chile,Reptilia,Reptilia
Caretta caretta,Caretta caretta,public_outside,Colombia,Reptilia,Reptilia
Dermochelys coriacea,Dermochelys coriacea,public_outside,Colombia,Reptilia,Reptilia
Eretmochelys imbricata,Eretmochelys imbricata,public_inside,Colombia,Reptilia,Reptilia

Default value of --matching option is _bold_. It is, however, stated as a matter of clearness. While default values of column group is the same from subgroup, this can be modified with --as option. This is particularly usefull when merging an entire directory (i.e. using --from option) under a custom group:

joinfiles \
   --from data/Invertebrate\
   --as Invertebrate\
   --matching _bold_ > invertebrate_bold.txt 

head -n 5 invertebrate_bold.txt
valid_name,synonyms,availability,region,subgroup,group
Aglaophamus macroura,Aglaophamus macroura,private,Chile,Annelida,Invertebrate
Aglaophamus trissophyllus,Aglaophamus trissophyllus,public_outside,Chile,Annelida,Invertebrate
Amphitrite kerguelensis,Amphitrite kerguelensis,private,Chile,Annelida,Invertebrate
Ancistrosyllis groenlandica,Ancistrosyllis groenlandica,public_outside,Chile,Annelida,Invertebrate
joinfiles \
   --from data/Invertebrate\
   --as Invertebrate\
   --matching _obis_ > invertebrate_obis.txt
             
head -n 5 invertebrate_obis.txt
valid_name,region,subgroup,group
Abyssoninoe abyssorum,Chile,Annelida,Invertebrate
Aglaophamus foliosus,Chile,Annelida,Invertebrate
Aglaophamus macroura,Chile,Annelida,Invertebrate
Aglaophamus peruana,Chile,Annelida,Invertebrate

Likewise, this command can also join files from different directories while adding corresponding values for group column:

joinfiles \
   --from data/Invertebrate data/Actinopterygii data/Elasmobranchii data/Reptilia data/Mammalia\
   --as Invertebrate Actinopterygii Elasmobranchii Reptilia Mammalia\
   --matching _bold_ > WholeDirectories_bold.txt
joinfiles \
   --from data/Invertebrate data/Actinopterygii data/Elasmobranchii data/Reptilia data/Mammalia\
   --as Invertebrate Actinopterygii Elasmobranchii Reptilia Mammalia\
   --matching _obis_ > WholeDirectories_obis.txt

Each file is bigger than 400 KB and these can be found here: WholeDirectories_bold.txt, WholeDirectories_obis.txt

checkspps*

Now, let's suppose we have a species list and not a list of taxonomical ranks instead. This species list may come from any source but OBIS database (e.g. FishBase, WoRMS, etc). checkspps was justly created to take a custom species list and to directly compare species into BOLD by skiping data mining steps from OBIS.

If we have a species list called sl_test.txt, then:

checkspps Reptilia\
   --area-name Peru\
   --species-list sl_test.txt\
   --at Phylum

It will return both: Peru_Reptilia_obis_validated.txt and Peru_Reptilia_bold_validated.txt. Their format are the same from joinfiles.py command. Values of subgroup column are taken from a taxonomical rank of species specified with --at option. Remaining values for both group and country columns are filled with the positional argument (i.e. Reptilia in above case) and --area-name option correspondingly.

* Intermediate files generated up while running this command are the same at each run. Therefore, if this command is running in parallel, specific directory per run must be used in order to avoid intermediate file crashing. Since the following example is a single run, repo directory is used as the working directory.

barplot

barplot -i data/invertebrate_bold.txt

upsetplot

upsetplot -i data/bold.csv

sankeyplot

sankeyplot -b data/bold.csv -o data/obis.csv

auditspps

This command adds both an audition step (Oliveira et al. 2016) and custom taxonomical ranks (i.e. according to WoRMS database)to each selected specimen with public records in BOLD repository:

auditspps -i data/bold.csv --at Phylum Order Family Genus

Output name is based on its input and --at option is used for specifying taxonomical ranks to look for.

head bold_audited.tsv
Group	Species	Classification	sppsOnBins	N	N_Institutes	taxIDs	BINs	Phylum	Order	Family	Genus
Actinopterygii	Ablennes hians	C	Ablennes hians	14	6	13055	BOLD:AAB9824,BOLD:AAC1231,BOLD:AAC1232,BOLD:AAH7716	Chordata	Beloniformes	Belonidae	Ablennes
Actinopterygii	Abudefduf concolor	D		3	1	11481		Chordata	Perciformes	Pomacentridae	Abudefduf
Actinopterygii	Abudefduf saxatilis	E**	Abudefduf taurus,Abudefduf saxatilis	58	7	34219	BOLD:AAA7276,BOLD:AAA7275	Chordata	Perciformes	Pomacentridae	Abudefduf
Actinopterygii	Abudefduf taurus	E*	Abudefduf taurus,Abudefduf saxatilis	9	3	60120	BOLD:AAA7276	Chordata	Perciformes	Pomacentridae	Abudefduf
Actinopterygii	Abudefduf troschelii	D	Abudefduf troschelii	3	1	11480	BOLD:AAC8011	Chordata	Perciformes	Pomacentridae	Abudefduf
Actinopterygii	Acanthemblemaria balanorum	D	Acanthemblemaria balanorum	1	1	374819	BOLD:ABU5784	Chordata	Perciformes	Chaenopsidae	Acanthemblemaria
Actinopterygii	Acanthemblemaria castroi	D	Acanthemblemaria castroi	1	1	175464	BOLD:AAJ3429	Chordata	Perciformes	Chaenopsidae	Acanthemblemaria
Actinopterygii	Acanthemblemaria exilispinus	D		3	1	18107		Chordata	Perciformes	Chaenopsidae	Acanthemblemaria
Actinopterygii	Acanthemblemaria hancocki	D		2	1	18109		Chordata	Perciformes	Chaenopsidae	Acanthemblemaria

radarplot

This command uses radar plot to depict composition of audition performed by auditspps

radarplot -i bold_audited.tsv --at Order --n 4 -l

The prior example plot radars in order according to species counts and uses Order category for making inner polygons. Furthermore, --n option indicate the maximum amount of polygons per radar and -l is used to include legends. There are options available to aesthetically enhance radars which can be explored with radarplot -h.

About

🐚 Check lists comparisons between both OBIS and BOLD databases

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published