Check spelling when indexing #58

sujaikumar · 2021-03-19T17:01:08Z

Currently, index.py checks if taxonomy in file to be indexed matches exactly. However, many user-provided files have misspellings/extra spaces.

If species name does not match exactly, then index.py could use lookup / suggest options already implemented and if exactly 1 species is returned, then that should be used, else the record to be imported can go into exceptions.

rjchallis · 2021-03-22T14:21:28Z

I've added a spellcheck option when importing. This shies away from importing mis-spelled taxa as there is not enough information available during import to tell the difference between a mis-spelling of a taxon in the database or an alternate taxon with a similar name to one already in the database.

Using the flags --taxon-lookup any and --taxon-spellcheck, the import functions first look for a matching scientific name, if none is found then look for a match elsewhere in taxon names (synonyms, common names, etc). If no match is found the entry will be written to exceptions but the spellcheck tries to find a candidate name and writes it to exceptions/<input_file>.spellcheck.tsv for manual inspection.

The spellcheck function also prevents new taxa being created with alt_taxon_id if they are very similar to existing names.

For example:

genomehubs index --taxon_dir /contains/bird/tolids --taxon-lookup any --taxon-spellcheck

creates an exception file like:

input	suggested
Accipiter nanus	accipiter nisus
Chlidonias hybrida	Chlidonias hybridus
Phylloscopus sibillatrix	phylloscopus sibilatrix

Where

Accipiter nanus is a different species to Accipiter nisus and should be imported
Chlidonias hybrida is a synonym of Chlidonias hybridus that should be getting picked up by --taxon-lookup any, but isn't (bug in current implementation)
Phylloscopus sibillatrix appears to be a mis-spelling of phylloscopus sibilatrix but both spellings are present in the input file

Additional bug is that spellchecked taxa in files with alt_taxon_id are still being imported despite being added to exceptions

sujaikumar added the enhancement label Mar 19, 2021

rjchallis self-assigned this Mar 22, 2021

rjchallis mentioned this issue Mar 23, 2021

Check all names / synonyms while indexing new data in GoaT #47

Closed

rjchallis closed this as completed in 2025061 Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check spelling when indexing #58

Check spelling when indexing #58

sujaikumar commented Mar 19, 2021

rjchallis commented Mar 22, 2021 •

edited

Loading

Check spelling when indexing #58

Check spelling when indexing #58

Comments

sujaikumar commented Mar 19, 2021

rjchallis commented Mar 22, 2021 • edited Loading

rjchallis commented Mar 22, 2021 •

edited

Loading