You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, index.py checks if taxonomy in file to be indexed matches exactly. However, many user-provided files have misspellings/extra spaces.
If species name does not match exactly, then index.py could use lookup / suggest options already implemented and if exactly 1 species is returned, then that should be used, else the record to be imported can go into exceptions.
The text was updated successfully, but these errors were encountered:
I've added a spellcheck option when importing. This shies away from importing mis-spelled taxa as there is not enough information available during import to tell the difference between a mis-spelling of a taxon in the database or an alternate taxon with a similar name to one already in the database.
Using the flags --taxon-lookup any and --taxon-spellcheck, the import functions first look for a matching scientific name, if none is found then look for a match elsewhere in taxon names (synonyms, common names, etc). If no match is found the entry will be written to exceptions but the spellcheck tries to find a candidate name and writes it to exceptions/<input_file>.spellcheck.tsv for manual inspection.
The spellcheck function also prevents new taxa being created with alt_taxon_id if they are very similar to existing names.
For example:
genomehubs index --taxon_dir /contains/bird/tolids --taxon-lookup any --taxon-spellcheck
Accipiter nanus is a different species to Accipiter nisus and should be imported
Chlidonias hybrida is a synonym of Chlidonias hybridus that should be getting picked up by --taxon-lookup any, but isn't (bug in current implementation)
Phylloscopus sibillatrix appears to be a mis-spelling of phylloscopus sibilatrix but both spellings are present in the input file
Additional bug is that spellchecked taxa in files with alt_taxon_id are still being imported despite being added to exceptions
Currently, index.py checks if taxonomy in file to be indexed matches exactly. However, many user-provided files have misspellings/extra spaces.
If species name does not match exactly, then index.py could use lookup / suggest options already implemented and if exactly 1 species is returned, then that should be used, else the record to be imported can go into exceptions.
The text was updated successfully, but these errors were encountered: