Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check spelling when indexing #58

Closed
sujaikumar opened this issue Mar 19, 2021 · 1 comment
Closed

Check spelling when indexing #58

sujaikumar opened this issue Mar 19, 2021 · 1 comment
Assignees

Comments

@sujaikumar
Copy link
Contributor

Currently, index.py checks if taxonomy in file to be indexed matches exactly. However, many user-provided files have misspellings/extra spaces.

If species name does not match exactly, then index.py could use lookup / suggest options already implemented and if exactly 1 species is returned, then that should be used, else the record to be imported can go into exceptions.

@rjchallis rjchallis self-assigned this Mar 22, 2021
@rjchallis
Copy link
Contributor

rjchallis commented Mar 22, 2021

I've added a spellcheck option when importing. This shies away from importing mis-spelled taxa as there is not enough information available during import to tell the difference between a mis-spelling of a taxon in the database or an alternate taxon with a similar name to one already in the database.

Using the flags --taxon-lookup any and --taxon-spellcheck, the import functions first look for a matching scientific name, if none is found then look for a match elsewhere in taxon names (synonyms, common names, etc). If no match is found the entry will be written to exceptions but the spellcheck tries to find a candidate name and writes it to exceptions/<input_file>.spellcheck.tsv for manual inspection.

The spellcheck function also prevents new taxa being created with alt_taxon_id if they are very similar to existing names.

For example:

genomehubs index --taxon_dir /contains/bird/tolids --taxon-lookup any --taxon-spellcheck

creates an exception file like:

input	suggested
Accipiter nanus	accipiter nisus
Chlidonias hybrida	Chlidonias hybridus
Phylloscopus sibillatrix	phylloscopus sibilatrix

Where

  • Accipiter nanus is a different species to Accipiter nisus and should be imported
  • Chlidonias hybrida is a synonym of Chlidonias hybridus that should be getting picked up by --taxon-lookup any, but isn't (bug in current implementation)
  • Phylloscopus sibillatrix appears to be a mis-spelling of phylloscopus sibilatrix but both spellings are present in the input file

Additional bug is that spellchecked taxa in files with alt_taxon_id are still being imported despite being added to exceptions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants