Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty faa file for 1008392 produces error with diamond #14

Closed
urmi-21 opened this issue May 14, 2020 · 2 comments
Closed

Empty faa file for 1008392 produces error with diamond #14

urmi-21 opened this issue May 14, 2020 · 2 comments

Comments

@urmi-21
Copy link
Contributor

urmi-21 commented May 14, 2020

I found that the function add_recommended_prokaryotes add a species with tax id 1008392. The uniprot_fill_strata function downloads an empty faa file for this. This creates an issue when using the newly added strata_diamond function as diamond throws an error when using an empty file.

I looked up NCBI and found this species only has a nucleotide sequence https://www.ncbi.nlm.nih.gov/taxonomy/?term=1008392
Uniprot also returns empty result: https://www.uniprot.org/uniprot/?query=taxonomy%3A1008392&sort=score

I think this species should be removed from add_recommended_prokaryotes

My current workaround for this problem is to manually remove 1008392 from list of prokaryotes:

prokaryote_sample <- readRDS(system.file("extdata","prokaryote_sample.rda", package = "phylostratr"))
prokaryotes_toadd<-prokaryote_sample$tip.label
#remove these
toremove<-as.character(c(1008392)) #this has empty faa file
prokaryotes_toadd<-prokaryotes_toadd [! prokaryotes_toadd %in% toremove]
h_strata <-
  uniprot_strata(focal_taxid, from=2) %>%
  strata_apply(f=diverse_subtree, n=10, weights=uniprot_weight_by_ref()) %>%
  add_taxa(prokaryotes_toadd) %>%
  uniprot_fill_strata

#rundiamond
h_strata <- strata_diamond(h_strata, blast_args=list(nthreads=20)) %>% strata_besthits
@haojingshao
Copy link

haojingshao commented Jun 16, 2020

I am following the online supplementary document and find this problem too.

When I try
strata <- strata_blast(strata, blast_args=list(nthreads=16L)) %>%
+ strata_besthits

I get

Error in (function (fastafile, blastdb = "blastdb", verbose = FALSE) :

Failed to make blast database blastdb/1008392.faa
In addition: Warning message:
In system2("makeblastdb", stderr = TRUE, stdout = TRUE, args = c("-dbtype", :
running command ''makeblastdb' -dbtype prot -in uniprot-seqs/1008392.faa -out blastdb/1008392.faa 2>&1' had status 1`

Then I remove this TaxID by strata <- prune(strata, '1008392', type='name')
It works. Meanwhile, the "TaxID.tab" need a header. Otherwise, you will get this error message.

The following named parsers don't match the column names: qseqid, sseqid, qstart, qend, sstart, send, evalue, score, staxid

@arendsee
Copy link
Owner

@urmi-21 @haojingshao you are exactly right. See issue #11 where we just now fixed this issue by removing the species 1008392, as suggested by @urmi-21, from the default set of representatives for bacteria. Apparently the proteome was removed this year from UniProt. Everything should work fine in the latest commit.

@haojingshao the BLAST headers in phylostratr are definitely a pain. You may submit a separate issue report for cleaning up header handling and error messages if you want.

I'll close this issue for now since the immediate problem of the missing 1008392 strain is solved in #11 and the general problem of missing sequences can be avoided by removing them as @haojingshao shows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants