-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zika Virus is not in your p+h+v pre-made indices? AND Centrifuge-download does not work? #53
Comments
It seems there is currently no complete Zika genome in RefSeq - I found that very surprising, too. Look at https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi and https://www.ncbi.nlm.nih.gov/assembly/?term=txid64320[Organism:noexp] . Since we only take the latest complete genome, it didn't find its way into the database. I think it is a mistake that that assembly is flagged as 'Scaffold' level assembled - there is only one scaffold, and it replaced an assembly that was flagged as complete. I will look into the later issue of downloading the RefSeq data. However it won't fix the issue of the missing Zika genome - RefSeq has to be updated for that. However you could add the Zika virus reference genome, and add one entry ( Also I'll work on providing a Makefile target for a database that includes viral strains from the NCBI viral genome resource. |
Fixed now. Couple of points:
etc. I'll re-build the standard database next week with all viral genomes. |
Hello, Re-installed centrifuge, and installed rsync. When trying to make the p+v index, I got the following error... jrussellmac:indices jrussell$ make THREADS=4 p+v DONT_DUSTMASK=1 Also tried 'make THREADS=4 v'. Error is below... jrussellmac:indices jrussell$ make THREADS=4 v DONT_DUSTMASK=1 It seems like NCBI isn't liking the way things are named in the MAKEFILE? I tried changing names a bit but got nowhere. Any insight much appreciated. Thanks. |
Hello,
Thank you for updates. Do the new p+v indices include Zika?
What if I have my own custom reference fasta, but not the other files. Is
there a way to generate the other files needed (conversion table, taxonomy
tree, name table) from a custom reference fasta using ncbi software or
samtools?
I'm still running into the same downloading error when trying to 'make
p+v'. I.e., NCBI doesn't like the link. I double checked that I do have
rsync installed.
Thanks,
Joe
…-------------------------------------------
Joe Russell, Ph.D.
www.waywardscientist.com
On Sat, Feb 11, 2017 at 6:32 PM, Florian Breitwieser < ***@***.***> wrote:
Fixed now. Couple of points:
-
consider installing rsync for faster downloads. The downloads failed
because the script falls back to curl/wget when rsync is not installed, and
those did not have the address updated from ftp to https
-
I added several more database targets to the Makefile, including one
with only viruses (v) and prokaryotes (p) or the combination (p+v). Try
make THREADS=10 v
etc.
I'll re-build the standard database next week with all viral genomes.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/infphilo/centrifuge/issues/53#issuecomment-279183820>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALagPH4zHNcAYSyrfdxD9FngVmlhJIFKks5rbkT6gaJpZM4L7fXL>
.
|
Hello,
Kind of at wits-end with Centrifuge as I've been trying to get it to work with my own database, and NCBI bac & virus, for a long time now. To paraphrase Roseanna Roseannadanna, "Its always something..."
I recently gave it another go with your pre-made indices just to see if I could get it to run at all. Before running a bunch of my samples through, I used centrifuge-inspect to determine if all of my target organisms were indeed in the database. I used centrifuge-inspect and grep for this...
$ centrifuge-inspect --name-table p+h+v > nametable.txt
$ grep "Zika" nametable.txt
$
From what I can tell, Zika Virus is not in the p+h+v (the pre-made bacteria, viruses, archaea, human index listed on the right margin of your website)? All of my other target organisms (Human papillomavirus Type 132 and Variola virus, for example) are included in this index.
$ grep "Human papillomavirus type 132" nametable.txt
909331 Human papillomavirus type 132
$ grep "Variola" nametable.txt
10255 Variola virus
ALSO...
Since Zika did not seem to be included, I tried using centrifuge-download again, but I get an error. The connection to NCBI's ftp site seems to be blocked or otherwise not good. Below is the error I get...
$ centrifuge-download -o taxonomy taxonomy
Downloading NCBI taxonomy ...
rsync: failed to connect to ftp.ncbi.nih.gov (130.14.250.7): Connection refused (111)
rsync: failed to connect to ftp.ncbi.nih.gov (2607:f220:41e:250::13): Network is unreachable (101)
rsync error: error in socket IO (code 10) at clientserver.c(128) [Receiver=3.1.0]
I sent an email to NCBI describing what I was trying to do and asking whether there was an issue on their end or maybe my corporate firewall was the problem. Here is their response...
Hi,
Thanks for writing to us.
The issue is mostly in the http protocol used by the tool. With the switching to HTTPS late last year, NCBI also requires that http access to our ftp site be switched to HTTPS. You will need to contact the Centrifuge code provider for them to update their code to use HTTPS protocol instead.
A minor issue is the ftp.ncbi.nih.gov domain. Even though it may still work for historical reasons, it may not. The domain should be fully specified with .nlm included, aka ftp.ncbi.nlm.nih.gov
Regards,
Tao Tao, PhD
NCBI User Services
I dove into the centrifuge-download script to see if I could manually update the web address that the script is pointed to. There was only one place where the web address was listed that didn't have the '.nlm' in it, and that was line 194. I added the '.nlm' to the address on that line, saved and re-compiled, and re-ran....but I got the same error. I didn't see any references to http and/or https in the centrifuge-download source code.
Also, where does one manually retrieve the names.dmp and nodes.dmp files from NCBI? Weren't those files phased out when they updated to the new format without GI numbers?
Any help ironing out these problems would be much appreciated.
Thank you.
The text was updated successfully, but these errors were encountered: