Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zika Virus is not in your p+h+v pre-made indices? AND Centrifuge-download does not work? #53

Open
waywardsyintist opened this issue Feb 8, 2017 · 4 comments

Comments

@waywardsyintist
Copy link

Hello,

Kind of at wits-end with Centrifuge as I've been trying to get it to work with my own database, and NCBI bac & virus, for a long time now. To paraphrase Roseanna Roseannadanna, "Its always something..."

I recently gave it another go with your pre-made indices just to see if I could get it to run at all. Before running a bunch of my samples through, I used centrifuge-inspect to determine if all of my target organisms were indeed in the database. I used centrifuge-inspect and grep for this...

$ centrifuge-inspect --name-table p+h+v > nametable.txt
$ grep "Zika" nametable.txt
$

From what I can tell, Zika Virus is not in the p+h+v (the pre-made bacteria, viruses, archaea, human index listed on the right margin of your website)? All of my other target organisms (Human papillomavirus Type 132 and Variola virus, for example) are included in this index.

$ grep "Human papillomavirus type 132" nametable.txt
909331 Human papillomavirus type 132

$ grep "Variola" nametable.txt
10255 Variola virus

ALSO...

Since Zika did not seem to be included, I tried using centrifuge-download again, but I get an error. The connection to NCBI's ftp site seems to be blocked or otherwise not good. Below is the error I get...

$ centrifuge-download -o taxonomy taxonomy
Downloading NCBI taxonomy ...
rsync: failed to connect to ftp.ncbi.nih.gov (130.14.250.7): Connection refused (111)
rsync: failed to connect to ftp.ncbi.nih.gov (2607:f220:41e:250::13): Network is unreachable (101)
rsync error: error in socket IO (code 10) at clientserver.c(128) [Receiver=3.1.0]

I sent an email to NCBI describing what I was trying to do and asking whether there was an issue on their end or maybe my corporate firewall was the problem. Here is their response...

Hi,

Thanks for writing to us.

The issue is mostly in the http protocol used by the tool. With the switching to HTTPS late last year, NCBI also requires that http access to our ftp site be switched to HTTPS. You will need to contact the Centrifuge code provider for them to update their code to use HTTPS protocol instead.

A minor issue is the ftp.ncbi.nih.gov domain. Even though it may still work for historical reasons, it may not. The domain should be fully specified with .nlm included, aka ftp.ncbi.nlm.nih.gov

Regards,

Tao Tao, PhD
NCBI User Services

I dove into the centrifuge-download script to see if I could manually update the web address that the script is pointed to. There was only one place where the web address was listed that didn't have the '.nlm' in it, and that was line 194. I added the '.nlm' to the address on that line, saved and re-compiled, and re-ran....but I got the same error. I didn't see any references to http and/or https in the centrifuge-download source code.

Also, where does one manually retrieve the names.dmp and nodes.dmp files from NCBI? Weren't those files phased out when they updated to the new format without GI numbers?

Any help ironing out these problems would be much appreciated.

Thank you.

@fbreitwieser
Copy link
Collaborator

It seems there is currently no complete Zika genome in RefSeq - I found that very surprising, too.

Look at https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi and https://www.ncbi.nlm.nih.gov/assembly/?term=txid64320[Organism:noexp] . Since we only take the latest complete genome, it didn't find its way into the database. I think it is a mistake that that assembly is flagged as 'Scaffold' level assembled - there is only one scaffold, and it replaced an assembly that was flagged as complete.

I will look into the later issue of downloading the RefSeq data. However it won't fix the issue of the missing Zika genome - RefSeq has to be updated for that. However you could add the Zika virus reference genome, and add one entry (NC_012532<tab>64320) to the map file provided to centrifuge-build via the --conversion-table argument.

Also I'll work on providing a Makefile target for a database that includes viral strains from the NCBI viral genome resource.

@fbreitwieser
Copy link
Collaborator

Fixed now. Couple of points:

  • consider installing rsync for faster downloads. The downloads failed because the script falls back to curl/wget when rsync is not installed, and those did not have the address updated from ftp to https

  • I added several more database targets to the Makefile, including one with only viruses (v) and prokaryotes (p) or the combination (p+v). Try

    make THREADS=10 v

etc.

I'll re-build the standard database next week with all viral genomes.

@waywardsyintist
Copy link
Author

Hello,

Re-installed centrifuge, and installed rsync.

When trying to make the p+v index, I got the following error...

jrussellmac:indices jrussell$ make THREADS=4 p+v DONT_DUSTMASK=1
Making: p+v: p+v
/Library/Developer/CommandLineTools/usr/bin/make -f Makefile IDX_NAME=p+v
[[ -d tmp_p+v ]] && rm -rf tmp_p+v; mkdir -p tmp_p+v
Downloading and dust-masking archaea
centrifuge-download -o tmp_p+v -d "archaea" -P 4 refseq >
tmp_p+v/all-archaea.map
Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/assembly_summary.txt ...
rsync: failed to connect to ftp.ncbi.nlm.nih.gov: No route to host (65)
rsync error: error in socket IO (code 10) at clientserver.c(122) [Receiver=3.0.7]
rsync Download failed! Have a look at valid domains at ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq .
make[1]: *** [reference-sequences/all-archaea.fna] Error 1
make: *** [p+v] Error 2

Also tried 'make THREADS=4 v'. Error is below...

jrussellmac:indices jrussell$ make THREADS=4 v DONT_DUSTMASK=1
Making: v: v
/Library/Developer/CommandLineTools/usr/bin/make -f Makefile IDX_NAME=v
[[ -d tmp_v ]] && rm -rf tmp_v; mkdir -p tmp_v
Downloading and dust-masking viral-any_level
centrifuge-download -o tmp_v -d "viral-any_level" -P 4 refseq >
tmp_v/all-viral-any_level.map
viral-any_level is not a valid domain - use one of the following:
archaea
bacteria
fungi
invertebrate
plant
protozoa
unknown
vertebrate_mammalian
vertebrate_other
viral
make[1]: *** [reference-sequences/all-viral-any_level.fna] Error 1
make: *** [v] Error 2

It seems like NCBI isn't liking the way things are named in the MAKEFILE? I tried changing names a bit but got nowhere.

Any insight much appreciated.

Thanks.

@waywardsyintist
Copy link
Author

waywardsyintist commented Feb 24, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@fbreitwieser @waywardsyintist and others