-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GenBank genomes not found #235
Comments
Thank you for reporting this, this problem has come up a lot. Essentially, the sourmash databases (including GTDB) have some accessions that have been suppressed by GenBank, so they are no longer available for download. One way to access the genome is to download the GTDB database and place the genome in the folder charcoal/snakemake expects to see it in (you might need to add the flag This issue documents a similar problem in a different pipeline, and the solution there was to use a picklist to remove the suppressed accession from the database dib-lab/2022-dominating-set-differential-abundance-example#8 (comment) It might be appropriate for sourmash to prepare a database that removes these genomes from the GTDB representation and makes this the default for pipelines that have a genome download step...I'm not sure if that is the right move though. One admittedly hacky approach that comes to mind is that you could delete that line from the charcoal results before running subsequent steps...but that would change the results of your contamination analysis. I'm sorry I don't have a better solution at the moment! |
yep, we are investigating a general solution for this around picklists, but I've got to get a few hours of mental space together to do that :). note also same problem cropping up in genome-grist: dib-lab/genome-grist#277 |
Hello!
Thanks a lot for making this tool available. I ran the demo and analyzed the sample dataset with no issues but when testing charcoal on my own dataset I am running into errors which seem to be caused by some genomes not downloading from GenBank correctly.
I am running this command:
python -m charcoal run zebrafish-test.conf -j 16
It fails with this error message:
Which appears to be caused by a file not downloading from GenBank:
I checked the
genbank_genomes
folder, it did contain some genome files but this accession (GCF_002943105) was not there.I manually downloaded this file from GenBank and reran the snakemake command. It failed twice again (on GCA_000798955.1_genomic.fna.gz and GCF_000820225.1_genomic.fna.gz) which I also then manually downloaded and reran the snakemake command. The workflow then failed on genome GCA_011046675.1, which has been suppressed in GenBank and isn't available (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/046/675/GCA_011046675.1_ASM1104667v1/assembly_status.txt).
This is what the error looked like for the suppressed genome:
I tried to run charcoal on a small subset of my genomes (the ones that went through with no errors during this initial test) and that completed without errors and a report was generated successfully.
The text was updated successfully, but these errors were encountered: