Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenBank genomes not found #235

Open
tgurbich opened this issue Feb 16, 2023 · 2 comments
Open

GenBank genomes not found #235

tgurbich opened this issue Feb 16, 2023 · 2 comments

Comments

@tgurbich
Copy link

Hello!

Thanks a lot for making this tool available. I ran the demo and analyzed the sample dataset with no issues but when testing charcoal on my own dataset I am running into errors which seem to be caused by some genomes not downloading from GenBank correctly.

I am running this command:
python -m charcoal run zebrafish-test.conf -j 16

It fails with this error message:

Error in snakemake invocation: Command '['snakemake', '-s', 
'/users/tg/Misc/Tool_testing/charcoal/charcoal/Snakefile',  '--use-conda', 
'-j', '1', '-j', '16', '--configfile', '/users/tg/Misc/Tool_testing/charcoal/charcoal/conf/defaults.conf',
 '/users/tg/Misc/Tool_testing/charcoal/charcoal/conf/system.conf', 
'zebrafish-test.conf']' returned non-zero exit status 1.

Which appears to be caused by a file not downloading from GenBank:

ERROR, skch::validateInputFile, Could not open genbank_genomes/GCF_002943105.1_genomic.fna.gz
[Thu Feb 16 11:58:47 2023]
Error in rule mashmap_compare:
    jobid: 1373
    output: output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.align, 
output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.out
    conda-env: /users/tg/Misc/Tool_testing/charcoal/.snakemake/conda/d01f2d1356a2c223e7b61208c452d8a0
    shell:
mashmap -q zebrafish-genomes/MGYG000299400.fna -r 
genbank_genomes/GCF_002943105.1_genomic.fna.gz -o 
output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.align   
--pi 95 > output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.out

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I checked the genbank_genomes folder, it did contain some genome files but this accession (GCF_002943105) was not there.
I manually downloaded this file from GenBank and reran the snakemake command. It failed twice again (on GCA_000798955.1_genomic.fna.gz and GCF_000820225.1_genomic.fna.gz) which I also then manually downloaded and reran the snakemake command. The workflow then failed on genome GCA_011046675.1, which has been suppressed in GenBank and isn't available (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/046/675/GCA_011046675.1_ASM1104667v1/assembly_status.txt).

This is what the error looked like for the suppressed genome:

Error in rule download_matching_genomes_one_by_one:
    jobid: 0
    output: genbank_genomes/GCA_011046675.1_genomic.fna.gz

RuleException:
HTTPError in line 465 of /users/tg/Misc/Tool_testing/charcoal/charcoal/Snakefile:
HTTP Error 404: Not Found
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2357, in run_wrapper
  File "/users/tg/Misc/Tool_testing/charcoal/charcoal/Snakefile", line 465, in __rule_download_matching_genomes_one_by_one
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 214, in urlopen
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 523, in open
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 632, in http_response
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 561, in error
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 494, in _call_chain
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 641, in http_error_default
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 574, in _callback
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/concurrent/futures/thread.py", line 58, in run
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 560, in cached_or_run
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2390, in run_wrapper
Exiting because a job execution failed. Look above for error message

I tried to run charcoal on a small subset of my genomes (the ones that went through with no errors during this initial test) and that completed without errors and a report was generated successfully.

@taylorreiter
Copy link
Member

Thank you for reporting this, this problem has come up a lot. Essentially, the sourmash databases (including GTDB) have some accessions that have been suppressed by GenBank, so they are no longer available for download. One way to access the genome is to download the GTDB database and place the genome in the folder charcoal/snakemake expects to see it in (you might need to add the flag --rerun-triggers mtime so it wont be erased). This is cumbersome though.

This issue documents a similar problem in a different pipeline, and the solution there was to use a picklist to remove the suppressed accession from the database dib-lab/2022-dominating-set-differential-abundance-example#8 (comment)

It might be appropriate for sourmash to prepare a database that removes these genomes from the GTDB representation and makes this the default for pipelines that have a genome download step...I'm not sure if that is the right move though.

One admittedly hacky approach that comes to mind is that you could delete that line from the charcoal results before running subsequent steps...but that would change the results of your contamination analysis.

I'm sorry I don't have a better solution at the moment!

@ctb
Copy link
Member

ctb commented Apr 20, 2023

yep, we are investigating a general solution for this around picklists, but I've got to get a few hours of mental space together to do that :).

note also same problem cropping up in genome-grist: dib-lab/genome-grist#277

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants