GenBank genomes not found #235

tgurbich · 2023-02-16T16:58:47Z

Hello!

Thanks a lot for making this tool available. I ran the demo and analyzed the sample dataset with no issues but when testing charcoal on my own dataset I am running into errors which seem to be caused by some genomes not downloading from GenBank correctly.

I am running this command:
python -m charcoal run zebrafish-test.conf -j 16

It fails with this error message:

Error in snakemake invocation: Command '['snakemake', '-s', 
'/users/tg/Misc/Tool_testing/charcoal/charcoal/Snakefile',  '--use-conda', 
'-j', '1', '-j', '16', '--configfile', '/users/tg/Misc/Tool_testing/charcoal/charcoal/conf/defaults.conf',
 '/users/tg/Misc/Tool_testing/charcoal/charcoal/conf/system.conf', 
'zebrafish-test.conf']' returned non-zero exit status 1.

Which appears to be caused by a file not downloading from GenBank:

ERROR, skch::validateInputFile, Could not open genbank_genomes/GCF_002943105.1_genomic.fna.gz
[Thu Feb 16 11:58:47 2023]
Error in rule mashmap_compare:
    jobid: 1373
    output: output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.align, 
output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.out
    conda-env: /users/tg/Misc/Tool_testing/charcoal/.snakemake/conda/d01f2d1356a2c223e7b61208c452d8a0
    shell:
mashmap -q zebrafish-genomes/MGYG000299400.fna -r 
genbank_genomes/GCF_002943105.1_genomic.fna.gz -o 
output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.align   
--pi 95 > output.zebrafish-test/stage2/MGYG000299400.fna.x.GCF_002943105.1.mashmap.out

(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I checked the genbank_genomes folder, it did contain some genome files but this accession (GCF_002943105) was not there.
I manually downloaded this file from GenBank and reran the snakemake command. It failed twice again (on GCA_000798955.1_genomic.fna.gz and GCF_000820225.1_genomic.fna.gz) which I also then manually downloaded and reran the snakemake command. The workflow then failed on genome GCA_011046675.1, which has been suppressed in GenBank and isn't available (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/046/675/GCA_011046675.1_ASM1104667v1/assembly_status.txt).

This is what the error looked like for the suppressed genome:

Error in rule download_matching_genomes_one_by_one:
    jobid: 0
    output: genbank_genomes/GCA_011046675.1_genomic.fna.gz

RuleException:
HTTPError in line 465 of /users/tg/Misc/Tool_testing/charcoal/charcoal/Snakefile:
HTTP Error 404: Not Found
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2357, in run_wrapper
  File "/users/tg/Misc/Tool_testing/charcoal/charcoal/Snakefile", line 465, in __rule_download_matching_genomes_one_by_one
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 214, in urlopen
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 523, in open
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 632, in http_response
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 561, in error
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 494, in _call_chain
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/urllib/request.py", line 641, in http_error_default
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 574, in _callback
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/concurrent/futures/thread.py", line 58, in run
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 560, in cached_or_run
  File "/software/miniconda_py39/envs/charcoal/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2390, in run_wrapper
Exiting because a job execution failed. Look above for error message

I tried to run charcoal on a small subset of my genomes (the ones that went through with no errors during this initial test) and that completed without errors and a report was generated successfully.

The text was updated successfully, but these errors were encountered:

taylorreiter · 2023-04-20T14:14:02Z

Thank you for reporting this, this problem has come up a lot. Essentially, the sourmash databases (including GTDB) have some accessions that have been suppressed by GenBank, so they are no longer available for download. One way to access the genome is to download the GTDB database and place the genome in the folder charcoal/snakemake expects to see it in (you might need to add the flag --rerun-triggers mtime so it wont be erased). This is cumbersome though.

This issue documents a similar problem in a different pipeline, and the solution there was to use a picklist to remove the suppressed accession from the database dib-lab/2022-dominating-set-differential-abundance-example#8 (comment)

It might be appropriate for sourmash to prepare a database that removes these genomes from the GTDB representation and makes this the default for pipelines that have a genome download step...I'm not sure if that is the right move though.

One admittedly hacky approach that comes to mind is that you could delete that line from the charcoal results before running subsequent steps...but that would change the results of your contamination analysis.

I'm sorry I don't have a better solution at the moment!

ctb · 2023-04-20T14:26:11Z

yep, we are investigating a general solution for this around picklists, but I've got to get a few hours of mental space together to do that :).

note also same problem cropping up in genome-grist: dib-lab/genome-grist#277

This was referenced Apr 20, 2023

add check to ignore genome(s) that cannot be up downloaded dib-lab/genome-grist#277

Open

01_perform_dda.snakefile shuts down without error message on rule download_query_genome dib-lab/2022-dominating-set-differential-abundance-example#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GenBank genomes not found #235

GenBank genomes not found #235

tgurbich commented Feb 16, 2023

taylorreiter commented Apr 20, 2023

ctb commented Apr 20, 2023

GenBank genomes not found #235

GenBank genomes not found #235

Comments

tgurbich commented Feb 16, 2023

taylorreiter commented Apr 20, 2023

ctb commented Apr 20, 2023