Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with blobtools_create step #213

Open
aureliendejode opened this issue May 1, 2024 · 13 comments
Open

Error with blobtools_create step #213

aureliendejode opened this issue May 1, 2024 · 13 comments

Comments

@aureliendejode
Copy link

Hello,

I have encountered the followinf error:
more SHADDVT3_asm_bp_hap1_p_ctg/run_blobtools_create.log

Reading all TSV files in ../window_stats
Traceback (most recent call last):
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/bin/blobtools", line 8, in
sys.exit(cli())
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/blobtools.py", line 105, in cli
sys.exit(subcommand())
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/add.py", line 203, in cli
main(args)
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/add.py", line 149, in main
parsed = field["module"].parse(
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/busco.py", line 82, in parse
busco = parse_busco(file, identifiers=kwargs["dependencies"]["identifiers"])
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/busco.py", line 63, in parse_busco
raise UserWarning(
UserWarning: Contig names in the Busco file did not match dataset identifiers.

I saw people having similar issue (e.g. #2) in the past but I did not see the fix for it.
In my case it looks like everything works fine for busco metazoa and eukaryota but for archaea and bacteria busco removed "_path" from the contigs id and I think that is what is causing the error.

zcat SHADDVT3_asm_bp_hap1_p_ctg.busco.bacteria_odb10/full_table.tsv.gz | head

BUSCO version is: 5.6.1
The lineage dataset is: bacteria_odb10 (Creation date: 2020-03-06, number of genomes: 4085, number of BUSCOs: 124)
Busco id Status Sequence Gene Start Gene End Strand Score Length OrthoDB url Description
4421at2 Missing
9601at2 Missing
26038at2 Fragmented h1tg000560l 68481 70196 + 151.4 265 https://www.orthodb.org/v10?query=26038at2 phosphoribosylformylglycinamidine synthase
91428at2 Missing
95696at2 Missing
143460at2 Missing
182107at2 Missing

Is there a fix to this or is it something that has to be done manually ?

Thanks for your help

@rjchallis
Copy link
Contributor

Hi, It does look like a similar issue. That one should have been fixed in this commit that parsed the sequence IDs differently. Could you share the fasta header for the config that is in the BUSCO results so I can take a look at what needs to be done to get them to match this time

@aureliendejode
Copy link
Author

Hi,
Here are the headers for the fasta file (the assembly), they all have the following format:

h1tg000309l_path
h1tg000310l_path
h1tg000311l_path
h1tg000312l_path
h1tg000313l_path
h1tg000314l_path

They are contigs output from hifiasm assembly.

Then in the busco metazoa and eukaryota the contigs id are the same as in the fasta file:

270107at2759 Complete h1tg000039l_path 1923850 1928544 + 617.0 599
271586at2759 Complete h1tg000019l_path 3448379 3458776 + 297.6 389
290630at2759 Complete h1tg000004l_path 2354527 2344837 - 543.9 461
293315at2759 Complete h1tg000021l_path 1983882 1947754 - 509.3 559
296129at2759 Complete h1tg000005l_path 2926981 2938826 + 954.7 708
299766at2759 Complete h1tg000020l_path 9862506 9853025 - 495.8 436
306227at2759 Complete h1tg000025l_path 9236278 9247485 + 542.8 459

For the archaea and bacteria the "_path" has been removed:
26038at2 Fragmented h1tg000560l 68481 70196 + 151.4 265 https://www.orthodb.org/v10?query=26038at2 phosphoribosylformylglycinamidine synthase
91428at2 Missing
95696at2 Missing
143460at2 Missing
182107at2 Missing
219876at2 Missing
223233at2 Fragmented h1tg000118l 1228298 1230874 + 128.3 383 https://www.orthodb.org/v10?query=223233at2 DNA topoisomerase I

Hope that answer your question, let me know if you need more info.

Cheers
Aurélien

@aureliendejode
Copy link
Author

Hi,
Have you found a fix for this ?
Best
Aurélien

@rjchallis
Copy link
Contributor

Looking at this more closely, it not straightforward to fix this in the BlobToolKit import code - the previous issue was with extra characters being added to the sequence IDs, but here they are removed so it is not as easy to put them back.

I think the best way would be a manual workaround either changing the sequence IDs in the assembly or editing the BUSCO file before import. To get the pipeline to run happily, you could also skip the bacteria and archaea lineages from the set of BUSCOs to run as these are usually included for use when exploring the results rather than required for the pipeline to run.

@aureliendejode
Copy link
Author

Thanks for this. I'll just skip the bacteria and archaea lineages for now.
Best
Aurélien

@aureliendejode
Copy link
Author

aureliendejode commented May 13, 2024

Hi,
I removed the bacteria and archaea lineages but then I ran into an error at the diamond_blastp step.
I think it is because I used "taxrule: blastp=buscogenes" for the tax rule.
How can I solve this issue. Should change the taxrule for the diamond_blastp step ?
Here is my config file:

assembly:
  accession: SHADDVT3
  file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/SHADDVT3_asm_bp_hap1_p_ctg.fasta.gz
  level: contig
  prefix: SHADDVT3_asm_bp_hap1_p_ctg
  span: 459760080
busco:
  download_dir: /share/apps/genetic-databases/busco 
  basal_lineages:
    - eukaryota_odb10
    - bacteria_odb10
    - archaea_odb10
  lineages:
   - metazoa_odb10
   - eukaryota_odb10
reads:
  single:
   - file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/XBTUA20231220R84050PL40740011D01bc2031bc2031_hifireads.fastq.gz
     prefix: XBTUA20231220R84050PL40740011D01bc2031bc2031
     platform: PACBIO_SMRT
settings:
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_min_length: 1000
  blast_overlap: 0
  stats_chunk: 1000
  stats_windows:
  - 0.1
  - 0.01
  - 100000
  - 1000000
  taxdump: /share/apps/genetic-databases/taxdump
  tmp: /tmp
similarity:
  blastn:
    name: nt
    path: /share/apps/genetic-databases/nt
  defaults:
    evalue: 1.0e-10
    import_evalue: 1.0e-25
    max_target_seqs: 10
    taxrule: bestsumorder
  diamond_blastp:
    import_max_target_seqs: 100000
    name: reference_proteomes
    path: /share/apps/genetic-databases/uniprot
    taxrule: blastp=buscogenes
  diamond_blast:
    name: reference_proteomes
    path: /share/apps/genetic-databases/uniprot
taxon:
  name: Stichodactyla haddoni

version: 2

@rjchallis
Copy link
Contributor

I think you may just need to remove bacteria and archaea from the basal_lineages section of the config as well

@aureliendejode
Copy link
Author

Yes, it worked but I ran into another issue at the run_blobtools_create:

Reading all TSV files in ../window_stats
Loading parsed taxdump
Traceback (most recent call last):
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/bin/blobtools", line 8, in
sys.exit(cli())
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/blobtools.py", line 105, in cli
sys.exit(subcommand())
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/add.py", line 203, in cli
main(args)
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/add.py", line 149, in main
parsed = field["module"].parse(
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/hits.py", line 541, in parse
blast = parse_blast(
File "/share/apps/miniconda3/py311_23.5.2-0/installed/envs/snakemake7/lib/python3.9/site-packages/blobtools/lib/hits.py", line 54, in parse_blast
seq_id, start, end = re.split(r"[:-]", parts[0])
ValueError: too many values to unpack (expected 3)

I have redownloaeded the newtaxdump to sure that this was not the issue.
Here is what I have in the taxdump folder:

-rw-r--r-- 1 adejode bmtitus 4,9K 14 mai 08:31 gencode.dmp
-rw-r--r-- 1 adejode bmtitus 452 14 mai 08:31 division.dmp
-rw-r--r-- 1 adejode bmtitus 3,0K 14 mai 08:37 typeoftype.dmp
-rw-r--r-- 1 adejode bmtitus 292M 14 mai 08:37 taxidlineage.dmp
-rw-r--r-- 1 adejode bmtitus 321M 14 mai 08:37 rankedlineage.dmp
-rw-r--r-- 1 adejode bmtitus 1,4M 14 mai 08:37 merged.dmp
-rw-r--r-- 1 adejode bmtitus 3,8M 14 mai 08:37 images.dmp
-rw-r--r-- 1 adejode bmtitus 691M 14 mai 08:37 fullnamelineage.dmp
-rw-r--r-- 1 adejode bmtitus 4,5M 14 mai 08:37 delnodes.dmp
-rw-r--r-- 1 adejode bmtitus 232M 14 mai 08:39 nodes.dmp
-rw-r--r-- 1 adejode bmtitus 217M 14 mai 08:40 names.dmp
-rw-r--r-- 1 adejode bmtitus 19M 14 mai 08:40 citations.dmp
-rw-r--r-- 1 adejode bmtitus 26M 14 mai 08:40 typematerial.dmp
-rw-r--r-- 1 adejode bmtitus 5,6M 14 mai 08:40 host.dmp
-rw-r--r-- 1 adejode bmtitus 44K 14 mai 08:40 excludedfromtype.dmp
-rw-r--r-- 1 adejode bmtitus 130M 14 mai 09:20 new_taxdump.tar.gz
-rw-r--r-- 1 adejode bmtitus 53 14 mai 09:20 new_taxdump.tar.gz.md5
-rw-r--r-- 1 adejode bmtitus 614M 14 mai 09:25 taxdump.json

It looks like some parsing issue for one of the file but I could not figure out which one.

Any idea ?

Here is my config file:

assembly:
  accession: SHADDVT3
  file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/SHADDVT3_asm_bp_hap1_p_ctg.fasta.gz
  level: contig
  prefix: SHADDVT3_asm_bp_hap1_p_ctg
  span: 459760080
busco:
  download_dir: /share/apps/genetic-databases/busco 
  basal_lineages:
   - eukaryota_odb10
  lineages:
   - metazoa_odb10
   - eukaryota_odb10
reads:
  single:
   - file: /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Stichodactyla_haddonii/blobtoolkit/XBTUA20231220R84050PL40740011D01bc2031bc2031_hifireads.fastq.gz
     prefix: XBTUA20231220R84050PL40740011D01bc2031bc2031
     platform: PACBIO_SMRT
settings:
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_min_length: 1000
  blast_overlap: 0
  stats_chunk: 1000
  stats_windows:
  - 0.1
  - 0.01
  - 100000
  - 1000000
  taxdump: /grps2/bmtitus/databases/taxdump
  tmp: /tmp
similarity:
  blastn:
    name: nt
    path: /share/apps/genetic-databases/nt
  defaults:
    evalue: 1.0e-10
    import_evalue: 1.0e-25
    max_target_seqs: 10
    taxrule: bestsumorder
  diamond_blastp:
    import_max_target_seqs: 100000
    name: reference_proteomes
    path: /share/apps/genetic-databases/uniprot
    taxrule: blastp=buscogenes
  diamond_blastx:
    name: reference_proteomes
    path: /share/apps/genetic-databases/uniprot
taxon:
  name: Stichodactyla haddoni

version: 2

@aureliendejode
Copy link
Author

Hello,

Does anyone have an idea about this last issue ?

seq_id, start, end = re.split(r"[:-]", parts[0])
ValueError: too many values to unpack (expected 3)

@rjchallis
Copy link
Contributor

This step parses the BUSCO locations from the diamond blast results file so it expects the sequence IDs to look like

seq1:start-end=busco_id=status

From the error, I guess you may have : or - characters in your seq IDs. Unfortunately the import won't work if this is the case, but you could remove the :start-end=busco_id=status and import it using --taxrule buscogenes (removing the blastp=) or skip this taxrule and import only the diamond blastx results.

@aureliendejode
Copy link
Author

My diamond out put looks like this:
head SHADDVT3_asm_bp_hap1_p_ctg.diamond.reference_proteomes.out
h1tg000621l_path 317549 790 h1tg000621l_path tr|A0A7D9DP43|A0A7D9DP43_PARCT 38.1 1138 686 13 7388 3981 480 1601 3.92e-240 790
h1tg000621l_path 317549 648 h1tg000621l_path tr|A0A7D9DC04|A0A7D9DC04_PARCT 35.4 1085 668 14 7364 4140 228 1289 7.77e-195 648
h1tg000621l_path 317549 638 h1tg000621l_path tr|A0A6S7HVJ1|A0A6S7HVJ1_PARCT 35.5 1179 684 27 7388 3993 2 1151 2.10e-193 638
h1tg000621l_path 317549 633 h1tg000621l_path tr|A0A6S7I3Z1|A0A6S7I3Z1_PARCT 34.3 1126 650 18 7469 4140 4 1055 1.69e-192 633

My diamond_blastp results look like this:
head SHADDVT3_asm_bp_hap1_p_ctg.diamond.busco_genes.out
1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single 6105 724 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single tr|A0A6P8II84|A0A6P8II84_ACTTE 85.9 453 55 1 1 444 15 467 1.93e-259 724
1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single 2652724 696 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single tr|A0A913XFY4|A0A913XFY4_EXADI 84.0 443 61 2 1 433 15 457 8.55e-249 696
1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single 45351 615 1001705at2759_1720309_0:002e06|h1tg000030l_path:4705163-4709960=1001705at2759=single tr|A7RYD1|A7RYD1_NEMVE 80.3 401 70 1 1 392 10 410 8.34e-219 615

and my blastn results look like this:
h1tg000625l_path 6106 281 h1tg000625l_path MN307079.1 83.121 314 48 3 6387 6699 7210 6901 5.34e-69
h1tg000625l_path 6105 239 h1tg000625l_path XM_031699469.1 80.000 365 40 15 6387 6738 2210 1866 3.26e-56
h1tg000625l_path 6105 239 h1tg000625l_path XM_031699463.1 80.000 365 40 14 6387 6738 3262 3606 3.26e-56
h1tg000625l_path 1789172 228 h1tg000625l_path OU974084.1 78.512 363 67 6 6386 6741 1087799 1088157 7.05e-53
h1tg000625l_path 1789172 209 h1tg000625l_path OU974083.1 79.048 315 59 6 6385 6698 10137538 10137230 2.55e-47

So I guess the issue comes from the diamon_blastp output file.
I'll try to skip the blastp then.

I am wondering is there a way to not have this issue without having to manually edit the files ?

@rjchallis
Copy link
Contributor

I think I can update the code to split only on the suffix that the pipeline adds so seq IDs with these characters can still be imported - should be able to get a release out with this fixed by the end of next week.

@rjchallis
Copy link
Contributor

Looks like I need to trace this back a bit further, could you share the first few lines from the eukaryota busco full_table.tsv file and one of the headers from a eukaryota_odb10/busco_sequences/single_copy_busco_sequences/*.faa file as it looks like the extra characters are being added by BUSCO and included in the headers when I extract the sequences before the diamond blastp step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants