Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUSCO gene hits #194

Open
TCHeaven opened this issue Oct 5, 2023 · 1 comment
Open

BUSCO gene hits #194

TCHeaven opened this issue Oct 5, 2023 · 1 comment

Comments

@TCHeaven
Copy link

TCHeaven commented Oct 5, 2023

I have been trying to provide hits to blobtools via blastx search of BUSCO gene regions vs Uniprot, however all contigs are being assigned to no-hit despite there being taxid hit info in the input file:

`ptg016977l:1-7766 121845 223.4 ptg016977l:1-7766 tr|A0A3Q0J621|A0A3Q0J621_DIACI 93.3 119 8 0 5149 4793 25 143 4.1e-54 223.4

ptg016977l:1-7766 121845 161.4 ptg016977l:1-7766 tr|A0A3Q0J621|A0A3Q0J621_DIACI 85.4 82 12 0 1757 1512 179 260 1.9e-35 161.4

ptg004722l:33979-35044 28743 468.8 ptg004722l:33979-35044 tr|A0A3Q2GQF1|A0A3Q2GQF1_CYPVA 65.5 354 106 1 1 1062 8 345 7.7e-129 468.8

ptg008147l:570-8350 7740 451.1 ptg008147l:570-8350 tr|A0A8K0EU27|A0A8K0EU27_BRALA 42.7 536 294 5 3707 5278 579 1113 1.2e-122 451.1
`

A BUSCOgenes taxrule was mentionaed in the recent workshop but I can't find reference to this anywhere. I am wondering if I need to remove the text follow in ':' characters in columns 1 and 4, however the test data provided in the recent workshop had text in addition to the contig names following ':' and provided hits to the plot without issue:

`ptg000043l:272955-274207=1275837at2759=single 1903189 553 ptg000043l:272955-274207=1275837at2759=single tr|A0A8H3G8J7|A0A8H3G8J7_9LECA 93.4 316 15 3 1 315 1 311 2.01e-196 553

ptg000043l:272955-274207=1275837at2759=single 560253 553 ptg000043l:272955-274207=1275837at2759=single tr|A0A8H6CAM3|A0A8H6CAM3_9LECA 93.4 316 15 3 1 315 1 311 2.85e-196 553

ptg000043l:272955-274207=1275837at2759=single 112416 553 ptg000043l:272955-274207=1275837at2759=single tr|A0A8H6G095|A0A8H6G095_9LECA 93.4 316 15 3 1 315 1 311 2.85e-196 553

ptg000043l:272955-274207=1275837at2759=single 172621 549 ptg000043l:272955-274207=1275837at2759=single tr|A0A8H3FIP3|A0A8H3FIP3_9LECA 93.0 316 16 3 1 315 1 311 1.45e-194 549`

@rjchallis
Copy link
Contributor

When blobtools sees the format in the workshop example, it should automatically recognise it as the output of the diamond blastp step from the blobtools pipeline and treat it as a blastp file and parse the details in the sequence_id columns accordingly, in this case the tax rule provided really only changes the name of the output fields so buscogenes acts as a label and setting --taxrule buscogenes is the same as explicitly setting the blastp tax rule and giving it an alternate name with --taxrule blastp=buscogenes.

The sequence IDs in your file look to be missing the =1275837at2759=single part that adds the busco gene information so the import is treating the sequence ID as being ptg016977l:1-7766, as you thought, removing everything after the : should fix the problem as then the sequence IDs will match the sequence IDs from the assembly fasta file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants