-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixing of 0-based and 1-based position in the outputed vcf file of DiscoSNP++ #21
Comments
Dear Jordi, Thanks for this issue. I'm sorry I'm seeing it only now. Best, |
Dear Jordi, I tried to reproduce the problem you mention with a toy example. With no success. I created a file with two sequences identical but with 3 mutations, two closes and one isolated. These are lower case characters in the second sequence.
I also used "tibo" as the reference genome:
Mutations are located at positions 33 38 and 160.
The resulting vcf is
We retrieve positions 33, 38, and 160 with no difference between close SNPs and isolated SNPs. I hope this helps. Pierre |
Hello Pierre, Thanks for your answer. Indeed it works when using a reference sequence, but in my analysis I don't have one so I tried your data without it with: And then the problem appears again. (I used the latest github version) The unitigs found by discosnp in the discoRes_k_31_c_1_D_100_P_3_b_0_coherent.fa file are:
And the vcf discoRes_k_31_c_1_D_100_P_3_b_0_coherent.vcf is giving 3 snps:
Identically as I observed before with my dataset, the isolated SNP position is incorrect by one base (SNP_lower_path_2 121). If you look at the sequence of the SNP_lower_path_2 unitig, the nucleotide in position 121 is a C and not a G, the G being positioned at position 122 (or 121 in a 0-based system). The positions of the 2 others SNPs are correct (in a 1-based system). From what I have seen in the code of VCF_creator.py I think you may need to add a +1 to the mappingPositionCouple value somewhere (maybe in the WhichPathIsTheRef() function of the SNP class (and maybe to the INDEL one too)). I have the impression that the calculation are correct for the SNPCLOSE class because it don't use this value to determine the position and only add the corrected position and the unitig length, but I didn't had the time to look further. Tell me if you can reproduce the problem or if you want me to run some tests and maybe propose a pull request. (And for the issue I do no have the right to reopen it myself so I will leave it to you if you found it appropriate) Best regards, |
Thanks again Jordi for your detailed and precious feedback. I found and corrected the bug for close SNPs. Everything should now be zero-based. You may pull and retry (no need to recompile). I let the issue open as the bug also exists for INDELS. I have to explore this code. |
The bug is also fixed for unmapped indels |
Thanks a lot Pierre for fixing this. I see that you have made everything 0-based, but the vcf convention state that vcf are 1-based. To avoid confusion with software working on vcf files do you think you could produce 1-based vcf instead? I think this could prevent many mistakes in the future. Jordi |
Hello Pierre,
I use the bioconda discoSNP++ version (2.5.4).
I found a bug in the given positions of the SNPs in the vcf output of discoSNP++ (except if I missunderstood something).
When there are several SNPs on the same sequence, their positions are 1-based, and where there is only one SNP, it is 0-based.
Here are an extract of the discoRes_k_31_c_3_D_100_P_3_b_0_coherent.vcf:
And the corresponding discoRes_k_31_c_3_D_100_P_3_b_0_coherent.fa:
The real position are:
Each SNP where the ID is XXXXX is 0-based, and when it's XXXXX_X it's 1-based.
I confirmed this on a 1 million SNPs vcf file.
My command line was
run_discoSnp++.sh --max_threads 16 -T -r ../file.fof
Do you have an idea of where this come from?
Best regards,
Jordi
The text was updated successfully, but these errors were encountered: