Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect zero-based VCF file #202

Open
ethering opened this issue Dec 5, 2023 · 3 comments
Open

Incorrect zero-based VCF file #202

ethering opened this issue Dec 5, 2023 · 3 comments

Comments

@ethering
Copy link

ethering commented Dec 5, 2023

Hi,
I'm using SURVIVOR (v1.0.7) for the first time and trying out some test data.
I've generated a params file using
SURVIVOR simSV test_params.param
and then simulated the SVs onto a reference genome.
SURVIVOR simSV ref.fasta test_params.param 0.1 0 simulated

Upon viewing the generated simulated.vcf file, I see that the VCF file is using zero-based coordinates when VCF uses 1-based coordinates.
Here are the first 16 characters of my reference sequence:

>SJAP_Chr1
CACCAAAAACCCTAAG

Here are the first 16 characters of my simulated sequence:

>SJAP_Chr1
AACAAAAAAGCCTAAA

and here are the lines in the VCF file that relate to the variants (the VCF header states 'source=Sniffles').

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT
SJAP_Chr1       0       SNP0SURVIVOR    C       A       PRECISE;SVMETHOD=SURVIVOR_sim;SVLEN=1   GT:GL:GQ:FT:RC:DR:DV:RR:RV      1/1
SJAP_Chr1       3       SNP1SURVIVOR    C       A       PRECISE;SVMETHOD=SURVIVOR_sim;SVLEN=1   GT:GL:GQ:FT:RC:DR:DV:RR:RV      1/1
SJAP_Chr1       9       SNP2SURVIVOR    C       G       PRECISE;SVMETHOD=SURVIVOR_sim;SVLEN=1   GT:GL:GQ:FT:RC:DR:DV:RR:RV      1/1
SJAP_Chr1       15      SNP3SURVIVOR    G       A       PRECISE;SVMETHOD=SURVIVOR_sim;SVLEN=1   GT:GL:GQ:FT:RC:DR:DV:RR:RV

As you can see, the first variant is at position zero and then the subsequent co-ordinates are also zero-based. The actual 'POS' values should be 1, 4, 10, and 16.

Based on my initial use of SURVIVOR, I have two other comments:

  1. The 'simulated.bed' file is not in BED format, so perhaps it would be better to give it a different file extension
  2. I can't find any command line help for the simSV method. Typing SURVIVOR simSV --help gives the stdout of Parameter file generated and produces a file called --help
@waltergallegog
Copy link

Hi @ethering have you been able to overcome this issue?

I'm observing the same behavior, and this issue is breaking some downstream analysis (liftover).

A simple check with

bcftools norm --check-ref e --fasta-ref <fasta_file> <vcf_file> -o <vcf_norm_file>

Gives errors like the following:

Reference allele mismatch at chr7 .. REF_SEQ:'GCTCTTCTCTTCTT' vs VCF:'CTCTTCTCTTCTTA

@waltergallegog
Copy link

Besides the issue with POS I would like to add that the END information field seems 1-based to me, which is correct according to the vcf specification, but inconsistent with the 0-based POS. This brakes the following relation from the vcf specification:

For precise variants, END is POS + length of REF allele − 1, and the for imprecise variants the corresponding best estimate.

A simple example of the issue showing a diff between Reference and produced fasta, and the corresponding vcf:

Screenshot_20240710_135945

  • Top left: Reference, with the deletion highlighted in red
  • Top right: Fasta output
  • bottom: VCF output

(The lines in the fasta file are of length = 100)

The deletion should have:

POS = 307 (1 based)
LEN = 14
END = 320 (1 based)

307 + 14 -1 = 320

@ethering
Copy link
Author

Hi @ethering have you been able to overcome this issue?

I'm observing the same behavior, and this issue is breaking some downstream analysis (liftover).

A simple check with

bcftools norm --check-ref e --fasta-ref <fasta_file> <vcf_file> -o <vcf_norm_file>

Gives errors like the following:

Reference allele mismatch at chr7 .. REF_SEQ:'GCTCTTCTCTTCTT' vs VCF:'CTCTTCTCTTCTTA

No. I modified my workflow so that it didn't require the VCF file. One alternative would be to create a script that increments the POS position by one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants