VarDict-java produces malformed VCF #42

aryarm · 2022-03-02T18:22:37Z

I'm creating this issue to record a problem encountered by a user (through personal correspondence). They received the following error message:

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 165135: unparsable vcf record with allele CCCCCTCCCCACTGTTCCAGTAGTCACTCCCTGGCTCCTCCCCAGGCCTCT<dup-8>AGGCCTCTGCTGCTCCTCCCCACTGTGTTCCAGTAGTCACTCCCTGGCTG

Based on the error message, it sounds like the VarDict-java tool is creating a malformed VCF allele:

CCCCCTCCCCACTGTTCCAGTAGTCACTCCCTGGCTCCTCCCCAGGCCTCT<dup-8>AGGCCTCTGCTGCTCCTCCCCACTGTGTTCCAGTAGTCACTCCCTGGCTG

The <dup-8> part of that allele is not valid in the VCF format, so GATK flags it and raises an exception.

It appears that someone else has already reported the issue in the VarDict repo. In the meantime, if anyone else encounters this while we wait for the issue to be resolved, I would recommend just discarding those alleles manually using awk just like we did in #25 . For example, you could edit line 17 of the callers/vardict file from this

teststrandbias.R | var2vcf_valid.pl | bgzip > "$output_dir/vardict.vcf.gz" && \

to this

teststrandbias.R | var2vcf_valid.pl | \
awk -F $"\t" -v 'OFS=\t' '/^#/ || $5 !~ /<dup/' | \
bgzip > "$output_dir/vardict.vcf.gz" && \

This will simply remove any lines in the VCF where the fifth column (for the ALT alleles) contains <dup. Ideally, we would keep those lines in the file and fix those alleles so that they are valid, since they potentially represent real structural variants that should be reported in VarCA's output. But without further information, I can't know what the correct allele should be, so I don't know how to properly change it using awk.

The text was updated successfully, but these errors were encountered:

elahoehne · 2022-03-09T06:31:48Z

Hey Arya,

I'm Michaela who wrote you the mail. I finally registered here at GitHub.
I tried what you suggested. Unfortunately, it is not working. New errors occur:

awk: fatal: cannot open file out_WT/callers/WT/vardict/vardict.vcf' for reading: No such file or directory
`

`***********************************************************************

A USER ERROR has occurred: Cannot read file:///scratch/mhoehne/Gisela/CUTTag/fastq_trimmed/bam_all/bams/varCA/out_WT/callers/WT/vardict/vardict.vcf.gz because no sui$

Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
00:32:11.989 INFO SelectVariants - Shutting down engine
[March 9, 2022 at 12:32:11 AM CET] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.11 minutes.
Runtime.totalMemory()=98566144

A USER ERROR has occurred: Cannot read file:///scratch/mhoehne/Gisela/CUTTag/fastq_trimmed/bam_all/bams/varCA/out_WT/callers/WT/vardict/vardict.vcf.gz because no sui$
`

even though there is a vardict.vcf.file in that folder.
I'll append you the full log and also the adjusted vardict script.

Thank you very much for your help!
log.txt
vardict.txt

aryarm · 2022-03-09T13:17:43Z

ooops, I meant to delete the varscan.vcf portion on the second line of that code snippet! Here's the corrected version

teststrandbias.R | var2vcf_valid.pl | \
awk -F $"\t" -v 'OFS=\t' '/^#/ || $5 !~ /<dup/' | \
bgzip > "$output_dir/vardict.vcf.gz" && \

Sorry about that! I've edited the original post to reflect this corrected code.

elahoehne · 2022-03-11T06:38:30Z

Thank you very much!
Now it run successfully! There is only a warning occurring:

scripts/2vcf.py:267: UserWarning: Ignored 32116410 classification sites that didn't have a variant.
"Ignored {:n} classification sites that didn't have a variant.".format(skipped)

But I guess that makes sense!?
I am now running it with the merged fastq files.

aryarm · 2022-03-13T16:44:44Z

Yes, that is a standard warning message that will happen regardless of the Vardict issue. When generating a VCF, VarCA will keep track of every position in the genome, regardless of whether there's a variant there. The 2vcf.py script will then discard these sites when it converts the final output to VCF.

elahoehne · 2022-03-21T09:35:20Z

Now it worked for all replicates as well as for the merged fastq files!
Thank you very much for your help!

resolves #42

aryarm added the bug Something isn't working label Mar 2, 2022

aryarm added a commit that referenced this issue Jun 14, 2022

handle vardict iupac ambiguity codes

e0ca967

resolves #42

aryarm mentioned this issue Jun 14, 2022

handle vardict iupac ambiguity codes #44

Merged

aryarm closed this as completed in #44 Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VarDict-java produces malformed VCF #42

VarDict-java produces malformed VCF #42

aryarm commented Mar 2, 2022 •

edited

Loading

elahoehne commented Mar 9, 2022

aryarm commented Mar 9, 2022 •

edited

Loading

elahoehne commented Mar 11, 2022 •

edited

Loading

aryarm commented Mar 13, 2022

elahoehne commented Mar 21, 2022

VarDict-java produces malformed VCF #42

VarDict-java produces malformed VCF #42

Comments

aryarm commented Mar 2, 2022 • edited Loading

elahoehne commented Mar 9, 2022

aryarm commented Mar 9, 2022 • edited Loading

elahoehne commented Mar 11, 2022 • edited Loading

aryarm commented Mar 13, 2022

elahoehne commented Mar 21, 2022

aryarm commented Mar 2, 2022 •

edited

Loading

aryarm commented Mar 9, 2022 •

edited

Loading

elahoehne commented Mar 11, 2022 •

edited

Loading