Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VarDict-java produces malformed VCF #42

Closed
aryarm opened this issue Mar 2, 2022 · 5 comments · Fixed by #44
Closed

VarDict-java produces malformed VCF #42

aryarm opened this issue Mar 2, 2022 · 5 comments · Fixed by #44
Labels
bug Something isn't working

Comments

@aryarm
Copy link
Owner

aryarm commented Mar 2, 2022

I'm creating this issue to record a problem encountered by a user (through personal correspondence). They received the following error message:

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 165135: unparsable vcf record with allele CCCCCTCCCCACTGTTCCAGTAGTCACTCCCTGGCTCCTCCCCAGGCCTCT<dup-8>AGGCCTCTGCTGCTCCTCCCCACTGTGTTCCAGTAGTCACTCCCTGGCTG

Based on the error message, it sounds like the VarDict-java tool is creating a malformed VCF allele:

CCCCCTCCCCACTGTTCCAGTAGTCACTCCCTGGCTCCTCCCCAGGCCTCT<dup-8>AGGCCTCTGCTGCTCCTCCCCACTGTGTTCCAGTAGTCACTCCCTGGCTG

The <dup-8> part of that allele is not valid in the VCF format, so GATK flags it and raises an exception.

It appears that someone else has already reported the issue in the VarDict repo. In the meantime, if anyone else encounters this while we wait for the issue to be resolved, I would recommend just discarding those alleles manually using awk just like we did in #25 . For example, you could edit line 17 of the callers/vardict file from this

teststrandbias.R | var2vcf_valid.pl | bgzip > "$output_dir/vardict.vcf.gz" && \

to this

teststrandbias.R | var2vcf_valid.pl | \
awk -F $"\t" -v 'OFS=\t' '/^#/ || $5 !~ /<dup/' | \
bgzip > "$output_dir/vardict.vcf.gz" && \

This will simply remove any lines in the VCF where the fifth column (for the ALT alleles) contains <dup. Ideally, we would keep those lines in the file and fix those alleles so that they are valid, since they potentially represent real structural variants that should be reported in VarCA's output. But without further information, I can't know what the correct allele should be, so I don't know how to properly change it using awk.

@aryarm aryarm added the bug Something isn't working label Mar 2, 2022
@elahoehne
Copy link

Hey Arya,

I'm Michaela who wrote you the mail. I finally registered here at GitHub.
I tried what you suggested. Unfortunately, it is not working. New errors occur:

awk: fatal: cannot open file out_WT/callers/WT/vardict/vardict.vcf' for reading: No such file or directory
`

`***********************************************************************

A USER ERROR has occurred: Cannot read file:///scratch/mhoehne/Gisela/CUTTag/fastq_trimmed/bam_all/bams/varCA/out_WT/callers/WT/vardict/vardict.vcf.gz because no sui$


Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
00:32:11.989 INFO SelectVariants - Shutting down engine
[March 9, 2022 at 12:32:11 AM CET] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.11 minutes.
Runtime.totalMemory()=98566144


A USER ERROR has occurred: Cannot read file:///scratch/mhoehne/Gisela/CUTTag/fastq_trimmed/bam_all/bams/varCA/out_WT/callers/WT/vardict/vardict.vcf.gz because no sui$
`

even though there is a vardict.vcf.file in that folder.
I'll append you the full log and also the adjusted vardict script.

Thank you very much for your help!
log.txt
vardict.txt

@aryarm
Copy link
Owner Author

aryarm commented Mar 9, 2022

ooops, I meant to delete the varscan.vcf portion on the second line of that code snippet! Here's the corrected version

teststrandbias.R | var2vcf_valid.pl | \
awk -F $"\t" -v 'OFS=\t' '/^#/ || $5 !~ /<dup/' | \
bgzip > "$output_dir/vardict.vcf.gz" && \

Sorry about that! I've edited the original post to reflect this corrected code.

@elahoehne
Copy link

elahoehne commented Mar 11, 2022

Thank you very much!
Now it run successfully! There is only a warning occurring:

scripts/2vcf.py:267: UserWarning: Ignored 32116410 classification sites that didn't have a variant.
"Ignored {:n} classification sites that didn't have a variant.".format(skipped)

But I guess that makes sense!?
I am now running it with the merged fastq files.

@aryarm
Copy link
Owner Author

aryarm commented Mar 13, 2022

Yes, that is a standard warning message that will happen regardless of the Vardict issue. When generating a VCF, VarCA will keep track of every position in the genome, regardless of whether there's a variant there. The 2vcf.py script will then discard these sites when it converts the final output to VCF.

@elahoehne
Copy link

Now it worked for all replicates as well as for the merged fastq files!
Thank you very much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants