Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want an engine feature to convert XT tagged base qualities to two #3540

Open
sooheelee opened this issue Sep 1, 2017 · 0 comments
Open

Want an engine feature to convert XT tagged base qualities to two #3540

sooheelee opened this issue Sep 1, 2017 · 0 comments

Comments

@sooheelee
Copy link
Contributor

sooheelee commented Sep 1, 2017

Base qualities of two (#) are handled specially by BWA and our tools and are typically used to indicate adapter sequence. See reply to jhess in https://gatkforums.broadinstitute.org/gatk/discussion/comment/35120#Comment_35120:

That's correct, Q2 bases are considered to be special and left untouched by BQSR.

Currently, there is no easy way to convert base qualities to two. The only instances I am aware of is (i) for SamToFastq, which then unaligns the reads and (ii) MergeBamAlignment, which isn't necessarily a part of everyone's workflow. Also, MergeBamAlignment's CLIP_ADAPTERS softclips XT tagged sequence, which then becomes fair game for our assembly-based callers.

MarkIlluminaAdapters uses aligned reads to mark those with 3' adapter sequence with the XT tag. The XT tag values note the start of the 3' adapter sequence in the read. During MergeBamAlignment, one must especially request that this XT tag is retained in the merged output. Because our assembly-based callers throw out CIGAR strings from the aligner when reassembling reads, so as to use soft-clipped sequence that may contain true variants we wish to resolve, adapter sequence can be incorporated into the graph. This is not an issue for libraries with low levels of adapter read through and for germline calling as we prune nodes in the graph that have less than two reads supporting it.

However, for somatic cases and for libraries where there is considerable adapter read through, the current solution is to hard-clip adapter sequences out of reads or to toss these reads altogether so as not to increase the extent of spurious calls.

The issue with hard-clipping is that our reads become malformed due to a mismatch in CIGAR string and sequence length. These the GATK engine filters. So the solution is to either correct the CIGAR strings or to go back and re-align the clipped reads or again to toss the reads.

It would be great not to have to throw out reads that include some adapter sequence in somatic workflows that call down to the lowest allele fraction variants. It seems this would simply be a matter of a tool or feature that replaces adapter sequence marked with the XT tag with base qualities of 2 and special handling by our callers of sequence with base quality of two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant