Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: truvari collapse --keep option to mantain the ALT sequence #214

Closed
leone93 opened this issue May 28, 2024 · 1 comment
Closed

Comments

@leone93
Copy link

leone93 commented May 28, 2024

Version : 4.2.2

Describe the request : Trying different merging options, I noticed that sometimes you could lose the sequence information present in the ALT field because another record without a precise sequence in the field, like <DEL> or <INV> or whatever, is chosen. I would like to maintain instead the field where the sequence is present to not loose this information.

What do you think?
Thanks
Leonardo

@ACEnglish
Copy link
Owner

ACEnglish commented May 28, 2024

For those specific examples, I would recommend pre-processing the variants to turn them into resolved variants. A <DEL> could be filled in by setting the REF to reference[entry.start:entry.end] and alt the first base of that sequence. An <INV> essentially the same thing for REF but with the reverse complement in the alt. I've also resolved <DUP> to simply be an insertion of the duplicated range.

Plus, once that's done, you can turn on sequence similarity, which is an important measure for accurately comparing SVs.

An example script for resolving SVs in a VCF is below.

resolve.py
"""
Given a VCF, fill in the <DEL>, <INV>, <DUP> ALT sequence
"""

import sys
import pysam
import truvari

MAX_SV = 100_000_000 # Filter things smaller than this

RC = str.maketrans("ATCG", "TAGC")
def do_rc(s):
    """
    Reverse complement a sequence
    """
    return s.translate(RC)[::-1]

def resolve(entry, ref):
    """
    """
    if entry.start > ref.get_reference_length(entry.chrom):
        return entry
    if entry.alts[0] in ['<CNV>', '<INS>']:
        return entry

    seq = ref.fetch(entry.chrom, entry.start, entry.stop)
    if entry.alts[0] == '<DEL>':
        entry.ref = seq
        entry.alts = [seq[0]]
    elif entry.alts[0] == '<INV>':
        entry.ref = seq
        entry.alts = [do_rc(seq)]
    elif entry.alts[0] == '<DUP>':
        entry.info['SVTYPE'] = 'INS'
        entry.ref = seq[0]
        entry.alts = [seq]
        entry.stop = entry.start + 1

    return entry

if __name__ == '__main__':
    vcf = pysam.VariantFile(sys.argv[1])
    ref = pysam.FastaFile(sys.argv[2])
    n_header = vcf.header.copy()
    
    out = pysam.VariantFile("/dev/stdout", 'w', header=n_header)
    for entry in vcf:
        if truvari.entry_size(entry) >= MAX_SV:
            continue
        if entry.alts[0].startswith("<"):
            entry = resolve(entry, ref)
        try:
            out.write(entry)
        except Exception:
            sys.stderr.write(f"{entry}\n{type(entry)}\n")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants