You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the request : Trying different merging options, I noticed that sometimes you could lose the sequence information present in the ALT field because another record without a precise sequence in the field, like <DEL> or <INV> or whatever, is chosen. I would like to maintain instead the field where the sequence is present to not loose this information.
What do you think?
Thanks
Leonardo
The text was updated successfully, but these errors were encountered:
For those specific examples, I would recommend pre-processing the variants to turn them into resolved variants. A <DEL> could be filled in by setting the REF to reference[entry.start:entry.end] and alt the first base of that sequence. An <INV> essentially the same thing for REF but with the reverse complement in the alt. I've also resolved <DUP> to simply be an insertion of the duplicated range.
Plus, once that's done, you can turn on sequence similarity, which is an important measure for accurately comparing SVs.
An example script for resolving SVs in a VCF is below.
resolve.py
"""Given a VCF, fill in the <DEL>, <INV>, <DUP> ALT sequence"""importsysimportpysamimporttruvariMAX_SV=100_000_000# Filter things smaller than thisRC=str.maketrans("ATCG", "TAGC")
defdo_rc(s):
""" Reverse complement a sequence """returns.translate(RC)[::-1]
defresolve(entry, ref):
""" """ifentry.start>ref.get_reference_length(entry.chrom):
returnentryifentry.alts[0] in ['<CNV>', '<INS>']:
returnentryseq=ref.fetch(entry.chrom, entry.start, entry.stop)
ifentry.alts[0] =='<DEL>':
entry.ref=seqentry.alts= [seq[0]]
elifentry.alts[0] =='<INV>':
entry.ref=seqentry.alts= [do_rc(seq)]
elifentry.alts[0] =='<DUP>':
entry.info['SVTYPE'] ='INS'entry.ref=seq[0]
entry.alts= [seq]
entry.stop=entry.start+1returnentryif__name__=='__main__':
vcf=pysam.VariantFile(sys.argv[1])
ref=pysam.FastaFile(sys.argv[2])
n_header=vcf.header.copy()
out=pysam.VariantFile("/dev/stdout", 'w', header=n_header)
forentryinvcf:
iftruvari.entry_size(entry) >=MAX_SV:
continueifentry.alts[0].startswith("<"):
entry=resolve(entry, ref)
try:
out.write(entry)
exceptException:
sys.stderr.write(f"{entry}\n{type(entry)}\n")
Version : 4.2.2
Describe the request : Trying different merging options, I noticed that sometimes you could lose the sequence information present in the ALT field because another record without a precise sequence in the field, like
<DEL>
or<INV>
or whatever, is chosen. I would like to maintain instead the field where the sequence is present to not loose this information.What do you think?
Thanks
Leonardo
The text was updated successfully, but these errors were encountered: