iSNV all-sample merge step (step 2) #89

dpark01 · 2015-02-10T14:26:51Z

Each set of iSNV calls starts off separate for each sample. They are in their own coordinate spaces (having been called against their own consensus sequences) and have their own relative definition of "reference" and "alternate" alleles. We'd like to merge this all to a single master table, using consistent coordinates and alleles, and a standard file format (e.g. VCF). Make a new command in intrahost.py (perhaps called something like "merge_to_vcf").

input:
- a bunch of marked up text files per sample with iSNV calls, the output of the command described in iSNV per-sample processing pipeline (step 1) #88.
- each individual sample's consensus assembly (fasta) that describes their reference alleles and coordinate space.
- a master reference assembly (fasta) that describes our target coordinate space and reference alleles for the output file.
output: a single VCF file

Steps:

For each sample:

a. Align individual assembly to master reference with MUSCLE. Create custom code that reads that alignment and is able to generally map arbitrary coordinates (chr name, position) between the two coordinate spaces at will.

b. Map all iSNV calls to target coordinate space. Change the position of deletions off-by-one to match the VCF conventions of how to talk about deletions.

c. Flip alleles (ref & alt) if necessary to match the target reference genome.
Across all samples:

a. Sort and group all data per genomic position. The VCF format describes one row per position in the genome. This requires us to merge together data from all files into a single output row since our input describes each sample separately. But more than that, V-Phaser often describes a single position multiple times for a single sample, if multiple types of variants (SNPs, insertions, deletions) are seen at the same place. In this step, we must merge all variants together across all samples, and pad out their allele lengths with extra genome sequence if necessary.

b. If V-Phaser had no output at a given position for a sample, reach back into that sample's consensus assembly and record that consensus allele at 100% (V-Phaser only reports variant positions). If the consensus assembly is an N (insufficient read support at this position), don't do this. If the consensus assembly is gapped at this position, figure out the proper indel to put in this place (at 100%).

The existing "vphaser_to_vcf" command is a good starting point to refer to for the code, but it doesn't do all of the above, and it also does a few steps that belong in #88 instead. In particular, it doesn't do the MUSCLE coordinate remapping and doesn't handle any consensus-level indels, because none of these existed in our Summer 2014 data set, but they definitely exist in our current EBOV data. (they existed in previous Lassa data sets, but we weren't using V-Phaser back then)

Clearly, this step will need some decent unit testing...

dpark01 · 2015-02-20T22:13:25Z

depends on #99

dpark01 · 2015-02-28T03:32:09Z

Finished an initial implementation that has bugs and will need some fixing. Started sketching out test cases and implemented some of them, but not all.

implement iSNV step 2 (merge to VCF) - closes #89

dpark01 added the 1 - Ready label Feb 10, 2015

dpark01 added this to the Implement iSNV pipeline milestone Feb 10, 2015

dpark01 mentioned this issue Feb 10, 2015

iSNV filtering based on tuneable criteria (step 3) #90

Closed

dpark01 self-assigned this Feb 26, 2015

dpark01 mentioned this issue Mar 1, 2015

implement iSNV step 2 (merge to VCF) #104

Merged

dpark01 closed this as completed in #104 Mar 3, 2015

dpark01 added a commit that referenced this issue Mar 3, 2015

Merge pull request #104 from broadinstitute/dp-89-isnv-merge

0ac1a8e

implement iSNV step 2 (merge to VCF) - closes #89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iSNV all-sample merge step (step 2) #89

iSNV all-sample merge step (step 2) #89

dpark01 commented Feb 10, 2015

dpark01 commented Feb 20, 2015

dpark01 commented Feb 28, 2015

iSNV all-sample merge step (step 2) #89

iSNV all-sample merge step (step 2) #89

Comments

dpark01 commented Feb 10, 2015

dpark01 commented Feb 20, 2015

dpark01 commented Feb 28, 2015