Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iSNV all-sample merge step (step 2) #89

Closed
dpark01 opened this issue Feb 10, 2015 · 2 comments · Fixed by #104
Closed

iSNV all-sample merge step (step 2) #89

dpark01 opened this issue Feb 10, 2015 · 2 comments · Fixed by #104
Assignees

Comments

@dpark01
Copy link
Member

dpark01 commented Feb 10, 2015

Each set of iSNV calls starts off separate for each sample. They are in their own coordinate spaces (having been called against their own consensus sequences) and have their own relative definition of "reference" and "alternate" alleles. We'd like to merge this all to a single master table, using consistent coordinates and alleles, and a standard file format (e.g. VCF). Make a new command in intrahost.py (perhaps called something like "merge_to_vcf").

  • input:
    • a bunch of marked up text files per sample with iSNV calls, the output of the command described in iSNV per-sample processing pipeline (step 1) #88.
    • each individual sample's consensus assembly (fasta) that describes their reference alleles and coordinate space.
    • a master reference assembly (fasta) that describes our target coordinate space and reference alleles for the output file.
  • output: a single VCF file

Steps:

  1. For each sample:

    a. Align individual assembly to master reference with MUSCLE. Create custom code that reads that alignment and is able to generally map arbitrary coordinates (chr name, position) between the two coordinate spaces at will.

    b. Map all iSNV calls to target coordinate space. Change the position of deletions off-by-one to match the VCF conventions of how to talk about deletions.

    c. Flip alleles (ref & alt) if necessary to match the target reference genome.

  2. Across all samples:

    a. Sort and group all data per genomic position. The VCF format describes one row per position in the genome. This requires us to merge together data from all files into a single output row since our input describes each sample separately. But more than that, V-Phaser often describes a single position multiple times for a single sample, if multiple types of variants (SNPs, insertions, deletions) are seen at the same place. In this step, we must merge all variants together across all samples, and pad out their allele lengths with extra genome sequence if necessary.

    b. If V-Phaser had no output at a given position for a sample, reach back into that sample's consensus assembly and record that consensus allele at 100% (V-Phaser only reports variant positions). If the consensus assembly is an N (insufficient read support at this position), don't do this. If the consensus assembly is gapped at this position, figure out the proper indel to put in this place (at 100%).

The existing "vphaser_to_vcf" command is a good starting point to refer to for the code, but it doesn't do all of the above, and it also does a few steps that belong in #88 instead. In particular, it doesn't do the MUSCLE coordinate remapping and doesn't handle any consensus-level indels, because none of these existed in our Summer 2014 data set, but they definitely exist in our current EBOV data. (they existed in previous Lassa data sets, but we weren't using V-Phaser back then)

Clearly, this step will need some decent unit testing...

@dpark01
Copy link
Member Author

dpark01 commented Feb 20, 2015

depends on #99

@dpark01 dpark01 self-assigned this Feb 26, 2015
@dpark01
Copy link
Member Author

dpark01 commented Feb 28, 2015

Finished an initial implementation that has bugs and will need some fixing. Started sketching out test cases and implemented some of them, but not all.

dpark01 added a commit that referenced this issue Mar 3, 2015
implement iSNV step 2 (merge to VCF) - closes #89
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant