You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each set of iSNV calls starts off separate for each sample. They are in their own coordinate spaces (having been called against their own consensus sequences) and have their own relative definition of "reference" and "alternate" alleles. We'd like to merge this all to a single master table, using consistent coordinates and alleles, and a standard file format (e.g. VCF). Make a new command in intrahost.py (perhaps called something like "merge_to_vcf").
each individual sample's consensus assembly (fasta) that describes their reference alleles and coordinate space.
a master reference assembly (fasta) that describes our target coordinate space and reference alleles for the output file.
output: a single VCF file
Steps:
For each sample:
a. Align individual assembly to master reference with MUSCLE. Create custom code that reads that alignment and is able to generally map arbitrary coordinates (chr name, position) between the two coordinate spaces at will.
b. Map all iSNV calls to target coordinate space. Change the position of deletions off-by-one to match the VCF conventions of how to talk about deletions.
c. Flip alleles (ref & alt) if necessary to match the target reference genome.
Across all samples:
a. Sort and group all data per genomic position. The VCF format describes one row per position in the genome. This requires us to merge together data from all files into a single output row since our input describes each sample separately. But more than that, V-Phaser often describes a single position multiple times for a single sample, if multiple types of variants (SNPs, insertions, deletions) are seen at the same place. In this step, we must merge all variants together across all samples, and pad out their allele lengths with extra genome sequence if necessary.
b. If V-Phaser had no output at a given position for a sample, reach back into that sample's consensus assembly and record that consensus allele at 100% (V-Phaser only reports variant positions). If the consensus assembly is an N (insufficient read support at this position), don't do this. If the consensus assembly is gapped at this position, figure out the proper indel to put in this place (at 100%).
The existing "vphaser_to_vcf" command is a good starting point to refer to for the code, but it doesn't do all of the above, and it also does a few steps that belong in #88 instead. In particular, it doesn't do the MUSCLE coordinate remapping and doesn't handle any consensus-level indels, because none of these existed in our Summer 2014 data set, but they definitely exist in our current EBOV data. (they existed in previous Lassa data sets, but we weren't using V-Phaser back then)
Clearly, this step will need some decent unit testing...
The text was updated successfully, but these errors were encountered:
Finished an initial implementation that has bugs and will need some fixing. Started sketching out test cases and implemented some of them, but not all.
Each set of iSNV calls starts off separate for each sample. They are in their own coordinate spaces (having been called against their own consensus sequences) and have their own relative definition of "reference" and "alternate" alleles. We'd like to merge this all to a single master table, using consistent coordinates and alleles, and a standard file format (e.g. VCF). Make a new command in intrahost.py (perhaps called something like "merge_to_vcf").
Steps:
For each sample:
a. Align individual assembly to master reference with MUSCLE. Create custom code that reads that alignment and is able to generally map arbitrary coordinates (chr name, position) between the two coordinate spaces at will.
b. Map all iSNV calls to target coordinate space. Change the position of deletions off-by-one to match the VCF conventions of how to talk about deletions.
c. Flip alleles (ref & alt) if necessary to match the target reference genome.
Across all samples:
a. Sort and group all data per genomic position. The VCF format describes one row per position in the genome. This requires us to merge together data from all files into a single output row since our input describes each sample separately. But more than that, V-Phaser often describes a single position multiple times for a single sample, if multiple types of variants (SNPs, insertions, deletions) are seen at the same place. In this step, we must merge all variants together across all samples, and pad out their allele lengths with extra genome sequence if necessary.
b. If V-Phaser had no output at a given position for a sample, reach back into that sample's consensus assembly and record that consensus allele at 100% (V-Phaser only reports variant positions). If the consensus assembly is an N (insufficient read support at this position), don't do this. If the consensus assembly is gapped at this position, figure out the proper indel to put in this place (at 100%).
The existing "vphaser_to_vcf" command is a good starting point to refer to for the code, but it doesn't do all of the above, and it also does a few steps that belong in #88 instead. In particular, it doesn't do the MUSCLE coordinate remapping and doesn't handle any consensus-level indels, because none of these existed in our Summer 2014 data set, but they definitely exist in our current EBOV data. (they existed in previous Lassa data sets, but we weren't using V-Phaser back then)
Clearly, this step will need some decent unit testing...
The text was updated successfully, but these errors were encountered: