Skip to content
Baoxing Song edited this page Oct 4, 2017 · 1 revision

Welcome to the Irisas wiki!

We exemplified that a large number of INDELs could be mis-classified as multi-allelic with potentially different functional annotations, which would certainly undermine the association. We developed software Irisas (Integrated region-based variants synchronization and Annotation for association studies) to reclassify and re-annotate these variants. In addition, we proposed a robust measure that integrates the predicted functional impact of SNPs, INDELs, and structure variants for burden analysis. The association using our synchronized INDELs and integrated burden analysis explained a large proportion of missing heritability. The novel loci that previous GWA studies have failed to associate contain established candidate genes.

Algorithm

Variants synchronization

Problem of the currently available solution

A variant record is generally represented with chromosome name, coordinate position, reference allele and alternate allele. However, the same sequence variant could be represented in different ways in a variants file (sdi/vcf format) due to alternative alignment. Adrian Tan et al. designed an algorithm to synchronize variants records. They distinguished variants synchronization as two problems, decomposition/reconstruction of variants and normalization. They proposed a left alignment and parsimonious algorithm for normalization. And this algorithm is widely implemented. While there is no standard way for decomposition or reconstruction which could result into variants synchronization failure (https://genome.sph.umich.edu/wiki/Variant_Normalization, https://github.com/atks/vt/issues/16).

Overview

In theory, multiple sequence alignment is a standard way to synchronize variants records. However, the computational complexity of multiple sequence alignment increasing exponentially with the increase of sequence length. The computing time and RAM (random accessible memory) cost of eukaryotic whole chromosome multiple sequence alignment is beyond the ability of currently available mainstream hardware. Here we implemented a pipeline that 1) cuts the whole genome sequence into fragments with a sliding window, 2) perform multiple sequence alignment on each fragment, 3) assembly all the multiple sequence alignment results on the chromosome together, 4) call synchronized variants with the whole chromosome multiple sequence alignment results.

Details

With variants records and reference genome sequence, the (pseudo-)genome sequence of each whole-genome-wide re-sequenced accession/line could be inferred. And we designed an algorithm which could transform/project a certain reference genome coordinate position to the coordinate position of another genome, we defined is as liftover. So, if we cut the reference genome sequence into fragments with a sliding window, we could get the genome sequence of re-sequenced accession for each window by liftover the window-start and -end coordinates.

In Irisas, we use an overlapping window to cut the whole genome sequence. For each window, is has a start coordinate and an end coordinate. The window-end of previous window is neighboring with the window-start of next window. We define

reference start= window start-overlap size  (1)

and

reference end=window end+overlap size       (2)

We got the reference sequence with reference start and reference end. We liftover the window start coordinates and window end coordinates to each re-sequenced accession/line and define

target start= window start liftover-overlap size   (3)

and

target end= window end liftover+overlap size    (4)

And the genome sequences of each re-sequence accessions/lines for each window were got with its target start coordinate and target end coordinate. There are two reasons that there should be an overlap,1) the original, non-synchronized variants could blur the liftover for a certain region, the windows overlap could put the ambiguous regions within the same sequence fragment, 2) the same sequence variation could be represented as different variant entries. It is important to put those different entries of same variant within the same fragments for synchronization purpose. The multiple sequence alignment could be performed on each window. With a reasonable window and overlap size, the computational and RAM costing of multiple sequence alignment is acceptable.

We trim each multiple sequence alignment results basing on the window start and window end, so that for each window, the reference start and reference end just equal with the window start and window end. Then we assemble all the window fragments from the same chromosome together. After trimming, the reference sequences could be linked together directly. For a certain re-sequenced accession, if there is still overlap between two neighboring windows, the multiple sequence alignment result of the previous window would be taken for the overlap region. If there is a gap between two neighboring windows for a certain re-sequenced genome, we will create a deletion variants entry. The synchronized variants would be called by comparing the sequence alignment of each re-sequenced genome with the reference genome pair-wisely, base-by-base. The one base insertions with the same position would be merged as a single variant entry. And the one base deletions with serial position would be merged as a single variant entry.

Variant annotation

As described previously in our publication, alternative gene models often restore protein-coding in a natural accession. In addition, insertions and deletions can be called in different forms at different genomic positions, as shown in Fig.1a. Evaluating a gene’s ORFS based on each mutation independently would cause serious problem. The de novo annotation method suggested in the previous study6 is computationally intensive and requires the additional transcriptome sequence data, which could be unavailable in certain cases. As re-annotation indicated that the conservation of the protein-coding has a dominate effect6, we thus proposed here a light-weight algorithm to detect the change of ORFS in a gene. The algorithm in Irisas includes three steps: (a) obtaining the assembled haplotype of the gene; (b) aligning both the CDS and the protein sequence of reference to the haplotype sequence, and annotating the gene accordingly; (c) merging the two gene models from step b and the direct liftover result, the most conserved version was chosen for the ORFS. The details are as follows:

(a) CDS based orthologous annotation

The genome sequences of each gene from each accession were extracted from the assembled sequence including 1kbp upstream and downstream region. The CDS sequences of reference annotation was mapped to the genome sequence using exonerate (V2.32.4) with parameters “--maxintron 30000 --model est2genome -i -10 --score 10 --bestn 1 --minintron 10”.

(b) Protein based orthologous annotation

The protein sequences of TAIR10 was mapped to the haplotype sequence using exonerate with parameters: “--bestn 1 --maxintron 30000 --intronpenalty -10 --model protein2genome --percent 10 --score 10 --minintron 10”.

(c) Annotations for the ORFS

The gene annotation generated with the above two methods and direct liftover were checked with the following criterions:

  1. Are the start codon and stop codon intact?

  2. Has any of the splicing sites been interrupted? The splicing sites are same with reference sequences or follow the GT-AG rule13 or one of GC-AG, GG-AG, GT-TG, GT-CG, CT-AG, etc.?

  3. Is there any premature stop codon?

  4. Is the full length of CDS (protein coding sequences) sequence of the gene is dividable by 3?

If the ORFS was annotated as shift with all the above three different annotation, the ORFS was annotated as ORF-shift else the ORFS was annotated as ORF-conserved.