Skip to content

1. How treeWAS Works

caitiecollins edited this page Feb 24, 2018 · 3 revisions

The treeWAS Approach

The approach adopted within treeWAS is described fully in our paper, available in PLOS Computational Biology.

As a GWAS approach, treeWAS performs an unbiased search for statistically significant associations between a phenotype of interest and the genotype at all loci in a genetic dataset. No prior hypotheses about potential associations at candidate loci are required. Instead, a statistical approach is used to compare the genomes of individuals and to identify systematic differences in genotype that correspond to differences in the phenotype.

In addition to measuring associations and identifying statistically-significant findings, a central aim of treeWAS is to control for the confounding effects of clonal population structure and population stratification (overlap between the genetic ancestry and phenotypic states of individuals that gives rise to spurious associations) and homologous recombination. Our approach uses data simulation to disentangle genuine associations, with statistical significance and evolutionary support, from the noisy background of spurious associations arising by chance and from confounding factors. treeWAS simulates a "null" genetic dataset in such a way as to maintain several features of the empirical dataset, namely: its clonal genealogy, terminal phenotype, genetic composition, and homoplasy distribution (the number of substitutions per site due to both mutation and recombination). The simulated dataset is therefore able to capture these potentially-confounding features of the dataset under analysis, but without recreating any of the "true" associations, beyond those expected to arise by chance or as a result of these confounding factors.

Once the "null" genetic dataset has been generated, treeWAS can calculate association scores for loci in both the real and simulated datasets and compare the two. The association between each simulated locus and the empirical phenotype is measured and, collectively, these values form a null distribution of association score statistics. The degree of association between each locus in the empirical genetic dataset and the empirical phenotype is measured using the same association scores. At the upper tail of the null distribution, a threshold of significance is drawn at the quantile corresponding to: 1 - (a base p-value corrected for multiple testing). Any loci in the real dataset that have association score values lying above this threshold are deemed to be significantly associated to the phenotype.

Tests of Association

When measuring association between the phenotype of interest and the genotype, by default, three separate association scores are calculated for each locus in the genetic dataset.

Equations below use the following notation:

  • G = Genotypic state...
  • P = Phenotypic state...
  • a = ... at ancestral nodes
  • d = ... at descendant nodes
  • Nterm = Number of terminal nodes

Terminal:

The terminal test solves the following equation, for each genetic locus, at the terminal nodes of the tree only:

Terminal = | 1/Nterm((Pd x Gd) - (1 - Pd)Gd - Pd(1 - Gd) + (1 - Pd)(1 - Gd)) |

The terminal test is a sample-wide test of association that seeks to identify broad patterns of correlation between genetic loci and the phenotype, without relying on inferences drawn from reconstructions of the ancestral states.


Simultaneous:

The simultaneous test solves the following equation, for each genetic locus, across each branch in the tree.

Simultaneous = | ((Pa - Pd)(Ga - Gd)) |

This allows for the identification of simultaneous substitutions in both the genetic locus and phenotypic variable on the same branch of the phylogenetic tree (or parallel change in non-binary data). Simultaneous substitutions are an indicator of a deterministic relationship between genotype and phenotype. Moreover, because this score is not negatively impacted by the lack of association on other branches, it may be able to detect associations occurring through complementary pathways (i.e., in some clades but not others).


Subsequent:

The subsequent test solves the following equation, for each genetic locus, across each branch in the tree:

Subsequent = | 4/3(Pa x Ga) + 2/3(Pa x Gd) + 2/3(Pd x Ga) + 4/3(Pd x Gd) - Pa - Pd - Ga - Gd + 1 |

Calculating this metric across all branches of the tree allows us to measure in what proportion of tree branches we expect the genotype and phenotype to be in the same state. By drawing on inferences from the ancestral state reconstructions as well as the terminal states, this score may allows us to identify broad, if imperfect, patterns of association.