Skip to content

3. treeWAS Function & Arguments

caitiecollins edited this page Jun 11, 2024 · 14 revisions

Running treeWAS

Running treeWAS takes only one function. It requires two inputs: snps, a matrix containing binary genetic data, and phen, a vector containing the phenotype of each individual in your dataset. You can also use tree to provide a phylogenetic tree if you have already built one. See Data for more details on inputs. And read Arguments below to tailor your analysis and outputs. TreeWAS should finish running within a couple of minutes, depending on the size of the dataset.


out <- treeWAS(snps = snps,
                phen = phen,
                tree = tree,
                seed = 1)


Arguments

The treeWAS function takes the following arguments:


## Don't run this:
out <- treeWAS(snps,
                phen,
                tree = c("BIONJ", "NJ", "parsimony", "BIONJ*", "NJ*"),
                phen.type = NULL,
                n.subs = NULL,
                n.snps.sim = ncol(snps)*10,
                chunk.size = ncol(snps),
                mem.lim = FALSE,
                test = c("terminal", "simultaneous", "subsequent"),
                correct.prop = FALSE,
                snps.reconstruction = "parsimony",
                snps.sim.reconstruction = "parsimony",
                phen.reconstruction = "parsimony",
                na.rm = TRUE,
                p.value = 0.01,
                p.value.correct = c("bonf", "fdr", FALSE),
                p.value.by = c("count", "density"),
                dist.dna.model = "JC69",
                plot.tree = TRUE,
                plot.manhattan = TRUE,
                plot.null.dist = TRUE,
                plot.dist = FALSE,
                snps.assoc = NULL, 
                filename.plot = NULL,
                seed = NULL)

snps : A matrix containing binary genetic data, with individuals in the rows and genetic loci in the columns and both rows and columns labelled.

phen : A vector containing the phenotypic state of each individual, whose length is equal to the number of rows in snps and which is named with the same set of labels. The phenotype can be either binary (character or numeric) or continuous (numeric).

tree : A phylo object containing the phylogenetic tree; or, a character string, one of "NJ", "BIONJ" (the default), or "parsimony"; or, if NAs are present in the distance matrix, one of: "NJ*" or "BIONJ*", specifying the method of phylogenetic reconstruction.

phen.type : An optional character string specifying whether the phenotypic variable should be treated as either "categorical", "discrete" or "continuous". If phen.type is NULL (the default), ancestral state reconstructions performed via ML will treat any binary phenotype as discrete and any non-binary phenotype as continuous. If phen.type is "categorical", ML reconstructions and association tests will treat values as nominal (not ordered) levels and not as meaningful numbers. Categorical phenotypes must have >= 3 unique values (<= 5 recommended). If phen.type is "continuous", ML reconstructions will treat values as meaningful numbers and may infer intermediate values.

n.subs : A numeric vector containing the homoplasy distribution (if known, see details), or NULL (the default).

n.snps.sim : An integer specifying the number of loci to be simulated for estimating the null distribution (by default 10*ncol(snps)). Note that 10x is the recommended minimum: where possible (i.e., for datasets that are not very large), simulating more loci (e.g., 100*ncol(snps)) may further improve results.

chunk.size : An integer indicating the number of snps loci to be analysed at one time. This provides a solution for machines with insufficient memory to analyse the dataset at hand. Note that smaller values of chunk.size will increase the computational time required (e.g., for chunk.size = ncol(snps)/2, treeWAS will take twice as long to complete).

mem.lim : Either a number or a logical value to establish a memory limit (in GB) that will be used to automatically update the chunk.size argument if there is not enough available memory to run treeWAS in one chunk. If FALSE (the default), no limit is estimated and chunk.size is not changed. If TRUE, the amount of memory currently available is estimated with memfree() and chunk.size is scaled back to account for the amount of memory estimated to be needed by treeWAS for this dataset. If a single numeric value, this is taken to be the amount of memory (in GB) available/designated for use by treeWAS and chunk.size is updated to reflect this.

test : A character string or vector containing one or more of the following available tests of association: "terminal", "simultaneous", "subsequent", "cor", "fisher". By default, the first three tests are run (see details).

correct.prop : A logical indicating whether the "terminal" and "subsequent" tests will be corrected for phenotypic class imbalance. Recommended if the proportion of individuals varies significantly across the levels of the phenotype (if binary) or if the phenotype is skewed (if continuous). If correct.prop is FALSE (the default), the original version of each test is run. If TRUE, an alternate association metric based on the phi correlation coefficient is calculated across the terminal and all (internal and terminal) nodes, respectively.

snps.reconstruction : Either a character string specifying "parsimony" (the default) or "ML" (maximum likelihood) for the ancestral state reconstruction of the genetic dataset, or a matrix containing this reconstruction if it has been performed elsewhere and you provide the tree.

snps.sim.reconstruction : A character string specifying "parsimony" (the default) or "ML" (maximum likelihood) for the ancestral state reconstruction of the simulated null genetic dataset.

phen.reconstruction : Either a character string specifying "parsimony" (the default) or "ML" (maximum likelihood) for the ancestral state reconstruction of the phenotypic variable, or a vector containing this reconstruction if it has been performed elsewhere.

na.rm : A logical indicating whether to remove snps columns if they contain more than 75% NAs (by default, TRUE).

p.value : A number specifying the base p-value to be set the threshold of significance (by default, 0.01).

p.value.correct : A character string, either "bonf" (the default) or "fdr", specifying whether correction for multiple testing should be performed by Bonferonni correction (recommended) or the False Discovery Rate.

p.value.by : A character string specifying how the upper tail of the p-value distribution is to be identified. Either "count" (the default, recommended) for a simple count-based approach or "density" for a kernel-density based approximation.

dist.dna.model : A character string specifying the type of model to use in reconstructing the phylogenetic tree for calculating the genetic distance between individual genomes, only used if tree is a character string (see ?dist.dna).

plot.tree : A logical indicating whether to generate a plot of the phylogenetic tree (TRUE, the default) or not (FALSE).

plot.manhattan : A logical indicating whether to generate a manhattan plot for each association score (TRUE, the default) or not (FALSE).

plot.null.dist : A logical indicating whether to plot the null distribution of association score statistics (TRUE, the default) or not (FALSE).

plot.dist : A logical indicating whether to plot the true distribution of association score statistics (TRUE) or not (FALSE, the default).

snps.assoc : An optional character string or vector specifying known associated loci to be demarked in results plots (e.g., from previous studies or if data is simulated); else NULL.

filename.plot : An optional character string denoting the file location for saving any plots produced (eg. "C:/Home/treeWAS_plots.pdf"); else NULL.

seed : An optional integer to control the pseudo-randomisation process and allow for identical repeat runs of the function; else NULL.