Skip to content

Flexible linear mixed model framework for Genome Wide Association Studies

License

Notifications You must be signed in to change notification settings

birneylab/flexlmm

Repository files navigation

birneylab/flexlmm birneylab/flexlmm

Nextflow run with conda run with docker run with singularity

Introduction

birneylab/flexlmm is a bioinformatics pipeline that runs linear mixed models for Genome-Wide Association Studies. Its main advantage compared to similar tools is that it is very flexible in the definition of the statistical model to be used and it performs permutations to correct for multiple testing and non-normal phenotypes. p-values are evaluated with a likelyhood ratio test between a 'null model' and a 'real model'. Both models are specified by the user with the R formula interface.

Disclaimer: this pipeline uses the nf-core template but it is not part of nf-core itself.

birneylab/stitchimpute_metro_map

  1. Convert vcf genotypes to pgen format (plink2)
  2. Compute the relatedness matrix for the whole genome and each LOCO subset (plink2)
  3. Verify that the statistical model specified is nested in the null model (R language)
  4. Estimate variance components using the null model fixed effects and the relatedness matrix (gaston)
  5. Compute the Cholesky decomposition of the phenotype variance-covariance matrix (R language)
  6. Remove the covariance structure from the phenotypes and fixed effect covariates (R language)
  7. Fit the null and complete models for each SNP, and compute a p-value using a likelyhood ratio test (R language)
  8. Fit the model and compute p-values for each permutation of the genotypes (R language)
  9. Compute the significance threshold using the Westfall–Young minP approach (R language)
  10. Make the final plots (ggplot2):
    • Manhattan plot of the associations
    • Quantile-quantile plots
    • Heatmap of the relatedness matrices (ComplexHeatmap)

Capabilities

  • Permutations are reproducible since the permutation index is used as a random seed
  • Plink2 used wherever possible
  • Model fitting done by loading in memory only one SNP at a time directly from the pgen file and using low-level internal R routines to increase performance
  • Can test for dominance and interactions of the genotype with fixed covariates (for example, to test for GxE and GxG)
  • Can standardize or quantile-normalize the phenotypes
  • Can include quantitative and/or categorical covariates
  • Permutations can be done within subgroups specified by a categorical covariate
    • This is useful when distinct sub-populations are present, such as in the case of multiple F2 crosses

To be implemented

  • Quantitative phenotypes (case-control studies)

The formula interface

The central concept of this pipeline is that of testing weather a certain statistical model (the 'real model') is significantly better than another statistical model (the 'null model'). The ratio of likelyhoods of the 2 models follows a $\chi^2$ distribution, from which it is possible to extract a p-value. This approach is known as Likelihood-ratio test. A condition for this test is that the null model must be nested in the real model. That is, the real model must include all the terms present in the null model plus at least one extra term.

In this pipeline both models are supplied by the user in the form of R formulas using the parameters null_model_formula and model_formula. In the formulas is possible to refer to the genotype of a given SNP with the term x, and to the phenotype with the term y. An intercept is automatically included but can be removed by adding the term 0 or -1 to the models. Derived quantities such as a dominance term (i.e. x==1 for genotypes encoded as 0,1,2) can be used but must be enclosed in parenthesis. Names of the columns in the covariate and quantitative covariate file can also be used. Consult the parameter documentation to see how these files should be formatted.

A valid real model and null model formula pair is for example:

model_formula: y ~ x + (x==1) + x:cov1 + cov1
null_model_formula: y ~ cov1

Here, cov1 is the name of one of the columns of the file specified via --covar or --qcovar. The p-value that you will obtain in this case will indicate if any of the terms x, (x==1), and x:cov1 have an effect significantly different from 0, when an intercept and cov1 are accounted for.

The following formulas instead are NOT valid:

model_formula: y ~ x + (x==1) + x:cov1
null_model_formula: y ~ cov1

This is because the null model contains the term cov1 which is not present in the model, and thus the formulas are not nested.

Technical details

This pipeline fits a Leave-One-Chromosome-Out (LOCO) mixed model using the 'null model' terms as fixed effects and the realised relatedness matrix as correlation matrix to estimate the variance components. The variance components are then used to estimate the variance-covariance matrix of the phenotypes. The square root of the variance-covariance matrix is determined via Cholesky decomposition, and this decomposition is used to rotate the response vector and the design matrix for each 'real model' before running Ordinary Least Squares (OLS). Except for the fact that the fixed effect SNPs are not included in the variance components estimation (for performance reasons), fitting OLS to the rotated response and design matrix is mathematically equivalent to fitting Generalised Least Squares (i.e. fitting a mixed model) to the original response and design matrix. This is a fairly standard approach used for example here.

Permutations are run on the genotype vectors jointly, so that each genotype is permuted in the same way and linkage disequilibrium is maintained. Since phenotypes and covariates are not permuted, also their relationship is not altered. Genotype permutations are performed AFTER the Cholesky rotation, so that the relatedness structure is regressed out from the correct samples and only the residuals from this operation are permuted. This corresponds to a null hypothesis where the exchangeable quantities are the genotype identities when relatedness is accounted already for. See here for an example of this approach being used in practice.

Significance thresholds for a given nominal significance level are reported using Bonferroni correction and Westfall–Young permutations (see here). If $m$ permutations are performed, the significance threshold is set as the $t$ quantile of the empirical distribution given by the minimum p-values for each permutation (in total a set of $m$ p-values), where $t$ is the nominal significance desired.

Integration with birneylab/stitchimpute

In order to use a vcf file obtained from the birneylab/stitchimpute pipeline, activate the stitch profile with the flag -profile stitch. This correctly loads the dosage information and fills missing genotypes.

Birneylab-specific information

For ease of use, the ideal settings for stitch for medaka samples have been specified in a profile called medaka. This can be activated with the flag -profile medaka. Always use this profile when working with medaka samples. If the genotypes that you are using have been obtained with the birneylab/stitchimpute pipeline, you should also use the stitch profile. In this case the flag to be used is -profile medaka,stitch

Usage

Note If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Since just a few files are required to run this pipeline, differently from other pipelines a samplesheet is not used. Instead, the required files are all specified with dedicated parameters.

You can run the pipeline using:

nextflow run birneylab/flexlmm \
   -profile <docker/singularity/.../institute> \
   --vcf input.vcf.gz \
   --pheno input.pheno \
   --model_formula 'y ~ x' \
   --null_model_formula 'y ~ 1' \
   --outdir <OUTDIR>

Warning: Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Warning: It is highly recommended to use the docker or singularity profile. Some processes do not have a working conda configuration.

Pipeline output

For more details about the output files and reports, please refer to the output documentation.

Credits

birneylab/flexlmm was originally written by Saul Pierotti.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

Flexible linear mixed model framework for Genome Wide Association Studies

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published