Skip to content

Incorporating regulatory interactions into gene-set analyses for GWAS data

Notifications You must be signed in to change notification settings

dgroenewoud/AUG-MAGMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 

Repository files navigation

AUG-MAGMA

A controlled approach for incorporating regulatory interactions into gene-set analyses for GWAS data

When using our approach, we ask that you reference our publication as well:

Groenewoud D, Shye A, Elkon R. Incorporating regulatory interactions into gene-set analyses for GWAS data: a controlled analysis with the MAGMA tool. [accepted for publication in PLOS Computational Biology]

You may wish to refer to our related page as well, which stores data related to this project: https://figshare.com/projects/MAGMA_with_Regulatory_Interactions/118056

Background

We provide an Rscript (current version: AUG-MAGMA-V1.01.R) to execute gene scoring and gene-set analysis with MAGMA with the specific aim of comparing between a baseline SNV-to-gene mapping (that is, a minimal SNV-to-gene mapping) and an augmented SNV-to-gene mapping (that is, a SNV-to-gene mapping that is built on-top-of a baseline SNV-to- gene mapping). For each mapping, MAGMA (a popular gene-set analysis tool for GWAS data: https://doi.org/10.1371/journal.pcbi.1004219) is used to calculate gene scores based on SNV-level associations (from GWAS summary statistics) amongst SNVs mapped to each gene. These gene scores are then fed into a competitive gene-set analysis to identify collections of genes that are enriched for phenotype association. Gene scores and gene-set scores can be compared between both mappings to evaluate the benefits of augmentation. To limit spurious discoveries, we control for non-specific effects with matched, random augmentation (see publication for details).

Requirements

  • Access to a linux server
  • An installation of R version 3.5.3 or higher
  • An installation of the data.table package (version 1.12.2 or higher), the stringi package (version 1.4.3 or higher), the igraph package (version 1.2.4.1 or higher), the foreach package (version 1.4.4 or higher), the parallel package (version 3.5.3 or higher), and the doParallel package (version 1.0.15 or higher)
  • An installation of MAGMA (refer to: https://ctg.cncr.nl/software/magma)
  • A data set of GWAS summary statistics containing: (i) an identifier column with rs identifiers (ii) a p-value column (iii) a sample-size column (if not provided, set this manually to the study sample-size for every entry)
  • A relevant data set of binary files (refer to: https://ctg.cncr.nl/software/magma) (i) choose the appropriate population with respect to the GWAS study being analyzed (ii) it is possible to use custom binary files with MAGMA at your own risk
  • A gene locations file for building SNV-to-gene mappings (refer to: https://ctg.cncr.nl/software/magma): (i) ensure that the genome build matches the binary files (ii) it is possible to use a custom gene locations file with MAGMA at your own risk (iii) it is also possible to avoid this file completely and to build your own SNV-to-gene mappings directly (however, we currently do not implement this)
  • A gene-set file (for examples, refer to: http://www.gsea-msigdb.org/gsea/downloads.jsp)

Prerequisites

Users should be familiar with MAGMA and we recommend that they refer to the MAGMA manual for additional details. The manual can be downloaded from the MAGMA website (https://ctg.cncr.nl/software/magma).

Installation

Simply save the Rscript (AUG-MAGMA.R) into any directory of your choice.

Create an empty working directory and, within it, an empty subdirectory called "output". For convenience, we recommend that you also create a subdirectory called "input" (within the working directory, that is), and that you create five subdirectories within it, namely: "annotations" (for gene locations and SNV-to-gene mappings), "binaries" (for a set of binary files), "miscellaneous" (for an optional file containing a list of genes to exclude from the analyses), "sets" (for a gene-set file), and "sumstats" (for a GWAS summary statistics file).

./workdir/
          input/
                    annotations/
                    binaries/
                    miscellaneous/
                    sets/
                    sumstats/
          output/

Tutorial

Stage 1. We recommend this tutorial for first-time users to confirm that everything is running as it should. Let's examine the effect of incorporating 10kb flanks on-top-of gene bodies in the context of a gene-set analysis for coronary-artery disease GWAS summary statistics. Create the directory structure as described above (under the installation heading). Name the working directory "tutorial".

Stage 2. Build two SNV-to-gene mappings. The baseline SNV-to-gene mapping assigns SNVs to genes based on overlap with gene bodies. The augmented SNV-to-gene mapping assigns SNVs to genes based on overlap with either gene bodies or 10kb flanks. Our publication describes how to build custom SNV-to-gene mappings such as mappings that incorporate regulatory interactions. However, for this tutorial we can use the following general command to first build the baseline SNV-to-gene mapping and then the augmented SNV-to-gene mapping.

</path/to/magma-executable>   # define path to magma executable
      --annotate window=X,Y   # define upstream (X) and downstream (Y) flank-size in kb
      --snp-loc </path/to/binaries/prefix.bim>   # define path to .bim file (one of the binary files)
      --gene-loc </path/to/annotations/gene.loc.file>   # define path to gene locations file
      --out </path/to/annotations/genes_uXdY>   # define path and name of output file

Stage 3. Run the Rscript from the linux command-line (see below).

Running AUG-MAGMA

To run AUG-MAGMA, use the following general command in the command line:

nohup
</path/to/R-interpreter>   # define path to R interpreter
</path/to/Rscript>   # define path to Rscript for execution
      --magma </path/to/magma-executable>   # define path to magma executable
      --sumstat </path/to/summary-statistics-file>   # define path to summary statistics file
      --sumstat-id Q   # Q is column index of column containing rs-identifier
      --sumstat-pval W   # W is column index of column containing p-value
      --sumstat-nsample R   # R is column index of column containing sample size
      --binaries </path/to/binaries/prefix>   # define path to the set of binary files including their common prefix
      --baseline-model </path/to/annotations/baseline-prefix.genes.annot>   # define path to baseline SNV-to-gene mapping
      --augmented-model </path/to/annotations/augmented-prefix.genes.annot>   # define path to augmented SNV-to-gene mapping
      --gene-set-file </path/to/gene-set-file>   # define path to gene-set file
      --output </path/to/output>   # define directory for storing output
&

Optional flags include:

--cores V   # set number of cores V manually (not recommended to exceed default | default is a quarter of total available cores)
--gene-scoring-model top   # change the way MAGMA calculates gene scores (not recommended | default is SNP-Wise Mean)
--gene-set-format col=A,B   # use alt. gene-set format (see MAGMA manual | A/B are index of gene/set column | default is row-based)
--ignore-genes </path/to/gene-list-file>   # define path list of genes to exclude from analyses (see MAGMA manual | default is none)
--permutations P   # set number of permutations P manually (default is 20)

General Description

  1. Summary statistics are read-in and preprocessed for compatibility with MAGMA and EPVP.
  2. Both SNV-to-gene mappings are read-in and are used to generate an annotation table (that is, a table in which each each gene is defined by SNVs mapped to it either exclusively via augmentation (a) or alternatively via both mappings (b)).
  3. Gene scoring (to obtain unadjusted gene scores) and gene-set analysis (to obtain adjusted gene scores that in turn are used to obtain gene set scores) are performed for the baseline SNV-to-gene mapping and then for the augmented SNV-to-gene mapping using the unpermuted summary statistics (that is, the output from step 1).
  4. EPVP permutations are executed and used to obtain an unadjusted gene score for each gene for each permutation, data which in turn are used in gene-set analysis to obtain adjusted gene scores and gene set scores for each permutation.

Output

Run time varies but will usually not exceed 12 hours. Progress can be monitored in the nohup.out file. All output is stored in the specified output directory under a subdirectory named (1) after the summary statistics file and then (2) an additional subdirectory named after the augmented SNV-to-gene mapping file. At the bottom of this directory structure are three subdirectories:

  1. The "sumstat" subdirectory. Contains three additional subdirectories, two of which contain intermediate output and can be ignored (namely, "permuted" and "aggregated"), and the other (namely, "original") contains the summary statistics used for the analyses (note: this file is a filtered and reformatted version of the input summary statistics, as generated in step 1 described under the general description heading).

  2. The "annotation" subdirectory. Contains a table in which each each gene is defined by SNVs mapped to it either exclusively via augmentation (a, that is a for augmentation) or via both mappings (b, that is b for baseline, since by definition the SNV is already mapped to the gene via the minimal SNV-to-gene mapping). This file serves as a reference for any downstream analyses that the user may wish to perform.

  3. The "scores" subdirectory contains gene scores and gene-set scores according to the baseline SNV-to-gene mapping with unpermuted summary statistics ("baseline"), the augmented SNV-to-gene mapping with unpermuted summary statistics ("augmented"), and the augmented SNV-to-gene mapping with permuted (EPVP) summary statistics ("random"). The intermediate output stored within the "batches" subdirectory (under the "random" subdirectory) can be ignored. Suffixes:
     - Unadjusted gene scores (unadjusted.genes.raw)
     - Adjusted gene scores (.adjusted.gsa.genes.out)
     - Gene-set scores from competitive gene-set analysis with adjusted gene scores (.adjusted.gsa.out)
    Note: difference in file name for scores computed with unpermuted ("original") or permuted summary statistics ("permutation-n").

About

Incorporating regulatory interactions into gene-set analyses for GWAS data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages