An R package for import, QC and analysis of Illumina Infinium genotyping arrays
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data
figs
inst/extdata
man
src
vignettes
.Rbuildignore
.gitignore
DESCRIPTION
NAMESPACE
README.md
argyle_0.1.tar.gz
argyle_0.1.tgz
argyle_0.1.zip

README.md

argyle logo argyle

An R package for GenotYpes from ILlumina Et al. Utilities for import, QC and (some) analysis of genotyping and hybridization-intensity data from Illumina Infinium arrays using R.

Morgan AP (2015) argyle: an R package for analysis of Illumina genotyping arrays. G3 6: 281-286. doi:10.1534/g3.115.023739.

If you suspect a bug, please consider reporting it on the Issues page. To receive (infrequent) updates about argyle and interact with other users, subscribe to the arglye-users Google Group.

Installation

A source version of the package (*.tar.gz) and binaries for Mac (*.tgz) and Windows (*.zip) are available from this repository. Before installing, all dependencies will need to be in place. Building from soure requires a reasonably modern C++ compiler.

If at all possible, consider installing the most recent version of the package directly from Github with devtools.

library(devtools)

## allow R to look for pacakges in both CRAN and Bioconductor
setRepositories(ind = 1:2)

## install from Github source
install_github("andrewparkermorgan/argyle")

Dependencies

Effort has been made to keep to a minimum the number of package dependencies, subject to the constraint that I don't want to re-implement from scratch what others have done better.

  • data.table: really fast and efficient handling of big (multi-GB scale) table-style data with low overhead
  • preprocessCore (Biodoncuctor): robust quantile normalization routine written in C
  • plyr: generalizations of base-R's apply() family
  • reshape2: easy "flattening" of matrices to dataframes
  • digest: for computing MD5 checksums to check data integrity
  • ggplot2 (and friends): required for the plotting functions
  • Rcpp: for compiled code

The following are required for some functions but one could get by without them:

  • corpcor: required for "fast"-mode PCA

Recent updates

New in version 0.2:

  • as.data.frame() for conversion to friendly-looking genotype table suitable for Excel
  • integrity checks for Illumina BeadStudio output files
  • checks for duplicate sample and marker names in cbind() and rbind()
  • bug-fixes in functions for conversion between R/qtl and argyle
  • random access (over markers or samples) to PLINK filesets
  • option to store intensitites in compact binary format (*.bii) alongside a PLINK fileset

Usage

## load example dataset
data(ex)

## high-level summary
summary(ex)

## NB: print(.) same as summary, won't flood the terminal
print(ex)

## peek at genotype matrix
head(ex)

## see marker map and sample metadata
markers(ex)
samples(ex)

## subset operations: hard brackets or subset()
ex[ 1:10,1:2 ]

## or stuff like
x <- subset(ex, chr == "chrM")
x <- subset(ex, sex == 2, by = "samples")

## run QC checks and flag samples above thresholds
ex <- run.sample.qc(ex, max.H = 5e3, max.N = 500)
# how many samples fail QC?
summarize.filters(ex)

Interface to R/DOQTL

Genotypes processed with argyle can be packaged into a set of R objects (bundled in an *.Rdata file) suitable for use as input to Dan Gatti's DOQTL software. DOQTL performs haplotype reconstruction and genetic mapping (under both linkage/composite-interval and single-marker association models) in multifounder advanced intercross populations. Its namesake is the Diversity Outbred (DO) mouse population (see do.jax.org).

## export for DOQTL
export.doqtl(ex, "./doqtl.objects.Rdata")

## convert to R/qtl
cross <- as.rqtl(ex, type = "f2")

Interface to PLINK

Computation on large SNP array genotyping datasets is not a new problem. Many common operations -- frequency statistics (sample-wise and marker-wise), differentiation statistics ($F_{st}$ et al), homozygosity checks, association testing, multivariate clustering by PCA and MDS -- are implemented efficiently in the PLINK package. The input formats popularized by PLINK are now used by other software in population genetics.

This package provides functions to read and write binary PLINK filesets. The binary fileset consists of three files:

  • *.fam: the "family file" describing samples (6 columns): family ID, sample ID, mom ID, dad ID, sex (0=unknown, 1=male, 2=female), phenotype (-9=missing)
  • *.bim: a file describing the marker map (6 columns, at least): chromosome, marker ID, genetic position (cM), physical position (bp), allele 1, allele 2
  • *.bed: compact binary representation of genotypes using 2 bits per genotype

Note that order matters: genotypes from the *.bed file are mapped to samples and markers using order of appearance in the *.fam and *.bim files.

## this command produces files 'sample.bed', 'sample.bim' and 'sample.fam' in the R sessions temporary directory
ff <- file.path(tempdir(), "sample")
ex <- recode(ex, "native")
write.plink(ex, ff)

## ... and this one reads it back in
summary( read.plink(ff) )

Also included are thin wrappers around many PLINK utilities. These of course require a working executable named plink in the user's path. See the user manual for details.