Skip to content

Latest commit

 

History

History
367 lines (288 loc) · 15.8 KB

README.org

File metadata and controls

367 lines (288 loc) · 15.8 KB

SNPHAP

Copy of snphap-1.3.1 source, written by David Clayton, put here because existing website is closing.

I am not a maintainer, but a user who wants to keep a copy where I can install from in future. What follows is David’s documentation.

A program for estimating frequencies of large haplotypes of SNPs

(Version 1.3)

David Clayton

Department of Medical Genetics Cambridge Institute for Medical Research Wellcome Trust/MRC Building Addenbrooke’s Hospital, Cambridge, CB2 2XY

Disclaimer

This program has now had a fair bit of use by our group and others. However it comes with no guarantees. I’d like to know of any difficulties/bugs that people experience and will try and fdeal with them, but this may take some time.

Introduction

This program implements a fairly standard method for estimating haplotype frequencies using data from unrelated individuals. It uses an EM algorithm to calculate ML estimates of haplotype frequencies given genotype measurements which do not specify phase. The algorithm also allows for some genotype measurements to be missing (due, for example, to PCR failure). It also allows multiple imputation of individual haplotypes.

EM algorithm

The well-known algorithm first expands the data for each subject into the complete set of (fully phased) pairs of haplotype consistent with the observed data (the “haplotype instances”). The EM algorithm then proceeds as follows:

  • E step:: Given current estimates of haplotype probabilities and assuming Hardy-Weinberg equilibrium, calculate the probability of each phased and complete genotype assignment for each subject. Scale these so that they sum to 1.0 within each subject. These are then the POSTERIOR PROBABILITIES of the genotype assignments.
  • M step:: Calculate the next set of estimates of haplotype probabilities by summing posterior probabilities over instances of each distinct haplotype. After scaling to sum to 1.0, These provide the new estimates of the PRIOR PROBABILITIES.

This algorithm is widely used and not novel. However, for a large number of loci, the number of possible haplotype instances rapidly becomes impossibly large, even though the eventual solution may only give appreciable support to a rather limited set of haplotypes. This program avoids this difficulty by starting by fitting 2-locus haplotypes and extending the solution by one locus at a time. As each new locus is added, the number of haplotype instances to be considered is first expanded, by considering all possible larger haplotypes. Then, after applyiny the EM algorithm to estimate the “prior” haplotype probabilities and the posterior probilities of haplotype instances, the haplotype instances are culled in two ways:

  • Posterior trimming:: Any genotype assignment whose posterior probability falls below a given threshold is deleted and the posterior probabilities of assignments of genotype to the subject are recomputed.
  • Prior trimming:: All instances of any haplotype whose prior probability falls below a threshold are removed. This option is not used by default since it can lead to difficulties in comparing likelihoods (see below).

We add one locus at a time until completion.

The process of culling haplotype assignments at early stages can lead to solutions which are not optimal. For example, haplotype 1.1.1 may have zero estimated frequency in the maximum likelihood analysis of the three-locus haplotype, while 1.1.1.x may have non-zero estimated frequency in the ML solution to the four-locus problem. It is not clear how often this will be a problem. A partial solution is to try including loci in different orders, seeing if the soultion obtained varies. A further protection is not to cull haplotypes after inclusion of every locus, but only every k loci, although there will be a penalty both in computer time and use of memory incurred by choice of large values of k. Note that you will only get the benefit of choosing k>1 if each EM algorithm is started from a random starting point (see -rs option).

Multiple imputation

Sampling the (Bayesian) posterior distribution of individual haplotype data is conveniently carried out using a Gibbs sampler. This mimics the EM algorithmm but uses stochastic steps rather than deterministic ones. It has been termed the IP (Imputation/Posterior sampling) algorithm. In our case the algorithm works as follows:

I-step (replaces the E-step): For each subject, pick a haplotype assignment from the possible instances, with probability given by the current posterior, or “full conditional” distribution.

P-step (replaces the M-step): Sample the haplotype population frequencies from their full conditional Bayesian posterior. If the prior is a Dirichlet distribution with constant df on all possible haplotypes*, the full conditional posterior distribution is also Dirichlet. To obtain the set of df parameters for this posterior Dirichlet, we simply add the constant prior df to the number of chromosomes currently assigned to each haplotype.

(* i.e. if the unknown population haplotype relative frequencies are denoted p_1, p2, …, p_i, …, p_n, then their prior density is assumed to be proportional to

   n           d-1
Product   (p_i)
  i=1 

where d is the prior degree of freedom parameter)

There can be difficulties in sampling the entire space using this algorithm if the prior Dirchlet df is taken as zero; if a haplotype is not assigned to any individual at in one step, then the full conditional posterior is improper and the haplotype will be given zero probability at the next step. Thereafter it can never be sampled again. Also, when there are multiple maxima in the likelihood, the algorithm may become “stuck” under one peak. To avoid these difficulties, provision is made to start the prior df parameter at a relatively large value, thereby giving all haplotypes an appreciable probability of being sampled. Thereafter the prior df parameter is reduced at each step. This algorithm is repeated for a fixed number of steps to obtain a single imputation. The prior df parameter is then set back up to the high value, the population haplotype frequencies restored to their MLE’s, and the process repeated to obtain the next imputation. And so on.

Warning: Although multiple imputation using the IP algorithm is an established technique (see Schafer J.L. “Analysis of Incomplete Multivariate Data” Chapman and Hall: London, 1997), it remains to be rigorously validated in this application.

Multiple maxima

It is well known that the likelihood surface for this problem may have multiple maxima and that the EM algorithm will only converge to a local maximum. After all loci have been added and a final trimmed list of haplotype instances has been computed, the EM algorithm may be repeated multiple times from random starting points in order to search for the global maximum. The random starting points may be chosen in one of two ways: (a) from randomly chosen values for the prior haplotype probabilities, or (b) from randomly chosen posterior probabilities for each haplotype assignment. Random starting points can also be chosen in the first set of EM iterations and, in this case, method (b) is used.

Use

The program is invoked from the command line by

snphap [-ds # -de # -i # -l # -mb # -mc # -mi # -mm # -n -pr #  -po # 
	-ro -rv -rp -rs -sd # -ss -to # -th # -w] 
		input-file [output-file-1] [output-file-2]

The input file should contain the data in subject order, with a subject identifier followed by pairs of alleles of each locus. The subject identifier need not be numeric, but must not include “white space” (blanks or tabs). The alleles should either be coded 1, 2 (numeric coding), or A,C,G or T (“nucleotide” coding). Missing data is indicated by 0 in numeric coding and, for nucleotide coding, by any character not hitherto mentioned. Data fields should be separated by any “white space” (any number of blanks, tabs or new-line characters).

By default loci, are added in the same order that they appear on the input file but, optionally, they may be added in

  1. Reverse order
  2. Random order
  3. In decreasing order completeness of data
  4. In decreasing order of minor allele frequency (MAF)

I have little experience yet of the effect of changing the order of inclusion. The idea of (3) is to stop too much proliferation of possible haplotypes early on in the process, when there is little data on which to reliably pick rare haplotypes to cull. The idea of (4) is to concentrate first on older haplotypes.

The log likelihood output from this program should be used with some caution, particularly when prior trimming has been applied, since likelihoods which do not consider the same subsets of possible haplotypes may not be comparable.

Options

The optional flags allow one to set the following parameters:

-co	Add loci in decreasing order of data completeness

-i #	The maximum number of EM iterations at each step is # (default 50).

-l #	The number of loci is #. This option is no longer necessary if the 
	data are entered as one line per subject.

-k #    Kill improbable haplotype assignments (see -pr and -po) after 
	every # loci (default is 5).

-mb #	The maximum amount of dynamic storage to be allocated to the program
	is # Mbytes. This option should not be needed since, by default, the
	program should be able to determine this.

-mi #	Create # output files containing fully phased genotypes imputed at 
	random from the posterior distribution.

-mm #	Carry out the final EM maximization # times, starting from random 
	starting points. The solution with the largest likelihood is accepted.

-mo	Add loci in decreasing order of minor allele frequency

-nu	Forces numeric coding (1/2) of alleles on output, even when input data 
	are in A/T/C/G format (default is "off").

-nh	Locus names are supplied as first line of input file.

-nf #	Locus names are supplied in file #. This file should have one
	line per locus, with the locus name as  the first field.

-pr #	The threshold for prior trimming is # (default is zero).

-po #	The threshold for posterior trimming is # (default is 0.01).

-q	"Quiet" operation. This suppresses the constantly-updated progress 
	report that is written to the screen. This is needed when the 
	program is to be run in "batch" mode. 

-ro	Add new loci in random order (off by default).

-rv	Add new loci in reverse order (off by default).

-rp	When -mm option is set, each repeated EM algoritm is restarted with
	random values for the prior haplotype probbailities. If not set,
	each EM algorithm is restarted with random posterior assignment
	probabilities for each subject (the default behaviour).

-rs	Select random starting point for each EM iteration. Otherwise we start
	by assuming linkage equilibrium between the new marker and the previous
	ones (default is "on").

-sd #	Set the pseudo-random number generator seed to a large integer, #. The
	default is to generate a seed from the date and time.

-ss	Specifies that the output file should be written as a tab-delimited 
	file with variable names on the first line (suitable for reading
	into spreadsheet or statistical programs; default is "off").

-to #	The convergence criterion for the EM iterations is # (the tolerated
	change in log-likelihood between two iterations; default 0.0001).

-th #	A number between 0 and 1, controlling the posterior threshold for 
	writing most likely  haplotype assignments to subjects to 
	output-file-2. Only assignments whose posterior probability exceeds
	this multiple of the most likely a posterior assignment will be
	written to the output file (not relevant when in multiple imputation
	mode).

-w	The second output file, containing haplotype assignments to subjects,
	is written in "wide" format i.e. one line per assignment (default is
	"long" format in which two lines are written - one for each haplotype).

Multiple imputation options:

-mc #	The number of MCMC steps between imputations 

-ds #	The starting value of the df parameters for the Dirichlet prior. This
	is specified as a multiple of the number of chromosomes observed 
	(i.e. twice the number of subjects). The default value is 0.1.

-de #	The final values of the df parameters for the Dirichlet prior 
	(specified in the same way as above). The default value is zero, 
	corresponding to complete ignorance.

If the command is issued without options or arguments, a brief description of available options is written to the screen.

Output

  1. Iteration progress reports (written to the screen). Note that some terminal emulators which provide “scrolling” may seriously slow down operation of the program. In this case you should either use a standard non-scrolling xterm, or invoke the -q option which suppresses this output.
  2. A file listing the haplotypes found, and their probabilities (output-file-1). The list is in descending order of probability and a cumulative probability is also listed. The cumulative probability is suppressed if the -ss option is in force.
  3. A file listing assignments of haplotypes to subjects (output-file-2). This file contains all assignments whose posterior probability exceeds a multiple of that of the most probable assignment (see -th option).
  4. A file (named “snphap-warnings”) which contains any warning messages.

Output files output-file-1 and output-file-2 are in a compressed and easily readable format. Alternativelly they can be saved as tab-delimited text files suitable for reading into a spreadsheet program, or a statistical program such as “Stata”. Both file names are optional and a missing argument can be indicated with a single “.” (period or full-stop) character. But since it must be assumed that you want SOME output, omission of both file names causes the program to default to “snphap.out” for output-file-1.

In multiple imputation mode, an additional series of files is created. Each imputation causes a fresh file (or pair of files) to be written. The file names are as specified on the command line, but the strings .001, .002, .003 … etc. are appended.

Building

A primitive Makefile is supplied. This uses the gcc compiler and will need to be edited if a different C compiler is to be used. You may also need to edit the CMP_FLAGS and LD_FLAGS options (which provide flags used by the compiler at compile and load stages respectively)

For Microsoft Windows users, I suggest use of the “Cygwin” Unix emulation package. See

http://www.redhat.com/software/tools/cygwin

I found that setting LD_FLAGS to -lm worked for me on both Linux and Solaris (this is the default setting), but on Cygwin I had to omit this flag.

The default uniform random number generator (UNIFORM_RANDOM) is set to be the standard 48-bit function `drand48’, and the corresponding seeding function (RANDOM_SEED) is `srand48’. However, for systems which do no support the 48-bit functions (this includes Cygwin), the 32-bit versions can be chosen:

UNIFORM_RANDOM = drand
RANDOM_SEED = srand

`drand()’ is defined as a macro evaluating to (0.5+rand())/(1+RAND_MAX).

A short test data file is also included. This contains typings of 100 subjects for 51 SNPs in a small region. To test the program:

./snphap test.dat

Altrenatively, if you wish to incorporate locus names in the output,

./snphap -nf test.nam test.dat

Acknowledgements

Thanks to Newton Morton and Nikolas Maniatis for their helpful comments and suggestions on an early previous version. Thanks also to anyone who has pointed out bugs in earlier versions.

David Clayton