plmc infers undirected graphical models to describe coevolution and covariation in families of biological sequences. With a multiple sequence alignment as an input, plmc can quantify inferred coupling strengths between all pairs of positions (couplingsfile output) or infer a generative model of the sequences for predicting the effects of mutations or designing new sequences (paramfile output).
plmc [options] alignmentfile plmc -c couplingsfile alignmentfile plmc -o paramfile -c couplingsfile alignmentfile plmc [-h | --help] Required input: alignmentfile Multiple sequence alignment in FASTA format Options, output: -c --couplings couplingsfile Save coupling scores to file (text) -o --output paramfile Save estimated parameters to file (binary) Options, alignment processing: -s --scale <value> Sequence weights: neighborhood weight [s > 0] -t --theta <value> Sequence weights: neighborhood divergence [0 < t < 1] Options, Maximum a posteriori estimation (L-BFGS, default): -lh --lambdah <value> Set L2 lambda for fields (h_i) -le --lambdae <value> Set L2 lambda for couplings (e_ij) -lg --lambdag <value> Set group L1 lambda for couplings (e_ij) Options, general: --fast Fast weights and stochastic gradient descent -a --alphabet alphabet Alternative alphabet (default 21: -ACDEFGHIKLMNPQRSTVWY) -f --focus identifier Select only uppercase, non-gapped sites from a focus sequence -g --gapignore Exclude first alphabet character from potential calculations -m --maxiter Maximum number of iterations -n --ncores [<number>|max] Maximum number of threads to use in OpenMP -h --help Usage
plmc requires no external libraries, but can optionally be accelerated with OpenMP for multicore parallelism.
Multicore. To compile with
gcc and OpenMP:
On macOS, OpenMP requires that an external version of GCC be installed rather than
clang. Precompiled binaries of GCC are available here or can be downloaded through package managers like homebrew or macports.
Single core, Linux. To compile with
Single core, Mac OS X. To compile with
Single precision. All of the above targets compile to double precision (64 bit), but reducing the precision to single (32 bit) increases speed and decreases memory requirements by approximately a factor of two. The fastest compile settings are then:
Coupling scores. The
couplingsfile is a flat text file containing scores quantifying the inferred strength of the coupling between every pair of positions. It has 6 columns:
RES_I FOCUS_AI RES_J FOCUS_AJ 0 SCORE, where
SCORE is the coupling score between positions
FOCUS_AJ are the letters in the focus sequence (optional,
- if no focus),
0 is a placeholder. The
SCORE values are APC-corrected Frobenius norm scores, but alternative scores can be computed from the raw parameter values.
Parameter estimates. The optional
paramfile specified with
-o, will store all inferred model parameters in binary. These can get large, as for proteins the model explicitly parameterizes all possible pairs of amino acids at all possible pairs of postions, which is about 106-108 numbers for families of lengths ~70-700. The MATLAB script
scripts/read_params.m unpacks this binary file into model parameters as well as associated metadata, such as inferred sequence weights.
Protein alignments. The example directory includes an alignment of the protein dihdyrofolate reductase (DHFR). To infer a model for this family, we can type the following in the base directory:
bin/plmc -o example/protein/DHFR.params -le 16.0 -lh 0.01 -m 100 -g -f DYR_ECOLI example/protein/DHFR.a2m
The numeric options set a strong L2 regularization for the couplings, λe = 16.0, a weak L2 regularization for the sites, λh = 0.01, and the maximum number of iterations at 100. The focus
-f option tells plmc to only model columns that are upper-case in the E. coli sequence DYR_ECOLI (NOTE: for focus mode, the alignment should only contain columns that are coding in the focus sequence, otherwise the offsets for the output sequence will be incorrect.) The
-g gap-ignoring option ignores gaps by modeling only the coding portions of each sequence. To read the binary paramfile
DHFR.eij and visualize the couplings, we can type the following in MATLAB from the
params = read_params('../example/protein/DHFR.params'); plot_coupling_scores(params)
This computes and plots the same (APC-corrected) coupling strengths that would be output to couplingfile, revealing many strongly coupled pairs of positions in the long-term evolution of DHFR:
EVzoom makes it possible to navigate the large number of sequence-level parameters present in undirected models of sequence families:
RNA alignments. An example RNA alignment is included for the SAM riboswitch. To infer the couplings with an RNA alphabet (.ACGU) type the following in the base directory:
bin/plmc -c example/RNA/RF00162.EC -o example/RNA/RF00162.params -a .ACGU -le 20.0 -lh 0.01 -m 50 example/RNA/RF00162.fasta
To plot the aggregated coupling scores type the following in MATLAB from the
params = read_params('../example/RNA/RF00162.params'); plot_coupling_scores(params)
The default alphabet (-ACDEFGHIKLMNPQRSTVWY) can be overridden by the option
-a ALPHABET. All uppercase letters and non-letter ASCII characters are acceptable. Lowercase letters in the alignment file will be mapped to the corresponding uppercase letter, except for in focus mode
-f, in which columns that are lower case in the focus sequence will be discarded.
Physical systems. As an example, simulated draws from a 3-state, 1-dimensional Potts model are provided in the examples folder and encoded by the characters _, *, and ^. The following command would estimate the parameters by running to convergence with λe = 1.0, λh = 1.0 and sequence reweighting disabled:
bin/plmc -c example/potts/potts3.txt -a _*^ -t -1 -le 1.0 -lh 1.0 example/potts/potts3.a2m
A 1D Potts model will only have interactions between i -> i + 1, which should be evident in the coupling summary scores output to example/potts/potts3.txt
The use of pseudolikelihood approximations for approximate inference of spin models for sequence families was established in the compsci/bioinformatics literature by
Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S. I., & Langmead, C. J. (2011). Learning generative models for protein fold families. Proteins: Structure, Function, and Bioinformatics, 79(4), 1061-1078.
and in the biophysics literature by
Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Physical Review E, 87(1), 012707.
plmc implements a joint optimization version of inference described in these and subsequent works (i.e. the 'symmetric' pseudolikelihood). If you'd like to use
plmc in your own work, please cite the following paper. Also, please let us know if you have any comments or questions!
Hopf, T. A., Ingraham, J. B., Poelwijk, F. J., Schärfe, C. P., Springer, M., Sander, C., & Marks, D. S. (2017). Mutation effects predicted from sequence co-variation. Nature Biotechnology, 35(2), 128-135.