psmcr: Pairwise Sequentially Markovian Coalescent With R
psmcr is an R port of
psmc()that runs the PSMC with (almost) all options like the original program. The bootstrap can be run from this function, with the possibility to run the replications in parallel.
plotfunction to display the result of
psmcwith various options.
VCF2DNAbinthat prepares a consensus genome from a VCF file with the possibility to choose the individual if the VCF file is from a population study.
seqBinningto bin the sequences.
See below for some details on preparing data input.
There are several parameters to consider in order to obtain a sensible output from the PSMC: it is recommended to see the help page
?psmc and the original documentation and publication accompanying
The present R package runs in a reasonable time (i.e., ~1 min) with an input size of ~1 Mb and 20 iterations. An input size of 1 Mb is equivalent to a genome size of 100 Mb if the default binning size (100) is used. This makes possible to assess different parameter settings (see documentations). Once these are found, it is possible to run the program in batch (i.e., non-interactively), for instance on a remote machine, with a small script like:
x <- readRDS("data.rds") o <- psmc(x, niters = 30, B = 100, mc.cores = 8, .....) saveRDS(o, "output_run_psmc.rds")
where the file data.rds contains the input (eventually binned) sequences saved with the R function
niters is the number of iterations of the PSMC,
B is the number of bootstrap replications,
mc.cores is the number of threads (see the default R package parallel), and
..... are other parameter settings. Alternatively, if you prefer to save the input sequences as FASTA, the first line can be replaced with:
library(ape) x <- read.FASTA("data.fas")
The script can then be run in the usual way with (depending on the operating system):
R CMD BATCH script.R nohup &
Once the job is completed, the output can be analysed in R with:
o <- readRDS("output_run_psmc.rds") plot(o)
Of course, since the object
o is a list, it can be analysed with standard R operations.
Getting Input From a VCF File
The input for
psmc is the (eventually binned) consensus sequence of a diploid individual. It is possible to build this sequence from:
- a reference genome in a FASTA file, and
- a VCF file with genotypes from one or several individuals.
VCF2DNAbin takes as input the VCF file and the reference genome and outputs a consensus sequence. The option
individual selects which individual to consider in the VCF file in case there are more than one. The reference genome can be specified in different ways:
- by default,
VCF2DNAbinlooks for a reference in the VCF file and opens it (after downloading if it is a remote file);
- a "DNAbin" object;
- the name of a FASTA file.
seqBinning can be used to bin the consensus sequences before calling
psmc. The default bin size is 100, and the output is a set of sequences with either K if there is at least one ambiguous base (representing a heterozygous site) within a binning interval, or T if there are only homozygous sites.