Pipeline for running conStruct on a distributed HPC system
TO-DO:
- Add clustering to geodist, for automatic pop assignment
geodist.py calculates geographic distances between populations for input into the conStruct R package. It does so by first grouping by population IDs specified in a popmap file, then collapsing the individuals into distinct populations. It then calculates the great circle distances between each population and outputs two files: A distances file and a two-column file with lon/lat coordinates.
python3 (tested with python 3.6)
geopandas
pandas
numpy
utm
geopy
shapely
The easiest way to install all the dependencies is with Anaconda or Miniconda. I have provided a conda environment file called geodist.yml that will allow you to create a conda environment containing all the dependencies. E.g.,
conda env create -f geodist.yml
Alternatively, you can install the dependencies yourself. E.g.,
conda create --name geodist python=3.6
conda install -n geodist -c conda-forge geopandas
conda install -n geodist -c conda-forge utm
conda install -n geodist -c conda-forge geopy
Installing geopandas using conda will install pandas, numpy, and shapely as well.
I recommend using only getting the packages from conda-forge. I had compatibility issues when I tried getting packages from mixed repositories.
Use the help option to see full list of options:
geodist.py -h
Required Arguments:
-c [STRING], --coords [STRING]: Name of comma-separated file containing coordinates for each sample
-i [STRING], --id [STRING]: Specify name of column containing sample IDs in --coords file.
-p [STRING], --popmap [STRING]: Specify filename for tab-separated population map
--lat [STRING]: Specify name of latitude column in --coords file
--long [STRING]: Specify name of longitude column in --coords file
Optional Arguments:
-f [FORMAT], --format [STRING]:
Specify input coordinate format: [ dd || utm
]; default=dd (decimal degrees); can be decimal degrees or utm
-z [INT], --zone [INT]:
Specify UTM zone if '--format utm' option is
specified; default=None
--zone_column [STRING]:
Specify column name for UTM zones in --coords file; either
--zone_column or --zone can be used, but not both
--hemisphere [STRING]:
Specify hemisphere (N || S) if --zone or
--zone_column is used; default=N
-e [EPSG], --epsg [STRING]:
Define epsg for spatial projection;
default=4269 (NAD83)
--polygons [STRING]:
Write population polygons to shapefile;
can be name of shapefile or name of directory (created if !exists)
default = off;
--centroids [STRING]
Write centroids of each population to shapefile;
can be name of shapefile or name of directory (created if !exists)
default = off
-g [STRING], --geodist [STRING]
Specify output filename for geodist matrix between populations;
default = out.geodist.txt
--coord_outfile [STRING]
Specify output file for lon/lat coordinates of populations
(Tab-delimited; default = coords.out.txt
-h, --help Displays this help menu
geodist.py takes as input:
-
A CSV file containing coordinates. They coordinates can be in decimal degrees or UTM format with an easting or longitude and a northing or latitude column. If the UTM format is specified, you need to specify either the name of a column in the --coords file containing the UTM zones, or use the --zone option to specify a single UTM zone from the command-line. The script will then convert the UTM coordinates to decimal degrees.
-
A two-column, tab-separated population map file with sampleID\tpopulationID as the columns. There should not be a header. E.g.,:
sample1\tpop1
sample2\tpop1
sample3\tpop1
sample4\tpop2
sample5\tpop2
sample6\tpop2
The script will group by populationID and calculate great circle distances between the centroid points of each population. You can have it save shapefiles containing the polygons and centroids if you want to visualize them on a map.
Finally, the script will output the centroid point coordinates for each population as a tab-separated file (lon\tlat), and it will output a geographic distance matrix that can be input into the runConstruct.R script.
The next step in the pipeline is to run conStruct using the runConstruct.R script. This script is intended to be run in batch mode from the command-line. The script runs conStruct on a single K value and outputs the data to an RData file. Multiple K values can be run in parallel using GNU parallel, as shown in the included .pbs script (coming soon).
The following R packages must be installed to use runConstruct.R and the following runXvalidation.R scripts:
R version >3.4.0 (tested on R 3.5.1)
optparse
conStruct
doParallel
foreach
parallel
If you are working on a distributed HPC enviroment without sudo (root) privelidges, I highly recommend installing R using conda. You can use the Renv.yml file to create your environment with required packages. E.g.:
conda env create -f Renv.yml
Installing the packages yourself
You will need to install r-essentials and r-base. E.g.:
conda create -n r_env r-essentials r-base
Then you can install the packages that are available through conda:
conda install -n r_env -c r r-doparallel
conda install -n r_env -c r r-foreach
conStruct is not available through conda, so you will have to install it within R; also I had better luck installing optparse using install.packages as well:
R
install.packages("optparse", repos="https://cran.r-project.org")
install.packages("conStruct", repos="https://cran.r-project.org")
Call the help menu with:
Rscript runConstruct.R -h
Options:
-s CHARACTER, --str=CHARACTER
Input structure filename; default = NULL
-p CHARACTER, --popmap=CHARACTER
Input popmap filename; default = NULL
-g CHARACTER, --geodist=CHARACTER
Input geoDist matrix file (tab-delimited)
-c COORDS, --coords=COORDS
Input coordinates file (tab-delimited)
-w CHARACTER, --wd=CHARACTER
Set working directory; default = ./
--prefix=CHARACTER
Specify prefix for output files in conStruct analysis
--nchains=INTEGER
Specify number of independend MCMC chains to run
-i INTEGER, --niter=INTEGER
Specify length of MCMC chain to run
-K INTEGER, --K=INTEGER
Specify K value (Number of popualations) to run
-a CHARACTER, --afreq=CHARACTER
Specify allele frequency file to output
-o CHARACTER, --outdir=CHARACTER
Specify directory to write output files
-r, --onerowperind
Boolean; If toggled, specifies only one row per individual in structure file; default = FALSE
--data_column=DATA_COLUMN
Specify column index (1-based) for first data column in structure file; default = 3
-m MISSINGVAL, --missingval=MISSINGVAL
Specify missing data value in structure file; default = -9
-h, --help
Show this help message and exit
Required inputs are:
-
A STRUCTURE file. This file must have sampleIDs as the first column, population IDs as the second column, and the rest of the columns must contain the SNPs. There cannot be a header line.
-
A tab-separated geographic distance matrix file. You can use the geodist.py script (see above) to calculate distances and output the matrix.
-
A coordinates file. This is also outputted from geodist.py. Must be tab-separated file with longitude as the first column and latitude as the second.
-
Prefix for all the output files.
-
Population map (popmap) file. Should be the same one input into geodist.py. This is to collapse all the individuals into populations. The resulting allele frequencies object created in the script will contain one line per population.
I also strongly suggest changing the --niter option to a more appropriate (much longer) chain length. You can also tell it to run additional independent chains with the --nchains option.
Finally, you can change the output directory if you want. I recommend it because a lot of files will be produced.
I also strongly recommend that you assess the trace output files produced in this step before going on to the next one. This will help you to determine whether you ran the MCMC chains long enough by assessing convergence.
Run the script as follows:
Rscript runConstruct.R [options...]
The runXvalidation.R script will run cross-validation for conStruct to evaluate whether 1) the spatial versus non-spatial model is better supported and 2) determine the "best" K value.
The dependencies for runConstruct.R are all required for runXvalidation.R as well. However, no additional dependences are required.
Options:
-K INTEGER, --maxK=INTEGER
required; Specify maximum K value (upper case K)
-r INTEGER, --nreps=INTEGER
required; Specify number of cross-validation replicates
-t NUMERIC, --trainProp=NUMERIC
optional; Specify training proportion for cross-validation
-n INTEGER, --nodes=INTEGER
required; Specify number of CPU cores for parallelization
-f, --saveFiles
optional; Boolean; Toggle save Robj files from cross-validation; will save lots of files; default=FALSE
-F, --saveFigs
optional; Boolean; Don't save figures from cross-validation; default=TRUE
-o CHARACTER, --outdir=CHARACTER
optional; Specify directory for output files; will be created if doesn't exist
-w CHARACTER, --wd=CHARACTER
required; Specify directory containing output from runConstruct.R
-h, --help
Show this help message and exit
Required Input:
-
You need to have run runConstruct.R first, which will write the .RData files to the output directory specified in the runConstruct.R options.
-
You need to specify the maximum K value that you ran runConstruct.R on. The minimum K must be 1.
-
You need to specify the directory where the runConstruct.R files are stored using the --wd option.
-
You need to indicate the number of CPU threads to use with the --nodes option.
-
Specify the number of cross-validation replicates with the --nreps option.
The script will output lots of files, so it is recommended that you specify an output directory with the --outdir option. It is also highly recommended that you inspect the cross-validation replicate traces for convergence.
Finally, run the script as follows:
Rscript runXvalidation.R [options...]