Skip to content
RAiSD: software to detect positive selection based on multiple signatures of a selective sweep and SNP vectors
C Shell Makefile R
Branch: master
Clone or download
Latest commit 02b9091 May 31, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
evaluation
sources muVar and muSFS factors for scoring, -A option to generate manhattan … May 15, 2019
Makefile muVar and muSFS factors for scoring, -A option to generate manhattan … May 15, 2019
MakefileZLIB muVar and muSFS factors for scoring, -A option to generate manhattan … May 15, 2019
README.md Update README.md May 31, 2019

README.md

RAiSD: software to detect positive selection based on multiple signatures of a selective sweep and SNP vectors

Authors: Nikolaos Alachiotis (n.alachiotis@gmail.com) and Pavlos Pavlidis (pavlidisp@gmail.com)

First release: 9/6/2017

Last update: 15/5/2019

Version: 2.0

About

RAiSD (Raised Accuracy in Sweep Detection) is a stand-alone software implementation of the μ statistic for selective sweep detection. Unlike existing implementations, including our previously released tools (SweeD and OmegaPlus), RAiSD scans whole-genome SNP data based on a composite evaluation scheme that captures multiple sweep signatures at once.

The main article describing RAiSD and the μ statistic is published in Communications Biology:

  1. RAiSD detects positive selection based on multiple signatures of a selective sweep and SNP vectors (PDF)
    URL: https://www.nature.com/articles/s42003-018-0085-8

Other related publications:

  1. Accelerated Inference of Positive Selection on Whole Genomes (PDF)
    URL: https://ieeexplore.ieee.org/abstract/document/8533493

Update -- May 2019 We have modified the way to calculate the partial statistics μ_VAR, μ_SFS, as follows.

μ_VAR: we have multiplied the old version of μ_VAR with the total number of SNPs in the chromosome. Now, μ_VAR is independent of the chromosome length. Also, it is distributed around the value 1 under neutrality.

μ_SFS: We have multiplied the old version of μ_SFS with the Sum_i=1^(n-1) 1/i, i.e., with the (sample size) correcting factor of Watterson's θ. Now, the μ_SFS is distributed around 1 under neutrality.

Download and Compile

The following commands can be used to download and compile the source code.

$ mkdir RAiSD
$ cd RAiSD
$ wget https://github.com/alachins/raisd/archive/master.zip
$ unzip master.zip
$ cd raisd-master
$ make

The executable is placed in the path RAiSD/raisd-master/bin/release. A link to the executable is placed in the installation folder, i.e., raisd-master.

Test Run

To verify that RAiSD is installed correctly, a test run can be done with the following commands. These commands are going to download a dataset (one of the many that we used for evaluation purposes) and execute RAiSD to process 1,000 simulated sets of SNPs (genomic region size = 100000 bp, weak bottleneck, selective sweep at the center of the region).

$ wget 139.91.162.50/raisd_data/d1.tar.gz
$ tar -xvzf d1.tar.gz
$ ./RAiSD -n test_run -I d1/msselection1.out -L 100000

Upon completion, the output files RAiSD_Info.test_run and RAiSD_Report.test_run are generated.

In-tool Help

RAiSD outputs a quick-reference help message that provides a short description for each of the supported command-line flags with the following command.

$ RAiSD -h

The quick-reference help message generated by the current RAiSD release is the following.

  This is RAiSD version 2.0, released in May 2019.

  RAiSD
      	-n STRING
      	-I STRING

	--- SNP and SAMPLE HANDLING
	[-L INTEGER]
	[-S STRING]
	[-m FLOAT]
	[-M 0|1|2|3]
	[-y INTEGER]

	--- SLIDING WINDOW and MU STATISTIC
	[-w INTEGER]
	[-c INTEGER]

	--- STANDARD OUTPUT and REPORTS
	[-f]
	[-s]
	[-t]
	[-p]
	[-O]
	[-R]
	[-P]
	[-D]
	[-A FLOAT]

	--- ACCURACY and SENSITIVITY EVALUATION
	[-T INTEGER]
	[-d INTEGER]
	[-k FLOAT]
	[-l FLOAT]

	--- ADDITIONAL EXECUTION PARAMETERS
	[-b]
	[-a INTEGER]

	--- HELP and VERSION NOTES
	[-h]
	[-v]


  	DETAILED DESCRIPTION

	-n	Provides a unique run ID that is used to name the output files, i.e., the info file and the report(s).
	-I	Provides the path to the input file, which can be either in ms or in vcf format.

	--- SNP and SAMPLE HANDLING
	-L	Provides the size of the region in basepairs for ms files. If known, it can be used for vcf as well, leading to faster processing.
	-S	Provides the path to the list of samples to be processed (supported only with VCF).
	-m	Provides the threshold value for excluding SNPs with minor allele frequency < threshold (0.0-1.0).
	-M	Indicates the missing-data handling strategy (0: discards SNP (default), 1: imputes N per SNP, 2: represents N through a mask, 3: ignores allele pairs with N).
	-y	Provides the ploidy (integer value), used to correctly represent missing data.

	--- SLIDING WINDOW and MU STATISTIC
	-w	Provides the window size (integer value). The default value is 50 (empirically determined).
	-c	Provides the slack for the SFS edges to be used for the calculation of mu_SFS. The default value is 1 (singletons and S-1 snp class, where S is the sample size).

	--- STANDARD OUTPUT and REPORTS
	-f	Overwrites existing run files under the same run ID.
	-s	Generates a separate report file per set.
	-t	Removes the set separator symbol from the report(s).
	-p	Generates the output file RAiSD_Samples.STRING, where STRING is the run ID, comprising a list of samples in the input file (supported only with VCF).
	-O	Shows progress on the display device (at snp set granularity).
	-R	Includes additional information (window start and end, and the mu-statistic factors for variation, SFS, and LD) in the report file(s).
	-P	Generates four plots (for the three mu-statistic factors and the final score) in one PDF file per set of SNPs in the input file using Rscript (activates -s, -t, and -R).
	-D	Generates a site report, e.g., total, discarded, imputed etc.
	-A	Provides a probability value to be used for the quantile function in R and generates a Manhattan plot for the final mu-statistic score using Rscript (activates -s, -t, and -R).

	--- ACCURACY and SENSITIVITY EVALUATION
	-T	Provides the selection target (in basepairs) and calculates the average distance (over all datasets in the input file) between the selection target and the reported locations.
	-d	Provides a maximum distance from the selection target (in base pairs) to calculate the success rate, i.e., reported locations in the proximity of the target of selection (provided via -T).
	-k	Provides the false positive rate (e.g., 0.05) to report the corresponding reported score after sorting the reported locations for all the datasets in the input file.
	-l	Provides the threshold score, reported by a previous run using a false positive rate (e.g., 0.05, via -k) to report the true positive rate.

	--- ADDITIONAL EXECUTION PARAMETERS
	-b	Indicates that the input file is in mbs format.
	-a	Provides a seed for the random number generator.

	--- HELP and VERSION NOTES
	-h	Prints this help message.
	-v	Prints version information.

Input File Formats

The current RAiSD release can process SNP data in Hudson's ms or VCF (Variant Call Format) file formats. The d1 test dataset, previously used for the test run, is in Hudson's ms format. For the VCF format, refer to the respective Wikipedia entry (https://en.wikipedia.org/wiki/Variant_Call_Format).

RAiSD version 1.7 (or later) can also parse and process ANGSD VCF files.
RAiSD version 1.8 (or later) can also parse compressed VCF files (including ANGSD-generated VCF files) in GZ file format.

Output Files

RAiSD generates two output files, the RAiSD_Info and the RAiSD_Report, with the run name (provided via "-n") as file extension.

RAiSD_Info.test_run

The first 20 lines of the RAiSD_Info.test_run file are shown below.

RAiSD, Raised Accuracy in Sweep Detection
Copyright (C) 2017, and GNU GPL'd, by Nikolaos Alachiotis and Pavlos Pavlidis
Contact n.alachiotis/pavlidisp at gmail.com

Command: ./RAiSD -n test_run -I d1/msselection1.out -L 100000 -f 
Samples: 20
Region:  100000 bp
Format:  ms

A pattern structure of 65536 patterns (max. capacity) and approx. 1 MB memory footprint has been created.

0: Set 0 | sites 6300 | snps 6300 | region 100000 - Var 53090 8.400e-04 | SFS 12005 1.000e+00 | LD 38385 1.333e+00 | MuStat 49905 3.908e-04
1: Set 1 | sites 6296 | snps 6296 | region 100000 - Var 49665 1.970e-03 | SFS 3745 1.000e+00 | LD 46945 2.000e+00 | MuStat 49685 8.320e-04
2: Set 2 | sites 6109 | snps 6109 | region 100000 - Var 50835 1.890e-03 | SFS 42100 1.000e+00 | LD 89390 1.500e+00 | MuStat 50770 6.825e-04
3: Set 3 | sites 6052 | snps 6052 | region 100000 - Var 49590 1.880e-03 | SFS 5545 1.000e+00 | LD 22460 1.500e+00 | MuStat 50150 1.102e-03
4: Set 4 | sites 6184 | snps 6184 | region 100000 - Var 51920 3.020e-03 | SFS 26445 1.000e+00 | LD 55080 1.333e+00 | MuStat 52400 1.180e-03
5: Set 5 | sites 5519 | snps 5519 | region 100000 - Var 49310 1.660e-03 | SFS 5475 1.000e+00 | LD 42490 1.333e+00 | MuStat 47165 8.925e-04
6: Set 6 | sites 6052 | snps 6052 | region 100000 - Var 49640 1.700e-03 | SFS 9090 1.000e+00 | LD 61435 1.250e+00 | MuStat 48080 6.200e-04
7: Set 7 | sites 6274 | snps 6274 | region 100000 - Var 50480 2.100e-03 | SFS 18910 1.000e+00 | LD 36695 1.333e+00 | MuStat 50490 8.505e-04

The RAiSD_Info file provides execution- and dataset-related information (command line, number of samples, region size, dataset format), as well as a result line per SNP set in the input file. Each per-set result line provides the following information: a) set index, b) number of sites, c) number of SNPs, d) region size, e) best-score location and the respective score for each of the factors that form the μ statistic, denoted as VAR, SFS, and LD, and f) best-score location and the respective score for the μ statistic (MuStat).

The RAiSD_Info file additionally reports total execution time, memory footprint, and statistics about the total number of SNP sets in the input file (total, processed, skipped), as shown in the last 6 lines of the file.

Sets (total):     1000
Sets (processed): 1000
Sets (skipped):   0

Total execution time 32.03070 seconds
Total memory footprint 1109 kbytes

RAiSD_Report.test_run

The first 20 lines of the RAiSD_Report.test_run file are shown below (using -R to include additional information in the report file).

// 0
430	20	840	1.640e-04	2.800e-01	2.580e+00	1.185e-04
445	20	870	1.700e-04	2.600e-01	2.500e+00	1.105e-04
450	20	880	1.720e-04	2.600e-01	2.077e+00	9.288e-05
455	20	890	1.740e-04	2.800e-01	2.210e+00	1.077e-04
460	20	900	1.760e-04	2.800e-01	2.071e+00	1.020e-04
480	30	930	1.800e-04	2.800e-01	2.097e+00	1.057e-04
490	30	950	1.840e-04	2.600e-01	1.667e+00	7.973e-05
500	40	960	1.840e-04	2.600e-01	1.113e+00	5.324e-05
515	50	980	1.860e-04	2.800e-01	8.798e-01	4.582e-05
550	110	990	1.760e-04	2.800e-01	9.087e-01	4.478e-05
565	120	1010	1.780e-04	2.800e-01	9.688e-01	4.828e-05
585	150	1020	1.740e-04	2.800e-01	1.308e+00	6.371e-05
595	170	1020	1.700e-04	2.800e-01	1.687e+00	8.031e-05
600	170	1030	1.720e-04	2.800e-01	1.595e+00	7.683e-05
605	180	1030	1.700e-04	3.000e-01	1.352e+00	6.897e-05
625	190	1060	1.740e-04	3.000e-01	1.356e+00	7.077e-05
650	240	1060	1.640e-04	3.000e-01	1.080e+00	5.315e-05
665	260	1070	1.620e-04	3.000e-01	7.723e-01	3.753e-05
670	270	1070	1.600e-04	2.800e-01	7.634e-01	3.420e-05

For each set of SNPs (separated by a line that contains the separator symbol "//" followed by the set index or chromosome name), the RAiSD_Report file contains a set of result lines, one per evaluated genomic location, with each result line providing the genomic location followed by the calculated μ statistic value (default) or the genomic location followed by the start and end positions of the window, the factors VAR, SFS, and LD, and finally the μ statistic (-R command-line flag).

Required Input Parameters

The in-tool help message generates a list of parameters, as shown above. Those in brackets are optional.

The basic execution mode does NOT require any FREE input parameters. 

In addition to the run name ("-n", used to name the output files accordingly) and the path to the input file ("-I") containing SNP data, the region length ("-L") is required only when simulated data are processed, i.e., ms files. The region length, which is required for calculating the μ statistic is extracted from the input file prior to processing when VCF files are analyzed.

The following command lines provide two examples of command lines for processing ms and VCF files.

Hudson's ms format

$ ./RAiSD -n ms_run -I input_file.ms -L 1000000

VCF format

$ ./RAiSD -n vcf_run -I input_file.vcf

Optional Input Parameters

In addition to the required set of parameters, several optional ones can be used. The in-tool help message provides a short description for each one of them.

The -f parameter allows to overwrite existing output files. The default operation mode is to prevent execution in order to avoid accidental overwritting of existing output files.

Parameters -s and -t affect the way output files are generated. The -s parameter splits reports in separate files, one per SNP set, whereas the -t parameter removes the SNP set separator symbols "//".

Processing only a subset of the samples in a VCF file is possible with the use of parameters -p and -S. The former generates a list of all samples in the input file, stored in an output file named RAiSD_Samples.run_name, while the latter allows the user to provide a similar to RAiSD_Samples input file containing only the list of samples to process.

The optional parameters -T, -d, -k, and -l are used for evaluation purposes and are discussed in detail below.

Missing-data Strategies

RAiSD provides four different strategies to handle missing data, using the -M parameter followed by the strategy number.

0: Discards all SNPs with missing data (default)
1: Imputes N per SNP   
2: Creates a mask for valid alleles and treats N as a third state 
3: Creates a mask for valid alleles and ignores allele pairs with N

Evaluating Accuracy

RAiSD provides a series of parameters that facilitate measuring performance when simulated datasets are analyzed. Given a simulated dataset, e.g., in ms format, that comprises several sets of SNPs, the -T parameter can be used in order to direct RAiSD to report accuracy, defined as the average distance between a known sweep location (provided via -T and the reported best-score locations).

The following command line example for the previously conducted test run demonstrates the use of -T and the respective output. All provided datasets, like the d1 used in the test run, exhibit a selective sweep at the center of the genomic region they simulate. Therefore, RAiSD can be launched as follows:

$ ./RAiSD -n test_run_accuracy -I d1/msselection1.out -L 100000 -T 50000

Note the additional output lines in the RAiSD_Info file, reporting accuracy. The values are in base pairs, averaged over the number of processed sets of SNPs.

AVERAGE DISTANCE (Target 50000)
mu-VAR	1022.500
mu-SFS	26137.655
mu-LD	18577.810
MuStat	1253.685

Measuring Success Rate

The -d parameter can be used along with -T to direct RAiSD to report the success rate, defined as the percentage of sets with reported best-score location withing a distance (provided via -d, in base pairs) from the known target of selection (provided via -T, in base pairs). The respective command line for a distance threshold of 1% of the total region length, i.e., 1,000 bp, is shown below.

$ ./RAiSD -n test_run_success_rate -I d1/msselection1.out -L 100000 -T 50000 -d 1000

Note the additional output lines in RAiSD_Info file, reporting success rate. The results show that 62.6% of the runs report locations within 1,000 bp from the known selection target (location 50,000).

SUCCESS RATE (Distance 1000)
mu-VAR	0.692
mu-SFS	0.022
mu-LD	0.019
MuStat	0.626

Evaluating Sensitivity (True Positive Rate)

The -k and -l parameters can be used to direct the tool to report sensitivity. This requires to conduct a run on neutral data first and sort all per-set best scores in order to define a threshold for a given False Positive Rate (FPR). RAiSD does this automatically when neutral data are processed, with the use of the -k parameter, providing an FPR. The following command line illustrates this for an FPR of 5%.

$ ./RAiSD -n test_run_fpr -I d1/msneutral1.out -L 100000 -k 0.05

Note the additional lines that report the respective threshold for the chosen FPR value.

SORTED DATA (FPR 0.050000)
Size			1000
Highest Score		0.000450000
Lowest Score		0.000089600
FPR Threshold		0.000221867
Threshold Location	50

Thereafter, the FPR Threshold value can be given to RAiSD when selection data are processed (via the -l parameter) as shown below.

$ ./RAiSD -n test_run_tpr -I d1/msselection1.out -L 100000 -l 0.000221867

This will report the respective TPR value as shown in the additional lines in RAiSD_Info file as shown below.

SCORE COUNT (Threshold 0.000222)
TPR	    0.995000

These results demonstrate a TPR of 99.5% when d1 is analyzed.

Generating μ-statistic plots

The -P parameter can be used to generate a set of four plots in a single PDF file per set of SNPs in the input dataset. This option requires R (https://www.r-project.org/) to be installed and Rscript to be in the default $PATH. By default, -P activates the generation of a separate report per set of SNPs (-s option), the removal of the set separator symbol from each report (-t option), and the inclusion of additional information in each report (-R option). The following command shows the use of -P for the test run.

$ ./RAiSD -n test_run -I d1/msselection1.out -L 100000 -P

This command will generate RAiSD_Plot PDF files with the run name (provided via "-n") followed by the SNP set index (or name in VCF format) preceding the file extension.

RAiSD_Plot.test_run.0.pdf

An example of such a file can be found here: http://139.91.162.50/raisd_plots/examples/RAiSD_Plot.test_run.1.pdf

Processing VCF (ANGSD version)

As of version 1.7, RAiSD can process ANGSD (http://www.popgen.dk/angsd/index.php/ANGSD) VCF files. Such files (see a sample here: http://www.popgen.dk/angsd/index.php/Vcf) only include the GP and GL fields, rather than the GT field that is required by RAiSD. In the absence of GT, diploidy is assumed.

Given a [GL:GP] entry [L1,L2,L3:P1,P2,P3], RAiSD constructs unphased genotype data on the fly as follows:

GT="./." for all-zero GL entries    
GT="0/0" with probability P1   
GT="0/1" with probability P2     
GT="1/1" with probability P3        

Processing VCF in GZ file format

Since version 1.8, RAiSD can process compressed VCF files in GZ file format. To activate this feature, compile the source code using the dedicated makefile MakefileZLIB as follows:

$ make clean
$ make -f MakefileZLIB 

The zlib library (https://zlib.net/) needs to be installed prior to compilation. In Ubuntu, the zlib library can be installed with the following command:

$ sudo apt-get install zlib1g-dev

Generating the RAiSD_SiteReport file

The -D option can be used to generate the RAiSD_SiteReport, i.e., a single file that includes a line per set of SNPs (ms format) or per CHROM (VCF format), providing a breakdown of the dataset in terms of total number of sites, number of SNPs used in the analysis, and total number of discarded sites. The total number of discarded sites is further broken down into site groups based on the reason they were discarded (failed check).

When the default missing-data strategy is used (-M 0), the total number of discarded sites includes:

a) sites dropped due to a failed "header" check (only for VCF), 
b) sites dropped because of the MAF check, 
c) sites with missing data, and 
d) monomorphic sites. 

When any other missing-data strategy is used (-M 1,2, or 3), the total number of discarded sites includes:

a) sites dropped due to a failed "header" check (only for VCF), 
b) sites dropped because of the MAF check, and 
c) potentially monomorphic sites with missing data, i.e., sites with missing data and no variation. 

When missing data are imputed per SNP (-M 1), the RAiSD_SiteReport additionally includes the total number of sites that have been imputed before applying the rest of the checks (minor allele frequency, monomorphic sites, etc).

Changing the window size

The window size for the sliding-window algorithm can be changed using the -w flag. The default value is set to 50 SNPs, which is empirically determined. Only even numbers can be used, due to a performance optimization employed in the computation of μ_LD. Using very large window sizes is expected to reduce the accuracy of the tool, and also leads to slower execution. Note that, RAiSD does not evaluate varying-size windows per genomic location. Therefore, μ statistic values obtained using different window sizes are not comparable. The following command shows the use of -w to set the window size to 20 SNPs.

$ ./RAiSD -n test_run -I d1/msselection1.out -L 100000 -w 20

Adapting the SFS edges for μ_SFS

The SFS is expected to assume a U shape in the presence of a selective sweep, due to the fact that the number of singletons and SNPs with S-1 derived mutations (where S is the sample size) increases, whereas intermediate SNP classes shrink in size. By default, the μ_SFS increases with the number of singletons and SNPs with S-1 derived mutations in a window. The -c flag allows to include SNPs with a larger number of derived mutations, e.g., doubletons or tripletons, in the calculation of μ_SFS values. For example, RAiSD can additionally include doubletons with the following example command.

$ ./RAiSD -n test_run -I d1/msselection1.out -L 100000 -c 2

When a number X that is larger than 1 is provided through -c for the SFS edges, all SNPs with a number of derived alleles less or equal than X and larger or equal than S-X are included in the calculation of μ_SFS values.

Generating Manhattan plots

As of version 2.0, the -A parameter followed by a probability value (to be used for the quantile function in R) can be used to generate a Manhattan plot in a single PDF file for all chromosomes (sets of SNPs) in the input dataset. This option requires R (https://www.r-project.org/) to be installed and Rscript to be in the default $PATH. By default, -A activates the generation of a separate report per set of SNPs (-s option), the removal of the set separator symbol from each report (-t option), and the inclusion of additional information in each report (-R option). The following command shows the use of -A for the test run.

$ ./RAiSD -n test_run -I d1/msselection1.out -L 100000 -A 0.995

This command will generate a single RAiSD_ManhattanPlot PDF file with the run name (provided via "-n") preceding the file extension.

RAiSD_ManhattanPlot.test_run.pdf

Note that, depending on the number of chromosomes (SNP sets) and the total number of SNPs in the input file, such a RAiSD run might require several minutes only to generate the RAiSD_ManhattanPlot PDF file.

An example of such a file, obtained by a different run, can be found here: http://139.91.162.50/raisd_plots/examples/RAiSD_ManhattanPlot.test_run_2.pdf

Changelog

v1.0 (9/6/2017): first release

v1.1 (7/3/2018): MAF threshold option

v1.2 (28/3/2018): mbs format with -b

v1.3 (18/7/2018): -i to impute N per SNP, -a for rand seed

v1.4 (3/8/2018): -M to handle missing data, -O to show progress on the display device

v1.5 (4/8/2018): -R to include additional information in the report (reduced default to location and score)

v1.6 (3/9/2018): -P to create plots per set of SNPs using Rscript

v1.7 (2/10/2018): -y for ploidy, -D for site report, fixed a bug in the plotting routine

v1.8 (31/12/2018): MakefileZLIB to parse VCF files in gzip file format (requires the zlib library)

v1.9 (27/4/2019): -w to set the window size (default 50), -c to set the SFS slack for the μ_SFS

v2.0 (15/5/2019): -A to create Manhattan plots, scale factors for muVar and muSFS to yield comparable scores among different chromosomes

Support

We provide support for RAiSD through our generic Google group for Population Genetics.

You can’t perform that action at this time.