A Distributed AI-REML Best Linear Unbiased Prediction framework for genomic prediction including marker-by-environment interaction. This software has been described and validated in the manuscript Needles: towards large-scale genomic prediction with marker-by-environment interaction
. (De Coninck et al., 2015, submitted to GENETICS)
This software was developed by Arne De Coninck and can only be used for research purposes.
Genomic datasets used for genomic prediction are constantly growing due to the decreasing costs of genotyping and increasing interest in improving agronomic performance of animals and plants. To be able to deal with those large-scale datasets, a distributed-memory framework was developed based on a message passing interface the ScaLAPACK library and the PARDISO library for efficiently dealing with the sparse information introduced by the marker-by-environemnt interaction effects. The complexity of the algorithm is defined by the number of genetic markers and environments included in the genomic prediction setting; the number of individuals only has a linear effect on the read-in time. To enhance performance it is advised to compile and execute Needles on an MPI-optimized machine.
#Installation
Needles relies heavily on the following software packages, which have to be installed prior to installation of Needles. These software packages are all open source, except for the vendor-optimized implementations and PARDISO, but an academic license of PARDISO is free of charge.
- MPI (OpenMPI, MPICH, IntelMPI)
- ScaLAPACK and all its dependencies BLAS, BLACS, LAPACK, PBLAS (It is recommended to install a vendor optimized implementation )
- [PARDISO] (http://www.pardiso-project.org/)
- CMake (http://www.cmake.org/)
Currently, compilation will only work with the Intel MKL libraries installed. When MKL libraries are not available, one must change the MKL libraries in the CMakelists.txt file to the ones which are installed.
- Unpack zip-file or clone git-repository
- go into the directory
Needles
- make a new directory
build
- go into the directory
build
- type
cmake ..
- type
make
Needles only needs an input file to start. A default input file is provided: defaultinput.txt
, more information on the arguments in the input-file can be found on the wiki.
To test Needles with a default example, the following command should be entered in the example
directory:
mpirun -np 4 ../build/Needles GxE_20penv_QTL_input.txt
At least 2 MPI processes should be initialised, because all sparse operations are performed by a single MPI process, while the other MPI processes are used to handle the dense operations.
This test-case is one of the many test cases as described in the research article Needles: towards large-scale genomic prediction with marker-by-environment interaction
and it analyses 800 observations, genotyped with 1575 QTL markers and evaluated at 10 different environments. The simulated QTL effects are in the file QTL_summary_20penv_10env.txt
and the different ocntirbutions to the final phenotypic values are in the file Observations_summary_lowvar_20penv_10env.txt
. When Needles is working correctly, the output should be exactly the same as in the files starting with correct_
. An example of the output that is produced by Needles is in the file Needles_out_4procs.txt
.
Needles creates 3 output-files with the estimates/predictors for the different effects.
estimates_fixed_effects.txt
: Lists the estimates for the fixed effects. Usually these are the fixed environmental effects, but users are free to choose the included fixed effects.estimates_random_genetic_effects.txt
: Lists the predictions for the random genetic effects. These are the predictions for the global genetic effects, independent of the environment.estimates_random_sparse_effects.txt
: Lists the predictions for the random marker-by-environment interaction effects.
Both random effects can be chosen by the user to model something else than genetic effects and their environmental interaction, but up until now one of the random effects should result in a sparse part of the coefficient matrix and the other shoudl result in a dense part. Also, both random effects can have a different variance, but the variandce is homoscedastic for each of the random effects, meaning that the variance of each random effect so modeled as a constant diagonal matrix.
Next to the result files, two files are given as output that monitor memory usage in the root node, which performs all the operations on the sparse part of the system (root_output.txt
), and in the other nodes, performing all operations on the dense part of the system (cluster_output.txt
).
- Version 0.1 (09/2015):
- First public release of Needles
Please feel free to contact arne.deconinck[at]ugent.be for any questions or suggestions.