This repository serves three purposes:
- Document the simulation design process used in our work Evaluation of Existing Methods for High-Order Epistasis Detection[1].
- Publish our developed data sets, allowing other researchers to use the same data.
- Ensure that our results are reproducible by any other scientist.
The objective of our work is to compare the performance of several epistasis detection methods, both in terms of detection power and false positives. To do that, a collection of data sets was created with the intention of providing a wide variety of population characteristics resembling real human traits, on which the different methods will be evaluated.
All our simulation was carried out using GAMETES[2], the most used epistasis simulation package in recent literature. Our simulation includes both epistasis with marginal effects and with no marginal effects. Simulation in GAMETES can be summarized in three steps:
- Calculate a penetrance table describing the epistatic interaction to be included in the data.
- Simulate population samples of the interacting SNPs following its penetrance table.
- Simulate population samples of the rest of the non-interacting SNPs.
Our criteria was to create third and fourth order interactions, with MAFs of the interacting SNPs of 0.10, 0.25 and 0.40, resulting in a trait with heritabilities of 0.10, 0.25, 0.50 and 0.80, and prevalences above 1E-06.
GAMETES can create penetrance tables describing interactions following no epistasis model and showing no marginal effects. Therefore, to generate penetrance tables following a specific epistasis model, Toxo[3] was used. Toxo is a MATLAB library capable of calculating penetrance tables of any bivariate epistasis model under certain conditions. In this study we decided to use the additive and threshold models proposed my Machini et al. in [3] for third and fourth order epistasis.
The following MATLAB code snippet shows how the different penetrance tables with marginal effects were obtained using Toxo. The directory epistasis_models/ contains the used Marchini's models in CSV format, as required by Toxo.
addpath('<path to Toxo>/src/');
m = toxo.Model('epistasis_models/<model>');
pt = m.find_max_prevalence([<MAFs>], <heritability>);
pt.write('<path to penetrance table output>',
toxo.PTable.format_gametes,
<MAFs>);
Following this procedure, the following penetrance tables were obtained:
Model | Order | MAF | h² | P(D) | File |
---|---|---|---|---|---|
Additive | 3 | 0.10 | 0.10 | 0.000012 | Link |
Additive | 3 | 0.10 | 0.25 | 0.000004 | Link |
Additive | 3 | 0.10 | 0.50 | 0.000002 | Link |
Additive | 3 | 0.10 | 0.80 | 0.000001 | Link |
Additive | 3 | 0.25 | 0.10 | 0.005370 | Link |
Additive | 3 | 0.25 | 0.25 | 0.001153 | Link |
Additive | 3 | 0.25 | 0.50 | 0.000504 | Link |
Additive | 3 | 0.25 | 0.80 | 0.000306 | Link |
Additive | 3 | 0.40 | 0.10 | 0.254558 | Link |
Additive | 3 | 0.40 | 0.25 | 0.022186 | Link |
Additive | 3 | 0.40 | 0.50 | 0.008545 | Link |
Additive | 3 | 0.40 | 0.80 | 0.005091 | Link |
Additive | 4 | 0.10 | 0.10 | < 1E-6 | Unavailable |
Additive | 4 | 0.10 | 0.25 | < 1E-6 | Unavailable |
Additive | 4 | 0.10 | 0.50 | < 1E-6 | Unavailable |
Additive | 4 | 0.10 | 0.80 | < 1E-6 | Unavailable |
Additive | 4 | 0.25 | 0.10 | 0.000234 | Link |
Additive | 4 | 0.25 | 0.25 | 0.000068 | Link |
Additive | 4 | 0.25 | 0.50 | 0.000031 | Link |
Additive | 4 | 0.25 | 0.80 | 0.000019 | Link |
Additive | 4 | 0.40 | 0.10 | 0.036282 | Link |
Additive | 4 | 0.40 | 0.25 | 0.003383 | Link |
Additive | 4 | 0.40 | 0.50 | 0.001374 | Link |
Additive | 4 | 0.40 | 0.80 | 0.000822 | Link |
Threshold | 3 | 0.10 | 0.10 | 0.064602 | Link |
Threshold | 3 | 0.10 | 0.25 | 0.025561 | Link |
Threshold | 3 | 0.10 | 0.50 | 0.013270 | Link |
Threshold | 3 | 0.10 | 0.80 | 0.008417 | Link |
Threshold | 3 | 0.25 | 0.10 | 0.477516 | Link |
Threshold | 3 | 0.25 | 0.25 | 0.267707 | Link |
Threshold | 3 | 0.25 | 0.50 | 0.154539 | Link |
Threshold | 3 | 0.25 | 0.80 | 0.102529 | Link |
Threshold | 3 | 0.40 | 0.10 | 0.780354 | Link |
Threshold | 3 | 0.40 | 0.25 | 0.586967 | Link |
Threshold | 3 | 0.40 | 0.50 | 0.415395 | Link |
Threshold | 3 | 0.40 | 0.80 | 0.307526 | Link |
Threshold | 4 | 0.10 | 0.10 | 0.012563 | Link |
Threshold | 4 | 0.10 | 0.25 | 0.005140 | Link |
Threshold | 4 | 0.10 | 0.50 | 0.002590 | Link |
Threshold | 4 | 0.10 | 0.80 | 0.001623 | Link |
Threshold | 4 | 0.25 | 0.10 | 0.275518 | Link |
Threshold | 4 | 0.25 | 0.25 | 0.132034 | Link |
Threshold | 4 | 0.25 | 0.50 | 0.070683 | Link |
Threshold | 4 | 0.25 | 0.80 | 0.041819 | Link |
Threshold | 4 | 0.40 | 0.10 | 0.668428 | Link |
Threshold | 4 | 0.40 | 0.25 | 0.446405 | Link |
Threshold | 4 | 0.40 | 0.50 | 0.287337 | Link |
Threshold | 4 | 0.40 | 0.80 | 0.201273 | Link |
Four penetrance tables from the fourth order additive model were discarded due to resulting in very low prevalence values, since prevalences that low are unrealistic for real human populations.
Penetrance tables with no epistatic model can be obtained directly from GAMETES, by running the following shell command:
java -jar <path to GAMETES> \
-M "-h <heritability> <MAFs> -o <output model file>" \
-q 1 \
-p 100 \
-t 100000 \
-r <seed>
Unlike Toxo, GAMETES implements a stochastic algorithm and thus penetrance tables can vary depending on the seed used by the pseudorandom number generator. Therefore, we also picked seeds at random to guarantee that results are reproducible:
Order | MAF | h² | P(D) | GAMETES seed | Link |
---|---|---|---|---|---|
3 | 0.10 | 0.10 | - | Unavailable | |
3 | 0.10 | 0.25 | - | Unavailable | |
3 | 0.10 | 0.50 | - | Unavailable | |
3 | 0.10 | 0.80 | - | Unavailable | |
3 | 0.25 | 0.10 | 0.5860 | -1643481676 | Link |
3 | 0.25 | 0.25 | 0.4923 | 1764873474 | Link |
3 | 0.25 | 0.50 | 0.4223 | 1893932570 | Link |
3 | 0.25 | 0.80 | - | Unavailable | |
3 | 0.40 | 0.10 | 0.5163 | -1568883956 | Link |
3 | 0.40 | 0.25 | 0.5644 | 2089582692 | Link |
3 | 0.40 | 0.50 | 0.5019 | 1343608856 | Link |
3 | 0.40 | 0.80 | 0.4970 | 1329415446 | Link |
4 | 0.10 | 0.10 | - | Unavailable | |
4 | 0.10 | 0.25 | - | Unavailable | |
4 | 0.10 | 0.50 | - | Unavailable | |
4 | 0.10 | 0.80 | - | Unavailable | |
4 | 0.25 | 0.10 | 0.4201 | -600481346 | Link |
4 | 0.25 | 0.25 | 0.5910 | -965873914 | Link |
4 | 0.25 | 0.50 | - | Unavailable | |
4 | 0.25 | 0.80 | - | Unavailable | |
4 | 0.40 | 0.10 | 0.4356 | 913749367 | Link |
4 | 0.40 | 0.25 | 0.4720 | 203584226 | Link |
4 | 0.40 | 0.50 | - | Unavailable | |
4 | 0.40 | 0.80 | - | Unavailable |
Many of the penetrance tables could not be obtained. The larger the interaction, the lower the MAFs and the higher the heritability are, the more unlikely that GAMETES is to converge on a solution.
Using previous penetrance tables, 100 data sets were simulated for each penetrance table containing 500 SNPs (including the interacting SNPs) of 2000 individuals (1000 cases and 1000 controls), with MAFs of non-interacting loci uniformly sampled from the interval [0.05, 0.5]. This can be achieved by running in a shell, for each penetrance table:
java -jar <path to GAMETES> \
-i <path to penetrance table> \
-D "-n <lower MAF>
-x <upper MAF>
-a <number of SNPs>
-s <number of cases>
-w <number of controls>
-r <number of datasets>
-o <output folder>"
The following table lists all epistatic data sets used during the evaluation:
Model | Marginal Effects | Order | MAF | h² | Data set |
---|---|---|---|---|---|
Additive | Yes | 3 | 0.10 | 0.10 | Folder |
Additive | Yes | 3 | 0.10 | 0.25 | Folder |
Additive | Yes | 3 | 0.10 | 0.50 | Folder |
Additive | Yes | 3 | 0.10 | 0.80 | Folder |
Additive | Yes | 3 | 0.25 | 0.10 | Folder |
Additive | Yes | 3 | 0.25 | 0.25 | Folder |
Additive | Yes | 3 | 0.25 | 0.50 | Folder |
Additive | Yes | 3 | 0.25 | 0.80 | Folder |
Additive | Yes | 3 | 0.40 | 0.10 | Folder |
Additive | Yes | 3 | 0.40 | 0.25 | Folder |
Additive | Yes | 3 | 0.40 | 0.50 | Folder |
Additive | Yes | 3 | 0.40 | 0.80 | Folder |
Additive | Yes | 4 | 0.25 | 0.10 | Folder |
Additive | Yes | 4 | 0.25 | 0.25 | Folder |
Additive | Yes | 4 | 0.25 | 0.50 | Folder |
Additive | Yes | 4 | 0.25 | 0.80 | Folder |
Additive | Yes | 4 | 0.40 | 0.10 | Folder |
Additive | Yes | 4 | 0.40 | 0.25 | Folder |
Additive | Yes | 4 | 0.40 | 0.50 | Folder |
Additive | Yes | 4 | 0.40 | 0.80 | Folder |
Threshold | Yes | 3 | 0.10 | 0.10 | Folder |
Threshold | Yes | 3 | 0.10 | 0.25 | Folder |
Threshold | Yes | 3 | 0.10 | 0.50 | Folder |
Threshold | Yes | 3 | 0.10 | 0.80 | Folder |
Threshold | Yes | 3 | 0.25 | 0.10 | Folder |
Threshold | Yes | 3 | 0.25 | 0.25 | Folder |
Threshold | Yes | 3 | 0.25 | 0.50 | Folder |
Threshold | Yes | 3 | 0.25 | 0.80 | Folder |
Threshold | Yes | 3 | 0.40 | 0.10 | Folder |
Threshold | Yes | 3 | 0.40 | 0.25 | Folder |
Threshold | Yes | 3 | 0.40 | 0.50 | Folder |
Threshold | Yes | 3 | 0.40 | 0.80 | Folder |
Threshold | Yes | 4 | 0.10 | 0.10 | Folder |
Threshold | Yes | 4 | 0.10 | 0.25 | Folder |
Threshold | Yes | 4 | 0.10 | 0.50 | Folder |
Threshold | Yes | 4 | 0.10 | 0.80 | Folder |
Threshold | Yes | 4 | 0.25 | 0.10 | Folder |
Threshold | Yes | 4 | 0.25 | 0.25 | Folder |
Threshold | Yes | 4 | 0.25 | 0.50 | Folder |
Threshold | Yes | 4 | 0.25 | 0.80 | Folder |
Threshold | Yes | 4 | 0.40 | 0.10 | Folder |
Threshold | Yes | 4 | 0.40 | 0.25 | Folder |
Threshold | Yes | 4 | 0.40 | 0.50 | Folder |
Threshold | Yes | 4 | 0.40 | 0.80 | Folder |
None | No | 3 | 0.25 | 0.10 | Folder |
None | No | 3 | 0.25 | 0.25 | Folder |
None | No | 3 | 0.25 | 0.50 | Folder |
None | No | 3 | 0.40 | 0.10 | Folder |
None | No | 3 | 0.40 | 0.25 | Folder |
None | No | 3 | 0.40 | 0.50 | Folder |
None | No | 3 | 0.40 | 0.80 | Folder |
None | No | 4 | 0.25 | 0.10 | Folder |
None | No | 4 | 0.25 | 0.25 | Folder |
None | No | 4 | 0.40 | 0.10 | Folder |
None | No | 4 | 0.40 | 0.25 | Folder |
Performing a git clone
is the preferred method of downloading this repository,
although, due to its large size, it might take a while to complete. If you are
interested in only downloading a specific folder, you may want to use
subversion instead:
svn checkout https://github.com/UDC-GAC/epistasis-simulation-data/trunk/<folder>
For example, if you are only interested in epistasis data with no marginal
effects, you can download only those datasets by running svn checkout https://github.com/UDC-GAC/epistasis-simulation-data/trunk/datasets/epistasis/no_model
.
[1] C. Ponte-Fernandez, J. Gonzalez-Dominguez, A. Carvajal-Rodriguez and M. J. Martin. Evaluation of Existing Methods for High-Order Epistasis Detection. IEEE/ACM Transactions on Computational Biology and Bioinformatics. https://doi.org/10.1109/TCBB.2020.3030312
[2] Urbanowicz, R.J., Kiralis, J., Sinnott-Armstrong, N.A. et al. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining 5, 16 (2012). https://doi.org/10.1186/1756-0381-5-16
[3] Ponte-Fernández, C., González-Domínguez, J., Carvajal-Rodríguez, A. et al. Toxo: a library for calculating penetrance tables of high-order epistasis models. BMC Bioinformatics 21, 138 (2020). https://doi.org/10.1186/s12859-020-3456-3
[3] Marchini, J., Donnelly, P. & Cardon, L. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 37, 413–417 (2005). https://doi.org/10.1038/ng1537