Skip to content

UDC-GAC/epistasis-simulation-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Epistasis simulation data

This repository serves three purposes:

  1. Document the simulation design process used in our work Evaluation of Existing Methods for High-Order Epistasis Detection[1].
  2. Publish our developed data sets, allowing other researchers to use the same data.
  3. Ensure that our results are reproducible by any other scientist.

Simulation design

The objective of our work is to compare the performance of several epistasis detection methods, both in terms of detection power and false positives. To do that, a collection of data sets was created with the intention of providing a wide variety of population characteristics resembling real human traits, on which the different methods will be evaluated.

All our simulation was carried out using GAMETES[2], the most used epistasis simulation package in recent literature. Our simulation includes both epistasis with marginal effects and with no marginal effects. Simulation in GAMETES can be summarized in three steps:

  1. Calculate a penetrance table describing the epistatic interaction to be included in the data.
  2. Simulate population samples of the interacting SNPs following its penetrance table.
  3. Simulate population samples of the rest of the non-interacting SNPs.

Our criteria was to create third and fourth order interactions, with MAFs of the interacting SNPs of 0.10, 0.25 and 0.40, resulting in a trait with heritabilities of 0.10, 0.25, 0.50 and 0.80, and prevalences above 1E-06.

Penetrance tables

GAMETES can create penetrance tables describing interactions following no epistasis model and showing no marginal effects. Therefore, to generate penetrance tables following a specific epistasis model, Toxo[3] was used. Toxo is a MATLAB library capable of calculating penetrance tables of any bivariate epistasis model under certain conditions. In this study we decided to use the additive and threshold models proposed my Machini et al. in [3] for third and fourth order epistasis.

The following MATLAB code snippet shows how the different penetrance tables with marginal effects were obtained using Toxo. The directory epistasis_models/ contains the used Marchini's models in CSV format, as required by Toxo.

addpath('<path to Toxo>/src/');
m = toxo.Model('epistasis_models/<model>');
pt = m.find_max_prevalence([<MAFs>], <heritability>);
pt.write('<path to penetrance table output>',
         toxo.PTable.format_gametes,
         <MAFs>);

Following this procedure, the following penetrance tables were obtained:

Model Order MAF P(D) File
Additive 3 0.10 0.10 0.000012 Link
Additive 3 0.10 0.25 0.000004 Link
Additive 3 0.10 0.50 0.000002 Link
Additive 3 0.10 0.80 0.000001 Link
Additive 3 0.25 0.10 0.005370 Link
Additive 3 0.25 0.25 0.001153 Link
Additive 3 0.25 0.50 0.000504 Link
Additive 3 0.25 0.80 0.000306 Link
Additive 3 0.40 0.10 0.254558 Link
Additive 3 0.40 0.25 0.022186 Link
Additive 3 0.40 0.50 0.008545 Link
Additive 3 0.40 0.80 0.005091 Link
Additive 4 0.10 0.10 < 1E-6 Unavailable
Additive 4 0.10 0.25 < 1E-6 Unavailable
Additive 4 0.10 0.50 < 1E-6 Unavailable
Additive 4 0.10 0.80 < 1E-6 Unavailable
Additive 4 0.25 0.10 0.000234 Link
Additive 4 0.25 0.25 0.000068 Link
Additive 4 0.25 0.50 0.000031 Link
Additive 4 0.25 0.80 0.000019 Link
Additive 4 0.40 0.10 0.036282 Link
Additive 4 0.40 0.25 0.003383 Link
Additive 4 0.40 0.50 0.001374 Link
Additive 4 0.40 0.80 0.000822 Link
Threshold 3 0.10 0.10 0.064602 Link
Threshold 3 0.10 0.25 0.025561 Link
Threshold 3 0.10 0.50 0.013270 Link
Threshold 3 0.10 0.80 0.008417 Link
Threshold 3 0.25 0.10 0.477516 Link
Threshold 3 0.25 0.25 0.267707 Link
Threshold 3 0.25 0.50 0.154539 Link
Threshold 3 0.25 0.80 0.102529 Link
Threshold 3 0.40 0.10 0.780354 Link
Threshold 3 0.40 0.25 0.586967 Link
Threshold 3 0.40 0.50 0.415395 Link
Threshold 3 0.40 0.80 0.307526 Link
Threshold 4 0.10 0.10 0.012563 Link
Threshold 4 0.10 0.25 0.005140 Link
Threshold 4 0.10 0.50 0.002590 Link
Threshold 4 0.10 0.80 0.001623 Link
Threshold 4 0.25 0.10 0.275518 Link
Threshold 4 0.25 0.25 0.132034 Link
Threshold 4 0.25 0.50 0.070683 Link
Threshold 4 0.25 0.80 0.041819 Link
Threshold 4 0.40 0.10 0.668428 Link
Threshold 4 0.40 0.25 0.446405 Link
Threshold 4 0.40 0.50 0.287337 Link
Threshold 4 0.40 0.80 0.201273 Link

Four penetrance tables from the fourth order additive model were discarded due to resulting in very low prevalence values, since prevalences that low are unrealistic for real human populations.

Penetrance tables with no epistatic model can be obtained directly from GAMETES, by running the following shell command:

java -jar <path to GAMETES> \
    -M "-h <heritability> <MAFs> -o <output model file>" \
    -q 1 \
    -p 100 \
    -t 100000 \
    -r <seed>

Unlike Toxo, GAMETES implements a stochastic algorithm and thus penetrance tables can vary depending on the seed used by the pseudorandom number generator. Therefore, we also picked seeds at random to guarantee that results are reproducible:

Order MAF P(D) GAMETES seed Link
3 0.10 0.10 - Unavailable
3 0.10 0.25 - Unavailable
3 0.10 0.50 - Unavailable
3 0.10 0.80 - Unavailable
3 0.25 0.10 0.5860 -1643481676 Link
3 0.25 0.25 0.4923 1764873474 Link
3 0.25 0.50 0.4223 1893932570 Link
3 0.25 0.80 - Unavailable
3 0.40 0.10 0.5163 -1568883956 Link
3 0.40 0.25 0.5644 2089582692 Link
3 0.40 0.50 0.5019 1343608856 Link
3 0.40 0.80 0.4970 1329415446 Link
4 0.10 0.10 - Unavailable
4 0.10 0.25 - Unavailable
4 0.10 0.50 - Unavailable
4 0.10 0.80 - Unavailable
4 0.25 0.10 0.4201 -600481346 Link
4 0.25 0.25 0.5910 -965873914 Link
4 0.25 0.50 - Unavailable
4 0.25 0.80 - Unavailable
4 0.40 0.10 0.4356 913749367 Link
4 0.40 0.25 0.4720 203584226 Link
4 0.40 0.50 - Unavailable
4 0.40 0.80 - Unavailable

Many of the penetrance tables could not be obtained. The larger the interaction, the lower the MAFs and the higher the heritability are, the more unlikely that GAMETES is to converge on a solution.

Data generation

Using previous penetrance tables, 100 data sets were simulated for each penetrance table containing 500 SNPs (including the interacting SNPs) of 2000 individuals (1000 cases and 1000 controls), with MAFs of non-interacting loci uniformly sampled from the interval [0.05, 0.5]. This can be achieved by running in a shell, for each penetrance table:

java -jar <path to GAMETES> \
    -i <path to penetrance table> \
    -D "-n <lower MAF>
        -x <upper MAF>
        -a <number of SNPs>
        -s <number of cases>
        -w <number of controls>
        -r <number of datasets>
        -o <output folder>"

The following table lists all epistatic data sets used during the evaluation:

Model Marginal Effects Order MAF Data set
Additive Yes 3 0.10 0.10 Folder
Additive Yes 3 0.10 0.25 Folder
Additive Yes 3 0.10 0.50 Folder
Additive Yes 3 0.10 0.80 Folder
Additive Yes 3 0.25 0.10 Folder
Additive Yes 3 0.25 0.25 Folder
Additive Yes 3 0.25 0.50 Folder
Additive Yes 3 0.25 0.80 Folder
Additive Yes 3 0.40 0.10 Folder
Additive Yes 3 0.40 0.25 Folder
Additive Yes 3 0.40 0.50 Folder
Additive Yes 3 0.40 0.80 Folder
Additive Yes 4 0.25 0.10 Folder
Additive Yes 4 0.25 0.25 Folder
Additive Yes 4 0.25 0.50 Folder
Additive Yes 4 0.25 0.80 Folder
Additive Yes 4 0.40 0.10 Folder
Additive Yes 4 0.40 0.25 Folder
Additive Yes 4 0.40 0.50 Folder
Additive Yes 4 0.40 0.80 Folder
Threshold Yes 3 0.10 0.10 Folder
Threshold Yes 3 0.10 0.25 Folder
Threshold Yes 3 0.10 0.50 Folder
Threshold Yes 3 0.10 0.80 Folder
Threshold Yes 3 0.25 0.10 Folder
Threshold Yes 3 0.25 0.25 Folder
Threshold Yes 3 0.25 0.50 Folder
Threshold Yes 3 0.25 0.80 Folder
Threshold Yes 3 0.40 0.10 Folder
Threshold Yes 3 0.40 0.25 Folder
Threshold Yes 3 0.40 0.50 Folder
Threshold Yes 3 0.40 0.80 Folder
Threshold Yes 4 0.10 0.10 Folder
Threshold Yes 4 0.10 0.25 Folder
Threshold Yes 4 0.10 0.50 Folder
Threshold Yes 4 0.10 0.80 Folder
Threshold Yes 4 0.25 0.10 Folder
Threshold Yes 4 0.25 0.25 Folder
Threshold Yes 4 0.25 0.50 Folder
Threshold Yes 4 0.25 0.80 Folder
Threshold Yes 4 0.40 0.10 Folder
Threshold Yes 4 0.40 0.25 Folder
Threshold Yes 4 0.40 0.50 Folder
Threshold Yes 4 0.40 0.80 Folder
None No 3 0.25 0.10 Folder
None No 3 0.25 0.25 Folder
None No 3 0.25 0.50 Folder
None No 3 0.40 0.10 Folder
None No 3 0.40 0.25 Folder
None No 3 0.40 0.50 Folder
None No 3 0.40 0.80 Folder
None No 4 0.25 0.10 Folder
None No 4 0.25 0.25 Folder
None No 4 0.40 0.10 Folder
None No 4 0.40 0.25 Folder

Downloading

Performing a git clone is the preferred method of downloading this repository, although, due to its large size, it might take a while to complete. If you are interested in only downloading a specific folder, you may want to use subversion instead:

svn checkout https://github.com/UDC-GAC/epistasis-simulation-data/trunk/<folder>

For example, if you are only interested in epistasis data with no marginal effects, you can download only those datasets by running svn checkout https://github.com/UDC-GAC/epistasis-simulation-data/trunk/datasets/epistasis/no_model.

References

[1] C. Ponte-Fernandez, J. Gonzalez-Dominguez, A. Carvajal-Rodriguez and M. J. Martin. Evaluation of Existing Methods for High-Order Epistasis Detection. IEEE/ACM Transactions on Computational Biology and Bioinformatics. https://doi.org/10.1109/TCBB.2020.3030312

[2] Urbanowicz, R.J., Kiralis, J., Sinnott-Armstrong, N.A. et al. GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining 5, 16 (2012). https://doi.org/10.1186/1756-0381-5-16

[3] Ponte-Fernández, C., González-Domínguez, J., Carvajal-Rodríguez, A. et al. Toxo: a library for calculating penetrance tables of high-order epistasis models. BMC Bioinformatics 21, 138 (2020). https://doi.org/10.1186/s12859-020-3456-3

[3] Marchini, J., Donnelly, P. & Cardon, L. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 37, 413–417 (2005). https://doi.org/10.1038/ng1537

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages