This repository contains source code and Jupyter notebooks for data processing, simulations and analyses used in this paper.
To reproduce everything from scratch, you'll need to install all dependencies listed bellow.
Full disclosure: I've been very lucky to have access to amazing computational resources (60 core machines with 1 TB RAM and a cluster with hundreds of nodes) and I often used them to their full potential. Unless you have similar resources, it's not going to be trivial to reproduce all results from scratch. At the very least, it will take much longer to run all the simulations if you cannot parallelize them effectively.
If you don't want to re-run the whole simulation and analysis pipeline but
still want to play around with results and plots, you can use the
RData files in the
data/ subdirectory. The
notebook is a good start, as it loads those processed R data files and uses them
to generate plots for the paper.
I used Python version 3.6.5 and the following Python modules:
pip install numpy pandas msprime pybedtools jupyter
The full list of Python modules I had installed in the project environment can be
found in the
I used R version 3.4.3.
Packages from CRAN:
install.packages(c("broom", "forcats", "future", "ggbeeswarm", "ggrepel", "here", "magrittr", "modelr", "purrr", "stringr", "tidyverse"))
Packages from Bioconductor:
install.packages("BiocManager") BiocManager::install(c("biomaRt", "VariantAnnotation", "BSgenome.Hsapiens.UCSC.hg19", "GenomicRanges", "rtracklayer"))
Packages from GitHub:
install.packages("devtools") devtools::install_github("bodkan/bdkn") devtools::install_github("bodkan/slimr", ref = "v0.1") devtools::install_github("bodkan/admixr", ref = "v0.6.2")
To be able to run Jupyter notebooks that contain all my analses and figures, you will also need to install IRkernel.
I used SLiM v2.6. Be aware that SLiM introduced some backwards incompatible changes since its 2.0 release, so make sure to use exactly version 2.6.
In principle, different notebooks in the
notebooks/ directory use different data
generated by "pipeline scripts" in the root of the repository (
However, there's no strict sequential order of executing everything. In fact, I ran those scripts mostly by pieces, adding additional commands as the project developed, and analyzed new data as they were being generated.