This repository contains data and scripts for reproducing the results accompanying the manuscript
Zhenchen Hong1, and John P. Barton2,3,#
1 Department of Physics and Astronomy, University of California, Riverside
2 Department of Physics and Astronomy, University of Pittsburgh
3 Department of Computational and Systems Biology, University of Pittsburgh School of Medicine
# correspondence to jpbarton@pitt.edu
This work is currently available on the bioRxiv at this link.
Scripts for generating and analyzing simulation data can be found in the simulation.ipynb
notebook. Scripts for processing and analyzing deep mutational scanning data are contained in the data_analysis.ipynb
notebook. Finally, scripts for analysis and figures contained in the manuscript are located in the figures.ipynb
notebook.
Due to the large size and number of some files generated by the interim analysis of deep mutational scanning data, some data has been stored in a compressed format using Zenodo. To access the full set of data, navigate to the Zenodo record. Then download and extract the contents of the archives into the directory epistasis_inference/
.
Methods to infer epistasis are implemented in C++11 and make use of the GNU Scientific Library and Eigen.
Version 3.4.0 of Eigen that we use can be downloaded from this link. For epistasis inference, this file should be unzipped into the ./epistasis_inference/
directory.
popDMS uses codon counts in dms_tools format or sequence counts in MaveDB-HGVS format for input. For reference, this link demonstrates the format for codon counts, and this link shows an example file in MaveDB-HGVS format.
Running popDMS differs slightly depending on the format of the input data.
When using codon counts as input, we require three variables: codon_counts_files
, replicates
, and times
. Here codon_counts_files
contains a list of file paths to codon counts files. For each file, there must be a corresponding entry in the list replicates
that identifies which replicate the file belongs to, and an entry in the list times
that gives the time (in numbers of generations) that sequencing was performed to obtain this data. For examples, see data_analysis.ipynb.
When using sequence counts, we require five variables: haplotype_counts_file
, reference_sequence_file
, n_replicates
, time_points
, and time_cols
. The variable haplotype_counts_file
gives the path to a file containing the sequence counts. To normalize the selection coefficients relative to a reference sequence, reference_sequence_file
should provide the path to a file storing the reference sequence in plain text (for an example, see here). The total number of replicates is specified by n_replicates
. For each replicate, the time(s) at which data was collected are given as a list in time_points
. Finally, for each replicate, the variable time_cols
points to the columns in the haplotype_counts_file
that store sequence counts for each time. For examples, see data_analysis.ipynb.
For both approaches, popDMS will compute and save the variant frequencies needed to calculate selection coefficients. Using these files, the code to infer the selection coefficients can quickly be rerun using the infer_independent
(for codon counts) or infer_correlated
(for sequence counts) methods. Both methods will save a compressed comma separated values (CSV) file containing the inferred selection coefficients at the inferred optimal value regularization strength. The file can be unzipped to be viewed in plain text or with a program such as Microsoft Excel.
The columns of the selection coefficient are:
site
: Specifies the site at which the variant is observed, following the numbering of sites in the original input fileamino_acid
: Specifies the amino acid (including stops)WT_indicator
: Set toTrue
if the amino acid matches the reference at that site, andFalse
otherwiserep_x
: Values in these columns give the selection coefficients inferred for each replicate independently, with replicates numbered starting from 0 (rep_0
)joint
: Joint selection coefficients inferred across all replicates
This CSV file can be used for downstream analysis. We also provide a built-in plotting function fig_dms
that takes a path to the CSV file as input and produces a heatmap of the inferred selection coefficients.
The format of input data for epistasis inference is described in this file. Once data has been stored in this format, inference of epistatic interactions proceeds by running the shell script run_epistasis.sh
in the epistasis_inference
directory.
This repository is dual licensed as GPL-3.0 (source code) and CC0 1.0 (figures, documentation, and our presentation of the data).