Overview

This repository contains data and scripts for reproducing the results accompanying the manuscript

popDMS infers mutation effects from deep mutational scanning data

Zhenchen Hong¹, and John P. Barton^2,3,#

¹ Department of Physics and Astronomy, University of California, Riverside
² Department of Physics and Astronomy, University of Pittsburgh
³ Department of Computational and Systems Biology, University of Pittsburgh School of Medicine
^# correspondence to jpbarton@pitt.edu

This work is currently available on the bioRxiv at this link.

Scripts for generating and analyzing simulation data can be found in the simulation.ipynb notebook. Scripts for processing and analyzing deep mutational scanning data are contained in the data_analysis.ipynb notebook. Finally, scripts for analysis and figures contained in the manuscript are located in the figures.ipynb notebook.

Due to the large size and number of some files generated by the interim analysis of deep mutational scanning data, some data has been stored in a compressed format using Zenodo. To access the full set of data, navigate to the Zenodo record. Then download and extract the contents of the archives into the directory epistasis_inference/.

Software dependencies

Methods to infer epistasis are implemented in C++11 and make use of the GNU Scientific Library and Eigen.

Version 3.4.0 of Eigen that we use can be downloaded from this link. For epistasis inference, this file should be unzipped into the ./epistasis_inference/ directory.

Running popDMS

popDMS uses codon counts in dms_tools format or sequence counts in MaveDB-HGVS format for input. For reference, this link demonstrates the format for codon counts, and this link shows an example file in MaveDB-HGVS format.

Running popDMS differs slightly depending on the format of the input data.

Using codon counts

When using codon counts as input, we require three variables: codon_counts_files, replicates, and times. Here codon_counts_files contains a list of file paths to codon counts files. For each file, there must be a corresponding entry in the list replicates that identifies which replicate the file belongs to, and an entry in the list times that gives the time (in numbers of generations) that sequencing was performed to obtain this data. For examples, see data_analysis.ipynb.

Using sequence counts

When using sequence counts, we require five variables: haplotype_counts_file, reference_sequence_file, n_replicates, time_points, and time_cols. The variable haplotype_counts_file gives the path to a file containing the sequence counts. To normalize the selection coefficients relative to a reference sequence, reference_sequence_file should provide the path to a file storing the reference sequence in plain text (for an example, see here). The total number of replicates is specified by n_replicates. For each replicate, the time(s) at which data was collected are given as a list in time_points. Finally, for each replicate, the variable time_cols points to the columns in the haplotype_counts_file that store sequence counts for each time. For examples, see data_analysis.ipynb.

Interpreting the output

For both approaches, popDMS will compute and save the variant frequencies needed to calculate selection coefficients. Using these files, the code to infer the selection coefficients can quickly be rerun using the infer_independent (for codon counts) or infer_correlated (for sequence counts) methods. Both methods will save a compressed comma separated values (CSV) file containing the inferred selection coefficients at the inferred optimal value regularization strength. The file can be unzipped to be viewed in plain text or with a program such as Microsoft Excel.

The columns of the selection coefficient are:

site: Specifies the site at which the variant is observed, following the numbering of sites in the original input file
amino_acid: Specifies the amino acid (including stops)
WT_indicator: Set to True if the amino acid matches the reference at that site, and False otherwise
rep_x: Values in these columns give the selection coefficients inferred for each replicate independently, with replicates numbered starting from 0 (rep_0)
joint: Joint selection coefficients inferred across all replicates

This CSV file can be used for downstream analysis. We also provide a built-in plotting function fig_dms that takes a path to the CSV file as input and produces a heatmap of the inferred selection coefficients.

Epistasis inference: (already merged in one bash file to run automatically)

The format of input data for epistasis inference is described in this file. Once data has been stored in this format, inference of epistatic interactions proceeds by running the shell script run_epistasis.sh in the epistasis_inference directory.

License

This repository is dual licensed as GPL-3.0 (source code) and CC0 1.0 (figures, documentation, and our presentation of the data).

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
data		data
epistasis_inference		epistasis_inference
figures		figures
output		output
src_Kai		src_Kai
.gitignore		.gitignore
LICENSE-CC0.txt		LICENSE-CC0.txt
LICENSE-GPL.txt		LICENSE-GPL.txt
README.md		README.md
data_analysis.ipynb		data_analysis.ipynb
dataset info.csv		dataset info.csv
epistasis_analysis.ipynb		epistasis_analysis.ipynb
figures.ipynb		figures.ipynb
figures.py		figures.py
mplot.py		mplot.py
popDMS.py		popDMS.py
preference_transform.ipynb		preference_transform.ipynb
simulation.ipynb		simulation.ipynb

License

Licenses found

bartonlab/paper-DMS-inference

Folders and files

Latest commit

History

Repository files navigation

Overview

popDMS infers mutation effects from deep mutational scanning data

Contents

Software dependencies

Running popDMS

Using codon counts

Using sequence counts

Interpreting the output

Epistasis inference: (already merged in one bash file to run automatically)

License

About

Resources

License

Licenses found

Stars

Watchers

Forks

Languages