AtacWorks is a deep learning toolkit for track denoising and peak calling from low-coverage or low-quality ATAC-Seq data.
AtacWorks trains a deep neural network to learn a mapping between noisy (low coverage/low quality) ATAC-Seq data and matching clean (high coverage/high quality) ATAC-Seq data from the same cell type. Once this mapping is learned, the trained model can be applied to improve other noisy ATAC-Seq datasets.
AtacWorks models can be trained using one or more pairs of matching ATAC-Seq datasets from the same cell type. AtacWorks requires three specific inputs for each such pair of datasets:
- A coverage track representing the number of sequencing reads mapped to each position on the genome in the low-quality dataset.
- A coverage track representing the number of sequencing reads mapped to each position on the genome in the high-quality dataset.
- The genomic positions of peaks called on the high-quality dataset. These can be obtained by using MACS2 or any other peak caller. The model learns a mapping from (1) to both (2) and (3); in other words, from the noisy coverage track, it learns to predict both the clean coverage track, and the positions of peaks in the clean dataset. We also provide pretrained models that can be applied to a noisy dataset.
Much more information and examples can be found in the AtacWorks preprint: https://www.biorxiv.org/content/10.1101/829481
Training: Approximately 22 minutes per epoch to train on single whole genome.
Inference: Approximately 28 minutes for inference and postprocessing on a whole genome.
Training and inference were performed on a single Tesla V100 GPU. Training time can be significantly reduced by using multiple GPUs.
We are working to improve runtime, particularly for inference. Improvements are tracked on our project board: https://github.com/clara-genomics/AtacWorks/projects
Latest released version
This will clone the repo to the
master branch, which contains code for latest released version
git clone --recursive -b master https://github.com/clara-genomics/AtacWorks.git
Latest development version
This will clone the repo to the default branch, which is set to be the latest development branch. This branch is subject to change frequently as features and bug fixes are pushed.
git clone --recursive https://github.com/clara-genomics/AtacWorks.git
- Ubuntu 16.04+
- CUDA 9.0+
- Python 3.6.7+
- GCC 5+
- (Optional) A conda or virtualenv setup
- Any NVIDIA GPU. AtacWorks training and inference currently does not run on CPU.
bigWigToBedGraphbinaries and add to your $PATH
rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bedGraphToBigWig <custom_path> rsync -aP rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/bigWigToBedGraph <custom_path> export PATH="$PATH:<custom_path>"
Install pip dependencies
pip install -r requirements-base.txt && pip install -r requirements-macs2.txt
Note: The above non-standard installation is necessary to ensure the requirements for macs2 are installed before macs2 itself.
``` python -m pytest tests/ ```
- Convert peak calls on the clean data to bigWig format with
- Generate genomic intervals for training/validation/holdout with
- Encode the training/validation/holdout data into .h5 format with
- Train a model with
- Apply the trained model for inference on another dataset with
main.py, producing output in bigWig or bedGraph format.
- bigWig file for clean ATAC-Seq
- bigWig file for noisy ATAC-Seq
- MACS2 output for clean ATAC-Seq (.narrowPeak or .bed file)
- bigWig file for noisy ATAC-Seq
Run the following script to validate your setup.
3 pretrained models are provided in
These are based on bulk ATAC-Seq data from 7 blood cell types. They are trained using clean data of depth 80 million reads, subsampled to a depth of 1 million (1000000.7cell.resnet.22.214.171.124.50.0803.pth.tar), 2 million (2000000.7cell.resnet.126.96.36.199.50.0803.pth.tar), or 5 million (5000000.7cell.resnet.188.8.131.52.50.0803.pth.tar) reads.
- What's the preferred way for setting up the environment ?
A virtual environment or conda installation is preferred. You can follow conda installation instructions on their website and then follow the instructions in the README.
Lal, A., Chiang, Z.D., Yakovenko, N., Duarte, F.M., Israeli, J. and Buenrostro, J.D., 2019. AtacWorks: A deep convolutional neural network toolkit for epigenomics. BioRxiv, p.829481.