A pipeline for running an ensemble of variant callers to predict variants from ATAC-seq reads.
The entire pipeline is made up of two smaller subworkflows. The prepare
subworkflow calls each variant caller and prepares the resulting data for use by the classify
subworkflow, which uses an ensemble classifier to predict the existence of variants at each site.
Using our Code Ocean compute capsule, you can execute VarCA v0.2.1 on example data without downloading or setting up the project. To interpret the output of VarCA, see the output sections of the prepare
subworkflow and the classify
subworkflow in the rules README.
Execute the following command or download the latest release manually.
git clone https://github.com/aryarm/varCA.git
Also consider downloading the example data.
cd varCA
wget -O- -q https://github.com/aryarm/varCA/releases/latest/download/data.tar.gz | tar xvzf -
The pipeline is written as a Snakefile which can be executed via Snakemake. We recommend installing version 5.18.0:
conda create -n snakemake -c bioconda -c conda-forge --no-channel-priority 'snakemake==5.18.0'
We highly recommend you install Snakemake via conda like this so that you can use the --use-conda
flag when calling snakemake
to let it automatically handle all dependencies of the pipeline. Otherwise, you must manually install the dependencies listed in the env files.
-
Activate snakemake via
conda
:conda activate snakemake
-
Execute the pipeline on the example data
Locally:
./run.bash &
or on an SGE cluster:
./run.bash --sge-cluster &
VarCA will place all of its output in a new directory (out/
, by default). Log files describing the progress of the pipeline will also be created there: the log
file contains a basic description of the progress of each step, while the qlog
file is more detailed and will contain any errors or warnings. You can read more about the pipeline's output in the rules README.
You must modify the config.yaml file to specify paths to your data. The config file is currently configured to run the pipeline on the example data provided.
The pipeline is made up of two subworkflows. These are usually executed together automatically by the master pipeline, but they can also be executed on their own for more advanced usage. See the rules README for execution instructions and a description of the outputs. You will need to execute the subworkflows separately if you ever want to create your own trained models.
We provide the example data so that you may quickly (in ~1 hr, excluding dependency installation) verify that the pipeline can be executed on your machine. This process does not reproduce our results. Those with more time can follow these steps to create all of the plots and tables in our paper.
We recommend that you run snakemake --help
to learn about Snakemake's options. For example, to check that the pipeline will be executed correctly before you run it, you can call Snakemake with the -n -p -r
flags. This is also a good way to familiarize yourself with the steps of the pipeline and their inputs and outputs (the latter of which are inputs to the first rule in each workflow -- ie the all
rule).
Note that Snakemake will not recreate output that it has already generated, unless you request it. If a job fails or is interrupted, subsequent executions of Snakemake will just pick up where it left off. This can also apply to files that you create and provide in place of the files it would have generated.
By default, the pipeline will automatically delete some files it deems unnecessary (ex: unsorted copies of a BAM). You can opt to keep these files instead by providing the --notemp
flag to Snakemake when executing the pipeline.
A Snakemake pipeline for calling variants from a set of ATAC-seq reads. This pipeline automatically executes two subworkflows:
- the
prepare
subworkflow, which prepares the reads for classification and - the
classify
subworkflow, which creates a VCF containing predicted variants
Snakemake rules for the prepare
and classify
subworkflows. You can either execute these subworkflows from the master Snakefile or individually as their own Snakefiles. See the rules README for more information.
Config files that define options and input for the pipeline and the prepare
and classify
subworkflows. If you want to predict variants from your own ATAC-seq data, you should start by filling out the config file for the pipeline.
Scripts for executing each of the variant callers which are used by the prepare
subworkflow. Small pipelines can be written for each caller by using a special naming convention. See the caller README for more information.
Scripts for calculating posterior probabilities for the existence of an insertion or deletion, which can be used as features for the classifier. These scripts are an adaptation from @Arkosen's BreakCA code.
Various scripts used by the pipeline. See the script README for more information.
An example bash script for executing the pipeline using snakemake
and conda
. Any arguments to this script are passed directly to snakemake
.
There is an option to "Cite this repository" on the right sidebar of the repository homepage.
Massarat, A. R., Sen, A., Jaureguy, J., Tyndale, S. T., Fu, Y., Erikson, G., & McVicker, G. (2021). Discovering single nucleotide variants and indels from bulk and single-cell ATAC-seq. Nucleic Acids Research, gkab621. https://doi.org/10.1093/nar/gkab621