pyflow-scATACseq

snakemake workflow for post-processing scATACseq data

cite if you use it:)

What does it do?

for single cell ATACseq experiment, one gets a merged bam file for all cells. After going through clustering, one groups similar cells into cell types (cell states). This workflow will split the bam by clusters to create a pseudo bulk bam for each cluster, create bigwig tracks for visulization, call peaks for each cluster and merge the peaks across the clusters. Finally it will count reads per peak per cell from the original bam file on the merged peaks.

Thanks Brent Pedersen for the rcbbc (Read Count By BarCode) in the scripts folder. This script can be very useful in general. It counts the number of reads per cell in regions specified in a bed format from the bam file, which contains CB tag for each cell. The code is written in nim and is super fast. For over 1 million regions for over 8000 cells in a ~64G bam, it took ~40 mins.

./rcbbc -h
rcbbc

read-count by barcode

Usage:
  rcbbc [options] bed bam

Arguments:
  bed
  bam

Options:
  -f, --fasta=FASTA          path to fasta. required only for CRAM
  -e, --exclude=EXCLUDE      exclude alignments with any of these bits set (default: 1796)
  -m, --min-mapping-quality=MIN_MAPPING_QUALITY
                             exclude alignments with a mapping-quality below this (default: 1)
  -w, --white-list=WHITE_LIST
                             line-delimited list of barcodes to include. default is to include any seen in the bam
  -t, --tag=TAG              tag on which to partition read-counts (default: CB)
  -p, --prefix=PREFIX        output prefix for files written by rcbbc (default: rcbbc)
  -h, --help                 Show this help

In the future, the peak calling software should be barcode aware, so one does not need to split the bam file by cluster. But for now, I have this work for me.

workflow of the piepline

Dependencies

snakemake >=5.4.5. Note snakemake is python3 based.
deeptools >= 3.1.3. We use bamCoverage from it to make bigwig files.
Genrich for calling peaks. It implements ATACseq peak calling with internal +4/-5bp shifting. Develped by John, our previous group member at Harvard FAS informatics.
bedtools >= 2.27.1 for manipulating bed files.
samtools >=1.9 for manipupating bam files.

How to use this pipeline

clone the repo

git clone https://github.com/crazyhottommy/pyflow-scATACseq
cd pyflow-scATACseq

Prepare a tab delimited tsv file with header: meta.tsv inside the pyflow-scATACseq folder:

sample	bam_path	cluster_id.csv	white_list
sample1	/path/to/sample1.bam	/path/to/sample1.csv	/path/to/white_list_sample1.txt
sample2	/path/to/sample2.bam	/path/to/sample2.csv	/path/to/white_list_sample2.txt

The first column is the sample name.
The second column is the path to the 10x cellranger_atac produced possorted_bam.bam
The cluster_id.csv is a two column csv file with header (does not matter what name you give). The first column is the cell barcode, the second column is the cluster id (can be numeric or strings).

make sure the cluster_id.csv is named exactly as sample.csv.

e.g. if the sample column is sample1, it should be sample1.csv for the cluster information csv file.

cat example_cluster.csv
name,value
AAACGAAAGCACCATT-1,32
AAACGAAAGTAGTCGG-1,4
AAACGAAGTAACGGCA-1,31
AAACTCGAGAACAGGA-1,17
AAACTCGAGTCCAGAG-1,1
AAACTCGCAAAGGAAG-1,41
AAACTCGCAAGGCGTA-1,5
AAACTCGCAGAAAGAG-1,4
AAACTCGCAGTAGGCA-1,24

The white_list is a one column txt file with valid barcode you want to include.

you can create it by:

cat example_cluster.csv | cut -f1 -d, | sed '1d' > example_white_list.txt

Create a samples.json file:

python sample2json.py meta.tsv

# a samples.json file will be created
cat samples.json
{
    "sample1": [
        "/path/to/sample1.bam",
        "/path/to/sample1.csv",
        "/path/to/white_list_sample1.txt"
    ],
    "sample2": [
        "/path/to/sample2.bam",
        "/path/to/sample2.csv",
        "/path/to/white_list_sample2.txt"
    ]
}

Running the workflow

#dry run
snakemake -np 

## real-run on local machine
snakemake -j 10


## if you want to use a singularity image that contains presto for differential accessible test
snakemake -j 10 --use-singularity

## submit to slurm (you can change this script for your own HPC)
./pyflow-scATACseq.sh

To-do list

check file paths exist and format.
Have a conda env set up for the Snakefile.
have a docker container.
Motif analysis.
Find differential peaks using presto.
Associating peaks with genes using Cicero
Gene set enrichment

Other Notes

more thoughts on pileup and shifting reads deeptools/deepTools#453 maxFragmentLength

deeptools/deepTools#370

https://groups.google.com/forum/#!topic/deeptools/JU9itiT5rYk

You probably want “–Offset 1”

The above is not shifting exactly, for shifting use https://deeptools.readthedocs.io/en/develop/content/tools/alignmentSieve.html

--ATACshift

filter out fragments?

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
scripts		scripts
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
config.yaml		config.yaml
meta.txt		meta.txt
pyflow-scATACseq.sh		pyflow-scATACseq.sh
rulegraph.png		rulegraph.png
sample2json.py		sample2json.py
samples.json		samples.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyflow-scATACseq

What does it do?

workflow of the piepline

Dependencies

How to use this pipeline

To-do list

Other Notes

About

Releases 1

Packages

Languages

crazyhottommy/pyflow-scATACseq

Folders and files

Latest commit

History

Repository files navigation

pyflow-scATACseq

What does it do?

workflow of the piepline

Dependencies

How to use this pipeline

To-do list

Other Notes

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages