GitHub - emdann/scATAC_prep: A not-too-orderly collection of scripts for preprocessing and dimensionality reduction of scATAC-seq data

Pre-processing of scATAC-seq data

This repo contains a collection of scripts and notebooks that I use for initial pre-processing and dimensionality reduction of 10X scATAC-seq data. These handle steps that come after peak calling with the cellatac pipeline.

Input: after running cellatac you will have a results folder. From that we will need the contents of results/peak_matrix

Set up: Make a new folder to store all your results

mkdir my_scATAC_dir
outdir=./my_scATAC_dir
cellatac_outs=results/peak_matrix

Install required R packages (other packages are installed directly in the notebooks)

install.packages(c("data.table","Signac", "tidyverse", "argparse"))

if (!requireNamespace("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
}
BiocManager::install(c("ensembldb", "EnsDb.Mmusculus.v79", "EnsDb.Hsapiens.v86","GenomicRanges"))

Install required python packages:

pip install scanpy anndata numpy pandas anndata2ri rpy2 scipy

1. Annotating peaks annotate_peaks.R: This script uses functionality in GenomicRanges to compute some basic stats on the peaks identified by cellatac. We will use these for filtering and in later stages of the analysis.

In terminal:

Rscript annotate_peaks.R $cellatac_outs/peaks.txt $outdir --genome hg38

2. Make anndata and filter peaks N1_cellatac2anndata.ipynb: Here we make an anndata object, we filter out a lot of peaks and start saving different layers (loading the big matrix takes some time, be patient). The notebook saves an .h5ad abject storing the raw counts and peak annotations for the dataset.

3. Run cisTopic N1_add_cistopic.ipynb: cisTopic uses Latent Dirichlet Allocation (LDA) to perform dimensionality reduction on a binary (or sort of binary) matrix of peak x cell accessibility. LDA is a robust Bayesian method used in text mining to group documents addressing similar topics and related words into topics. cisTopic treats cells as documents and peaks as words to group cells where the same "regulatory topics" are active. Read all about it here.

Here we use the LDA model from cisTopic for 3 purposes

Dimensionality reduction: the topic x cell matrix obtained with LDA can be used to construct a KNN graphs, UMAPs, clustering etc... (basically an alternative to PCA in scRNA-seq pipelines)
Data de-noising: scATAC-seq data from 10X Genomics is very sparse, de-noising based on the LDA model helps in exploratory data analysis and when examining accessibility patterns of single-peaks
Calculating accessibility scores per gene: the denoised peak x cell matrix can be used to aggregate signal over genes, useful to examine markers

References for uses of cisTopic:

4. Match genes to proximal peaks N3_peak2genes.ipynb A range of downstream analyses require to associate genes to peaks in their proximity (i.e. overlapping the gene body or less than n kbs away). Here we use functionality in the GenomicRanges R package to create a sparse matrix of peak x gene assignment, where 1s indicate that a peak is in the proximity of the gene (N.B. A peak might be proximal to multiple genes). The default proximity window is 50 kbs.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
20201130_PropLDA.ipynb		20201130_PropLDA.ipynb
N1_cellatac2anndata.ipynb		N1_cellatac2anndata.ipynb
N2_add_cistopic.ipynb		N2_add_cistopic.ipynb
N3_peak2genes.ipynb		N3_peak2genes.ipynb
N4_chromVAR.ipynb		N4_chromVAR.ipynb
README.md		README.md
annotate_peaks.R		annotate_peaks.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

20201130_PropLDA.ipynb

20201130_PropLDA.ipynb

N1_cellatac2anndata.ipynb

N1_cellatac2anndata.ipynb

N2_add_cistopic.ipynb

N2_add_cistopic.ipynb

N3_peak2genes.ipynb

N3_peak2genes.ipynb

N4_chromVAR.ipynb

N4_chromVAR.ipynb

README.md

README.md

annotate_peaks.R

annotate_peaks.R

Repository files navigation

Pre-processing of scATAC-seq data

About

Releases

Packages

Contributors 2

Languages

emdann/scATAC_prep

Folders and files

Latest commit

History

Repository files navigation

Pre-processing of scATAC-seq data

About

Resources

Stars

Watchers

Forks

Languages