CEMBA_wmb_snATAC

This repository is used for the whole mouse brain (wmb) snATAC-seq data analysis of Center for Epigenomics of the Mouse Brain Atlas (CEMBA), which is now accepted by Nature 2023.

Important Note

All the analysis and the h5ad data generated are from SnapATAC2 under <= 2.4.0 There are some break changes later after SnapATAC2 >= 2.5.0

Data

Demultiplexed data can be accessed via the NEMO archive (NEMO, RRID:SCR_016152) at https://assets.nemoarchive.org/dat-bej4ymm (the raw directory under Source Data URL in this archive).
We also uploaded our demultiplexed fastq files and processed files under the GEO accession number GSE246791:
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE246791
Processed data is also available on our web portal and can be explored here: http://www.catlas.org.
- A Google drive link for bigwig files and SnapATAC2 files just for backup:
  - https://drive.google.com/drive/folders/1mule-CkxN939mxhYAqpOIxjPpq0j8qVP?usp=sharing

Pipeline

We now have 234 samples and 2.3 million cells in total. So most of the analysis are depend on Snakefile to organize the pipeline and submit them to high-performance cluster (HPC) in order to use hundreds of CPUs at the same time.
R, Shell and Python (>= 3.10) are mainly used, especially R (>= 4.2).
Under the directory package, we put lots of common functions there.
We mainly use SnapATAC2 to analyze the single-nucleus ATAC-seq data
Comparation between Scrublet and AMULET: https://github.com/yuelaiwang/CEMBA_AMULET_Scrublet
The deep learning related codes now in the repo: https://github.com/yal054/mba_dl_model
sa2 is short for SnapATAC2 in this repo.

Codes

Clustering

In total, we have implemented four-round iterative clustering. See details in 01.clustering

Integration and annotation

We use Allen’s scRNAseq data and their annotations for our data annotation. See details in 02.integration

Peak calling

We use macs2 with multiple stage filtering, especially use SPM >= 5 for filtering peaks. See details in 03.peakcalling

Comments on some scripts:

cembav2env.R: R env to store the metadata during analysis.

Enviorment	Description
cembav2env	meta data of SnapATAC and SnapATAC2
cluSumBySa2	clustering meta data, such as resolution,
	barcode to L4 Ids, L4 major regions and so on
Sa2Integration	Integration meta data, like Allen’s data descriptions
Sa2PeakCalling	Peak calling meta data

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
00.data.preprocess		00.data.preprocess
01.clustering		01.clustering
02.integration		02.integration
03.peakcalling		03.peakcalling
04.nmf		04.nmf
05.cCREgene/sa2.cicero		05.cCREgene/sa2.cicero
06.motifanalysis		06.motifanalysis
07.m3C		07.m3C
08.GRN		08.GRN
09.cCRE_conservation		09.cCRE_conservation
10.cCRE_TE		10.cCRE_TE
11.deeplearning		11.deeplearning
manuscript_figures		manuscript_figures
meta		meta
package		package
repo_figures		repo_figures
snakemake.template		snakemake.template
supple.datashare		supple.datashare
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org

License

beyondpie/CEMBA_wmb_snATAC

Folders and files

Latest commit

History

Repository files navigation

CEMBA_wmb_snATAC

Important Note

Data

Pipeline

Codes

Clustering

Integration and annotation

Peak calling

Comments on some scripts:

About

Resources

License

Stars

Watchers

Forks

Languages