Skip to content

beyondpie/CEMBA_wmb_snATAC

Repository files navigation

CEMBA_wmb_snATAC

This repository is used for the whole mouse brain (wmb) snATAC-seq data analysis of Center for Epigenomics of the Mouse Brain Atlas (CEMBA), which is now accepted by Nature 2023.

./repo_figures/GraphAbstract.jpg

Important Note

All the analysis and the h5ad data generated are from SnapATAC2 under <= 2.4.0 There are some break changes later after SnapATAC2 >= 2.5.0

Data

Pipeline

  • We now have 234 samples and 2.3 million cells in total. So most of the analysis are depend on Snakefile to organize the pipeline and submit them to high-performance cluster (HPC) in order to use hundreds of CPUs at the same time.
  • R, Shell and Python (>= 3.10) are mainly used, especially R (>= 4.2).
  • Under the directory package, we put lots of common functions there.
  • We mainly use SnapATAC2 to analyze the single-nucleus ATAC-seq data
  • Comparation between Scrublet and AMULET: https://github.com/yuelaiwang/CEMBA_AMULET_Scrublet
  • The deep learning related codes now in the repo: https://github.com/yal054/mba_dl_model
  • sa2 is short for SnapATAC2 in this repo.

./repo_figures/snATAC-seq_analysis_pipeline.jpg

Codes

Clustering

In total, we have implemented four-round iterative clustering. See details in 01.clustering

Integration and annotation

We use Allen’s scRNAseq data and their annotations for our data annotation. See details in 02.integration

Peak calling

We use macs2 with multiple stage filtering, especially use SPM >= 5 for filtering peaks. See details in 03.peakcalling

Comments on some scripts:

  1. cembav2env.R: R env to store the metadata during analysis.
    EnviormentDescription
    cembav2envmeta data of SnapATAC and SnapATAC2
    cluSumBySa2clustering meta data, such as resolution,
    barcode to L4 Ids, L4 major regions and so on
    Sa2IntegrationIntegration meta data, like Allen’s data descriptions
    Sa2PeakCallingPeak calling meta data