scATAC-seq Benchmark

These are all the Jupyter notebooks and scripts that were used to analyse data and generate figures for our paper "Systematic benchmarking of scATAC-seq protocols" (De Rop et al., 2023). With these scripts, and our pipeline PUMATAC, you should be able to reproduce everything found in our manuscript.

Citing this work

Please cite De Rop, F.V., Hulselmans, G., Flerin, C. et al. Systematic benchmarking of single-cell ATAC-sequencing protocols. Nat Biotechnol (2023) if you use our data, and cite our manuscript and PUMATAC if you use our scripts.

Reproducing manuscript figures

You can find PUMATAC in its own repository. This pipeline can be used to realign data from all techniques assessed here to the reference genome.

Here you can find where the code for each figure in the manuscript can be found:

Main figures:
1b: general/fixedcells_merged_graphs.ipynb
1d: general/scatterplots_bytech_kde_v2.py
1e-h: general/fixedcells_boxplots.ipynb
2a-e: fixedcells_downsample_series/4c_qc_plots.ipynb
2f-j: general/fixedcells_general_statistics_scatterplots.ipynb
2k-n: general/fixedcells_boxplots.ipynb
3a: fixedcells_5_cell_downsampling/5b_DAR_scores.ipynb
3b-c: fixedcells_8_individual_tech_cistopic_objects/4b_dar_scores.ipynb
3d: fixedcells_7_merged_equalcells_celltypefair/4c_dar_traces.ipynb
3e-f: fixedcells_8_individual_tech_cistopic_objects/7_peak_dar_overlap_samples.ipynb
3g: fixedcells_8_individual_tech_cistopic_objects/8b_cistarget_analysis.ipynb
3h-i: general/fixedcells_general_statistics_scatterplots.ipynb
3j-k: fixedcells_9_individual_malefemale_celltypefair/5a_analyse_malefemale.ipynb
4a: fixedcells_9_individual_malefemale_celltypefair/7b_male_female_tracks.ipynb
4b: fixedcells_4_merged/5b_LISI.ipynb
4d: fixedcells_4_merged/5b_LISI.ipynb

Extended Data Figures:
ED1: general/fixedcells_general_statistics_gameshowell.ipynb
ED2: full_3_cistopic_consensus/9_plot_all_qc.ipynb
ED3: full_4_merged/8_lisi.ipynb
ED4b-c: fixedcells_8_individual_tech_cistopic_objects/7_peak_dar_overlap_samples.ipynb
ED5a-b: fixedcells_cellranger_arc/2_cell_filtering.ipynb
ED5c: fixedcells_cellranger_arc/3_venn.ipynb
ED6a: fixedcells_3_cistopic_consensus/3b_cell_type_analysis.ipynb
ED6b: full_5_cellranger/5_compare_rna_atac_seurat.ipynb
ED7a: fixedcells_downsample_series/5b_seurat_celltypes.ipynb
ED7b-c: fixedcells_downsample_series/7b_DARs_analysis.ipynb
ED8a: fixedcells_5_cell_downsampling/3_seurat_celltypes.ipynb
ED8b-c: fixedcells_5_cell_downsampling/5b_DAR_scores.ipynb
ED9: public_downsample_series/5_analyse_qc.ipynb
ED10: 1_data_repository/9_saturation_analysis.ipynb

Supplementary Figures:
S1a: full_5_cellranger/2b_validation_graphs.ipynb
S1b: fixedcells_3_cistopic_consensus/1b_count_fragments_in_blacklist.ipynb
S1c: full_1_vsn_preprocessing/3_otsu_filtering.ipynb
S2a: fixedcells_2_cistopic/2b_analyse_freemuxlet.ipynb
S2b: fixedcells_3_cistopic_consensus/0_deteremine_male_vs_female.ipynb
S2c: fixedcells_2_cistopic/2b_analyse_freemuxlet.ipynb
S3: fixedcells_7_merged_equalcells_celltypefair/4d_dar_carrot.ipynb

Supplementary files:
Supplementary table with quality control statistics: general/fixedcells_general_statistics.ipynb

Directory structure

Here you can find the structure of the root directory, with descriptions of each subdirectory.

scATAC-seq_benchmark
├── 0_resources # all resources used (generic scripts, specific region sets, ...). Some files, such as SCREEN peak sets, or reference genomes, are too large for github, please contact us if you want to request these.
├── 1_data_repository # all the raw data (FASTQ) as well as fragments files. If you want to reproduce our analyses, you should source our raw data from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194028) 
├── fixedcells_1_vsn_preprocessing # processing steps of libds_1_vsn_preprocessing bams
├── fixedcells_2_cistopic # cisTopic pre-processing in SCREEN regions and calling consensus peaks
├── fixedcells_3_cistopic_consensus # cisTopic pre-processing in consensus peaks and calling master consensus peaks
├── fixedcells_4_merged # cisTopic pre-processing of merged dataset (169k cells), using master consensus peaks
├── fixedcells_5_cell_downsampling # random cell downsampling analyses
├── fixedcells_6_merged_equalcells # merged object containing equal number of cells from each technique, but not an equal number of cells for each cell type within each technique (not used in manuscript)
├── fixedcells_7_merged_equalcells_celltypefair # merged object containing equal number of cells from each technique and an equal number of cells for each cell type within each technique
├── fixedcells_8_individual_tech_cistopic_objects # individual objects for each technology containing equal number of cells from each technique and an equal number of cells for each cell type 
├── fixedcells_9_individual_malefemale_celltypefair # individual objects for each technology containing equal number of cells from each technique and an equal number of cells for each cell type and each donor
├── fixedcells_cellranger_arc # aligning downsampled data to reference genome using cellranger_arc, and performing some analyses on multiome-rna component
├── fixedcells_downsample_series # downsampling further to 35k, 30k, ... 5k reads per cell and analysing the resulting cisTopic objects
├── full_1_vsn_preprocessing # aligning full FASTQs to reference genome
├── full_2_cistopic # cisTopic pre-processing in SCREEN regions and calling consensus peaks
├── full_3_cistopic_consensus # cisTopic pre-processing in consensus peaks and calling master consensus peaks
├── full_4_merged # cisTopic pre-processing of merged dataset (169k cells), using master consensus peaks
├── full_5_cellranger # aligning full data to reference genome using cellranger, and validating PUMATAC by comparing Cell Ranger and PUMATAC outputs
├── general # general, sample-wide plots and definitions
├── libds_1_vsn_preprocessing # aligning downsampled FASTQs to the reference genome
├── public_1_cistopic_qc # analysing fragments files from public repositories, not used in manuscript
├── public_2_vsn_preprocessing # aligning full public mouse brain data to mouse genome
├── public_3_cistopic_qc # cisTopic pre-processing in SCREEN regions and calling consensus peaks
├── public_4_cistopic_consensus # cisTopic pre-processing in consensus peaks and calling master consensus peaks
└── public_downsample_series # downsampling analysis on public mouse cortex data

How to interpret the directory structure

As new experiments were performed, sequencing data was deposited in 1_data_repository/original_fastq. Each sample's sequencing data was then merged to a maximum of 3 files (barcode read, and two mates) and deposited in 1_data_repository/full_fastq.
For each experiment, the full sequencing data was then aligned to the reference genome. Results are in full_1_vsn_preprocessing. Symlinks to .bam and .fragments.tsv.gz were placed in 1_data_repository/full_bams and 1_data_repository/full_fragments. We refer to VSN, as our pipeline at the time was still a part of VSN, but now has its own repository: PUMATAC.
For each sample, we then filtered true cell barcodes from noise barcodes in full_2_cistopic. This filtering was performed using thresholds on TSS enrichment and number of unique fragments.
Since we then knew the number of cells present in each sample, we could downsample the full sequencing data to the same common read depth (40k reads/cell). This was performed using the notebook 1_data_repository/5_downsample_fastq.ipynb and the downsampled FASTQs were deposited in 1_data_repository/libds_fastq. libds stands for "library downsampled".
We then re-aligned the downsampled FASTQs to GRCh38 in libds_1_vsn_preprocessing. For a long time, we then re-called cells in these FASTQ files and proceeded with analysis like this. For most samples, the number of cells called was very similar, but for some samples that were added later, there were large discrepancies, which strongly impacted the "reads per cell" depth. We were thus faced with a dilemma: either adapt our cell filtering algorithm so that for the new samples cell counts would be the same between full data and downsampled data, or simply take the list of filtered barcodes from the full data and re-do all the analysis on the downsampled data using this barcode list instead. We chose the latter approach. This new sampling strategy was then referred to as fixedcells, as the number and identity of cells was now fixed after identification in the full sequencing data. Following this reasoning, the aligned sequencing data in libds_1_vsn_preprocessing is further processed in fixedcells_1_vsn_preprocessing.
We performed cisTopic clustering, Freemuxlet donor assignment, Seurat cell type annotation and consensus peak calling in fixedcells_2_cistopic.
We re-count each sample's fragments in its own consensus peak set, re-do Seurat cell type annotation in fixedcells_3_cistopic_consensus, and do all further downstream analysis (such as DAR calling and motif enrichment analysis) based on these count matrices. FRIP scores are also calculated using each sample's specific consensus peaks. Freemuxlet donor assignment was re-taken from the first pass done in fixedcells_2_cistopic because it is a bam-level analysis and independent of consensus peaks. We also re-calculated new consensus peak sets for each sample, and aggregated each of these second-pass consensus peak sets into one master peak set.
We recounted all fragments of all cells in all samples in the master peak set to generate a fixedcells_merged cisTopic object, and performed some analyses on the merged object in fixedcells_4_merged.
In fixedcells_5_cell_downsampling, we performed some analyses to investigate the effect of number of cells on some metrics, mostly Seurat cell type assignment and DAR calling. In order to do this, we subsampled each of the 47 individual fixedcells cisTopic objects to 2500, 2000, 1500, ... cells.
We attempted to do some analyses on the merged cisTopic object where each technology had the same number of cells (fixedcells_6_merged_equalcells), equal to the number of cells of the technology that had the least number of cells (s3-ATAC). However, at the same time, we were doing the downsampling analysis and saw that the number of cells per cell type also had an impact on the analysis...
So, in fixedcells_7_merged_equalcells_celltypefair, we did the same, but now we took the same number of cells for each cell type for each technology, and in fixedcells_8_individual_tech_cistopic_objects, we simply split this merged object into 8 objects, one for each tech.
The same strategy was employed to create cisTopic objects for each technology that had the same number of cells for each cell type, but also from each donor within each cell type. Since s3-ATAC had so few cells compared to the rest, some concessions had to be made (the subsampling is done in fixedcells_4_merged/9a_subset_malefemale.ipynb, you can see the strategy used there).
In full_5_cellranger, we realigned all the 10x scATAC-seq data using Cell Ranger. We also performed our comparison with VSN there, and calibrated the Seurat scores using the multiome. We then filtered cells, and re-aligned the downsampled multiome data in fixedcells_cellranger_arc, and analysed the results (Venn diagram and correlations between ATAC and RNA counts).
In fixedcells_downsample_series, most of these analyses were performed on further read-downsampled data (35k, 30k, ... 5k reads/cell).
In public_* directories, all the public data was analysed, including a read downsampled analysis.

Contributing authors

All of these analyses were performed at the Stein Aerts lab by Florian De Rop, but they were largely based on a strong foundation laid by Christopher Flerin, who designed the initial analysis workflow. Gert Hulselmans also played a major role, as he designed PUMATAC (then still part of VSN) together with Christopher, and wrote most of the low-level scripts that work at the fragments and FASTQ level (calling bwa-mem, detecting and correcting barcodes, writing fragments files, calculating Jaccard indices, calling and speeding up Freemuxlet, subsampling BAM files, ...). This benchmark was supervised by Holger Heyn and Stein Aerts, who coordinated all work shown here and helped form major decisions at critical points. All work shown here was done with the highest regard for fairness and transparency. If you have any questions, suggestions or criticisms, please contact us or open a github issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scATAC-seq Benchmark

Citing this work

Reproducing manuscript figures

Directory structure

How to interpret the directory structure

Contributing authors

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
0_resources		0_resources
1_data_repository		1_data_repository
fixedcells_1_vsn_preprocessing		fixedcells_1_vsn_preprocessing
fixedcells_2_cistopic		fixedcells_2_cistopic
fixedcells_3_cistopic_consensus		fixedcells_3_cistopic_consensus
fixedcells_4_merged		fixedcells_4_merged
fixedcells_5_cell_downsampling		fixedcells_5_cell_downsampling
fixedcells_6_merged_equalcells		fixedcells_6_merged_equalcells
fixedcells_7_merged_equalcells_celltypefair		fixedcells_7_merged_equalcells_celltypefair
fixedcells_8_individual_tech_cistopic_objects		fixedcells_8_individual_tech_cistopic_objects
fixedcells_9_individual_malefemale_celltypefair		fixedcells_9_individual_malefemale_celltypefair
fixedcells_cellranger_arc		fixedcells_cellranger_arc
fixedcells_downsample_series		fixedcells_downsample_series
full_1_vsn_preprocessing		full_1_vsn_preprocessing
full_2_cistopic		full_2_cistopic
full_3_cistopic_consensus		full_3_cistopic_consensus
full_4_merged		full_4_merged
full_5_cellranger		full_5_cellranger
general		general
libds_1_vsn_preprocessing		libds_1_vsn_preprocessing
public_1_cistopic_qc		public_1_cistopic_qc
public_2_vsn_preprocessing		public_2_vsn_preprocessing
public_3_cistopic_qc		public_3_cistopic_qc
public_4_cistopic_consensus		public_4_cistopic_consensus
public_downsample_series		public_downsample_series
.gitignore		.gitignore
README.md		README.md
rsync_synology.sh		rsync_synology.sh

aertslab/scATAC-seq_benchmark

Folders and files

Latest commit

History

Repository files navigation

scATAC-seq Benchmark

Citing this work

Reproducing manuscript figures

Directory structure

How to interpret the directory structure

Contributing authors

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages