Processing of SIMPLE-seq datasets.
SIMPLE-seq is a scalable method for joint analysis of 5mC and 5hmC from single cells. This repository provide the scripts for decoding the cellular barcodes of SIMPLE-seq datasets (modified from ligation-based combinatorial barcoding from SPLiT-seq), and for the identification of 5mC and 5hmC sites for individual cells.
-
bowtie2, http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
-
samtools, http://www.htslib.org/ samtools version >= 1.3.1 is required.
-
Trim_galore, https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
-
Optional: FastQC, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
-
Compile the "simplecov" tool:
cd simpleconv
sh make.sh
Extract cellular barcode from Read2, map the reads to reference cell_ID, and convert the mapped cell ID samfiles to useable fastq files.
*** Please modification the paths to reference files according to the annotations in the script file.
Use sh shellscrips/01.pre_process_simple_seq_fastq.sh [sample_prefix]
.
-
Sample_combined.fq.gz
This file is a combined fastq file including Read1 sequences/qualities and barcode sequences extracted from Read2. -
Sample_BC.sam
This is a temporally file used to assign extracted barcode sequences to Cellular Barcode. Please delete this file if you have successful obtainedSample_BC_cov.fq.gz
. -
Sample_BC_cov.fq.gz
This is the fastq file with Read1 sequences and qualities, the Cellular Barocde and UMI from Read2 are now in ReadName section of the fastq file (and subsequent alignment files).
As SIMPLE-seq only introduce "C-to-T" mutations on 5mC and 5hmC sites, we used bowtie2 (instead of other methylation aligner) for mapping.
Use sh shellscrips/02.proc_mapping.sh [sample_prefix]
.
This step is to split 5mC and 5hmC reads to seperate alignment files (bam files) based on the indicator sequences.
Use perl perlscripts/02.split_modality.pl [sample_sorted.bam]
.
Three files will be generated, including [sample_sorted.bam_5mC.bam], [sample_sorted.bam_5hmC.bam] and [sample_sorted.bam_other.bam]
. Reads cannot be perfectly assigned to 5mC or 5hmC will be written to "XXX_other.bam".
This step will convert the bam files to an intermediate modification information file and then generate cell-to-modification abundance matrices.
Step.1 perl perlscripts/03.bam2srf.pl [sample_sorted.bam_5mC/5hmC.bam]
.
Step.2 perl perlscripts/04.srf2mtx.pl [input.rsf] [binsize]
.
The resulting matrix can be used for downstream single-cell analysis.