Utilising Cogent Analysis Pipeline (CogentAP) and Discovery Software (CogentDS) to analyse data from novel single-cell approach
This repository contains R scripts used to analyse single-cell RNA-seq data generated by a novel approach combining the CellenONE® X1 instrument from CELLENION utilising Image Based Single Cell Isolation (IBSCIT™) to isolate and sort cells with the ICELL8® cx Single-Cell System from Takara to process the cells for sequencing.
- Cogent NGS Analysis Pipeline v1.0
- cutadapt v2.5
- STAR v2.7.2b
- samtools v1.1
- featureCounts v1.6.4
- GATK v4.1.9.0
- GEOquery and RCurl - download the single-cell and bulk data from GEO
- Cogent NGS Discovery Software (CogentDS)
- ggpubr
- reshape2
- VennDiagram
- DESeq2
- GGally
- circlize
- VariantAnnotation
For both ICELL8 and composite datasets, the analysis used to generate gene matrix and metadata from FastQ files are as follows:
- Demultiplex reads based on barcodes each cell is affiliated with (found in the
wellLists
directory):
cogent demux -i /path/to/scRNAseq.fastq.gz -p /path/to/scRNAseq.fastq.gz -t ICELL8_FLA -b /path/to/wellList.txt -o /path/to/demux_out --gz
- Build genome based on the ENSEMBL hg38 fasta file and GTF file v103:
cogent add_genome -g hg38-v103 -f /path/to/Homo_sapiens.GRCh38.dna.toplevel.fa -a /path/to/Homo_sapiens.GRCh38.103.gtf
- Analyse data with human genome reference. In brief, the following steps are utilised:
- Trim reads using
cutadapt
, whereby N's are trimmed at ends of reads, 3' bases with quality <20 are trimmed, and reads with more than 70% of their length with N's and/or shorter than 15 bases after trimming are removed. - Align reads to hg38 genome using
STAR
. - Quantify reads in exonic, genic (including introns) and mitochondrial regions for all hg38 v103 genes using
featureCounts
, where only primary alignments are counted. - Summarise data and re-organise into gene matrix, metadata and gene info.
- Generate
CogentDS
report with default parameters.
cogent analyse -i /path/to/demux_out/demux_out_demuxed_R1.fastq.gz -t ICELL8_3DE -g hg38-v103 -o /path/to/analysis_out -d /path/to/demux_out/demux_out_counts_all.csv
To make sure the bulk and single-cell RNA-seq datasets are comparable, steps 3a-c used for the single-cell RNA-seq data (read trimming, alignment to genome and read quantification in exonic regions) should be used in the same way.
This analysis involved associating the gene expressions from individual cells/samples with specific underlying phenotypes. In brief, the following steps were utilized:
- Variants were called from alignment files for both bulk RNA-seq and scRNA-seq data using
GATK
. - For each sample, top variants from the scRNA-seq data were selected for each sample as (a) sample-specific, (b) corroborated by variants called from bulk RNA-seq data, (c) non-synonymous mutations, (d) with highest mutation rate (i.e. maximum proportion of mutant-containing depth out of total depth in that position), and (e) with highest overall depth.
- For each top variant, a multiple regression analysis was executed between average gene expressions per sample and the mutation percentage in that sample (i.e. proportion of cells in that sample containing the mutation).
The scripts used for this analysis can be found in the directory
├──figure5
.
As an additional assessment of the association between the sample phenotypes (mutations) and the gene expressions, the top variants of each sample were searched for within individual cells from that sample, and a differential expression analysis was performed for all genes between mutated and non-mutated cells.
The script used for this analysis can be found in the directory ├──figure6
.
The scripts were used to generate the figures and results for the following paper: Shomroni, O., Sitte, M., Schmidt, J. et al. A novel single-cell RNA-sequencing approach and its applicability connecting genotype to phenotype in ageing disease. Sci Rep 12, 4091 (2022). https://doi.org/10.1038/s41598-022-07874-1