This repo contains materials for the Gene Expression Analysis in R workshop for the R/Medicine Conference 2022.
- Piru Perampalam
- Siavash Ghaffari
Before attending the workshop, please clone this repo and make sure you have a working installation of R with the following packages:
# From Cran
install.packages("ggplot2")
install.packages("RColorBrewer")
install.packages("circlize")
# From Bioconductor (requires BiocManager)
BiocManager::install("DESeq2")
BiocManager::install("apeglm")
BiocManager::install("AnnotationDbi")
BiocManager::install("org.Hs.eg.db")
# From GitHub (requires devtools)
devtools::install_github("jokergoo/ComplexHeatmap")
devtools::install_github("kevinblighe/EnhancedVolcano")
This workshop will cover standard bulk RNA-seq analysis using R. For demonstration purposes, we will be using some unpublished prostate cancer data. For time limitations, we will begin will pre-aligned read counts. That is, raw read files (FASTQ files) have already been aligned to the human reference genome to obtain a read counts table. We will load these read counts into R and use DESeq2 to perform differential expression analysis.
Our experimental data was obtained by performing a drug screen on advanced-stage metastatic prostate cancer biopsies. In brief, tumor biopsies were obtained from five patients and dissociated into cell clumps. Patient tumor cells were seeded into plates and treated with either DMSO or an LSD1 inhibitor (LSD1i or SP2509) for 24 hours. Following this, cells were collected and total RNA was extracted for sequencing on the Illumina HiSeq platform. In addition, clinical attributes from the five patients were collected at the time of biopsy.
You will find read counts in the CSV file data/ProstateCancer_DMSO_SP2509_LSD1i_readcounts.csv
. Each column pertains to one sample and each row pertains to a unique transcript. Each patient has two samples: one that is DMSO-treated, and one that is treated with the drug (LDS1i, SP2509).
We also require some sample metadata (also referred to in DESeq2 as colData
). This metadata is stored in the CSV file data/ProstateCancer_sampleInfo.csv
. This file contains condition information (DMSO or SP2509) and replicate information (replicates 1 to 5 for patients 1 through 5). It also contains some clinical features:
- mutationCount: total number of mutations detected in the tumor biopsies by whole-genome sequencing
- diagnosisAge: age of the patient at the time of initial diagnosis (note: not the age at which biopsy was taken)
- psa: PSA levels at the time of biopsy
- stage: prostate cancer stage grouping based on the AJCC (American Joint Committee on Cancer) TNM system.
- tmb: tumor mutational burden (nonsynonymous mutations only)
The goals of this workshop are as follows:
- Load input data and do some simple data preparation
- Create DESeq2 data set (dds)
- Perform differential expression
- Perform some additional statistical operations
- Filter the data
- Make a PCA plot
- Make a sample clustering plot
- Make a heatmap to show top differentially expressed genes
- Make a volcano plot to look at log fold changes and p values
- Look at some individual gene expression data in relation to clinical attributes
After completing this workshop, you should feel comfortable analyzing bulk RNA-seq data in R!
We left out FASTQ alignments in this workshop due to the time it would take to perform this step. This is usually the most time consuming and resource intensive (CPU and memory) step in bulk RNA-seq analysis. If you are interested in learning about this step, please see a tutorial I have previously written: NGS alignments using STAR