Thank you to Alex's Lemonade Stand and the Seurat team for their teaching resources, from which much of the material below is adapted.

## scRNA-seq preprocessing and normalization

It's always important to do rigorous preprocessing before analyzing our datasets, and more so for noisy scRNA-seq data. These steps are essential for producing interpretable results in the downstream analysis. In this class, we’ll perform quality control and normalization of scRNA-seq count data with Seurat. 

### Setup

Many tools have been developed for the analysis of scRNA-seq data - we'll focus on one of the most widely used and well-maintained packages. Seurat is an R package that can perform QC, clustering, exploration, and visualization of scRNA-seq data, and they are regularly adding more features. It's a good place to start for us!

If you're interested, they have many more useful vigenettes on their website: https://satijalab.org/seurat/get_started.html

We installed Seurat for you already! So you can proceed. 

In [None]:
library(Seurat)
packageVersion("Seurat")
library(tidyverse)

### Load in Darmanis et al. dataset

We're going to continue using the dataset we started working on in the prelab. 

This dataset is generated from human primary glioblastoma tumor cells with the Smart-seq2 protocol (https://www.ncbi.nlm.nih.gov/pubmed/29091775). We're using a subset of this dataset (n=1854 cells). The raw count matrix provided has been quantified using Salmon. 

In [4]:
# Load in raw counts from Darmanis et al.
sc_data <- read.table("~/22_Prelab_scRNAseq-I/data/unfiltered_darmanis_counts.tsv", header=T, sep="\t", row.names=1)

Take a look at the data. How many genes and how many cells are there in the raw dataset? What is the prefix of cell names?

In [None]:
head(sc_data)
dim(sc_data)

What percentage of the data matrix is 0? Is this surprising?

In [None]:
## provide code here

### Seurat can do our QC and normalization

Recall from the prelab that we looked at two QC metrics of interest: 1. number of reads detected per cell and 2. number of genes detected per cell. Now that we have a better understanding of these QC metrics, we can make our lives easier by using Seurat to visualize these QC stats (and more!) and filter out cells and features that don't pass QC metrics.

If you're interested, you can get a lot more information about the Seurat package and its capabilities here: https://satijalab.org/seurat/get_started.html

Before we start Seurat, we first want to convert Ensembl gene ids to HGNC symbols for easier interpretation. To save you time, we've provided you those annotations directly so that you can load them into R (Savvy students will note that you can get those from ensemble using the code that is commented out as you had in a previous module). Run the following code:

In [7]:
#library(biomaRt)
#ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
#bm <- getBM(attributes=c("ensembl_gene_id", "hgnc_symbol"), values=rownames(sc_data), mart=ensembl)

bm <- read.table("data/scRNASeq_annotations_r2.csv", sep=",", header=T)

hgnc.symbols <- bm$hgnc_symbol[match(rownames(sc_data), bm$ensembl_gene_id)]
sc_data <- as.matrix(sc_data)
rownames(sc_data) <- hgnc.symbols

#Filter out any rows where the HGNC symbol is blank or NA
sc_data <- subset(sc_data, rownames(sc_data) != "" & !is.na(rownames(sc_data)))

###### 1) We'll start by reading in our count data as a Seurat object. This object will hold our count matrix, as well as data for all downstream processing steps or analyses (normalization, scaling, PCA, clustering results, etc.). We can specify extra parameters to take only the features that are present in *min.cells* and the cells that have *min.features*. 

In [None]:
sc <- CreateSeuratObject(counts=sc_data, project="Darmanis", min.cells=5, min.features=200)
sc

The original count matrix is stored in the assay called `RNA1, and a layer called `counts` 

You can access this in a very rough way: 

`sc[["RNA"]]$counts`

However, Seurat offers helper functions to access these data, e.g. the `LayerData()` function:

`LayerData(sc, assay="RNA", layer="counts")`

which also returns the equivalent matrix. Try them both out so you can see!

Print the raw counts for the first 6 genes of the first 6 cells below:

###### 2) Seurat automatically generates the number of reads (nCount_RNA) and number of genes (nFeature_RNA) detected per cell. We can access this data in the `sc@meta.data` slot.


Use ggplot to generate a density plot of nCount_RNA - it should exactly match the density plot we produced in the prelab for total read count!

In [None]:
head(sc@meta.data)

###### 3) One final QC metric we're often interested in is the percentage of reads mapping to the mitochondrial genome. A high percentage often indicates low-quality or dying cells. Seurat allows us to search for mitochondrial gene names and calculate the percentage of reads mapping to those genes. We can then stash these stats into our Seurat object's metadata by assigning `sc[[<metadata_feature_name>]]`.

In [12]:
sc <- PercentageFeatureSet(object=sc, pattern="^MT-", col.name="percent.mito")

What would you have to change in the code above if we were working with mouse data?

###### 4) How is the quality of this experiment? How do you know? We can visualize some QC metrics of interest in a violin plot. We can also check that the number of genes detected correlates with read count.

In [None]:
VlnPlot(object=sc, features=c("nCount_RNA", "nFeature_RNA", "percent.mito"), ncol=3, pt.size=0.5)
FeatureScatter(object=sc, feature1="nCount_RNA", feature2="nFeature_RNA")

## Note that the warning messages are Fine
## you could remove them by running the follow before creating the Seurat Object:

## options(Seurat.object.assay.calcn = TRUE)

###### 5) Remove low-quality cells (high mitochondrial content), empty droplets (low number of genes), doublets/multiplets (abnormally high number of genes). 

Seurat lets us easily apply QC filters based on our custom criteria. In this case, we want cells with >250 genes, but less than 2500, and <10% mitochondrial content.

In [None]:
sc <- subset(sc, subset = nFeature_RNA > 250 & nFeature_RNA < 2500 & percent.mito < 10)
sc

How many genes and cells remain after QC?

###### 6) Recover cell type annotations, stash in `sc@meta.data`

Run the following code to add cell type annotations to our Seurat object metadata. This will be useful later when we're visualizing the different cell populations.

In [15]:
sc_metadata <- read.table("~/22_Prelab_scRNAseq-I/data/unfiltered_darmanis_metadata.tsv", sep="\t", header=T)

celltypes <- sc_metadata$cell.type.ch1[match(rownames(sc@meta.data), sc_metadata$geo_accession)]
sc@meta.data$celltype <- celltypes

Print the metadata for the first 6 cells below:

###### 7) Normalization, variable feature identification, and scaling

The default in Seurat v5 is SCTransform (v2), a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments.

Initially, described in PMID: 31870423, this method proposes a generalized linear model for each gene, predicting UMI counts (response variable) from sequencing depth (predictor variable). The key initial idea of this method is rather than use a negative binomial or "zero-inflated" negative binomial model (to handle sparsity) for every gene - which can cause model over fitting and reduce signal - to instead pool genes with similar abundance in order to reduce the number of parameters for the estimated variances underlying the model for error (i.e., 'regularization') that explain variance in the data well. Hence, the model is a "regularized" binomial model regression (regression then allowing inclusion of confounding covariates, like mitochondrial percentage). This work describes SCTransform (v1).

PMID: 34488842 provided an updated version (v2), which includes three major modifications to the initial approach: (1) fixing the slope of the negative binomial GLM to ln(10) - the analytically derived solution - so that only overdispersion and intercept parameters are estimated per gene, (2) removal of genes from regularization that have very low expression or where the variance of molecular counts does not exceed the mean; instead, setting parameter choices there to capture a Poisson model, and (3) application of a lower bound of the minimum variance while calculating the Pearson's residual for each per cell to prevent genes with minimal information from resulting in high residual variance, ensuring that very low UMI counts are not assigned very high Pearson residuals.

This procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression.

This step will take a little bit of time, so be patient! (around 6-7 minutes!).

In [None]:
sc_sct <- SCTransform(sc, vars.to.regress = "percent.mito", verbose = FALSE)

# This is another way to normalize for your reference
#sc <- NormalizeData(object=sc, normalization.method = "LogNormalize", scale.factor=10000)

Now, print the scaled counts for the first 6 genes of the first 6 cells, comparing the original count data to the rescaled data.

**TIP**: Use the `LayerData` functions described above for this. In Seurat, the results from the SCTransform procedure with re-scaling are stored as a new assay types named `SCT`. 


In [None]:
## code for original count data here


In [None]:
## code for SCT scaled data here


###### 8) Compare normalized data to raw count data

Let's look at how proper data processing impacts our ability to draw interpretable conclusions about the biology. We'll generate PCA plots for a simple normalization based on raw counts data and the SCT procedure.

In [None]:
# Runs PCA on SCT-processed data
sc_sct <- RunPCA(sc_sct, verbose=FALSE)
DimPlot(sc_sct, dims = c(1, 2), reduction="pca", label=FALSE, group.by="celltype")
pca_plot_sct <- DimPlot(sc_sct, dims = c(1, 2), reduction="pca", label=FALSE, group.by="celltype")

In [None]:
# Runs PCA using a more basic normalization and data processing procedure
sc_scale <- NormalizeData(sc, normalization.method = "RC")
sc_scale <- FindVariableFeatures(sc_scale, verbose=FALSE)
sc_scale <- ScaleData(sc_scale, verbose=FALSE)
sc_scale <- RunPCA(sc_scale, verbose=FALSE)

DimPlot(sc_scale, dims = c(1, 2), reduction="pca", label=FALSE, group.by="celltype")
#pca_plot_raw <- DimPlot(sc, dims = c(1, 2), reduction="pca", label=FALSE, group.by="celltype")

Do you see any differences in the above plots? What are they?



###### 9) Finally, save your processed Seurat object for future downstream analysis.

In [46]:
saveRDS(sc, file="sc_Darmanis_normalized.rds")

## Stage Two.

###### We're going to be analyzing the Zheng et al. dataset (https://www.nature.com/articles/ncomms14049). This dataset consists of 68k FACS-sorted immune cells from peripheral blood. We'll use a small downsampled subset to save time and memory. 

###### Using what you've learned above, process the Zheng et al. dataset. You DON'T need to rerun every step, only the ones essential for producing the final processed Seurat object. 

######  These steps include: 
1. Visualize QC metrics
2. Filter out low-quality cells and lowly expressed genes. Criteria: nFeature_RNA > 500, nFeature_RNA < 2500, percent.mito < 10
3. Normalize
4. Scale, regress out mitochondrial content variable
5. Save the filtered, processed Seurat object

###### Make sure to include all essential code!

###### Tip: you can set image plotting sizes via `options(repr.plot.width=..., repr.plot.height=...)` which will help the sizing of these a bit

In [None]:
# We loaded in the data for you
Zheng_data <- read.table("data/Zheng_pbmc_downsample300_C9_filt_r2.txt", sep="\t", header=T, check.names=F)

geneids <- as.character(Zheng_data[,ncol(Zheng_data)])
Zheng_data$gene_symbols <- NULL
Zheng_data <- as.matrix(Zheng_data)
rownames(Zheng_data) <- geneids

# Create Seurat object
sc <- CreateSeuratObject(counts=Zheng_data, project="Zheng2017")

# Store type information in meta data object
celltype <- sapply(strsplit(rownames(sc@meta.data), split="_"), function(x) x[[2]])
sc@meta.data$celltype <- celltype

**Q1.** (a) Visualize QC metrics of interest and (b) filter out poor-quality cells.

In [None]:
# Visualize QC metrics


In [None]:
#Filter out poor-quality cells


**Q2.** Apply SCTransform and make sure to regress out mitochondrial contamination in the scale step. Save this to an variable named `sc` (e.g. overwrite the current `sc` variable). 

Run PCA on the SCT-processed data and visuale the output.

Use `saveRDS()` to save the Seurat object as an .rds object named `sc_Zheng_normlized.rds`.

In [None]:
# apply SCTransform


In [8]:
# Runs PCA on SCT-processed data


In [None]:
#Use saveRDS() to save the Seurat Object


**Q3** In Seurat and in the context of dimensional reduction, what is the goal? What is meant by an "embedding" and "loading" - What are those objects and what do they represent?

**Q4** Reviewing the [list of basic commands available on Seurat's website](https://satijalab.org/seurat/articles/seurat5_essential_commands), output the embeddings and save the embeddings matrix to a file called: `Zheng_pca_embeddings_output.txt`


In [0]:
## wrote this code de novo, could carry bugged
Zhang_embeddings <- Embeddings(sc, reduction = "pca")
write.table(Zhang_embeddings, file="Zheng_pca_embeddings_output.txt", row.names=F, quote=F)

**Q5** What are the primary sources of technical variability and biases you would be worried about in this experiment? See Zheng et al. for information about the experiment and Ziegenhain et al. for an overview of scRNA-seq technical biases (both papers are in your directory).