analysis/seurat.Rmd

---
title: "Getting started with Seurat"
date: "`r Sys.Date()`"
output:
  workflowr::wflow_html:
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# https://stackoverflow.com/questions/30237310/setting-work-directory-in-knitr-using-opts-chunksetroot-dir-doesnt-wor
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
```

This post follows the Peripheral Blood Mononuclear Cells (PBMCs) [tutorial](https://satijalab.org/seurat/articles/pbmc3k_tutorial) for 2,700 single cells. It was written while I was going through the tutorial and contains my notes. The dataset for this tutorial can be downloaded from the [10X Genomics dataset page](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k) but it is also hosted on Amazon (see below). The PBMCs, which are primary cells with relatively small amounts of RNA (around 1pg RNA/cell), come from a healthy donor. There were 2,700 cells detected and sequencing was performed on an Illumina NextSeq 500 with around 69,000 reads per cell. To get started [install Seurat](https://satijalab.org/seurat/articles/install.html) by using install.packages().

```{r install_seurat, eval=FALSE}
install.packages("Seurat")
```

If you get the warning:

>‘SeuratObject’ was built under R 4.3.0 but the current version is 4.3.2; it is recomended that you
reinstall ‘SeuratObject’ as the ABI for R may have changed

re-install the `SeuratObject` package using a repository that has an updated copy. The same goes for the `htmltools` package.

```{r install_seurat_obj, eval=FALSE}
install.packages("SeuratObject", repos = "https://cran.ism.ac.jp/")
install.packages("htmltools", repos = "https://cran.ism.ac.jp/")
packageVersion("SeuratObject")
packageVersion("htmltools")
```

Load `Seurat`.

```{r load_seurat}
library("Seurat")
packageVersion("Seurat")
```

## Data

To follow the tutorial, you need the 10X data.

```{bash, eval=FALSE}
mkdir -p data/pbmc3k && cd data/pbmc3k
wget -c https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
```

The extracted files.

```{bash}
ls -1 data/pbmc3k/filtered_gene_bc_matrices/hg19
```

`matrix.mtx` is a [MatrixMarket](https://math.nist.gov/MatrixMarket/formats.html) file. It has the following properties:

* Only non-zero entries are stored in the file
* Comments start with a `%`, like LaTeX
* The first line indicates the total number of rows, columns, and entries
* The following lines after the first provide a row and column number and the value at that coordinate

```{bash}
head data/pbmc3k/filtered_gene_bc_matrices/hg19/matrix.mtx
```

## Seurat object

Load 10x data into a matrix.

```{r}
pbmc.data <- Read10X(data.dir = "data/pbmc3k/filtered_gene_bc_matrices/hg19/")
```

[dgTMatrix-class](https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgTMatrix-class.html).

```{r}
class(pbmc.data)
```

32,738 genes and 2,700 cells.

```{r}
dim(pbmc.data)
```

Check out the first six genes and cells

```{r}
pbmc.data[1:6, 1:6]
```

Summary of total expression per single cell.

```{r}
summary(colSums(pbmc.data))
```

Check how many genes have at least one transcript in each cell.

The median number of detected genes among the single cells is 817.

```{r}
at_least_one <- apply(pbmc.data, 2, function(x) sum(x>0))
hist(
  at_least_one,
  breaks = 100,
  main = "Distribution of detected genes",
  xlab = "Genes with at least one tag"
)
abline(v = median(at_least_one), col = 2, lty = 3)
```

Total expression per cell. The median sum of expression among the single cells is 2,197. This distribution is very similar to the distribution of detected genes shown above.

```{r}
hist(
  colSums(pbmc.data),
  breaks = 100,
  main = "Expression sum per cell",
  xlab = "Sum expression"
)
abline(v = median(colSums(pbmc.data)), col = 2, lty = 3)
```

We will filter out genes and single cells before we continue with the analysis. The tutorial has arbitrary values of keeping genes expressed in three or more cells and keeping cells with at least 200 detected genes.

Manually check the number of genes detected in three or more cells; a lot of genes are not detected in 3 or more cells.

```{r}
tmp <- apply(pbmc.data, 1, function(x) sum(x>0))
table(tmp>=3)
```

All cells have at least 200 detected genes

```{r}
keep <- tmp>=3
tmp <- pbmc.data[keep,]
at_least_one <- apply(tmp, 2, function(x) sum(x>0))
summary(at_least_one)
```


```{r}
dim(tmp)
```

See `?SeuratObject` for more information on the class.

```{r}
pbmc <- CreateSeuratObject(
  counts = pbmc.data,
  min.cells = 3,
  min.features = 200,
  project = "pbmc3k"
)

class(pbmc)
```

Same numbers as above 

```{r}
pbmc
```

Slots in Seurat object.

> SeuratObject: Data Structures for Single Cell Data
>
> Defines S4 classes for single-cell genomic data and associated information, such as dimensionality reduction embeddings, nearest-neighbor graphs, and spatially-resolved coordinates. Provides data access methods and R-native hooks to ensure the Seurat object is familiar to other R users

Read more about the [S4 class](https://adv-r.hadley.nz/s4.html) in the Advanced R book.

```{r}
slotNames(pbmc)
```


## Basic filtering

The tutorial states that "The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat." The nUMI is calculated as num.mol <- colSums(object.raw.data), i.e. each transcript is a unique molecule. The number of genes is simply the tally of genes with at least 1 transcript; num.genes <- colSums(object.raw.data > is.expr) where is.expr is zero.

A common quality control metric is the percentage of transcripts from the mitochondrial genome. According to the paper [Classification of low quality cells from single-cell RNA-seq data](https://pubmed.ncbi.nlm.nih.gov/26887813/) the reason this is a quality control metric is because if a single cell is lysed, cytoplasmic RNA will be lost apart from the RNA that is enclosed in the mitochondria, which will be retained and sequenced.

Mitochondria genes conveniently start with MT

```{r}
mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@assays$RNA), value = TRUE)
length(mito.genes)
```


```{r}
percent.mito <- Matrix::colSums(pbmc[['RNA']]$counts[mito.genes, ]) / Matrix::colSums(pbmc[['RNA']]$counts)
head(percent.mito)
```

Check out the meta data

```{r}
head(pbmc@meta.data)
```

add some more meta data

```{r}
pbmc <- AddMetaData(object = pbmc,
                    metadata = percent.mito,
                    col.name = "percent.mito")
head(pbmc@meta.data)
```

Plot number of genes, UMIs, and % mitochondria

Visualize QC metrics as a violin plot

```{r}
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), ncol = 3)
```

A couple of cells have high mitochondrial percentage which may indicate lost of cytoplasmic RNA.

The GenePlot() function can be used to visualise gene-gene relationships as well as any columns in the seurat object. Below we use the plotting function to spot cells that have a high percentage of mitochondrial RNA and to plot the relationship between the number of unique molecules and the number of genes captured.

FeatureScatter is typically used to visualize feature-feature relationships, but can be used
for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

```{r}
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mito")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
```

Manual check; I already know all cells have >200 genes.

```{r}
table(pbmc@meta.data$percent.mito < 0.05 & pbmc@meta.data$nFeature_RNA<2500)
```

```{r}
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mito < 0.05)
pbmc
```

## Normalisation

The next step is to normalise the data, so that each cell can be compared against each other. At the time of writing, the only normalisation method implemented in Seurat is by log normalisation. Gene expression measurements for each cell are normalised by its total expression, scaled by 10,000, and log-transformed.

```{r}
hist(
  colSums(pbmc[['RNA']]$counts),
  breaks = 100,
  main = "Total expression before normalisation",
  xlab = "Sum of expression"
)
```

After removing unwanted cells from the dataset, the next step is to normalise the data. By default, we employ a global-scaling normalization method "LogNormalize" that normalises the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. In Seurat v5, Normalized values are stored in pbmc[["RNA"]]$data.

```{r}
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
```

For clarity, in this previous line of code (and in future commands), we provide the default values for certain parameters in the function call. However, this isn’t required and the same behavior can be achieved with:

```{r}
pbmc <- NormalizeData(pbmc)
```

While this method of normalization is standard and widely used in scRNA-seq analysis, global-scaling relies on an assumption that each cell originally contains the same number of RNA molecules. We and others have developed alternative workflows for the single cell preprocessing that do not make these assumptions. For users who are interested, please check out our SCTransform() normalization workflow. The method is described in ourpaper, with a separate vignette using Seurat here. The use of SCTransform replaces the need to run NormalizeData, FindVariableFeatures, or ScaleData (described below.)

```{r}
hist(
  colSums(pbmc[['RNA']]$data),
  breaks = 100,
  main = "Total expression after normalisation",
  xlab = "Sum of expression"
)
```

## Identification of highly variable features (feature selection)

Once the data is normalised, the next step is to find genes are vary between single cells; genes that are constant among all cells have no distinguishing power. The `FindVariableFeatures()` function calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. I interpret that as take each gene, get the average expression and variance of the gene across the 2,638 cells, categorise genes into bins (default is 20) based on their expression and variance, and finally normalise the variance in each bin. This was the same approach in [Macosko et al.](https://www.ncbi.nlm.nih.gov/pubmed/26000488) and new methods for detecting genes with variable expression patterns will be implemented in Seurat soon (according to the tutorial). The parameters used below are typical settings for UMI data that is normalised to a total of 10,000 molecules and will identify around 2,000 variable genes. The tutorial recommends that users should explore the parameters themselves since each dataset is different.

We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). We and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

Our procedure in Seurat is described in detail here, and improves on previous versions by directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the FindVariableFeatures() function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.

```{r}
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
```

```{r}
length(VariableFeatures(pbmc))
```

Identify the 10 most highly variable genes

```{r}
top10 <- head(VariableFeatures(pbmc), 10)
top10
```

Plot variable features with and without labels

```{r, fig.width=10, fig.height=6}
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
```

## Scaling the data

Next, we apply a linear transformation ("scaling") that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The `ScaleData()` function:

* Shifts the expression of each gene, so that the mean expression across cells is 0
* Scales the expression of each gene, so that the variance across cells is 1
    * This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
* The results of this are stored in pbmc[["RNA"]]$scale.data
* By default, only variable features are scaled.
* You can specify the features argument to scale additional features

```{r}
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
dim(pbmc[["RNA"]]$scale.data)
```

```{r}
hist(
  colSums(pbmc[['RNA']]$scale.data),
  breaks = 100,
  main = "Total expression after scaling",
  xlab = "Sum of expression"
)
```

How can I remove unwanted sources of variation?

In Seurat, we also use the `ScaleData()` function to remove unwanted sources of variation from a single-cell dataset. For example, we could "regress out" heterogeneity associated with (for example) cell cycle stage, or mitochondrial contamination i.e.:

```{r, eval=FALSE}
pbmc <- ScaleData(pbmc, features = all.genes, vars.to.regress = "percent.mito")
```

However, particularly for advanced users who would like to use this functionality, we strongly recommend the use of our new normalization workflow, `SCTransform()`. The method is described in [this paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02584-9), with a separate [vignette using Seurat](https://satijalab.org/seurat/articles/sctransform_vignette). As with `ScaleData()`, the function `SCTransform()` also includes a `vars.to.regress` parameter.