# 5. Filtering cells and clustering <a class="anchor" id="validate"></a>


### Overview

One of the key values of scRNA-Seq is it enables an examination of heterogeneity and complexity among homogeneous cell populations. To accomplish this individual cells of a specific type or types need to be isolated. There are a variety of laboratory techniques used to achieve this (which are beyond the scope of this report) but, though many techniques are highly accurate, none are completely accurate. Thus in any scRNA-Seq sample there are a small number of non-target cells, which can confound the results. In this section we examine gene expression in each cell for markers that are expression-specific to that cell type.

*************************************

## Contents

[5a. Initial setup](#setup)

[5b. Choose sample you want to analyse](#sample)

[5c. Import data](#import)

[5d. Identify markers in cells](#identify)

[5e. Processing expression data (dimensionality reduction)](#dim)

[5f. PCA, UMAP and t-SNE plots (plotting dimenstionality reduction data)](#dimplot)

[5g. Remove low quality or outlier cells](#outlier)

[5h. Visualise gene expression by marker](#plot)

[5i. Remove non-target cells](#remove)

[5j. Clustering by gene expression](#befaft)

[5k. Plot marker expression before and after filtration](#mfilt)

[5k. Output filtered results](#output)

**************************

## <font color="green">5a. Initial setup</font> <a class="anchor" id="setup"></a>

<font color="green">**Each section is designed to be run independently, therefore there is some repeated setup code that needs to be run first. That code is within this subsection, indicated by green text.**</font>

<font color="green">Choose which dataset you want to work on by clicking on one of the setwd() commands below. This sets the working directory for your dataset of choice.</font>

In [None]:
setwd("~/Fazeleh/Dataset1/scDATA")

In [None]:
setwd("~/Fazeleh/Dataset2/scDATA")

<font color="green">Load the R packages required for this section. If packages are already installed they can be used simply by loading them with the `library()` function.</font>

In [None]:
library(ggplot2)
library(tidyverse)
library(viridis)

<font color="green">Install R packages required for this section. Packages not installed on the server need to be installed first, then loaded with `library()`.</font>

<font color="green">Seurat (https://satijalab.org/seurat/) is the main package we will be using in this analysis workflow. Seurat installs multiple dependencies, so you may need to wait a few minutes for installation to complete.</font>

In [None]:
install.packages("Seurat")
install.packages("patchwork")
library(Seurat)
library(patchwork)

<font color="green">Define a set of colours for plotting. Some of these plots have multiple clusters and it's difficult to find enough contrasting colours to visually separate the clusters. I've developed a set of 25 colours that I've found contrast well, that we can use in the plots for this (and other) sections.</font>

In [None]:
c25 <- c(
  "dodgerblue2", "#E31A1C", # red
  "green4",
  "#6A3D9A", # purple
  "#FF7F00", # orange
  "black", "gold1",
  "skyblue2", "#FB9A99", # lt pink
  "palegreen2",
  "#CAB2D6", # lt purple
  "#FDBF6F", # lt orange
  "gray70", "khaki2",
  "maroon", "orchid1", "deeppink1", "blue1", "steelblue4",
  "darkturquoise", "green1", "yellow4", "yellow3",
  "darkorange4", "brown"
)

<font color="green">Set the default width and height for plots output on this Notebook. You can modify this as you prefer. Note that every plot in this Notebook is followed by code to output it as a file and this code defines width/height separately from the options below.</font>

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

***********************************

## 5b. Choose sample you want to analyse <a class="anchor" id="sample"></a>

Each sample needs to be analysed separately, so the first thing you need to do is choose the sample you want to analyse.

Each subdirectory in your working directory (which you set in the initial setup section) should be a sample name. View the subdirectories using the list.dirs() function.

In [None]:
list.dirs(full.names = F, recursive = F)

Enter one of the directory names (i.e. sample name). **NOTE: R is case sensitive. The sample name entered below must exactly match the directory name.**

In [None]:
sample <- "Choroid"

If you want to analyse a different sample, simply come back to this section, change the sample name, then re-run the following sections.

**************************

## 5c. Import data <a class="anchor" id="import"></a>

First we need to import a count table of reads per gene per cell.

Cell Ranger outputs 3 main database files, that we need to combine into a single Seurat database object. Most downstream analysis is completed on this object. These database files are the cell IDs (barcodes.tsv.gz) the gene IDs (features.tsv.gz) and the table of read counts per gene, per cell (matrix.mtx.gz). These files are found in the `/<sample_name>/outs/filtered_feature_bc_matrix` Cell Ranger output directory.

In [None]:
# Import barcodes, count matrix and genomic features files
mat <- Read10X(data.dir = sample)

If you look at the top few rows and columns you should see gene IDs as rows and barcodes (i.e. cells) as columns

In [None]:
as.matrix(mat[1:10, 1:10])

Now convert this to a Seurat object. This is required to apply the various Seurat functions to the dataset

In [None]:
mat2 <- CreateSeuratObject(counts = mat, project = sample)

You can see a summary of the data by simply running the Seurat object name.

In [None]:
mat2

*********************************

## 5d. Identify markers in cells <a class="anchor" id="identify"></a>

Now we're going to identify some markers in the matrix we created in the previous section.

Create a [vector](https://www.datamentor.io/r-programming/vector/) called 'markers' that contains each of the markers you want to examine. These should be gene symbols. Replace the gene symbols below with your target markers.

In [None]:
markers <- c("P2ry12", "Tmem119", "Itgam")

**IMPORTANT: Note that the gene symbols have to exactly match the gene symbols in your dataset (including capitalisation)**. Gene symbols are more like 'common names' and can vary between databases. Your main gene identifiers are Ensembl IDs and we need to find the gene symbols that match these Ensembl IDs. For example, P2ry12 is also called ADPG-R, BDPLT8, HORK3 and various other gene symbols, depending on the database it's listed in. In the Ensemble database it's listed as P2ry12 (not P2RY12, remember, case matters) and matches Ensembl ID ENSMUSG00000036353.

For this reason it's advisable to first search the Ensembl website for your markers of interest and for your organism, to ensure you are providing gene symbols that match the Ensembl IDs. https://asia.ensembl.org/Mus_musculus/Info/Index

Searching the above Ensembl website for P2ry12 will provide the following result, confirming the gene symbol: https://asia.ensembl.org/Mus_musculus/Gene/Summary?db=core;g=ENSMUSG00000036353;r=3:59123693-59170292

Alternatively, you can search the list of gene symbols found in your dataset, which are in the '*sample*/analysis/diffexp/graphclust/differential_expression.csv' file, under the 'Feature name' column.

Now back to the analysis..

You can see if your markers are present:

In [None]:
sum(row.names(mat) %in% markers)

If you input 3 markers and the output from the above code = `3`, then all are present. If the result is `2` then 2 of the 3 markers you provided are found in your data, etc.

If you want to see if an individual marker is present, you can run the following (replace with your marker of interest). Outputs `1` if the marker is present, `0` if it isn't:

In [None]:
sum(row.names(mat) == "P2ry12")

We can pull out just the read counts for your defined markers

In [None]:
y <- mat[row.names(mat) %in% markers, ]
as.matrix(y)

Now we can count the number of cells containing zero transcripts for each of the examined markers. This enables an examination of the number of cells that have zero expression for these markers and therefore the number of cells that can be considered non-target cells.

In [None]:
# First count all cells
# Then make a loop to cycle through all markers (defined in previously created 'markers' vector)
a <- length(colnames(y))
for (i in 1:length(markers)) {
  
  a <- c(a, sum(y[i,] == 0))
  
}
# Do a sum of the columns
y2 <- colSums(y)
# See if any zeros. If so, these cells are not target cells (as determined by absense of any target cell markers)
count <- c(a, sum(y2 == 0))
# Name the vector elements
names(count) <- c("Total_cells", markers, "All_zero")
# Generate the table
as.data.frame(count)

The above table shows the total number of cells for your sample, the number of cells which had 0 expression for **each** marker, and the number of cell that had zero expression for **all** of the markers you provided.

**********************

## 5e. Processing expression data (dimensionality reduction) <a class="anchor" id="dim"></a>

There are a variety of methods to visualise expression in single cell data. The most commonly used methods - PCA, t-SNE and UMAP - involve 'dimensionality', i.e. converting expression to x-n dimensions (which can then be plotted) based on gene expression per cell.

Suerat can generate and store PCA, t-SNE and UMAP data in the Seurat object we created in section 5c (which we called `mat2`), but first the raw data needs to be processed in a variety of ways:

1. Normalise the data by log transformation
2. Identify genes that exhibit high cell-to-cell variation
3. Scale the data so that highly expressed genes don't dominate the visual representation of expression
4. Perform the linear dimensional reduction that converts expression to dimensions
5. Plot the x-y dimension data (i.e. first 2 dimensions)


The first 4 steps are completed in the code cell below (this may take a few minutes to run)

In [None]:
# Normalise data
mat3 <- NormalizeData(mat2)
# Identification of variable features
mat3 <- FindVariableFeatures(mat3, selection.method = "vst", nfeatures = nrow(mat3))
# Scaling the data
all.genes <- rownames(mat3)
mat3 <- ScaleData(mat3, features = all.genes)
# Perform linear dimensional reduction (PCA)
mat3 <- RunPCA(mat3, features = VariableFeatures(object = mat3))

### Plot of highly variable genes

Using the `FindVariableFeatures` results, we can visualise the most highly variable genes, including a count of variable and non variable genes in your dataset. The below code ouputs the top 10 genes, but you can ajust this number as desired (i.e. in `top_genes <- head(VariableFeatures(mat3), 10)` change `10` to another number).

**NOTE** In the below plot you can change a number of parameters to modify the plot to look how you like. This can be done for any of the plots in these notebooks. In the plot below you can change:

Dot size: `pt.size = 2`. Increase or decrease the number to increase or decrease dot size.

Dot colours: `cols = c("black", "firebrick"))`. Change the colours to whatever you like. A list of R colour names is here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

Theme: `theme_bw()`. There are several default plot themes you can choose from, that change a variety of plot parameters. See here: https://ggplot2.tidyverse.org/reference/ggtheme.html

Axis text size: `theme(text = element_text(size = 17))`. There are a large number of parameters that can be modified with `theme()`. Here we've just changed the axis text to size 17. See here for other parameters that can be changed with `theme()`: https://ggplot2.tidyverse.org/reference/theme.html

Surat plots are based on the ggplot package. There are a multitude of other modifications you can make to a ggplot, too many to describe in this notebook. But there are plenty of online guides on how to modify ggplot plots. Here's an example: http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization

In [None]:
# Identify the 10 most highly variable genes
top_genes <- head(VariableFeatures(mat3), 10)
# plot variable features with labels
p <- VariableFeaturePlot(mat3, pt.size = 2, cols = c("black", "firebrick"))
p <- LabelPoints(plot = p, points = top_genes, repel = TRUE) +
theme_bw() +
theme(text = element_text(size = 17))
p

You can save your plot as a 300dpi (i.e. publication quality) tiff or pdf file. **These files can be found in your working directory.**

**Tip:** you can adjust the width and height of the saved images by changing `width =` and `height =` in the code below. Pdf files can be opened within Jupyter, so a good way to test a suitable width/height would be to save the image by running the pdf code below with the default 20cm width/height, then open the pdf file by clicking on it in the file browser panel (to the left of this notebook), then change the width/height and repeat this process as needed.

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_top_genes.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_top_genes.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

*************************************

## 5f. PCA, UMAP and t-SNE plots (plotting dimenstionality reduction data) <a class="anchor" id="dimplot"></a>

In the above section we ran dimensionality reduction based on [Principal Component Analysis (PCA)](https://builtin.com/data-science/step-step-explanation-principal-component-analysis). 

Technically, results from only one dimensionality reduction are needed for downstream analysis, but we will also perform [Uniform Manifold Approximation and Projection (UMAP](https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668) and [t-distributed stochastic neighbor embedding (t-SNE)](https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1) dimensionality reduction, so as to visualise expression based on 3 different methods

First we need to run UMAP and tSNE dimensionality reduction and add these results to our main Seurat object

In [None]:
mat3 <- RunUMAP(mat3, dims = 1:3, verbose = F)
mat3 <- RunTSNE(mat3, dims = 1:3, verbose = F)

Now we can visualise your expression data using all 3 dimensionality reduction methods

### Generate the PCA plot

In [None]:
p <- DimPlot(mat3, reduction = "pca", pt.size = 2, label = TRUE, label.size = 6, label.color = "white", cols = c("firebrick")) + 
theme_bw() +
theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

**Note** As discussed in the previous section, you can change various plot attributes. In these plots, the point size (`pt.size =`), label (`label =` make it `TRUE` if you want a label, `FALSE` if you don't), label colour (`label.color =`), point colour (`cols =`) and various theme attributes such as axis text size.

Then you can export the plot as a publication quality TIFF or PDF file.

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_PCA_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_PCA_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### Generate the UMAP pplot

In [None]:
p <- DimPlot(mat3, reduction = "umap", pt.size = 2, label = TRUE, label.size = 8, label.color = "black", cols = c("firebrick")) + 
theme_bw() +
theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_umap_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_umap_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### Generate the tSNE plot

In [None]:
p <- DimPlot(mat3, reduction = "tsne", pt.size = 2, label = TRUE, label.size = 8, label.color = "black", cols = c("firebrick")) + 
theme_bw() +
theme(legend.position="none", axis.title=element_text(size=16), axis.text=element_text(size=14))
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_tsne_pre_filtration.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_tsne_pre_filtration.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

*********************************************

## 5g. Remove low quality or outlier cells <a class="anchor" id="outlier"></a>

From the Seurat website:

>Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include

>* Low-quality cells or empty droplets will often have very few genes
>* Cell doublets or multiplets may exhibit an aberrantly high gene count

So in this section we can filter out cells that have very low and very high gene counts.

We can first visualise the spread of genes and reads using a violin plot.

In [None]:
VlnPlot(mat2, features = c("nFeature_RNA", "nCount_RNA"), ncol = 3)

Or as a scatter plot

In [None]:
FeatureScatter(mat2, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

You can then choose to filter out the top and bottom 'outliers' based on the above violin and scatter plots, by entering a maximum and minimum `nFeature_RNA` below. This max/min number can vary greatly depending on the sequencing depth of your samples and other factors. Use the above violin plots to guide your decision.

In [None]:
mat3_filt <- subset(mat3, subset = nFeature_RNA > 200 & nFeature_RNA < 4000)

****************************

## 5h. Visualise gene expression by marker <a class="anchor" id="plot"></a>

In the previous section (plotting dimenstionality reduction data) we visualised total expression, i.e. all genes within each cell.

In this section we will visualise expression for specific markers within each cell, using the same dimensionality reduction data (PCA, UMAP and t-SNE) that we generated in the previous section.

This has a variety of uses: to identify patterns of differential expression between cells for specific markers, identify 'non-target' cells - i.e. expression of markers that are known to be not expressed in target cells, marker-based heterogeneity of expression, etc.

In section 5d you selected a set of markers. To confirm which markers they were:

In [None]:
markers

If you wish to plot a different set of markers, you can do so by changing the set of markers in the `markers <- c(..)` code and re-running that code cell. You can choose 1 marker, or as many as you like. Be aware though that there will be a plot generated for every marker provided.

Now generate the plots. You can change the colours in the plots (`cols = c("red", "lightgrey")` and the point size (`pt.size = 1`). Note that the default colours show the *lowest* expression in red. This is so you can more easily see which cells don't express the diagnostic markers.

### PCA plot

In [None]:
p <- FeaturePlot(mat3_filt, features = markers, reduction = "pca", cols = c("red", "lightgrey"), pt.size = 1)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_pca_markers.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_pca_markers.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### UMAP

In [None]:
p <- FeaturePlot(mat3, features = markers, reduction = "umap", cols = c("red", "lightgrey"), pt.size = 1)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_umap_markers.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_umap_markers.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### t-SNE

In [None]:
p <- FeaturePlot(mat3, features = markers, reduction = "tsne", cols = c("red", "lightgrey"), pt.size = 1)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_tsne_markers.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_tsne_markers.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### Heatmap

In addition to the dimensionality reduction plots, you can visualise expression for your selected markers with a heat map (note: this requires at least 2 markers, preferably several, to be visually meaningful). You can change the colour range as you like by providing a high, centre and low colour (`scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4"))`)

In [None]:
p <- DoHeatmap(mat3, features = markers, group.bar = FALSE) + 
scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + 
theme(text = element_text(size = 16))
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_hmap_markers.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_hmap_markers.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### Clustered heatmap

You can also cluster the cells and then plot a heatmap based on this cluster data, to see if expression of your chosen markers is related to clusters

First, generate the clusters for your sample

**Important: this will generate the clusters for the remainder of your analysis. Choose the `resolution =` score carefully. A lower score means fewer clusters. You can adjust this higher or lower to see how it affects your clusters.**

In [None]:
mat3 <- FindNeighbors(mat3, dims = 1:10)
mat3 <- FindClusters(mat3, resolution = 0.5)

Then generate the heatmap

In [None]:
p <- DoHeatmap(mat3, features = markers, raster = T) + 
scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + 
theme(text = element_text(size = 16)) + labs(color = "Dose (mg)")
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_hmap_markers_clustered.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_hmap_markers_clustered.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

******************************

## 5i. Remove non-target cells <a class="anchor" id="remove"></a>

In this section we will remove any 'non-target' cells from our dataset. 'Non-target' cells are defined as those with 0 reads (i.e. 0 expression) for our marker(s) of choice.

**If you don't want to remove any cells based on expression of specific markers, skip this section**

Once again, we can view our chosen markers

In [None]:
zm <- mat[row.names(mat) %in% markers, ]
as.matrix(zm)

Then we can again see a count of cells that had 0 reads (thus 0 expression) for each marker. **The 'All markers' row indicates the number of cells that have zero expression in <u>any</u> of the provided markers. <u>These are the cells that will be filtered out from your dataset</u>**

In [None]:
a <- length(colnames(zm))
for (i in 1:length(markers)) {
  
  a <- c(a, sum(zm[i,] == 0))
  
}
a <- c(a, length(colnames(zm)) - sum(apply(as.matrix(zm) == 0, 2, sum) == 0))
names(a) <- c("Total cells", markers, "All markers")
as.data.frame(a)

Based on these results, we can choose which markers we wish to use for removing '0 expression' cells. We may decide to keep all markers, keep only some of the markers, or use different markers (in which case we should re-run the 'Identify markers in cells' section, with a different set of markers). Once you have decided on your set of diagnostic markers, enter them in the `marker_rem` object below.

In [None]:
marker_rem <- c("P2ry12", "Tmem119")

Remove cells from main Seurat object that have zero expression for **any** of these markers.

In [None]:
zm <- mat[row.names(mat) %in% marker_rem, ]
# This line does a sum of every column and then outputs column where this = 0 (if any cell contains reads, this will at least = 1).
zm_1 <- as.matrix(zm)[, apply(as.matrix(zm) == 0, 2, sum) == 0]
# Then we can filter the Seurat object to contain just these cell (i.e. barcode) IDs
mat3_filt <- subset(mat3_filt, cells = colnames(zm_1))

In [None]:
mat3_filt

***********************************

## 5j. Clustering by gene expression <a class="anchor" id="befaft"></a>

this section examines clustering by gene expression similarity for each sample. PCA, t-SNE and UMAP plots are used to visualize the gene expression patterns and clusters.

This clustering visualisation section is included in this 'Filtering cells using markers' Notebook, because we will also examine here how filtration has affected clustering.

We can generate some 'before and after' plots, to visualise how removing the non-target cells changed the data structure. You can use this section to examine if your cell filtration had a meaningful effect on your data structure. If it didn't, you may want to choose a different set of markers or filtering parameters to filter with.

### Choosing the correct resolution

A cluster represents a unique group of cells, based on gene expression patterns. But what consitutes 'unique'? When you calculate the clustering (using Seurat's `FindClusters()` function), it's important to use the correct `resolution` score to generate accurate, biologically meaningful clusters. Using a lower `resolution` score will generate fewer clusters (but you risk combining two clusters that should be distinct), a higher score will generate more clusters (but you risk falsely splitting a biologically relevant cluster of cells). Every single cell dataset is different (cell population similarity, sequencing depth, etc) and as such the optimal `resolution` score needs to be chosen for each dataset.

The package [clustree](https://cran.r-project.org/web/packages/clustree/vignettes/clustree.html) generates a tree based on multiple `resolution` scores, which can help you in picking the optimal score.

Read the clustree manual to understand how to interpret the generated tree: https://cran.r-project.org/web/packages/clustree/vignettes/clustree.html

Install and load the clustree package:

In [None]:
install.packages("clustree")
library(clustree)

Then generate a range of clusters, from 0 to 1, at 0.1 increments (`resolution = seq(0, 1, 0.1)`).

In [None]:
mat3_clust <- FindNeighbors(mat3_filt, dims = 1:10)
mat3_clust <- FindClusters(mat3_clust, resolution = seq(0, 1, 0.1), verbose = F)

Convert the results into a Seurat object, which can be used as input into `clustree()`

In [None]:
clus_seurat <- CreateSeuratObject(counts = mat3_clust@assays$RNA@counts, meta.data = mat3_clust[[]])
clus_seurat[['TSNE']] <- CreateDimReducObject(embeddings = Embeddings(object = mat3_clust, reduction = "pca"), key = "tSNE_")

Then generate the tree. Refer to the [clustree manual](https://cran.r-project.org/web/packages/clustree/vignettes/clustree.html) for tips on how to use this tree to choose the optimal `resolution`

In [None]:
clustree(clus_seurat, prefix = "RNA_snn_res.") + scale_color_manual(values=c25) + scale_edge_color_continuous(low = "blue", high = "red")

### Calculate the clusters

First you need to re-run the variable feature calculation, scaling, dimensionality reduction (PCA, t-SNE and UMAP). This may take several minutes to run.

In [None]:
mat3_filt <- FindVariableFeatures(mat3_filt, selection.method = "vst", nfeatures = nrow(mat3_filt))
all.genes <- rownames(mat3_filt)
mat3_filt <- ScaleData(mat3_filt, features = all.genes)
mat3_filt <- RunPCA(mat3_filt, dims = 1:3, verbose = F)
mat3_filt <- RunUMAP(mat3_filt, dims = 1:3, verbose = F)
mat3_filt <- RunTSNE(mat3_filt, dims = 1:3, verbose = F)

Then you generate a 'nearest neighbor' graph by calculating the neighborhood overlap (Jaccard index) between every cell and identify clusters of cells based on shared nearest neighbor (SNN).

**Remember to choose the the `resolution =` score in `FindClusters()` based on the above 'Choosing the correct resolution' section.**

In [None]:
mat3_filt <- FindNeighbors(mat3_filt, dims = 1:10)
mat3_filt <- FindClusters(mat3_filt, resolution = 0.6)

You can see how many cells there are per cluster like so:

In [None]:
cellcount <- as.data.frame(table(mat3_filt@meta.data[4]))
names(cellcount) <- c("Cluster", "Cell_count")
cellcount

Now we can plot the results.

### Clusters - PCA plot

**Before filtration:**

In [None]:
p <- DimPlot(mat3, reduction = "pca", pt.size = 2, cols = c25) + 
theme_bw() +
theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster")
p

**After filtration:**

In [None]:
p <- DimPlot(mat3_filt, reduction = "pca", pt.size = 2, cols = c25) + 
theme_bw() +
theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster")
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_pca_filtered.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_pca_filtered.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**Plot one individual cluster**

Here you can visualise a single cluster by colouring it red (or a colour of your choice) and colouring all the other clusters grey.

First define the colours, based on the cluster information in your seurat data. You can change the background colours (`"gray70"`) and the highlighted cluster colour (`"red"`) to whatever you like. See http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf for the list of R colours. 

In [None]:
newcols <- rep("gray70", length(levels(mat3_filt))-1)
newcols <- c(newcols, "red")

Now we need to order your clusters, so that your target cluster is plotted on top of all other clusters. Enter which cluster you want to visualise in the `myclust` variable below. This is based on the cluster numbers in the previous plot. E.g. if you want to plot cluster 5, change `myclust <- 3` to `myclust <- 5`

In [None]:
myclust <- 3

Now we can plot the cluster, placing this cluster on top (the `order` parameter).

In [None]:
p <- DimPlot(mat3_filt, reduction = "pca", pt.size = 2, cols = newcols, order = myclust) + 
theme_bw() +
theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster")
p

Now we can export this plot as a pdf and tiff.

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", myclust, "_pca_filtered.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", myclust, "_pca_filtered.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### Clusters - UMAP

**Before filtration:**

In [None]:
p <- DimPlot(mat3, reduction = "umap", pt.size = 2, cols = c25) + 
theme_bw() +
theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=16), axis.text=element_text(size=14)) + labs(color="Cluster")
p

**After filtration:**

In [None]:
p <- DimPlot(mat3_filt, reduction = "umap", pt.size = 2, cols = c25) + 
theme_bw() +
theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster")
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_umap_filtered.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_umap_filtered.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**Plot individual clusters**

Select your colours.

In [None]:
newcols <- rep("gray70", length(levels(mat3_filt))-1)
newcols <- c(newcols, "red")

Select your cluster and then plot it.

In [None]:
myclust <- 3

In [None]:
p <- DimPlot(mat3_filt, reduction = "umap", pt.size = 2, cols = newcols, order = myclust) + 
theme_bw() +
theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster")
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", myclust, "_umap_filtered.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", myclust, "_umap_filtered.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### Clusters - T-sne

**Before filtration:**

In [None]:
p <- DimPlot(mat3, reduction = "tsne", pt.size = 2, cols = c25) + 
theme_bw() +
theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster")
p

**After filtration:**

In [None]:
p <- DimPlot(mat3_filt, reduction = "tsne", pt.size = 2, cols = c25) + 
theme_bw() +
theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster")
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_tsne_filtered.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_tsne_filtered.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**Plot individual clusters**

Select your colours.

In [None]:
newcols <- rep("gray70", length(levels(mat3_filt))-1)
newcols <- c(newcols, "red")

Select your cluster and then plot it.

In [None]:
myclust <- 3

In [None]:
p <- DimPlot(mat3_filt, reduction = "tsne", pt.size = 2, cols = newcols, order = myclust) + 
theme_bw() +
theme(legend.title=element_text(size=14), legend.text = element_text(size = 14), axis.title=element_text(size=18), axis.text=element_text(size=14)) + labs(color="Cluster")
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", myclust, "_tsne_filtered.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", myclust, "_tsne_filtered.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### Clusters - Heatmap of selected markers

Recall which markers you previously defined:

In [None]:
markers

**Before filtration:**

In [None]:
p <- DoHeatmap(mat3, features = markers, raster = T) + 
scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + 
theme(text = element_text(size = 16)) + labs(color = "Dose (mg)")
p

**After filtration:**

In [None]:
p <- DoHeatmap(mat3_filt, features = markers, raster = T) + 
scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + 
theme(text = element_text(size = 16)) + labs(color = "Dose (mg)")
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_hmap_filtered.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_hmap_filtered.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**********************************

## 5k. Plot marker expression before and after filtration <a class="anchor" id="mfilt"></a>

Finally, we can see how the expression of specific markers changed before and after filtration

Again, select your set of markers (you could re-enter the same ones you used earlier, or use a different set):

In [None]:
markers <- c("P2ry12", "Tmem119")

Select which type of plot you want to generate ("pca", "umap" or "tsne"):

In [None]:
redplot <- "pca"

### Before filtration

In [None]:
p <- FeaturePlot(mat3, features = markers, reduction = redplot, cols = c("lightgrey", "red"), pt.size = 1)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_", redplot, "_markers.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_", redplot, "_markers.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

### After filtration

In [None]:
p <- FeaturePlot(mat3_filt, features = markers, reduction = redplot, cols = c("lightgrey", "red"), pt.size = 1)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "_", redplot, "_markers_filtered.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "_", redplot, "_markers_filtered.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

*********************************

## 5l. Output filtered results <a class="anchor" id="output"></a>

In this last section we will export the filtered dataset (non-target cells removed, other filtration applied), for analysis in the next sections of this analysis workflow. **You need to run this entire '5. Filtering cells using markers' Notebook once for every sample you have, because the other sections rely on the output created here for each sample.**

Output the filtered Seurat object as a file, to be imported in the other sections of this workflow (this is a large amount of data, so may take a few minutes):

In [None]:
saveRDS(mat3_filt, file = paste0(sample, "_seurat_filtered.rds"))

[Click here to go to the next section: Aggregate clustering](./scRNASeq_6_aggregate_cluster.ipynb)