# 7. Differential expression <a class="anchor" id="de"></a>

### Overview

This section of the report contains plots and tables of significantly differentially expressed genes for each cluster, for a selected sample. Significantly DE genes are those with false-discovery adjusted p values of < 0.05.


**IMPORTANT: To run this section, you must have processed your sample(s) by completing the '5. Filtering cells and clustering' section. This section removes outliers and non-target cells and identifies clusters. This differential expression analysis uses these filtered cells and cluster information. Section 5 only has to run once for each sample, as it outputs a datafile for each sample that is imported into this section.**

*************************************

## Contents

[7a. Initial setup](#setup)

[7b. Choose sample and import data](#sample2)

[7c. Explanation of differential expression strategies](#exp)

[7d. Differential expression: one cluster vs all cells](#de1)

[7e. Differential expression: one cluster vs another cluster](#de2)

[7f. Differential expression: every cluster vs all other cells](#de3)

**************************

## <font color="green">7a. Initial setup</font> <a class="anchor" id="setup"></a>

<font color="green">**Each section is designed to be run independently, therefore there is some repeated setup code that needs to be run first. That code is within this subsection, indicated by green text.**</font>

<font color="green">Choose which dataset you want to work on by clicking on one of the setwd() commands below. This sets the working directory for your dataset of choice.</font>

In [None]:
setwd("~/Fazeleh/Dataset1/scDATA")

In [None]:
setwd("~/Fazeleh/Dataset2/scDATA")

<font color="green">Load the R packages required for this section. If packages are already installed they can be used simply by loading them with the `library()` function.</font>

In [None]:
library(ggplot2)
library(tidyverse)
library(viridis)

<font color="green">Install R packages required for this section. Packages not installed on the server need to be installed first, then loaded with `library()`.</font>

<font color="green">Seurat (https://satijalab.org/seurat/) is the main package we will be using in this analysis workflow. Seurat installs multiple dependencies, so you may need to wait a few minutes for installation to complete.</font>

In [None]:
install.packages("Seurat")
install.packages("patchwork")
library(Seurat)
library(patchwork)

<font color="green">Define a set of colours for plotting. Some of these plots have multiple clusters and it's difficult to find enough contrasting colours to visually separate the clusters. I've developed a set of 25 colours that I've found contrast well, that we can use in the plots for this (and other) sections.</font>

In [None]:
c25 <- c(
  "dodgerblue2", "#E31A1C", # red
  "green4",
  "#6A3D9A", # purple
  "#FF7F00", # orange
  "black", "gold1",
  "skyblue2", "#FB9A99", # lt pink
  "palegreen2",
  "#CAB2D6", # lt purple
  "#FDBF6F", # lt orange
  "gray70", "khaki2",
  "maroon", "orchid1", "deeppink1", "blue1", "steelblue4",
  "darkturquoise", "green1", "yellow4", "yellow3",
  "darkorange4", "brown"
)

<font color="green">Set the default width and height for plots output on this Notebook. You can modify this as you prefer. Note that every plot in this Notebook is followed by code to output it as a file and this code defines width/height separately from the options below.</font>

In [None]:
options(repr.plot.width=12, repr.plot.height=8)

***********************************

## 7b. Choose sample and import data <a class="anchor" id="sample2"></a>

**IMPORTANT: As mentioned in the Overview section, you need to have run section 5 at least once for your chosen sample. Section 5 outputs a data file of your clustered, quality filtered cells, that will now be imported below.

Running your sample through section 5 created a datafile called 'sample_name_seurat_filtered.rds', so if your sample was called 'liver', the file would be 'liver_seurat_filtered.rds'.

These datafiles will be in your working directories. You can see which samples you've run through section 5 (and thus have generated the required output files) by running the `dir()` command below:

In [None]:
dir(pattern = "seurat_filtered.rds")

Choose the sample name you wish to work with from the list above (just the name, without the '_seurat_filtered.rds'):

In [None]:
sample <- "Cerebellum"

If you want to analyse a different sample, simply come back to this section, change the sample name, then re-run the following sections.

Now import the data file for that sample:

In [None]:
data <- readRDS(paste0(sample, "_seurat_filtered.rds"))

See a summary of your data object:

In [None]:
data

**********************************

## 7c. Explanation of differential expression strategies <a class="anchor" id="exp"></a>

Bulk RNA-Seq usually examines differentially expressed genes between two treatments or tissues. scRNA-Seq is somewhat different, in that it is based on heterogeneity of gene expression within a group of cells, so typically differential expression is examined between clusters, rather than treatments.

Given that there are usually multiple clusters, and differential expression analysis requires a comparison between two groups, there are different strategies used to combine the two group analysis with the multiple cluster analysis. Three strategies included in this workflow include:

1) Comparing a single cluster to all other clusters/cells (7d). This enables one to examine how gene expression in a single cluster differs from the entire dataset.

2) Comparing one cluster to another cluster (7e). This requires two clusters being initially chosen, then examining how gene expression differs between them. Note that to do every possible combination of cluster comparisons may require running this section many times. E.g. if your sample has 8 clusters, this represents 7+6+5+4+3+2+1 = you would have to run that section 28 times to compare every cluster to every other cluster.

3) Doing a batch comparison of each cluster to all other clusters/cells (7f). Seurat has a function (`FindAllMarkers()`) to accomplish this, but this can take some time to run, as multiple clusters are being compared at once, and it also produces a multilevel dataset. This is an alternative to running each cluster one at a time, as in section 7d.

**IMPORTANT: the first cluster you choose in each analysis below represents the baseline or control group. Genes are either upregulated (i.e. positive log fold change) or downregulated (-ve lfc) in comparison to this baseline group. This means that any downregulated genes (i.e. negative log fold change) are more highly expressed in this baseline cluster, but any upregulated genes (+ lfc) are more highly expressed in the cluster(s) being compared to this baseline cluster.**

************************************************

## 7d. Differential expression: one cluster vs all other cells <a class="anchor" id="de1"></a>

In this section you'll select a single cluster, then compare gene expression in this cluster to all other combined clusters. Any upregulated genes (postive log fold change) are more highly expressed in the combined clusters.

First, we should have a look at the clusters we have to work with:

In [None]:
levels(x = data)

You'll need to select one of these clusters as the baseline cluster (`declust <-`). You can re-run the below code with a different baseline cluster if you want to look at other comparisons:

In [None]:
declust <- 0

Then run the Seurat differential expression function:

In [None]:
DE_genes <- FindMarkers(data, ident.1 = declust, logfc.threshold = 0.2)

We can see how many 'differentially expressed' genes this produced by:

In [None]:
nrow(DE_genes)

Note the `logfc.threshold = 0.2` parameter above. This only tests genes with at least 0.2 log fold difference in expression and speeds up the analysis considerably. But it could also remove some significant DE genes.

To test for this, look at the bottom 6 genes (ordered by p value) by running the `tail(DE_genes)` command below. If these genes are all non-significant (i.e. p_val_adj > 0.05) then you have captured all significant genes. If all these genes are significant (p_val_adj < 0.05) then re-run `FindMarkers()` with `logfc.threshold = 0.1`. This will take much longer to run, but should then capture all DE genes (if not, reduce `logfc.threshold = 0`, but will take a very long time to run).

In [None]:
tail(DE_genes)

You can view the top 20 DE genes by:

In [None]:
head(DE_genes, 20)

You can extract just the **significantly** differentially expressed genes (adjusted p < 0.05) like so:

In [None]:
DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ]

See how many significantly DE genes there are

In [None]:
nrow(DE_genes_sig)

You can also extract genes based on log fold change as well. Enter a log fold change threshold here:

In [None]:
lfc_threshold <- 0.3

Then filter your data by this log fold change threshold.

In [None]:
DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ]

Then see how many sig DE genes remain after lfc filtration:

In [None]:
nrow(DE_genes_sig_lfc)

**NOTE** You can easily filter out all your results here by using too high a lfc threshold. You can see the maximum log fold change for all DE genes like so:

In [None]:
max(DE_genes$avg_log2FC)

And you can view the minimum as well. Use this min/max log fold change information to decide on a suitable log fold change threshold. This can vary greatly, depending on your data.

In [None]:
min(DE_genes$avg_log2FC)

You can export the DE genes table to your working directory as a csv file (and then view the entire table in Jupyter by double clicking on the csv file):

In [None]:
write.csv(DE_genes_sig_lfc, paste0("DE_1vsALL_clust_", declust, "_", sample, ".csv"))

### Visualising DE results

There are a variety of ways to visualise your DE results. Below are a few examples and more can be added to this workflow as needed.

Not all of these methods have the space to plot all DE genes, so we can provide them with a list of DE genes to plot:

In [None]:
plotgenes <- rownames(DE_genes_sig)[1:10]

The above pulls out the top 10 most significantly DE genes, by adjusted p value.

You can enter selected genes to plot (e.g., choose genes of interest from the table of DE genes), in which case you'd change the above code to a vector of gene IDs, e.g. `plotgenes <- c("pDC", "Eryth", "Mk", "DC")`. You can include as many genes as you like.

**Scatter plot of individual genes**

You can select `reduction =` to be either `"pca"` for PCA plot, or `"umap"` or `"tsne"`. You can also change the colours or point sizes.

In [None]:
p <- FeaturePlot(data, features = plotgenes, cols = c("lightgrey", "red"), reduction = "pca", pt.size = 1)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", declust, "_DE_genes.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", declust, "_DE_genes.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**Dot plot**

In [None]:
p <- DotPlot(data, features = plotgenes, dot.scale = 8, cols = c("lightgrey", "red")) + RotatedAxis() +
ylab("Cluster") + xlab("Genes") +
theme(text = element_text(size = 18))
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", declust, "_DE_genes_dotplot.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", declust, "_DE_genes_dotplot.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**Violin plots of individual genes**

In [None]:
p <- VlnPlot(data, features = plotgenes, cols = c25, pt.size = 0)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", declust, "_DE_genes_violin.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", declust, "_DE_genes_violin.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**Heat map**

In [None]:
p <- DoHeatmap(data, features = plotgenes, raster = T) + 
scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + 
theme(text = element_text(size = 16))
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", declust, "_DE_genes_heatmap.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", declust, "_DE_genes_heatmap.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

You can go back to the start of this section and change `declust <-` to another cluster to examine DE for that cluster. 

************************************************

## 7e. Differential expression: one cluster vs another cluster <a class="anchor" id="de2"></a>

In this section you'll select to clusters to compare. Any upregulated genes (postive log fold change) are more highly expressed in the second cluster.

First, select your baseline cluster:

In [None]:
declust <- 0

Then, the cluster you want to compare it to:

In [None]:
declust2 <- 1

Then run the Seurat DE function. 

In [None]:
DE_genes <- FindMarkers(data, ident.1 = declust, ident.2 = declust2, logfc.threshold = 0.2)

As in the previous section, you can run `tail(DE_genes)` to see if you need to reduce `logfc.threshold = 0.2` so as to capture all **significantly** DE genes.

In [None]:
tail(DE_genes)

And again you can filter by p value and log fold change (i.e. change `p_val_adj` and `lfc_threshold` to suitable numbers), count the number of sig DE genes and view the top few.

In [None]:
DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ]
lfc_threshold <- 0.
DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ]
nrow(DE_genes_sig_lfc)
head(DE_genes_sig_lfc)

Export the table of DE genes as a csv file:

In [None]:
write.csv(DE_genes_sig_lfc, paste0("DE_1vs1_clust", declust, "_vs_clust", declust2, "_", sample, ".csv"))

Pull out the top 10 genes to plot (or change to the range or names of genes you want to plot, as in the previous section.)

In [None]:
plotgenes <- rownames(DE_genes_sig)[1:10]

Now you can plot these genes.

**Scatter plot**

You can select `reduction =` to be either `"pca"` for PCA plot, or `"umap"` or `"tsne"`. You can also change the colours or point sizes.

In [None]:
p <- FeaturePlot(data, features = plotgenes, cols = c("red", "lightgrey"), reduction = "pca", pt.size = 1)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**Dot plot**

In [None]:
p <- DotPlot(data, features = plotgenes, dot.scale = 8, cols = c("red", "lightgrey")) + RotatedAxis() +
ylab("Cluster") + xlab("Genes") +
theme(text = element_text(size = 18))
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_dotplot.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_dotplot.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**Violin plots**

In [None]:
p <- VlnPlot(data, features = plotgenes, cols = c25, pt.size = 0)
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_violin.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_violin.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

**Heat map**

In [None]:
p <- DoHeatmap(data, features = plotgenes, raster = T) + 
scale_fill_gradientn(colors = c("darkorange", "floralwhite", "dodgerblue4")) + 
theme(text = element_text(size = 16))
p

Export as a 300dpi tiff

In [None]:
tiff_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_heatmap.tiff")
ggsave(file = tiff_exp, dpi = 300, compression = "lzw", device = "tiff", plot = p, width = 20, height = 20, units = "cm")

Export as a pdf

In [None]:
pdf_exp <- paste0(sample, "clust_", declust, "clust_", declust2, "_DE_genes_heatmap.pdf")
ggsave(file = pdf_exp, device = "pdf", plot = p, width = 20, height = 20, units = "cm")

You can go back to the start of this section and change `declust <-` to another cluster to examine DE for that cluster. 

********************************

## 7f. Differential expression: every cluster vs all other cells  <a class="anchor" id="de3"></a>

In section 7d we can compare a chosen cluster to all other clusters/cells. In this section we can automate this somewhat and do this for every cluster at the same time.

Unlike the other DE section, you don't need to choose any clusters to compare, just run the Seurat function:

In [None]:
DE_genes <- FindAllMarkers(data, logfc.threshold = 0.2)

As with the previous sections, use `tail()` to see if you've captured all sig DE genes:

In [None]:
tail(DE_genes[order(DE_genes$p_val_adj, decreasing = F),])

Now filter by adjusted p and lfc (enter your own parameters), then re-order by cluster:

In [None]:
DE_genes_sig <- DE_genes[DE_genes$p_val_adj < 0.05, ]
DE_genes_sig <- DE_genes_sig[order(DE_genes_sig$cluster),]
lfc_threshold <- 0.3
DE_genes_sig_lfc <- DE_genes_sig[DE_genes_sig$avg_log2FC < -lfc_threshold | DE_genes_sig$avg_log2FC > lfc_threshold, ]

You can see how many DE genes per cluster were found:

In [None]:
summary(DE_genes_sig_lfc$cluster)

Then export this table of DE genes per cluster as a csv file:

In [None]:
write.csv(DE_genes_sig, paste0("DE_eachclustVSall_", sample, ".csv"))