-
Notifications
You must be signed in to change notification settings - Fork 1
/
seurat.Rmd
584 lines (417 loc) · 26.3 KB
/
seurat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
---
title: "Getting started with Seurat"
date: "`r Sys.Date()`"
output:
workflowr::wflow_html:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# https://stackoverflow.com/questions/30237310/setting-work-directory-in-knitr-using-opts-chunksetroot-dir-doesnt-wor
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
```
This post follows the Peripheral Blood Mononuclear Cells (PBMCs) [tutorial](https://satijalab.org/seurat/articles/pbmc3k_tutorial) for 2,700 single cells. It was written while I was going through the tutorial and contains my notes. The dataset for this tutorial can be downloaded from the [10X Genomics dataset page](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k) but it is also hosted on Amazon (see below). The PBMCs, which are primary cells with relatively small amounts of RNA (around 1pg RNA/cell), come from a healthy donor. There were 2,700 cells detected and sequencing was performed on an Illumina NextSeq 500 with around 69,000 reads per cell. To get started [install Seurat](https://satijalab.org/seurat/articles/install.html) by using install.packages().
```{r install_seurat, eval=FALSE}
install.packages("Seurat")
```
If you get the warning:
>‘SeuratObject’ was built under R 4.3.0 but the current version is 4.3.2; it is recomended that you
reinstall ‘SeuratObject’ as the ABI for R may have changed
re-install the `SeuratObject` package using a repository that has an updated copy. The same goes for the `htmltools` package.
```{r install_seurat_obj, eval=FALSE}
install.packages("SeuratObject", repos = "https://cran.ism.ac.jp/")
install.packages("htmltools", repos = "https://cran.ism.ac.jp/")
packageVersion("SeuratObject")
packageVersion("htmltools")
```
Load `Seurat`.
```{r load_seurat}
library("Seurat")
packageVersion("Seurat")
```
## Data
To follow the tutorial, you need the 10X data.
```{bash, eval=FALSE}
mkdir -p data/pbmc3k && cd data/pbmc3k
wget -c https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
```
The extracted files.
```{bash}
ls -1 data/pbmc3k/filtered_gene_bc_matrices/hg19
```
`matrix.mtx` is a [MatrixMarket](https://math.nist.gov/MatrixMarket/formats.html) file. It has the following properties:
* Only non-zero entries are stored in the file
* Comments start with a `%`, like LaTeX
* The first line indicates the total number of rows, columns, and entries
* The following lines after the first provide a row and column number and the value at that coordinate
```{bash}
head data/pbmc3k/filtered_gene_bc_matrices/hg19/matrix.mtx
```
## Seurat object
Load 10x data into a matrix.
```{r}
pbmc.data <- Read10X(data.dir = "data/pbmc3k/filtered_gene_bc_matrices/hg19/")
```
[dgTMatrix-class](https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgTMatrix-class.html).
```{r}
class(pbmc.data)
```
32,738 genes and 2,700 cells.
```{r}
dim(pbmc.data)
```
Save as CSV file.
```{r write_csv}
system.time(
write.csv(x = pbmc.data, file = "data/pbmc3k.csv")
)
```
Gzip.
```{bash}
if [[ ! -f data/pbmc3k.csv.gz ]]; then
gzip data/pbmc3k.csv
fi
ls -lh data/pbmc3k.csv.gz
```
Check out the first six genes and cells
```{r}
pbmc.data[1:6, 1:6]
```
Summary of total expression per single cell.
```{r}
summary(colSums(pbmc.data))
```
Check how many genes have at least one transcript in each cell.
The median number of detected genes among the single cells is 817.
```{r}
at_least_one <- apply(pbmc.data, 2, function(x) sum(x>0))
hist(
at_least_one,
breaks = 100,
main = "Distribution of detected genes",
xlab = "Genes with at least one tag"
)
abline(v = median(at_least_one), col = 2, lty = 3)
```
Total expression per cell. The median sum of expression among the single cells is 2,197. This distribution is very similar to the distribution of detected genes shown above.
```{r}
hist(
colSums(pbmc.data),
breaks = 100,
main = "Expression sum per cell",
xlab = "Sum expression"
)
abline(v = median(colSums(pbmc.data)), col = 2, lty = 3)
```
We will filter out genes and single cells before we continue with the analysis. The tutorial has arbitrary values of keeping genes expressed in three or more cells and keeping cells with at least 200 detected genes.
Manually check the number of genes detected in three or more cells; a lot of genes are not detected in 3 or more cells.
```{r}
tmp <- apply(pbmc.data, 1, function(x) sum(x>0))
table(tmp>=3)
```
All cells have at least 200 detected genes
```{r}
keep <- tmp>=3
tmp <- pbmc.data[keep,]
at_least_one <- apply(tmp, 2, function(x) sum(x>0))
summary(at_least_one)
```
```{r}
dim(tmp)
```
See `?SeuratObject` for more information on the class.
```{r}
pbmc <- CreateSeuratObject(
counts = pbmc.data,
min.cells = 3,
min.features = 200,
project = "pbmc3k"
)
class(pbmc)
```
Same numbers as above
```{r}
pbmc
```
Slots in Seurat object.
> SeuratObject: Data Structures for Single Cell Data
>
> Defines S4 classes for single-cell genomic data and associated information, such as dimensionality reduction embeddings, nearest-neighbor graphs, and spatially-resolved coordinates. Provides data access methods and R-native hooks to ensure the Seurat object is familiar to other R users
Read more about the [S4 class](https://adv-r.hadley.nz/s4.html) in the Advanced R book.
```{r}
slotNames(pbmc)
```
## Basic filtering
The tutorial states that "The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat." The nUMI is calculated as `num.mol <- colSums(object.raw.data)`, i.e. each transcript is a unique molecule. The number of genes is simply the tally of genes with at least 1 transcript; `num.genes <- colSums(object.raw.data > is.expr`) where `is.expr` is zero.
A common quality control metric is the percentage of transcripts from the mitochondrial genome. According to the paper [Classification of low quality cells from single-cell RNA-seq data](https://pubmed.ncbi.nlm.nih.gov/26887813/) the reason this is a quality control metric is because if a single cell is lysed, cytoplasmic RNA will be lost apart from the RNA that is enclosed in the mitochondria, which will be retained and sequenced.
Human mitochondria genes conveniently start with MT.
```{r mito_genes}
mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@assays$RNA), value = TRUE)
length(mito.genes)
```
```{r}
percent.mito <- Matrix::colSums(pbmc[['RNA']]$counts[mito.genes, ]) / Matrix::colSums(pbmc[['RNA']]$counts) * 100
head(percent.mito)
```
Check out the meta data.
```{r}
head(pbmc@meta.data)
```
Add mitochondrial percent to the meta using `AddMetaData`.
```{r}
pbmc <- AddMetaData(
object = pbmc,
metadata = percent.mito,
col.name = "percent.mito"
)
head(pbmc@meta.data)
```
Use `PercentageFeatureSet` to calculate the percentage of a set of features, which saves us from some typing. The `[[` operator can add columns to object metadata, which is a great place to stash QC stats.
```{r percentage_feature_set_mito}
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
head(pbmc@meta.data[, c('percent.mito', 'percent.mt')])
```
Instead of setting a hard filtering threshold, one can use the 3 * [Median Absolute Deviation](https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html#filtering-low-quality-reads) to determine threshold limits for filtering out cells.
```{r}
median(pbmc@meta.data$percent.mt) - 3 * mad(pbmc@meta.data$percent.mt)
median(pbmc@meta.data$percent.mt) + 3 * mad(pbmc@meta.data$percent.mt)
```
Plot number of genes, UMIs, and % mitochondria
Visualize QC metrics as a violin plot
```{r}
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
```
A couple of cells have high mitochondrial percentage which may indicate lost of cytoplasmic RNA.
The GenePlot() function can be used to visualise gene-gene relationships as well as any columns in the seurat object. Below we use the plotting function to spot cells that have a high percentage of mitochondrial RNA and to plot the relationship between the number of unique molecules and the number of genes captured.
FeatureScatter is typically used to visualize feature-feature relationships, but can be used
for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
```{r}
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
```
Manual check; I already know all cells have >200 genes.
```{r}
table(pbmc@meta.data$percent.mito < 5 & pbmc@meta.data$nFeature_RNA<2500)
```
```{r}
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc
```
## Normalisation
The next step is to normalise the data, so that each cell can be compared against each other. At the time of writing, the only normalisation method implemented in Seurat is by log normalisation. Gene expression measurements for each cell are normalised by its total expression, scaled by 10,000, and log-transformed.
```{r}
hist(
colSums(pbmc[['RNA']]$counts),
breaks = 100,
main = "Total expression before normalisation",
xlab = "Sum of expression"
)
```
After removing unwanted cells from the dataset, the next step is to normalise the data. By default, we employ a global-scaling normalization method "LogNormalize" that normalises the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. In Seurat v5, Normalized values are stored in pbmc[["RNA"]]$data.
```{r}
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
```
For clarity, in this previous line of code (and in future commands), we provide the default values for certain parameters in the function call. However, this isn’t required and the same behavior can be achieved with:
```{r}
pbmc <- NormalizeData(pbmc)
```
While this method of normalization is standard and widely used in scRNA-seq analysis, global-scaling relies on an assumption that each cell originally contains the same number of RNA molecules. We and others have developed alternative workflows for the single cell preprocessing that do not make these assumptions. For users who are interested, please check out our SCTransform() normalization workflow. The method is described in ourpaper, with a separate vignette using Seurat here. The use of SCTransform replaces the need to run NormalizeData, FindVariableFeatures, or ScaleData (described below.)
```{r}
hist(
colSums(pbmc[['RNA']]$data),
breaks = 100,
main = "Total expression after normalisation",
xlab = "Sum of expression"
)
```
## Identification of highly variable features (feature selection)
Once the data is normalised, the next step is to find genes are vary between single cells; genes that are constant among all cells have no distinguishing power. The `FindVariableFeatures()` function calculates the average expression and dispersion for each gene, places these genes into bins, and then calculates a z-score for dispersion within each bin. I interpret that as take each gene, get the average expression and variance of the gene across the 2,638 cells, categorise genes into bins (default is 20) based on their expression and variance, and finally normalise the variance in each bin. This was the same approach in [Macosko et al.](https://www.ncbi.nlm.nih.gov/pubmed/26000488) and new methods for detecting genes with variable expression patterns will be implemented in Seurat soon (according to the tutorial). The parameters used below are typical settings for UMI data that is normalised to a total of 10,000 molecules and will identify around 2,000 variable genes. The tutorial recommends that users should explore the parameters themselves since each dataset is different.
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). We and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.
Our procedure in Seurat is described in detail here, and improves on previous versions by directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the FindVariableFeatures() function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.
```{r}
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
```
```{r}
length(VariableFeatures(pbmc))
```
Identify the 10 most highly variable genes
```{r}
top10 <- head(VariableFeatures(pbmc), 10)
top10
```
Plot variable features with and without labels
```{r, fig.width=10, fig.height=6}
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
```
## Scaling the data
Next, we apply a linear transformation ("scaling") that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The `ScaleData()` function:
* Shifts the expression of each gene, so that the mean expression across cells is 0
* Scales the expression of each gene, so that the variance across cells is 1
* This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
* The results of this are stored in pbmc[["RNA"]]$scale.data
* By default, only variable features are scaled.
* You can specify the features argument to scale additional features
```{r}
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
dim(pbmc[["RNA"]]$scale.data)
```
```{r}
hist(
colSums(pbmc[['RNA']]$scale.data),
breaks = 100,
main = "Total expression after scaling",
xlab = "Sum of expression"
)
```
How can I remove unwanted sources of variation?
In Seurat, we also use the `ScaleData()` function to remove unwanted sources of variation from a single-cell dataset. For example, we could "regress out" heterogeneity associated with (for example) cell cycle stage, or mitochondrial contamination i.e.:
```{r, eval=FALSE}
pbmc <- ScaleData(pbmc, features = all.genes, vars.to.regress = "percent.mt")
```
However, particularly for advanced users who would like to use this functionality, we strongly recommend the use of our new normalization workflow, `SCTransform()`. The method is described in [this paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02584-9), with a separate [vignette using Seurat](https://satijalab.org/seurat/articles/sctransform_vignette). As with `ScaleData()`, the function `SCTransform()` also includes a `vars.to.regress` parameter.
## Perform linear dimensional reduction
Next we perform PCA on the scaled data. By default, only the previously determined variable features are used as input, but can be defined using features argument if you wish to choose a different subset (if you do want to use a custom subset of features, make sure you pass these to `ScaleData` first).
For the first principal components, Seurat outputs a list of genes with the most positive and negative loadings, representing modules of genes that exhibit either correlation (or anti-correlation) across single-cells in the dataset.
```{r}
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
```
Seurat provides several useful ways of visualizing both cells and features that define the PCA, including `VizDimReduction()`, `DimPlot()`, and `DimHeatmap()`.
Examine and visualize PCA results a few different ways
```{r}
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
```
```{r, fig.height=7, fig.width=8}
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
```
```{r}
DimPlot(pbmc, reduction = "pca") + NoLegend()
```
In particular `DimHeatmap()` allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Setting cells to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated feature sets.
```{r}
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
```
```{r fig.height=12, fig.width=8}
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
```
## Determine the dimensionality of the dataset
To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, how many components should we choose to include? 10? 20? 100?
In Macosko et al, we implemented a resampling test inspired by the JackStraw procedure. While still available in Seurat (see previous vignette), this is a slow and computationally expensive procedure, and we is no longer routinely used in single cell analysis.
An alternative heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each one (ElbowPlot() function). In this example, we can observe an ‘elbow’ around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.
```{r}
ElbowPlot(pbmc)
```
Identifying the true dimensionality of a dataset – can be challenging/uncertain for the user. We therefore suggest these multiple approaches for users. The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. The second (ElbowPlot) The third is a heuristic that is commonly used, and can be calculated instantly. In this example, we might have been justified in choosing anything between PC 7-12 as a cutoff.
We chose 10 here, but encourage users to consider the following:
* Dendritic cell and NK aficionados may recognize that genes strongly associated with PCs 12 and 13 define rare immune subsets (i.e. MZB1 is a marker for plasmacytoid DCs). However, these groups are so rare, they are difficult to distinguish from background noise for a dataset of this size without prior knowledge.
* We encourage users to repeat downstream analyses with a different number of PCs (10, 15, or even 50!). As you will observe, the results often do not differ dramatically.
* We advise users to err on the higher side when choosing this parameter. For example, performing downstream analyses with only 5 PCs does significantly and adversely affect results.
## Cluster the cells
Seurat applies a graph-based clustering approach, building upon initial strategies in (Macosko et al). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, our approach to partitioning the cellular distance matrix into clusters has dramatically improved. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015]. Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.
As in PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs).
To cluster the cells, we next apply modularity optimization techniques such as the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics], to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters can be found using the Idents() function.
```{r}
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
```
Look at cluster IDs of the first 5 cells
```{r}
head(Idents(pbmc), 5)
```
## Non-linear dimensional reduction
Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn underlying structure in the dataset, in order to place similar cells together in low-dimensional space. Therefore, cells that are grouped together within graph-based clusters determined above should co-localize on these dimension reduction plots.
While we and others have routinely found 2D visualization techniques like tSNE and UMAP to be valuable tools for exploring datasets, all visualization techniques have limitations, and cannot fully represent the complexity of the underlying data. In particular, these methods aim to preserve local distances in the dataset (i.e. ensuring that cells with very similar gene expression profiles co-localize), but often do not preserve more global relationships. We encourage users to leverage techniques like UMAP for visualization, but to avoid drawing biological conclusions solely on the basis of visualization techniques.
```{r}
pbmc <- RunUMAP(pbmc, dims = 1:10)
```
Note that you can set `label = TRUE` or use the LabelClusters function to help label individual clusters
```{r}
DimPlot(pbmc, reduction = "umap")
```
## Finding differentially expressed features
Seurat can help you find markers that define clusters via differential expression (DE). By default, it identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers() automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.
In Seurat v5, we use the presto package (as described here and available for installation here), to dramatically improve the speed of DE analysis, particularly for large datasets. For users who are not using presto, you can examine the documentation for this function (?FindMarkers) to explore the min.pct and logfc.threshold parameters, which can be increased in order to increase the speed of DE testing.
For a (much!) faster implementation of the Wilcoxon Rank Sum Test, (default method for FindMarkers) please install the {presto} package. After installation of {presto}, Seurat will automatically use the more
efficient implementation (no further action necessary).
```{r, eval=FALSE}
install.packages("remotes")
remotes::install_github('immunogenomics/presto')
```
Load {presto}.
```{r}
library('presto')
packageVersion('presto')
```
Find all markers of cluster 2.
```{r}
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2)
head(cluster2.markers, n = 5)
```
Find all markers distinguishing cluster 5 from clusters 0 and 3
```{r}
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3))
head(cluster5.markers, n = 5)
```
Find markers for every cluster compared to all remaining cells, report only the positive ones.
```{r}
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE)
pbmc.markers %>%
dplyr::group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1)
```
Seurat has several tests for differential expression which can be set with the test.use parameter (see our DE vignette for details). For example, the ROC test returns the ‘classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).
```{r}
cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
```
We include several tools for visualizing marker expression. `VlnPlot()` (shows expression probability distributions across clusters), and `FeaturePlot()` (visualizes feature expression on a tSNE or PCA plot) are our most commonly used visualizations. We also suggest exploring `RidgePlot()`, `CellScatter()`, and `DotPlot()` as additional methods to view your dataset.
```{r}
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
```
You can plot raw counts as well.
```{r}
VlnPlot(pbmc, features = c("NKG7", "PF4"), layer = "counts", log = TRUE)
```
```{r, fig.height=8, fig.width=10}
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
```
`DoHeatmap()` generates an expression heatmap for given cells and features. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
```{r, fig.height=10, fig.width=12}
pbmc.markers %>%
dplyr::group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1) %>%
dplyr::slice_head(n = 10) %>%
dplyr::ungroup() -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
```
## Assigning cell type identity to clusters
Fortunately in the case of this dataset, we can use canonical markers to easily match the unbiased clustering to known cell types:
Cluster ID | Markers | Cell Type
-----------|---------------|----------
0 | IL7R, CCR7 | Naive CD4+ T
1 | CD14, LYZ | CD14+ Mono
2 | IL7R, S100A4 | Memory CD4+
3 | MS4A1 | B
4 | CD8A | CD8+ T
5 | FCGR3A, MS4A7 | FCGR3A+ Mono
6 | GNLY, NKG7 | NK
7 | FCER1A, CST3 | DC
8 | PPBP | Platelet
```{r}
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
```
```{r, fig.height=7, fig.width=12}
library(ggplot2)
DimPlot(pbmc, reduction = "umap", label = TRUE, label.size = 4.5) + xlab("UMAP 1") +
ylab("UMAP 2") +
theme(axis.title = element_text(size = 18), legend.text = element_text(size = 18)) +
guides(colour = guide_legend(override.aes = list(size = 10)))
```