-
Notifications
You must be signed in to change notification settings - Fork 0
/
single_basic_tutorial.Rmd
409 lines (317 loc) · 17.6 KB
/
single_basic_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
---
title: "A basic overview of single data analysis"
output:
html_document:
theme: united
df_print: kable
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
---
***
```{r setup, include=FALSE}
all_times <- list() # store the time for each chunk
knitr::knit_hooks$set(time_it = local({
now <- NULL
function(before, options) {
if (before) {
now <<- Sys.time()
} else {
res <- difftime(Sys.time(), now, units = "secs")
all_times[[options$label]] <<- res
}
}
}))
knitr::opts_chunk$set(
tidy = TRUE,
tidy.opts = list(width.cutoff = 95),
message = FALSE,
warning = FALSE,
time_it = TRUE
)
```
# Overview
This tutorial demonstrates how to use `SRTpipeline` (>=0.1.0) to analyze a spatially-resolved transcriptomics (SRT) data. This tutorial will cover the following tasks, which we believe will be common for many SRT data analyses:
* Normalization
* Feature selection, including detecting spatially-variable features
* Dimensional reduction
* Clustering
* Visualization
* DEG analysis
* Trajectory inference
# Prepare SRTProject
For this tutorial, we will introduce how to create a `SRTProject` object with single SRT sample using SRTpipeline that includes an introduction to common analytical workflows for a single data batch.
Here, we will be taking a spatial transcriptomics dataset (SampleID: 151672) for human dorsolateral prefrontal cortex (DLPFC) as an example.
There are 4015 spots and 33538 genes that were sequenced on the 10x Visium platform. Our preprocessed data can be downloaded [here](https://github.com/feiyoung/PRECAST/blob/main/vignettes_data/dlpfc_151672.rda?raw=true), and the raw data can be found [here](https://github.com/LieberInstitute/spatialLIBD).
First, we loaded the required packages with the command:
```{r init}
library(SRTpipeline)
```
Next, we download the data to the current working path for the followed analysis by the following command:
```{r eval = FALSE}
githubURL <- "https://github.com/feiyoung/PRECAST/blob/main/vignettes_data/dlpfc_151672.rda?raw=true"
#download.file(githubURL,"dlpfc_151672.rda",mode='wb')
```
Then load to R. The data is saved in a Seurat object named `dlpfc_151672`.
```{r eval = FALSE}
load("dlpfc_151672.rda")
```
<details>
<summary>**How to read the raw sptial transcriptomics data?**</summary>
In addition, if users' data are in the raw form from spaceranger.
We start by reading in the data. The `read10XVisium()` function reads in the output of the [spaceranger](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/what-is-space-ranger) pipeline from 10X, returning a unique molecular identified (UMI) count matrix and a spatial coordinates file. The values in count matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each spot (column). The spatial information of all spots is represented by the spatial coordinates file.
We next use the count matrix and spatial coordinates to create a `SRTProject` object. The object serves as a container that contains both data (like the count matrix, spatial coordinates) and analysis (like PCA, or clustering results) for a spatial transcriptomics dataset. For a technical discussion of the `SRTProject` object structure, check the '?SRTProject-class'.
</details>
<details>
<summary>**What does data in a count matrix look like?**</summary>
```{r eval = FALSE}
# Let's examine a few genes in the first five spots
count <- dlpfc_151672[["RNA"]]@counts
count[1:5, 1:5]
```
The `.` values in the matrix represent 0s (no molecules detected). Since most values in an count matrix are 0, a sparse-matrix representation is necessary to save memory and speed savings for data.
````{r eval = FALSE}
meta_data <- dlpfc_151672@meta.data
head(meta_data[,1:5])
```
</details>
We check the data by printing it. We can see there are 4015 spots and 33538 genes.
```{r eval = FALSE}
dlpfc_151672
```
Check the meta data: each row contains the meta information of a spot with row name such as `AAACAAGTATCTCCCA-1`, which consists of the barcode, to identify each spot in RNA sequencing, separator `-` and a number `1`; note that the columns, `row`, `col`, `imagerow` and `imagecol`, save the spatial information of each spot, in wihch `row` and `col` are the location of a spot on the slide and will be used in spatial analysis, `imagerow` and `imagecol` are the location of a spot on the H&E image; there is the manual annotations `layer_guess_reordered` based on the cytoarchitecture in the original study (Maynard et al., 2021).
```{r eval = FALSE}
head(dlpfc_151672@meta.data)
```
Check the manual annotations: this slice is annotated as five layers: Layer 3-6 and White matter.
```{r eval = FALSE}
table(dlpfc_151672@meta.data$layer_guess_reordered)
```
# Pre-processing workflow
## Create a SRTProject object
First, we show how to create a SRTProject object step by step.
Because the gene name is Ensembl format in this data, we change the genes' Ensembl name to symbols for convenience of followed analyses.
```{r eval = FALSE}
count_matrix <- dlpfc_151672[['RNA']]@counts
## Use eg.db database: this method is fast
symbol_name <- transferGeneNames(row.names(count_matrix), now_name = "ensembl", to_name="symbol", species="Human", Method='eg.db')
symbol_name[1:10]
row.names(count_matrix) <- symbol_name
```
```{r eval = FALSE}
## create count matrix list: note each component has a name, i.e., `ID151672`.
cntList <- list(ID151672=count_matrix)
## create spatial coordinate matrix
coordList <- list(cbind(dlpfc_151672$row, dlpfc_151672$col))
## create metadata list
meta.data <- dlpfc_151672@meta.data
metadataList <- list(meta.data)
## create meta data for each data batches. Here we only have one data batch.
sampleMetadata <- data.frame(species="Human", tissues=c('DLPFC'))
row.names(sampleMetadata) <- names(cntList)
## Name of this project
projectName <- "DLPFC151672"
rm(dlpfc_151672)
```
Next, we start creating `SRTProject` object. We can print the basic information of this object, including three parts. The first part has the class of this object, outputPath of data that require to output, h5filePath that save the memory-cusuming data (i.e., count, logcount, ...). The second part is about the datasets basic information, such as how many data batches(sample) and the data names, sample meta data (sampleColData) and meta data for each spot (cellMetaData). The last part is about downstream analyses information (that is empty) when this object created.
```{r eval = FALSE}
SRTProj <- CreateSRTProject(cntList, coordList, projectName =projectName, metadataList,
sampleMetadata, force=F)
SRTProj
```
<details>
<summary>**What does data in a h5 file look like?**</summary>
```{r eval = FALSE}
h5ls(SRTProj@projectMetadata$h5filePath)
```
</details>
## Normalizing the data
After removing unwanted cells and genes from the dataset, the next step is to normalize the data. To save RAM memory, normalized values are stored in disk as a `h5file`.
```{r eval = FALSE}
SRTProj <- normalizeSRT(SRTProj)
```
### Feature selection
We next select a subset of genes that exhibit high spot-to-spot variation in the dataset (i.e, they are highly expressed in some spots, and lowly expressed in others). It has been found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets [here](https://www.nature.com/articles/nmeth.2645).
Then we choose variable features. The default number of variable features is 2,000, and users can change it using argument `nfeatures`. The default type is highly variable genes (HVGs), but users can use spatially variable genes by seting `type='SVGs'`, then `SPARK-X` will be used to choose SVGs.
#### Choose highly variable features
We can also directly use `selectVariableFeatures()` function to choose the top highly variable genes (HVGs). We found the performance of using SVGs or HVGs has no significant difference for the downstream analyses, such as dimension reduction and spatial clustering.
```{r eval = FALSE}
SRTProj <- selectVariableFeatures(SRTProj)
```
#### Choose spatially variable features
Additionally, `selectVariableFeatures()` function can also choose the top spatially variable genes (SVGs) based on SPARK-X. SPARK-X is embeded in the `DR.SC` R package and is implemented in the `FindSVGs()` function. By default, we return 2,000 genes per dataset. These will be used in downstream analysis, like probabilistic embeddings and spatial clustering. To further speed up the computation, we first use `FindVariableFeatures()` to select 5,000 highly variable genes, then use SPARK-X to choose top SVGs in the implementation of `FindSVGs()` function.
```{r eval =FALSE}
SRTProj <- selectVariableFeatures(SRTProj, type='SVGs', method='spark-x')
```
Calculate the adjcence matrix
```{r eval = FALSE}
## Obtain adjacence matrix
SRTProj <- AddAdj(SRTProj, platform = "Visium", force=T)
```
Visualize
```{r, eval = FALSE}
selectFeatures <- row.names(SRTProj@geneMetaData[which(SRTProj@geneMetaData$`isVGs#combine`),])[1:2]
EachExprSpaHeatMap(SRTProj, features=selectFeatures, title_name=T, quantVec = c(0.90,0.90))
```
# Dimension reduction
PCA is the most popular dimension reduction technique in single cell RNA sequencing (scRNA-seq) data and SRT data because of its simplicity, computational efficiency, and relatively comparable performance. SRTpipeline provided three versions of PCA: standard PCA (PCA), approximated PCA (APCA), and weighted PCA (WPCA) by the function `AddPCA` with default version as APCA for fastest computation. The 15-dimensional PCs are extracted by default, and users can use their own values.
After running `AddPCA`, we would see the output of `SRTProj` includes `PCA` in the `Low-dimensional embeddings` field.
```{r, eval = FALSE}
## APCA
SRTProj <- AddPCA(SRTProj, n_comp=15)
#accurate PCA
#SRTProj <- AddPCA(SRTProj, Method='PCA')
#weighted PCA
# SRTProj <- AddPCA(SRTProj, Method='WPCA')
SRTProj
```
# Clustering using SC-MEB
Some SRT clustering methods use Markov random field to model clusters of spots, such as SC-MEB and DR-SC. These approaches work extremely well for data with spatial smoothness and are a standard practice in SRT data. For this reason, SRTpipeline uses existing state-of-the-art clustering methods from SRT packages for clustering.
We have had the most success using the Markov random field model implemented by SC-MEB. In SRTpipeline, SC-MEB clustering is performed using the ` Cluster_SCMEB()` function.
```{r, eval = FALSE}
SRTProj <- Cluster_SCMEB(SRTProj, K= 5, reduction = "PCA")
SRTProj
```
## Visualization
To run tSNE in SRTpipeline we use the AddTSNE() function:
```{r, eval = FALSE}
SRTProj <- AddTSNE(SRTProj, n_comp = 2, reduction='PCA')
```
The `reduction` tells `AddTSNE()` function uses specified reduction in the `SRTProj@reductions` slot.
To plot the two-dimensional tSNE results, we use the `EmbedPlot()` function and pass the name of the tSNE embedding we just generated (“tSNE”). We can tell SRTpipeline how to color the spots by using `item` argument which tells SRTpipeline which matrix to use to find the specified metadata column provided to name.
`SRTpipeline` package offers a variety of visualization tools. We
visualized the inferred domain types on embeddings using two
components from tSNE. The tSNE plot showed the domain clusters were well segregated.
```{r, eval = FALSE}
p_pca_tsne2 <- EmbedPlot(SRTProj, item='cluster', plotEmbeddings = 'tSNE', cols=chooseColors(n_colors = 5), legend.position='right')
p_pca_tsne2
```
To run UMAP in SRTpipeline we use the AddUMAP() function. Frist, we evaluate the two-dimensional UMAPs.
```{r, eval = FALSE}
SRTProj <- AddUMAP(SRTProj, n_comp = 2, reduction='PCA')
p_pca_umap2 <- EmbedPlot(SRTProj, item='cluster', plotEmbeddings = 'UMAP', cols=chooseColors(n_colors = 5), legend.position='right')
p_pca_umap2
```
Next, we merge all plots into one figure.
```{r, eval = FALSE}
library(patchwork)
p_all <- p_pca_tsne2 +p_pca_umap2 + plot_layout(nrow=1, ncol=2)
p_all
```
To save the plot, we can use `write_fig()` function.
```{r, eval = FALSE}
write_fig(p_all, filename = 'SCMEB_plots.png')
```
Except for the embedding plots, SRTpipeline also provides a variaty of visualization functions.
First, we visualize the spatial distribution of cluster labels that shows the layer structure.
```{r, eval = FALSE}
## choose colors to function chooseColors
cols <- chooseColors(n_colors = 5)
EachClusterSpaHeatMap(SRTProj, cols=cols, legend.position='right',base_size=16)
# remove the border
# EachClusterSpaHeatMap(SRTProj, cols=cols, legend.position='right',base_size=16, border_col='white')
```
We plotted the heatmap of Pearson’s correlation coefcients of the PCA embeddings among
the detected domains shows the good separation of the estimated embeddings across
domains and the correlations between deeper layers were high, e.g., there were high
correlations between domain 2 and 3, while correlations among the separated layers were low, i.e., domain 1 and 4.
```{r, eval = FALSE}
CCHeatMap(SRTProj, reduction = 'PCA', grp_color=cols)
```
After adding the quantities for data visualization, the `SRTProject` object will have more information in the downstream analyses information.
Now, we print this `SRTProject` object to check it. We observed two components added in the slot `plotEmbeddings` (Embeddings for plotting): `tSNE`, `tSNE3`, `UMAP` and `UMAP3`.
```{r, eval = FALSE}
SRTProj
```
# DEG analysis
To do downstream analyses, we require to get the count data from h5file. The function `getGeneSpotData()` access the gene-by-spot matrix and other data in `SRTProj` and then return a `SpatialExperiment` object, including two assays: counts, logcounts; rowData: from `SRTProj@geneMetaData`; colData: from `SRTProj@cellMetaData`, `SRTProj@clusters` and sample_id; reducedDim: from `SRTProj@reductions`.
```{r, eval = FALSE}
spe <- getGeneSpotData(SRTProj)
spe
```
After obtain the spatial cluster labels using a clustering model, we can perform differentially expression analysis. First, we detect the DE genes for cluster 3.
```{r, eval = FALSE}
dat_deg <- FindDEGs(spe, cluster1='3')
subset(dat_deg, avg_log2FC >0.25)
```
We perform differential expression analysis for all clusters by using `FindAllDEGs()` function, then the DE genes' information is saved in a data.frame object `dat_degs`.
```{r, eval = FALSE}
dat_degs <- FindAllDEGs(spe)
dat_degs
```
We identify the significant DE genes by two criteria: (a) adjustd p-value less than 0.01 and (b) average log fold change greater than 0.25.
```{r, eval = FALSE}
degs_sig <- subset(dat_degs, p_val_adj < 0.01 & avg_log2FC>0.25)
degs_sig
```
In the following, we perform gene set enrichment analysis for the DE genes of each Domain identified by SC-MEB model using R package `gprofiler2`.
```{r, eval = FALSE}
library(gprofiler2)
termList <- list()
for(k in 1: 5){
# k <- 1
cat("k = ", k, '\n')
dat_degs_sub <- subset(degs_sig, cluster==k)
que1 <- dat_degs_sub$gene
gostres <- gost(query = que1, organism = "hsapiens", correction_method='fdr')
termList[[k]] <- gostres
}
head(termList[[1]]$result)
```
To understand the functions of the identified spatial domains by SC-MEB model, we compare the top significant biological process (BP) pathways in GO database for the DE genes from Domain 1 and 2. Here, we only show to visualize the significant BP pathways and users can explore the other databases such as KEGG and HPA.
```{r, eval = FALSE}
## Most commonly used databases
source_set <- c("GO:BP","GO:CC", "GO:MF", 'KEGG', "HPA")
cols <- c("steelblue3", "goldenrod", "brown3", "#f98866", "#CE6DBD")
## Here, we show GO:BP
source1 <- "GO:BP"
ss <- which(source_set==source1)
ntop = 5
names(cols) <- source_set
pList_enrich <- list()
for(ii in 1: 5){
## ii <- 3
message("ii=", ii)
gostres2 <- termList[[ii]]
dat1 <- subset(gostres2$result, term_size< 500)
dat1 <- get_top_pathway(dat1, ntop=ntop, source_set = source1)
dat1 <- dat1[complete.cases(dat1),]
dat1$nlog10P <- -log10(dat1$p_value)
pList_enrich[[ii]] <- barPlot_enrich(dat1[order(dat1$nlog10P),], source='source', 'term_name',
'nlog10P', cols=cols[source_set[ss]],base_size=14) + ylab("-log10(p-adj)")+
xlab("Biological terms") + ggtitle(paste0("Domain", ii))
}
drawFigs(pList_enrich[c(1,2)], layout.dim = c(2,1), common.legend = T, align='hv')
```
We take out the top DE genes for each cluster for visualization.
```{r, eval = FALSE}
library(dplyr)
n <- 3
dat_degs %>% as.data.frame %>%
group_by(cluster) %>%
top_n(n = n, wt = avg_log2FC) -> topGene
topGene
```
We visualize the DE genes for each cluster group by gene-by-cell heatmap using the `GCHeatMap()` function.
```{r, eval = FALSE}
cols_cluster <- chooseColors(n_colors = 5)
p1 <- EachGCHeatMap(spe, features=topGene$gene, grp_color = cols_cluster ,y_text_size=12)
p1
```
# Trajectory inference
Next, we performed trajectory inference using the PCA embeddings and domain labels estimated by SC-MEB model. The `EmbedPlot()` function can be used to visualize the inferred pseudotime on a specified embedding. If there is only one data batch, the function `EachEmbedPlot()` plays the same role as function `EmbedPlot()`.
```{r, eval = FALSE}
spe <- AddTrajectory(spe, reduction = 'PCA')
p1 <- EmbedPlot(spe, reduction = 'PCA' ,colour_by='PT')
p2 <- EmbedPlot(spe, reduction = 'Coord' ,colour_by='PT')
drawFigs(list(p1, p2),layout.dim = c(1,2),common.legend = TRUE,
legend.position = 'right')
# EachEmbedPlot(spe, reduction = 'PCA', colour_by='TrajPT')
# save(spe, SRTProj, file=paste0(SRTProj@projectMetadata$outputPath,"/SRTProj.rds"))
# load("F:/Research paper/IntegrateDRcluster/AnalysisCode/SRTpipeline/vignettes/DLPFC151672/SRTProj.rds")
```
# Other downstream analyses
<details>
<summary>**Session Info**</summary>
```{r}
sessionInfo()
```
</details>