-
Notifications
You must be signed in to change notification settings - Fork 0
/
LeukemiaProjection_Tutorial.Rmd
380 lines (274 loc) · 16 KB
/
LeukemiaProjection_Tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
title: "Leukemia Projections"
output: html_notebook
---
Here we will map an example dataset from three leukemias sequenced in our study.
This tutorial will take approximately 10 minutes to run
### Setup
```{r}
library(Seurat)
library(tidyverse)
library(symphony)
library(ggpubr)
library(patchwork)
library(RColorBrewer)
```
Install package from github
```{r}
## install dependencies that are not on CRAN
if(!require(BiocManager, quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("AUCell", "doMC"))
if(!require(devtools, quietly = TRUE)) install.packages("devtools")
devtools::install_github("jaredhuling/jcolors")
```
```{r}
## install BoneMarrowMap package
devtools::install_github('andygxzeng/BoneMarrowMap', force = TRUE)
library(BoneMarrowMap)
```
#### Download reference object and UMAP model
Set projection folder, Download reference object and UMAP model
```{r}
# Set directory to store projection reference files
projection_path = './'
# Download Bone Marrow Reference - 344 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/BoneMarrow_RefMap_SymphonyRef.rds',
destfile = paste0(projection_path, 'BoneMarrow_RefMap_SymphonyRef.rds'))
# Download uwot model file - 221 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/BoneMarrow_RefMap_uwot_model.uwot',
destfile = paste0(projection_path, 'BoneMarrow_RefMap_uwot_model.uwot'))
# Load Symphony reference
ref <- readRDS(paste0(projection_path, 'BoneMarrow_RefMap_SymphonyRef.rds'))
# Set uwot path for UMAP projection
ref$save_uwot_path <- paste0(projection_path, 'BoneMarrow_RefMap_uwot_model.uwot')
```
#### Visualize Bone Marrow Reference
If we want to visualize celltype labels or metadata from the BM Reference, we can create a Seurat Object from the symphony reference
This will be memory efficient as it will not include gene expression counts, only the UMAP coordinates and the metadata including cell labels and sorting information
```{r, fig.height=5, fig.width=11}
ReferenceSeuratObj <- create_ReferenceObject(ref)
DimPlot(ReferenceSeuratObj, reduction = 'umap', group.by = 'CellType_Annotation_formatted',
raster=FALSE, label=TRUE, label.size = 4)
```
We can visualize other annotations too, including cell cycle phase and lineage pseudotime estimates.
```{r, fig.height=3.5, fig.width=11}
p1 <- DimPlot(ReferenceSeuratObj, reduction = 'umap', group.by = 'CyclePhase', raster=FALSE)
p2 <- FeaturePlot(ReferenceSeuratObj, reduction = 'umap', features = 'Pseudotime', raster=FALSE)
p1 + p2
```
#### Load Leukemia scRNA-seq data.
We have confidently mapped leukemias spanning AML (incl. AMKL and AEL), B-ALL, MPAL, MDS, CML, MPN, and BDPCN, across sequencing technologies.
However, we cannot be confident in mapping T-ALL due to a lack of thymus-specific reference data on T cell precursor stages. Further, our tool does not discriminate between normal vs malignant cells, which is an important consideration particularly within low-blast count chronic diseases.
As an example here, we are going to project scRNA-seq data from three diverse AML patients sequenced in our study.
* pt_17844: MLL-AF9 translocation with a purely mature leukemia cell hierarchy
* pt_17746: NPM1c + DNMT3A + TET2 with a GMP-dominant leukemia cell hierarchy
* pt_30886: Complex Karyotype with a primitive leukemia cell hierarchy + extensive erythroid involvement
Each patient sample was downsampled to 2500 cells to decrease runtime for this example.
```{r}
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/ExampleQuery_AML_scRNAseq.rds',
destfile = paste0(projection_path, 'ExampleQuery_AML_scRNAseq.rds'))
query <- readRDS('ExampleQuery_AML_scRNAseq.rds')
query
```
### Map the Query Data
Provide raw counts, metadata, and donor key. This should take <1 min
Calculate mapping error and perform QC to remove low quality cells with high mapping error
```{r}
# batch variable to correct in the query data, set as NULL if no batches in query
batchvar <- 'Patient'
# Map query dataset using Symphony (Kang et al 2021)
query <- map_Query_custom(
exp_query = query@assays$RNA@counts,
metadata_query = query@meta.data,
ref_obj = ref,
vars = batchvar
)
```
In leukemia samples, the distribution of mapping error scores can vary broadly from sample to sample. In this context, we will want to threshold outliers with high mapping error on a per-sample basis. Typically, a threshold of 2, 2.5, or 3 MADs works well.
In some cases where sequencing depth is very low (e.g. older datasets from first-generation scRNA-seq protocols), a more stringent threshold of even 1.5 may be warranted to eliminate cells with low mapping quality
```{r, fig.height=3, fig.width=10}
# Run QC based on mapping error score, flag cells with mapping error >= 2.5 MADs above median
query <- query %>% calculate_MappingError(., reference = ref, MAD_threshold = 2.5,
threshold_by_donor = TRUE, donor_key = batchvar) # threshold mapping error on a per-sample basis.
# Plot distribution by patient to ensure you are catching the tail
query@meta.data %>%
ggplot(aes(x = mapping_error_score, fill = mapping_error_QC)) +
geom_histogram(bins = 200) + facet_wrap(.~get(batchvar))
```
```{r, fig.height=3, fig.width=10}
# Get QC Plots
QC_plots <- plot_MappingErrorQC(query)
# Plot together - If this is too crowded, can also just call "QC_plots" aloneto display one by one
patchwork::wrap_plots(QC_plots, ncol = 4, widths = c(0.8, 0.3, 0.8, 0.3))
```
This important step identifies a subset of cells with high mapping error from the query dataset that are either:
* not present within the reference, or
* have poor QC metrics (low RNA counts and low transcriptional diversity)
Sometimes, low quality cells may erroneously map to the orthochromatic erythroblast region as this cell type has very low transcriptional diversity.
These low quality query cells do not have hemoglobin expression and are in fact mis-mapped; they will be flagged by the QC filter and excluded from cell type assignments.
**Please adjust the MAD_threshold (typically between 1 and 3) based on the distribution of your dataset to identify the outliers with low quality and high mapping error scores. This will improve your classifications and any downstream composition analysis**
```{r}
# # Optional step - remove outliers with high mapping error
# query <- subset(query, mapping_error_QC == 'Pass')
```
Optionally, outlier cells with high mapping error can also be removed at this stage.
For ease of integrating these mapped annotations with the rest of your analysis, we can choose to skip this step. If so, Final CellType and Pseudotime predictions will be assigned as NA for cells failing the mapping error QC threshold.
### Cell Type Assignments
We will next use a KNN classifier to assign cell identity based on the 30 K-Nearest Neighbours from the reference map.
This label transfer step will take longer, potentially around 10 minutes for ~10,000 cells
```{r, fig.height=5, fig.width=10}
# Predict Hematopoietic Cell Types by KNN classification
query <- predict_CellTypes(
query_obj = query,
ref_obj = ref,
initial_label = 'initial_CellType', # celltype assignments before filtering on mapping QC
final_label = 'predicted_CellType' # celltype assignments with map QC failing cells assigned as NA
)
DimPlot(subset(query, mapping_error_QC == 'Pass'), reduction = 'umap', group.by = c('predicted_CellType'),
raster=FALSE, label=TRUE, label.size = 4)
```
#### Pseudotime Annotations
We can also annotate each query cell based on their position along hematopoietic pseudotime.
Query cells will be assigned a pseudotime score based on the 30 K-Nearest Neighbours from the reference map.
Since our Pseudotime KNN assignments are performed in UMAP space (more accurate than KNN on harmony components), this step is very fast (< 10s)
```{r, fig.height=3, fig.width=12}
# Predict Pseudotime values by KNN
query <- predict_Pseudotime(
query_obj = query,
ref_obj = ref,
initial_label = 'initial_Pseudotime', # pseudotime assignments before filtering on mapping QC
final_label = 'predicted_Pseudotime' # pseudotime assignments with map QC failing cells assigned as NA
)
# Visualize Hematopoietic Pseudotime in query data
FeaturePlot(subset(query, mapping_error_QC == 'Pass'), features = c('predicted_Pseudotime'), split.by = 'Patient') &
scale_color_gradientn(colors = rev(brewer.pal(11, 'RdBu')))
```
### Visualize Projection Density
Now let's visualize the density distribution of query cells across the hematopoietic hierarchy
```{r, fig.height=3, fig.width=12}
# Set batch/condition to be visualized individually
batch_key <- 'Patient'
# returns a list of plots for each donor from a pre-specified batch variable
projection_plots <- plot_Projection_byDonor(
query_obj = query,
batch_key = batch_key,
ref_obj = ref,
Hierarchy_only = FALSE, # Whether to exclude T/NK/Plasma/Stromal cells
downsample_reference = TRUE,
downsample_frac = 0.25, # down-sample reference cells to 25%; reduces figure file size
query_point_size = 0.2, # adjust size of query cells based on # of cells
saveplot = TRUE,
save_folder = 'projectionFigures/'
)
# show plots together with patchwork. Can also just call "projection_plots" object to display one-by-one
patchwork::wrap_plots(projection_plots, ncol = 3)
```
We can also set Hierarchy_only = TRUE to remove T/NK/Plasma/Stromal cells and focus solely on the hematopoietic hierarchy.
```{r, fig.height=3, fig.width=10}
# Set batch/condition to be visualized individually
batch_key <- 'Patient'
# returns a list of plots for each donor from a pre-specified batch variable
projection_plots <- plot_Projection_byDonor(
query_obj = query,
batch_key = batch_key,
ref_obj = ref,
Hierarchy_only = TRUE, # Whether to exclude T/NK/Plasma/Stromal cells
downsample_reference = TRUE,
downsample_frac = 0.25, # down-sample reference cells to 25%; reduces figure file size
query_point_size = 0.2, # adjust size of query cells based on # of cells
saveplot = TRUE,
save_folder = 'projectionFigures/'
)
# show plots together with patchwork. Can also just call "projection_plots" object to display one-by-one
patchwork::wrap_plots(projection_plots, ncol = 3)
```
### Get Composition data for each donor
Here, to study the abundance of each cell type within each donor, I focus on cells that were classified with a KNN prob > 0.5 (that is, >50% of nearest neighbours from the reference map agree on the assigned cell type).
We can present this as a long table
```{r}
query_composition <- get_Composition(
query_obj = query,
donor_key = 'Patient',
celltype_label = 'predicted_CellType',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = 0.5,
return_type = 'long')
query_composition
```
Or as a wide table with counts of # of cells
```{r}
query_composition <- get_Composition(
query_obj = query,
donor_key = 'Patient',
celltype_label = 'predicted_CellType',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = 0.5,
return_type = 'count')
query_composition
```
Or as a wide table with proportion of each cell type within each donor
```{r}
query_composition <- get_Composition(
query_obj = query,
donor_key = 'Patient',
celltype_label = 'predicted_CellType',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = 0.5,
return_type = 'proportion')
query_composition
```
```{r}
# Simple heatmap to visualize composition of projected samples
p <- query_composition %>%
# show celltypes present in >1% of total cells
select(Patient, colnames(query_composition)[-1][colSums(query_composition[-1]) > 0.01]) %>%
# convert to matrix and display heatmap
column_to_rownames('Patient') %>% data.matrix() %>% ComplexHeatmap::Heatmap()
p
```
### Extra: Score AML cells for enrichment of LSC signatures
Here we will load genesets derived from functional studies of LSC+ and LSC- fractions in diverse AML patients (Ng Nature 2017), and LSC+ vs LSC- fractions in patients with MLL translocations (Somervaille Cell Stem Cell 2009). The former tends to correspond to more primitive LSCs while the latter corresponds to a distinct signature of mature promonocytic LSCs.
Further, we will load additional genesets derived from marker genes of leukemia cell populations defined in van Galen et al Cell 2019 and Zeng et al Nature Medicine 2022. Notably, the LSPC-Quiescent population is highly associated with functional LSCs and relapse by deconvolution (Zeng 2022).
```{r}
curl::curl_download('https://raw.githubusercontent.com/andygxzeng/BoneMarrowMap_Extras/main/AML_CellType_Genesets.gmt',
destfile = paste0(projection_path, 'AML_CellType_Genesets.gmt'))
AMLgenesets <- load_Genesets_gmt('AML_CellType_Genesets.gmt')
AMLgenesets %>% summary()
```
```{r}
# score genesets by AUCell and add to metadata
# typically I split the data into batches of 5k-10k cells to conserve memory
query <- score_Genesets_AUCell(query, genesets = AMLgenesets, nbatches = 1, ncores = 10, output = 'metadata')
query@meta.data
```
Using genes associated with functional LSC+ engraftment capacity (LSC104_Ng2016_UP) together with the LSPC-Quiescent signature help us to identify candidate LSCs within the data.
```{r, fig.height=6, fig.width=12}
FeaturePlot(subset(query, mapping_error_QC == 'Pass'),
features = c('LSPC_Quiescent_AUC', 'LSC104_Ng2016_UP_AUC'),
split.by = 'Patient', max.cutoff = 'q99', min.cutoff = 'q5') &
scale_color_gradientn(colors = rev(brewer.pal(11, 'RdBu')))
```
It is no surprise that these are the most primitive cells earliest in Pseudotime. Note that this approach does not work for the MLL-AF9 samples wherein all leukemic cells are mature myeloid; in this context the MLL_LSC_Somervaille_2009_UP signature may help us identify functional "LSCs" capable of engraftment within the mature myeloid populations (typically late GMPs or Early ProMonocytes).
```{r, fig.height=3, fig.width=12}
FeaturePlot(subset(query, mapping_error_QC == 'Pass'),
features = c('MLL_LSC_Somervaille2009_UP_AUC'),
split.by = 'Patient', max.cutoff = 'q99', min.cutoff = 'q5') &
scale_color_gradientn(colors = rev(brewer.pal(11, 'RdBu')))
```
As expected, ProMonocytes within the MLL-AF9 sample exhibit enrichment for the MLL_LSC_Somervaille_2009_UP, in line with the mature myeloid origin of this patient's disease.
### Save projection results
This will save a csv file with the mapped annotations for each cell (mapping error scores, umap coordinates, predicted Cell Type, and predicted Pseudotime). We also have the option to save AUCell scores, provided that they are in the metadata.
```{r}
# Save CellType Annotations and Projected UMAP coordinates
save_ProjectionResults(
query_obj = query,
celltype_label = 'predicted_CellType',
celltype_KNNprob_label = 'predicted_CellType_prob',
pseudotime_label = 'predicted_Pseudotime',
save_AUCell_scores = TRUE,
file_name = 'querydata_projected_labeled.csv')
```
## Note on Downstream Analysis:
For downstream analysis, you can use the projected celltype labels to help annotate any leukemia cell clusters generated through unsupervised dimensionality reduction and clustering from individual patients. Sometimes, unsupervised analysis will yield clusters corresponding to different developmental states and other times it may yield clusters corresponding to distinct subclones within the patient. Along with tools like inferCNV for patients with known cytogenetic abnormalities, this can be integrated to visualize cellular hierarchies at the level of individual subclones.
Additional information from pseudotime projection and scoring of LSC-specific signatures can help identify candidate LSCs within the data.
Finally, for cohorts with many patients composition analysis can be performed to understand how leukemia cell hierarchies vary with patient characteristics and therapy response / relapse. I hope this tutorial was useful and feel free to comment with any questions you may have around projection of leukemia cells or downstream analysis.