LeukemiaProjection_Tutorial.Rmd

---
title: "Leukemia Projections"
output: html_notebook
---

Here we will map an example dataset from three leukemias sequenced in our study.
This tutorial will take approximately 10 minutes to run 

### Setup
 
```{r}
library(Seurat)
library(tidyverse)
library(symphony)
library(ggpubr)
library(patchwork)
library(RColorBrewer)
```

Install package from github

```{r}
## install dependencies that are not on CRAN
if(!require(BiocManager, quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("AUCell", "doMC"))
if(!require(devtools, quietly = TRUE)) install.packages("devtools")
devtools::install_github("jaredhuling/jcolors")
```
 
```{r}
## install BoneMarrowMap package
devtools::install_github('andygxzeng/BoneMarrowMap', force = TRUE)
library(BoneMarrowMap)
```

#### Download reference object and UMAP model

Set projection folder, Download reference object and UMAP model

```{r}
# Set directory to store projection reference files
projection_path = './'

# Download Bone Marrow Reference - 344 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/BoneMarrow_RefMap_SymphonyRef.rds', 
                    destfile = paste0(projection_path, 'BoneMarrow_RefMap_SymphonyRef.rds'))
# Download uwot model file - 221 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/BoneMarrow_RefMap_uwot_model.uwot', 
                    destfile = paste0(projection_path, 'BoneMarrow_RefMap_uwot_model.uwot'))

# Load Symphony reference
ref <- readRDS(paste0(projection_path, 'BoneMarrow_RefMap_SymphonyRef.rds'))
# Set uwot path for UMAP projection
ref$save_uwot_path <- paste0(projection_path, 'BoneMarrow_RefMap_uwot_model.uwot')
```


#### Visualize Bone Marrow Reference

If we want to visualize celltype labels or metadata from the BM Reference, we can create a Seurat Object from the symphony reference 
This will be memory efficient as it will not include gene expression counts, only the UMAP coordinates and the metadata including cell labels and sorting information

```{r, fig.height=5, fig.width=11}
ReferenceSeuratObj <- create_ReferenceObject(ref)

DimPlot(ReferenceSeuratObj, reduction = 'umap', group.by = 'CellType_Annotation_formatted', 
        raster=FALSE, label=TRUE, label.size = 4)
```

We can visualize other annotations too, including cell cycle phase and lineage pseudotime estimates.

```{r, fig.height=3.5, fig.width=11}
p1 <- DimPlot(ReferenceSeuratObj, reduction = 'umap', group.by = 'CyclePhase', raster=FALSE)
p2 <- FeaturePlot(ReferenceSeuratObj, reduction = 'umap', features = 'Pseudotime', raster=FALSE) 

p1 + p2
```


#### Load Leukemia scRNA-seq data.

We have confidently mapped leukemias spanning AML (incl. AMKL and AEL), B-ALL, MPAL, MDS, CML, MPN, and BDPCN, across sequencing technologies.
However, we cannot be confident in mapping T-ALL due to a lack of thymus-specific reference data on T cell precursor stages. Further, our tool does not discriminate between normal vs malignant cells, which is an important consideration particularly within low-blast count chronic diseases. 


As an example here, we are going to project scRNA-seq data from three diverse AML patients sequenced in our study. 

* pt_17844: MLL-AF9 translocation with a purely mature leukemia cell hierarchy
* pt_17746: NPM1c + DNMT3A + TET2 with a GMP-dominant leukemia cell hierarchy
* pt_30886: Complex Karyotype with a primitive leukemia cell hierarchy + extensive erythroid involvement 

Each patient sample was downsampled to 2500 cells to decrease runtime for this example. 

```{r}
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/ExampleQuery_AML_scRNAseq.rds', 
                    destfile = paste0(projection_path, 'ExampleQuery_AML_scRNAseq.rds'))

query <- readRDS('ExampleQuery_AML_scRNAseq.rds')
query
```


### Map the Query Data
Provide raw counts, metadata, and donor key. This should take <1 min
Calculate mapping error and perform QC to remove low quality cells with high mapping error

```{r}
# batch variable to correct in the query data, set as NULL if no batches in query
batchvar <- 'Patient'

# Map query dataset using Symphony (Kang et al 2021)
query <- map_Query_custom(
    exp_query = query@assays$RNA@counts, 
    metadata_query = query@meta.data,
    ref_obj = ref,
    vars = batchvar
)
```
In leukemia samples, the distribution of mapping error scores can vary broadly from sample to sample. In this context, we will want to threshold outliers with high mapping error on a per-sample basis. Typically, a threshold of 2, 2.5, or 3 MADs works well. 

In some cases where sequencing depth is very low (e.g. older datasets from first-generation scRNA-seq protocols), a more stringent threshold of even 1.5 may be warranted to eliminate cells with low mapping quality 

```{r, fig.height=3, fig.width=10}
# Run QC based on mapping error score, flag cells with mapping error >= 2.5 MADs above median
query <- query %>% calculate_MappingError(., reference = ref, MAD_threshold = 2.5, 
                                          threshold_by_donor = TRUE, donor_key = batchvar) # threshold mapping error on a per-sample basis.

# Plot distribution by patient to ensure you are catching the tail
query@meta.data %>% 
  ggplot(aes(x = mapping_error_score, fill = mapping_error_QC)) + 
  geom_histogram(bins = 200) + facet_wrap(.~get(batchvar))
```


```{r, fig.height=3, fig.width=10}
# Get QC Plots
QC_plots <- plot_MappingErrorQC(query)

# Plot together - If this is too crowded, can also just call "QC_plots" aloneto display one by one
patchwork::wrap_plots(QC_plots, ncol = 4, widths = c(0.8, 0.3, 0.8, 0.3))
```


This important step identifies a subset of cells with high mapping error from the query dataset that are either:

* not present within the reference, or
* have poor QC metrics (low RNA counts and low transcriptional diversity)

Sometimes, low quality cells may erroneously map to the orthochromatic erythroblast region as this cell type has very low transcriptional diversity. 
These low quality query cells do not have hemoglobin expression and are in fact mis-mapped; they will be flagged by the QC filter and excluded from cell type assignments.

**Please adjust the MAD_threshold (typically between 1 and 3) based on the distribution of your dataset to identify the outliers with low quality and high mapping error scores. This will improve your classifications and any downstream composition analysis**


```{r}
# # Optional step - remove outliers with high mapping error
# query <- subset(query, mapping_error_QC == 'Pass')
```

Optionally, outlier cells with high mapping error can also be removed at this stage.
For ease of integrating these mapped annotations with the rest of your analysis, we can choose to skip this step. If so, Final CellType and Pseudotime predictions will be assigned as NA for cells failing the mapping error QC threshold. 


### Cell Type Assignments
We will next use a KNN classifier to assign cell identity based on the 30 K-Nearest Neighbours from the reference map.
This label transfer step will take longer, potentially around 10 minutes for ~10,000 cells 

```{r, fig.height=5, fig.width=10}
# Predict Hematopoietic Cell Types by KNN classification
query <- predict_CellTypes(
  query_obj = query, 
  ref_obj = ref, 
  initial_label = 'initial_CellType', # celltype assignments before filtering on mapping QC
  final_label = 'predicted_CellType'  # celltype assignments with map QC failing cells assigned as NA
) 

DimPlot(subset(query, mapping_error_QC == 'Pass'), reduction = 'umap', group.by = c('predicted_CellType'), 
        raster=FALSE, label=TRUE, label.size = 4)
```


#### Pseudotime Annotations
We can also annotate each query cell based on their position along hematopoietic pseudotime. 
Query cells will be assigned a pseudotime score based on the 30 K-Nearest Neighbours from the reference map.
Since our Pseudotime KNN assignments are performed in UMAP space (more accurate than KNN on harmony components), this step is very fast (< 10s)

```{r, fig.height=3, fig.width=12}
# Predict Pseudotime values by KNN
query <- predict_Pseudotime(
  query_obj = query, 
  ref_obj = ref, 
  initial_label = 'initial_Pseudotime',  # pseudotime assignments before filtering on mapping QC
  final_label = 'predicted_Pseudotime'   # pseudotime assignments with map QC failing cells assigned as NA
)

# Visualize Hematopoietic Pseudotime in query data
FeaturePlot(subset(query, mapping_error_QC == 'Pass'), features = c('predicted_Pseudotime'), split.by = 'Patient') & 
  scale_color_gradientn(colors = rev(brewer.pal(11, 'RdBu')))
```

### Visualize Projection Density

Now let's visualize the density distribution of query cells across the hematopoietic hierarchy

```{r, fig.height=3, fig.width=12}
# Set batch/condition to be visualized individually
batch_key <- 'Patient'

# returns a list of plots for each donor from a pre-specified batch variable
projection_plots <- plot_Projection_byDonor(
  query_obj = query, 
  batch_key = batch_key, 
  ref_obj = ref, 
  Hierarchy_only = FALSE, # Whether to exclude T/NK/Plasma/Stromal cells 
  downsample_reference = TRUE, 
  downsample_frac = 0.25,   # down-sample reference cells to 25%; reduces figure file size
  query_point_size = 0.2,   # adjust size of query cells based on # of cells
  saveplot = TRUE, 
  save_folder = 'projectionFigures/'
)

# show plots together with patchwork. Can also just call "projection_plots" object to display one-by-one
patchwork::wrap_plots(projection_plots, ncol = 3)
```

We can also set Hierarchy_only = TRUE to remove T/NK/Plasma/Stromal cells and focus solely on the hematopoietic hierarchy.

```{r, fig.height=3, fig.width=10}
# Set batch/condition to be visualized individually
batch_key <- 'Patient'

# returns a list of plots for each donor from a pre-specified batch variable
projection_plots <- plot_Projection_byDonor(
  query_obj = query, 
  batch_key = batch_key, 
  ref_obj = ref, 
  Hierarchy_only = TRUE, # Whether to exclude T/NK/Plasma/Stromal cells 
  downsample_reference = TRUE, 
  downsample_frac = 0.25,   # down-sample reference cells to 25%; reduces figure file size
  query_point_size = 0.2,   # adjust size of query cells based on # of cells
  saveplot = TRUE, 
  save_folder = 'projectionFigures/'
)

# show plots together with patchwork. Can also just call "projection_plots" object to display one-by-one
patchwork::wrap_plots(projection_plots, ncol = 3)
```


### Get Composition data for each donor

Here, to study the abundance of each cell type within each donor, I focus on cells that were classified with a KNN prob > 0.5 (that is, >50% of nearest neighbours from the reference map agree on the assigned cell type). 

We can present this as a long table

```{r}
query_composition <- get_Composition(
  query_obj = query, 
  donor_key = 'Patient', 
  celltype_label = 'predicted_CellType', 
  mapQC_col = 'mapping_error_QC', 
  knn_prob_cutoff = 0.5, 
  return_type = 'long')

query_composition 
```

Or as a wide table with counts of # of cells

```{r}
query_composition <- get_Composition(
  query_obj = query, 
  donor_key = 'Patient', 
  celltype_label = 'predicted_CellType', 
  mapQC_col = 'mapping_error_QC', 
  knn_prob_cutoff = 0.5, 
  return_type = 'count')

query_composition 
```

Or as a wide table with proportion of each cell type within each donor

```{r}
query_composition <- get_Composition(
  query_obj = query, 
  donor_key = 'Patient', 
  celltype_label = 'predicted_CellType', 
  mapQC_col = 'mapping_error_QC', 
  knn_prob_cutoff = 0.5, 
  return_type = 'proportion')

query_composition 
```


```{r}
# Simple heatmap to visualize composition of projected samples
p <- query_composition %>% 
  # show celltypes present in >1% of total cells
  select(Patient, colnames(query_composition)[-1][colSums(query_composition[-1]) > 0.01]) %>% 
  # convert to matrix and display heatmap
  column_to_rownames('Patient') %>% data.matrix() %>% ComplexHeatmap::Heatmap()
p
```


### Extra: Score AML cells for enrichment of LSC signatures

Here we will load genesets derived from functional studies of LSC+ and LSC- fractions in diverse AML patients (Ng Nature 2017), and LSC+ vs LSC- fractions in patients with MLL translocations (Somervaille Cell Stem Cell 2009). The former tends to correspond to more primitive LSCs while the latter corresponds to a distinct signature of mature promonocytic LSCs.

Further, we will load additional genesets derived from marker genes of leukemia cell populations defined in van Galen et al Cell 2019 and Zeng et al Nature Medicine 2022. Notably, the LSPC-Quiescent population is highly associated with functional LSCs and relapse by deconvolution (Zeng 2022).

```{r}
curl::curl_download('https://raw.githubusercontent.com/andygxzeng/BoneMarrowMap_Extras/main/AML_CellType_Genesets.gmt', 
                    destfile = paste0(projection_path, 'AML_CellType_Genesets.gmt'))
AMLgenesets <- load_Genesets_gmt('AML_CellType_Genesets.gmt')
AMLgenesets %>% summary()
```

```{r}
# score genesets by AUCell and add to metadata
# typically I split the data into batches of 5k-10k cells to conserve memory
query <- score_Genesets_AUCell(query, genesets = AMLgenesets, nbatches = 1, ncores = 10, output = 'metadata')
query@meta.data
```

Using genes associated with functional LSC+ engraftment capacity (LSC104_Ng2016_UP) together with the LSPC-Quiescent signature help us to identify candidate LSCs within the data. 

```{r, fig.height=6, fig.width=12}
FeaturePlot(subset(query, mapping_error_QC == 'Pass'), 
            features = c('LSPC_Quiescent_AUC', 'LSC104_Ng2016_UP_AUC'), 
            split.by = 'Patient', max.cutoff = 'q99', min.cutoff = 'q5') & 
  scale_color_gradientn(colors = rev(brewer.pal(11, 'RdBu')))
```

It is no surprise that these are the most primitive cells earliest in Pseudotime. Note that this approach does not work for the MLL-AF9 samples wherein all leukemic cells are mature myeloid; in this context the MLL_LSC_Somervaille_2009_UP signature may help us identify functional "LSCs" capable of engraftment within the mature myeloid populations (typically late GMPs or Early ProMonocytes). 

```{r, fig.height=3, fig.width=12}
FeaturePlot(subset(query, mapping_error_QC == 'Pass'), 
            features = c('MLL_LSC_Somervaille2009_UP_AUC'), 
            split.by = 'Patient', max.cutoff = 'q99', min.cutoff = 'q5') & 
  scale_color_gradientn(colors = rev(brewer.pal(11, 'RdBu')))
```

As expected, ProMonocytes within the MLL-AF9 sample exhibit enrichment for the MLL_LSC_Somervaille_2009_UP, in line with the mature myeloid origin of this patient's disease.


### Save projection results

This will save a csv file with the mapped annotations for each cell (mapping error scores, umap coordinates, predicted Cell Type, and predicted Pseudotime). We also have the option to save AUCell scores, provided that they are in the metadata. 

```{r}
# Save CellType Annotations and Projected UMAP coordinates
save_ProjectionResults(
  query_obj = query, 
  celltype_label = 'predicted_CellType', 
  celltype_KNNprob_label = 'predicted_CellType_prob', 
  pseudotime_label = 'predicted_Pseudotime', 
  save_AUCell_scores = TRUE,
  file_name = 'querydata_projected_labeled.csv')
```


## Note on Downstream Analysis: 

For downstream analysis, you can use the projected celltype labels to help annotate any leukemia cell clusters generated through unsupervised dimensionality reduction and clustering from individual patients. Sometimes, unsupervised analysis will yield clusters corresponding to different developmental states and other times it may yield clusters corresponding to distinct subclones within the patient. Along with tools like inferCNV for patients with known cytogenetic abnormalities, this can be integrated to visualize cellular hierarchies at the level of individual subclones. 

Additional information from pseudotime projection and scoring of LSC-specific signatures can help identify candidate LSCs within the data. 

Finally, for cohorts with many patients composition analysis can be performed to understand how leukemia cell hierarchies vary with patient characteristics and therapy response / relapse. I hope this tutorial was useful and feel free to comment with any questions you may have around projection of leukemia cells or downstream analysis.