
# Protocol 1: assessment of cell type replicability with unsupervised MetaNeighbor

Protocol 1 demonstrates how to compute and visualize cluster replicability across 4 human pancreas datasets. We will show steps detailing how to install MetaNeighbor, how to download and reformat the datasets with the SingleCellExperiment package, how to compute and interpret MetaNeighbor AUROCs. All code blocks can be run in R command line, Rstudio, RMarkdown notebooks or a jupyter notebook with an R kernel.

## Step 0: Installation of MetaNeighbor and packages used in the protocol

1. We start by installing the latest MetaNeighbor package from the Gillis lab GitHub page.

In [None]:
if (!require('devtools')) {
  install.packages('devtools', quiet=TRUE)
}
#devtools::install_github("gillislab/MetaNeighbor")
devtools::install_github("gillislab/MetaNeighbor", ref="utility_dev")

2. We also install the following packages, which are not necessary to run MetaNeighbor itself, but are needed to run the protocol.

In [None]:
to_install = c("scRNAseq", "tidyverse", "org.Hs.eg.db")
installed = sapply(to_install, requireNamespace)
if (sum(!installed) > 0) {
    if (!requireNamespace("BiocManager", quietly = TRUE)) {
        install.packages("BiocManager")
        BiocManager::install()
    }
    BiocManager::install(to_install[!installed])
}

## Step 1: creation of a merged SingleCellExperiment dataset

3. We consider 4 pancreatic datasets along with their independent annotation (from the original publication). MetaNeighbor expects a gene x cell matrix encapsulated in a SummarizedExperiment format. We recommend the SingleCellExperiment (SCE) package, because it is able to handle sparse matrix formats. We load the pancreas datasets using the scRNAseq package, which provide annotated datasets that are already in the SingleCellExperiment format:

In [None]:
library(scRNAseq)
my_data <- list(
    baron = BaronPancreasData(),
    lawlor = LawlorPancreasData(),
    seger = SegerstolpePancreasData(),
    muraro = MuraroPancreasData()
)

4. MetaNeighbor's "mergeSCE" function can be used to merge multiple SingleCellExperiment objects. Importantly, the output object will be restricted to genes, metadata columns and assays that are common to every dataset. Before we use "mergeSCE", we need to make sure that gene and metadata information align across datasets.

We start by checking if gene information aligns (stored in the "rownames" slot of the SCE object).

In [None]:
lapply(my_data, function(x) head(rownames(x), 3))

Two datasets (Baron, Segerstolpe) use gene symbols, one dataset (Muraro) combines symbols with chromosome information (to avoid duplicate gene names) and the last dataset (Lawlor) uses Ensemble identifiers. We convert all gene names to unique gene symbols. We start by converting gene names to symbols in the Muraro dataset, which are stored in the "rowData" slot of the SCE object:

In [None]:
rownames(my_data$muraro) <- rowData(my_data$muraro)$symbol
my_data$muraro <- my_data$muraro[!duplicated(rownames(my_data$muraro)),]

To circumvent the initial problem of genes with duplicate names, we also remove all duplicated symbols. Next, we convert Ensemble IDs to symbols in the Lawlor dataset, removing all IDs with no match and duplicated symbols:

In [None]:
library(org.Hs.eg.db)
symbols <- mapIds(org.Hs.eg.db, keys=rownames(my_data$lawlor), keytype="ENSEMBL", column="SYMBOL")
keep <- !is.na(symbols) & !duplicated(symbols)
my_data$lawlor <- my_data$lawlor[keep,]
rownames(my_data$lawlor) <- symbols[keep]

5. We now turn our attention to metadata, which is stored in the "colData" slot of the SCE objects. Here we need to make sure that the column that contains cell type information is labeled identically in all datasets.

In [None]:
lapply(my_data, function(x) colnames(colData(x)))

Two datasets have the cell type information in the "cell type" column, the other two in the "label" column. For clarity, we add a "cell type" column in the latter two datasets.

In [None]:
my_data$baron$"cell type" <- my_data$baron$label
my_data$muraro$"cell type" <- my_data$muraro$label

6. Last, we check that count matrices, stored in the "assay" slot, have identical names.

In [None]:
lapply(my_data, function(x) names(assays(x)))

7. Now that gene, cell type and count matrix information is aligned across datasets, we can create a merged dataset. "mergeSCE" takes a list of SCE objects as an input and outputs a single SCE object.

In [None]:
library(MetaNeighbor)
#devtools::load_all("~/projects/metaneighbor/MetaNeighbor")
fused_data = mergeSCE(my_data)
dim(fused_data)
head(colData(fused_data))

The new dataset contains 15,295 common genes, 15,793 cells and two metadata columns: a concatenated "cell type" column, and "study_id", a column created by "mergeSCE" containing the name of the original study (corresponding to the names provided in the "my_data" list).

8. To avoid having to recreate the merged object, we recommend saving it as an RDS file.

In [None]:
saveRDS(fused_data, "merged_pancreas.rds")

## Step 2: Hierarchical cluster replicability analysis

9. We load the MetaNeighbor (analysis) and the SingleCellExperiment (data handling) libraries, as well as the previously created pancreas dataset.

In [None]:
library(MetaNeighbor)
library(SingleCellExperiment)

pancreas_data = readRDS("merged_pancreas.rds")

10. To perform neighbor voting, MetaNeighbor builds a cell-cell similarity network, which we defined as the Spearman correlation over a user-defined set of genes. We found that we obtained best results by picking genes that are highly variable across datasets, which can be picked using the "variableGenes" function.

In [None]:
system.time({
global_hvgs = variableGenes(dat = pancreas_data, exp_labels = pancreas_data$study_id)
})
length(global_hvgs)

The function returns a list of 600 genes that were detected as highly variable in each of the 4 datasets.

11. The data and a set of biological meaningful genes is all we need to run MetaNeighbor and obtain cluster similarities.

In [None]:
system.time({
aurocs = MetaNeighborUS(var_genes = global_hvgs,
                        dat = pancreas_data,
                        study_id = pancreas_data$study_id,
                        cell_type = pancreas_data$"cell type",
                        fast_version = TRUE)
})

Cluster similarities are defined as an Area Under the ROC curve (AUROC), which range between 0 and 1. The cross-dataset voting framework makes it batch-effect free (very different from average correlation)

12. For ease of interpretation the results can be visualized as a symmetric heatmap, where rows and columns are clusters from all datasets.

In [None]:
plotHeatmap(aurocs, cex = 0.5)

In the heatmap, the color of each square indicates the proximity of a pair of cluster, ranging from blue (low similarity) to red (high similarity). For example, "baron|gamma" (2nd row) is highly similar to "seger|gamma" (3rd column from the right) but very dissimilar from "muraro|ductal" (middle column). To group similar clusters together, "plotHeatmap" applies hierarchical clustering on the AUROC matrix. On the heatmap, we see two red blocks that indicate clear hierarchical structure in the data, with endocrine cell types clustering together (e.g., alpha, beta, gamma) and mesenchymal cells on the other side (e.g., amacrine, ductal, endothelial). Note that each red block is composed of smaller red blocks, indicating that clusters can be matched at an even higher resolution. We also see some off-diagonal patterns (e.g., lawlor|Gamma/PP, lawlor|Delta), which generally indicate the presence of doublets or contamination (presence of cells from other cell types), but what matters here is the clear presence of red blocks, which is a strong indicator of replicability.

13. To identify pairs of replicable clusters, we rely on a simple heuristics: a pair of cluster is replicable if they are reciprocal top hits (they preferentially vote for each other) and the AUROC exceeds a given threshold value (in our experience, 0.95 is a good heuristic value).

In [None]:
topHits(aurocs, dat = pancreas_data, study_id = pancreas_data$study_id, cell_type = pancreas_data$"cell type")

We find a long list of replicable clusters within endocrine and mesenchymal cell types. This list provides strong evidence that these cell types are robust, as they are identified across all datasets with high AUROC.

14. In the case where there is a clear structure in the data (endocrine vs mesenchymal here), we can refine AUROCs by splitting the data. AUROCs have a simple interpretation: an AUROC of 0.6 indicates that cells from a given cell type are ranked in front of 60% of other test cells. However, this interpretation is out-group dependent: because endocrine cells represent ~65% of cells, even unrelated mesenchymal cell types will have an AUROC > 0.65, just because they will always be ranked in front of endocrine cells.

By starting with the full datasets, we uncovered the global structure in the data. However, to evaluate replicability of endocrine cell types and reduce dataset composition effects, we can make the assessment more stringent by restricting the outgroup to close cell types, i.e. by keeping only endocrine subtypes. We split cell types in two by using the "splitClusters" function and retain only endocrine cell types:

In [None]:
level1_split = splitClusters(aurocs, k = 2)
level1_split
first_split = level1_split[[2]]

By outputting "level1_split" (not shown here), we found that the clusters were nicely split between mesenchymal and endocrine, and that endocrine clusters where in the second element of the list.

15. We repeat the MetaNeighbor analysis on endocrine cells only. First, we subset the data to the endocrine cell types (stored in "first_split").

In [None]:
to_keep = makeClusterName(pancreas_data$study_id, pancreas_data$"cell type") %in% first_split
subdata = pancreas_data[, to_keep]
dim(subdata)

The new dataset contains the 9341 putative endocrine cells.

16. To focus on variability that is specific to endocrine cells, we re-pick highly variable genes:

In [None]:
var_genes = variableGenes(dat = subdata, exp_labels = subdata$study_id)

17. Finally we recompute cluster similarities and visualize AUROCs.

In [None]:
system.time({
aurocs = MetaNeighborUS(var_genes = var_genes,
                        dat = subdata, fast_version = TRUE,
                        study_id = subdata$study_id,
                        cell_type = subdata$"cell type")
})
plotHeatmap(aurocs, cex = 0.7)

The resulting heatmap illustrates an example of a strong set of replicating clusters: when the assessment become more stringent (restriction to closely related cell types), the similarity of replicating clusters remains strong (AUROC~1 for alpha, beta, gamma, delta and epsilon cells) while the cross-cluster similarity has decreased (shift from red to blue, e.g. similarity of alpha and beta clusters has shifted from orange/red to dark blue) by virtue of zooming in on a subpart of the dataset.

18. We can continue to zoom in as long as there are at least two cell types per dataset:

In [None]:
level2_split = splitClusters(aurocs, k = 3)
my_split = level2_split[[3]]
keep_cell = makeClusterName(pancreas_data$study_id, pancreas_data$"cell type") %in% my_split
subdata = pancreas_data[, keep_cell]
var_genes = variableGenes(dat = subdata, exp_labels = subdata$study_id)
length(var_genes)
aurocs = MetaNeighborUS(var_genes = var_genes,
                        dat = subdata, fast_version = TRUE,
                        study_id = subdata$study_id,
                        cell_type = subdata$"cell type")
plotHeatmap(aurocs, cex = 1)

Here we removed the alpha and beta cells (representing close to 85% of endocrine cells) and validate that, even when restricting to neighboring cell types, there is still a clear distinction between delta, gamma and epsilon cells (AUROC ~ 1).

## Step 3: stringent assessment of replicability with one-vs-best AUROCs

In the previous section, we created progressively more stringent replicability assessments of replicability by selecting more and more specific subsets of related cell types. As an alternative, we provide the "one-vs-best" parameter, which offers similar results without having to restrict the dataset by hand. In this scoring mode, MetaNeighbor will automatically identify the two closest matching clusters in each dataset and compute an AUROC based on the voting result for cells from the closest match against cells from the second closest match. Essentially, we are asking how easily a cluster can be distinguished from its closest neighbor.

19. To obtain one-vs-best AUROCs, we run the same command as before with two additional parameters: "one_vs_best = TRUE" and "symmetric_output = FALSE".

In [None]:
system.time({
best_hits = MetaNeighborUS(var_genes = global_hvgs,
                           dat = pancreas_data,
                           study_id = pancreas_data$study_id,
                           cell_type = pancreas_data$"cell type",
                           fast_version = TRUE,
                           one_vs_best = TRUE, symmetric_output = FALSE)
})
plotHeatmap(best_hits, cex = 0.5)

The interpretation of the heatmap is slightly different compared to one-vs-all AUROCs. First, since we only compare the two closest clusters, most cluster combinations are not tested (NAs, shown in gray on the heatmap). Second, by setting "symmetric_output=FALSE", we broke the symmetric of the heatmap: train clusters are shown as columns and test clusters are shown as rows. Since each cluster is only tested against two clusters in each test dataset (closest and second closest match), we have 8 values per column (2 per dataset).

This representation helps to rapidly identify a cluster's closest hits as well as their closest outgroup. For example, ductal cells (2nd red square from the top right) strongly match with each other (one-vs-best AUROC>0.8) and acinar cells are their closest outgroup (blue segments in the same column). The non-symmetric view also makes it clear when best hits are not reciprocal. For example, mast cells (first two columns) heavily vote for "lawlor|Stellate" and "muraro|mesenchymal", but this vote is not reciprocal. This pattern indicates that the mast cell type is missing in the Lawlor and Muraro datasets (or that there are only a few mast cells that have been wrongly assigned to another cell type).

20. When using one-vs-best AUROCs, we recommend extracting replicating clusters as meta-clusters. Clusters are part of the same meta-cluster if they are reciprocal best hits. Note that if cluster 1 is the reciprocal best hit of 2 and 3, all three clusters are part of the same meta-cluster, even if 2 and 3 are not reciprocal best hits. To further filter for strongly replicating clusters, we specify an AUROC threshold (in our experience, 0.7 is a strong one-vs-best AUROC threshold).

In [None]:
mclusters = extractMetaClusters(best_hits, threshold = 0.7)
scoreMetaClusters(mclusters, best_hits)

The "scoreMetaClusters" provides a good summary of meta-clusters, ordering cell types by the number of datasets in which they replicate, then by average AUROC. We find 12 cell types that have strong support across at least 2 datasets, with 7 cell types replicating across all 4 datasets. 8 cell types are tagged as "outlier", as they had no strong match. These cell types usually contain doublets, low quality cells or contaminated cell types. The replicability structure described here can be summarized as an Upset plot.

In [None]:
plotUpset(mclusters)

Meta-clusters can also be visualized as heatmaps (called "cell-type badges") with the "plotMetaClusters" function (full output not shown here). Each badge shows an AUROC heatmap restricted to each specific meta-cluster. These badges help diagnose cases where AUROCs are lower in a specific train or test dataset. For example, the "muraro|duct" cell type has systematically lower AUROCs, likely indicating the presence of contaminating cells in another cell type (probably "muraro|unclear", referring to the original heatmap).

In [None]:
pdf("meta_clusters.pdf")
plotMetaClusters(mclusters, best_hits)
dev.off()

21. The last visualization is an alternative representation of the AUROC heatmap as a graph, which is particularly useful for large datasets. In this graph, top votes (AUROC > 0.5) are shown in black, while outgroup votes (AUROC < 0.5) are shown in orange. To highlight close calls, we recommend keeping only strong outgroup votes, here with AUROC >= 0.4.

In [None]:
cluster_graph = makeClusterGraph(best_hits, low_threshold = 0.3)
plotClusterGraph(cluster_graph, pancreas_data$study_id, pancreas_data$"cell type", size_factor=3, legend_cex=1.5)

We note that there are several orange edges, indicating that some cell types had two close matches. To investigate the origin of these close calls, we take "baron|epsilon" as our cluster of interest (coi), query its closest neighbors with "extendClusterSet", then zoom in on its subgraph with "subsetClusterGraph".

In [None]:
coi = "baron|epsilon"
coi = extendClusterSet(cluster_graph, initial_set = coi, max_neighbor_distance = 2)
subgraph = subsetClusterGraph(cluster_graph, coi)
plotClusterGraph(subgraph, pancreas_data$study_id, pancreas_data$"cell type", size_factor=5, legend_cex=2)
best_hits[coi, coi]

Here the explanation of the presence of the orange edges is relatively straightforward: the epsilon cell type seems to be missing in the Lawlor dataset, so votes from "baron|epsilon" were equally split between "Lawlor|Gamma/PP" and "Lawlor|Alpha".

In general, the cluster graph can be used to understand how meta-clusters are extracted, why some clusters are tagged and outliers and diagnose problems where resolution of cell types differs across datasets.