vignettes/articles/clustering.Rmd

---
title: "Clustering"
author: "Yichen Wang"
output:
  html_document:
    toc: true
    toc_depth: 5
bibliography: references.bib
csl: ieee.csl
---

```{r develSetting, include=FALSE}
knitr::opts_chunk$set(warning = FALSE)
```

````{=html}
<head>
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
````

## Introduction

SCTK allows users to do clustering on their data with various methods. The clustering methods that SCTK adopts are mainly graph based algorithms, such as Louvain and Leiden, which partitions a Shared Nearest-Neighbor (SNN) graph. The graph can be constructed by Scran [@Aaron2016] or Seurat [@Stuart2019], and the graph clustering is done by [igraph](https://igraph.org/r/) or Seurat, respectively. Meanwhile, we also support traditional method such as K-Means [@Forgy65].  

SCTK allows different types data matrix as input. When using Scran's SNN graph based clustering algorithms, users can specify either an expression matrix or a low-dimension representation as the input, while for Seurat's methods and K-Means, users can only use the reduced dimensions. For any of the options, we recommend using a low-dimension representation as the input. 

To view detailed instructions on how to use these methods, please select 'Interactive Analysis' for using clustering methods in Shiny application or 'Console Analysis' for using these methods in R console from the tabs below: 

## Workflow Guide

````{=html}
<div class="tab">
  <button class="tablinks" onclick="openTab(event, 'interactive')" id="ia-button">Interactive Analysis</button>
  <button class="tablinks" onclick="openTab(event, 'console')" id="console-button">Console Analysis</button>
</div>

<div id="interactive" class="tabcontent">
````

**Entry of The Panel**

From anywhere of the UI, the panel for clustering can be accessed from the top navigation panel at the circled tab shown below.  

![Entry](ui_screenshots/clustering/clst_ui_entry.png)\

The UI consists of the parameter setting panel on the left and the visualization panel on the right. 

![UI](ui_screenshots/clustering/clst_ui.png)\

**Run Clustering**

![Algo](ui_screenshots/clustering/clst_ui_algo.png)\

User will choose an algorithm to run the clustering at the very first step. The slide-down option list is constructed in a grouped style. Each group lists all the algorithm that a dependency (shown with grey text) supports. By selecting an algorithm that belongs to different groups, the parameter settings will change.

Method specific parameter settings are shown and explained below:

![scran](ui_screenshots/clustering/clst_ui_param.png)\

````{=html}
 <style>
div.offset { background-color:inherit ; border-radius: 5px; margin-left: 15px;}
</style>
<div class = "offset">
<details>
  <summary><b>Scran SNNs</b></summary>
````  

For the choices in the algorithm list, please refer to `cluster_louvain()`, `cluster_leiden()`, `cluster_walktrap()`, `cluster_infomap()`, `cluster_fast_greedy()`, `cluster_label_prop()`, and `cluster_leading_eigen()` in the `igraph` package documentation.  

- Data matrix selection - selection input **"Select Input Matrix"**. Scran SNN method allows various types of data matrices. Either a full-sized or a subsetted expression matrix, technically called `assay` or `altExp`, respectively, or a reduced dimension, technically `reducedDim`, is allowed. 
- Resolution related parameters:
    - Numeric input **"K value"**. `K` is an integer scalar specifying the number of nearest neighbors to consider during graph construction. Considering more neighbors results in larger groups. (See parameter `k` of `scran::buildSNNGraph()`)
    - Numeric input **"Resolution"**, only seen when choosing "leiden" algorithm. Higher resolutions lead to more smaller communities, while lower resolutions lead to fewer larger communities. (See parameter `resolution_parameter` of `igraph::cluster_leiden()`)
    - Numeric input **"Steps"**, only seen when choosing "walktrap" algorithm. The length of the random walks to perform. Less steps lead to fewer larger communities, and take more time for computation. (See parameter `steps` of `igraph::cluster_walktrap()`)
- Algorithm detail setting 
    - selection input **"Edge Weight Type"**. Users can specify the type of weighting scheme to use for shared neighbors. (See parameter `type` of `scran::buildSNNGraph()`)
    - Selection input **"Objective Function"**, only seen when choosing "leiden" algorithm. The objective function, "CPM" [@Ronhovde2010] or "modularity" [@Newman2004], evaluates the partitioning of the graph, and the algorithm attempts to maximize the score. 
- Component number setting - numeric input **"Number of Components"**. If users specify a low-dimensional matrix as input, the method will be applied on the top components. When an `assay` or `altExp` is chosen, the algorithm will perform a PCA on the matrix and obtain the top PCs for downstream use. This number could be determined with the help of an Elbow plot in the ["Dimension Reduction"](dimensionality_reduction.html) section. For better visualization, we suggest generating [2D embedding](2d_embedding.html) with the same number of components from the selected reduced dimensions. 

````{=html}
</details>
<details>
  <summary><b>K-Means</b></summary>
````  

When the selected algorithm belongs to "K-Means" group, the parameter settings will look like the figure above.  

- Data matrix selection - selection input **"Select a ReducedDim"**. Here the data matrix allowed can only be a reduced dimension, technically called `reducedDim` and should be obtained in advance from [Dimensionality Reduction](dimensionality_reduction.html) tab. 
- Cluster number determining - numeric input **"Number of Centers (Clusters)"**. User will decide the exact number of clusters here. In term of K-means algorithm, the number of cluster centroids. 
- Iteration limit setting - numeric input **"Max Number of Iterations"**. User will here decide the maximum number of iterations to run. 
- Algorithm detail setting - numeric input **"Number of Random Sets"**. Kmeans attempts multiple initial configurations and reports on the best one. Here users will set the number of initial configurations to use. 

````{=html}
</details>
<details>
  <summary><b>Seurat</b></summary>
````  

When the selected algorithm belongs to "Seurat" group, the parameter settings will look like the figure above.  

- Data matrix selection - selection input **"Select a ReducedDim"**. Here the data matrix allowed can only be a reduced dimension, technically called `reducedDim` and should be obtained in advance from ["Feature Selection & Dimension Reduction" tab](dimensionality_reduction.html). 
- Component number setting - numeric input **"How Many Dimensions to Use"**.
- Algorithm detail setting - checkbox input **"Group Singletons"**.
- Resolution setting - numeric input **"Resolution"**.

````{=html}
</details>
</div>
<br>
````  

**Visualization**

![](ui_screenshots/clustering/clst_ui_vis.png)\

The visualization is implemented with a scatter plot of a chosen low-dimension embedding, colored with a chosen cluster assignment (the newly generated result by default). SCTK by default uses the matrix used for clustering calculation for the visualization. However, usually, users use a PCA for clustering while needing a UMAP/t-SNE embedding for visualization. To change the embedding used, please click on the blue settings (cog) button on the left-top corner.  

````{=html}
<div class = "offset">
<details>
  <summary><b>Settings</b></summary>
````

- Cluster annotation selection - radio button selection **"Select from Current Results"** and **"Select from All Present Annotation"**. This selection updates the dropdown menu with the available annotations. Only a successful run of clustering in the current SCTK session adds an option for the new result to "Select from Current Results", while all the cell annotation stored in background (i.e. `colData(sce)`) are accessible if "Select from All Present Annotation" is chosen. 
- Dimension Reduction selection - selection input **"Use Reduction"**. User have to choose a reduction here for plotting. If there is not yet any option, user can obtain one from [Dimension Reduction](dimensionality_reduction.html) tab. 

````{=html}
</details>
</div>
</div>

<div id="console" class="tabcontent">
````

The clustering methods mentioned in the Introduction can be easily applied to any preprocessed SCE object. Meanwhile, there are other types of clustering methods supported by SCTK, but usually dependent to a curated workflow, such as the [Celda Curated Workflow](celda_curated_workflow.html). Here we present the usage of the three functions only.  

**Basic Parameters**

For each of the three functions, the common things are:

- The input SCE object - `inSCE`.
- The `colData` column name to save the cluster labels - `clusterName`
- The specific algorithm type supported by the dependency - `algorithm`

As for another essential parameter, the data matrix used for running the algorithm is a must. Here, users should notice that:

- For `runScranSNN()`, either a feature expression matrix or a dimension reduction is acceptable. Users need to use: 
    - `useAssay` for a full-sized expression data (`assay`)
    - `useAltExp` and `altExpAssay` for a subsetted expression data (`altExp`)
    - `useReducedDim` for a dimension reduction. (`reducedDim`)
    - `useAltExp` and `altExpRedDim` for a reduced dimension stored in the `altExp` object. 
- For `runKMeans()`, only a dimension reduction is acceptable. Users need to use `useReducedDim` to pass the argument.
- For `runSeuratFindClusters()`, it is included within the Seurat Curated Workflow, yet usable as an independent function. However, it will be complicated to use this way. It is recommended to invoke the Seurat clustering functionality in the UI since it is automated there. We will still present the complicated workflow later. 

When using `useReducedDim` as data input, users should be aware of the number of top components being passed to the underlying method. This is usually controlled by argument `nComp`. To determine this parameter, users can refer to the explanation in [Dimensionality Reduction Documentation](dimensionality_reduction.html). Meanwhile, for better visualization, we also recommend users to generate [2D Embedding](2d_embedding.html) using the same number of top components from the same dimension reduction result. 

Other parameters are method specific, please refer to the function manual pages for the detail. The typical command call for for each method is shown below:

```{R cluster_cmd, eval = FALSE}
# Scran method
sce <- runScranSNN(sce, useReducedDim = "PCA", clusterName = "scranSNN")
# K-means method
sce <- runKMeans(sce, nCenters = 9, useReducedDim = "PCA", clusterName = "KMeans")
# Seurat method
sce <- runSeuratFindClusters(sce, useAssay = "seuratScaledData")
```

**Example**

To demonstrate simple and clear examples, here we use the "PBMC-3k" dataset from "10X" which can be easily imported with SCTK functions. The preprocessing only includes necessary steps before getting cluster labels (i.e. QC and filtering are excluded). 

````{=html}
<div class = "offset">
<details>
  <summary><b>Preprocessing</b></summary>
````

```{R clst_prep, message=FALSE, warning=FALSE, fig.align="center", results='hide'}
library(singleCellTK)
sce <- importExampleData("pbmc3k")
sce <- runNormalization(sce, outAssayName = "logcounts", normalizationMethod = "logNormCounts")
# Default HVG method is "vst" from Seurat
sce <- runFeatureSelection(sce, useAssay = "counts")
sce <- setTopHVG(sce, featureSubsetName = "hvf")
sce <- runDimReduce(sce, useAssay = "logcounts", useFeatureSubset = "hvf", scale = TRUE, reducedDimName = "PCA")
# Optional visualization
sce <- runDimReduce(sce, method = "scaterUMAP", useReducedDim = "PCA", reducedDimName = "UMAP", nComponents = 10)
plotDimRed(sce, "UMAP")
```

````{=html}
</details>
````

````{=html}
<details>
  <summary><b>Example with Scran SNN</b></summary>
  <details>
    <summary><b>Method specific parameters</b></summary>
````
- `k`, the number of nearest neighbors used to construct the graph. Smaller value indicates higher resolution and larger number of clusters. 
- `nComp`, the number of components to use when `useAssay` or `useAltExp` is specified. WON'T work with `useReducedDim`. 
- For `weightType` and `algorithm`, users should choose from a given list of options, which can be found with `?runScranSNN`. For the introduction of those options, please refer to `scran::buildSNNGraph()` and [igraph](https://rdrr.io/cran/igraph/). 
- Different algorithms, supported by [igraph](https://rdrr.io/cran/igraph/), might allows different parameters, such as `objective_function` and `resolution_parameter` for Leiden [@Traag2019], `steps` for Walktrap [@Pons2005]. Please also refer to function reference for detail. 
````{=html}
</details>
````

```{R clst_ScranSNN, eval=TRUE, message=FALSE, warning=FALSE, fig.align="center", cache=TRUE, results='hide'}

# Most of the time, you will want to use a dimensionality reduction for clustering:
sce <- runScranSNN(inSCE = sce, useReducedDim = "PCA", nComp = 10, clusterName = "scranSNN_PCA")
plotSCEDimReduceColData(inSCE = sce, colorBy = "scranSNN_PCA", reducedDimName = "UMAP")

# Alternatively, clustering can be run directly from an assay by setting "useAssay" and "useReducedDim" to 'NULL':
sce <- runScranSNN(sce, useAssay = "logcounts", useReducedDim = NULL, clusterName = "scranSNN_logcounts")
plotSCEDimReduceColData(sce, colorBy = "scranSNN_logcounts", reducedDimName = "UMAP")

```

````{=html}
</details>
````

````{=html}
<details>
  <summary><b>Example with K-Means</b></summary>
  <details>
    <summary><b>Method specific parameters</b></summary>
````

- `nCenters`, the number of final clusters. This is required.
- `nIter`, the maximum number of iterations allowed.
- `nStart`, the number of random sets to choose. Since K-Means is an algorithm with reasonable randomness, the function allows attempting multiple initial configurations and reports on the best one.
- `seed`, The seed for the random number generator. 
- For `algorithm`, users should choose from a given list of options, which can be found with `?runKMeans`. For the introduction of those options, please refer to `?stats::kmeans`. 

````{=html}
</details>
````

```{R clst_kmeans, eval=TRUE, message=FALSE, warning=FALSE, fig.align="center", cache=TRUE, results='hide'}
sce <- runKMeans(inSCE = sce, useReducedDim = "PCA", nComp = 10, nCenters = 10, clusterName = "kmeans")
plotSCEDimReduceColData(inSCE = sce, colorBy = "kmeans", reducedDimName = "UMAP")
```

````{=html}
</details>
<details>
  <summary><b>Example with Seurat</b></summary>
````

Seurat clustering method is recommended to be used in the [Seurat Curated Workflow](seurat_curated_workflow.html) in console analysis. When users preprocessed the dataset with other approaches, the clustering with Seurat function can be complicated, since some sorts of Seurat specific metadata is always needed.  

```{R clst_seurat2, eval=TRUE, message=FALSE, warning=FALSE, fig.align="center", cache=TRUE, results='hide'}
# Prepare what seurat function needs
library(Seurat)
pca <- reducedDim(sce, "PCA")[,1:10]
rownames(pca) <- gsub("_", "-", rownames(pca))
stdev <- as.numeric(attr(pca, "percentVar"))
new_pca <- CreateDimReducObject(embeddings = pca, assay = "RNA", stdev = stdev, key = "PC_")
# Then we can take the Seurat Curated Workflow independent dimension 
# reduction as an input. 
sce <- runSeuratFindClusters(sce, externalReduction = new_pca)
plotSCEDimReduceColData(inSCE = sce, colorBy = "Seurat_louvain_Resolution0.8", reducedDimName = "UMAP")
```

````{=html}
</details>
</div>
</div>
<script>
document.getElementById("ia-button").click();
</script>
</body>
````

## References