analysis/harmony.Rmd

---
title: "Getting started with harmony"
date: "`r Sys.Date()`"
output:
  workflowr::wflow_html:
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# https://stackoverflow.com/questions/30237310/setting-work-directory-in-knitr-using-opts-chunksetroot-dir-doesnt-wor
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
library(tidyverse)
library(patchwork)
library(harmony)
library(Seurat)
```

## Quickstart

Follow the [quickstart tutorial](https://htmlpreview.github.io/?https://github.com/immunogenomics/harmony/blob/master/doc/quickstart.html)

```{r install_seurat, eval=FALSE}
install.packages("harmony")
```

Load {harmony}.

```{r load_seurat}
library("harmony")
packageVersion("harmony")
```

### Data

We library normalized the cells, log transformed the counts, and scaled the genes. Then we performed PCA and kept the top 20 PCs. The PCA embeddings and meta data are available as part of this package.

```{r cell_lines}
data(cell_lines)
V <- cell_lines$scaled_pcs
meta_data <- cell_lines$meta_data

str(cell_lines)
```

Table of cell types.

```{r cell_type_table}
table(cell_lines$meta_data$cell_type)
```

Table of datasets.

```{r dataset_table}
table(cell_lines$meta_data$dataset)
```

### Analysis

Initially, the cells cluster by both dataset (left) and cell type (right). The quickstart guide uses the `do_scatter()` function, which is [missing](https://github.com/immunogenomics/harmony/issues/239). We can simply plot the first two PCs using {ggplot2}.

Plot PC1 versus PC2.

```{r plot_pc1_pc2, fig.width=10, fig.height=7}
my_df <- data.frame(PC1 = V$X1, PC2 = V$X2, dataset = meta_data$dataset, cell_type = meta_data$cell_type)

ggplot(my_df, aes(PC1, PC2, colour = dataset)) +
  geom_point() +
  theme_minimal() +
  ggtitle("Before harmony") -> p1

ggplot(my_df, aes(PC1, PC2, colour = cell_type)) +
  geom_point() +
  theme_minimal() -> p2

p1 + p2
```

Let's run Harmony to remove the influence of dataset-of-origin from the cell embeddings.

```{r run_harmony_quickstart, fig.width=10, fig.height=7}
harmony_embeddings <- harmony::RunHarmony(
    V, meta_data, 'dataset', verbose=FALSE
)

my_df2 <- data.frame(PC1 = harmony_embeddings[, 1], PC2 = harmony_embeddings[, 2], dataset = meta_data$dataset, cell_type = meta_data$cell_type)

ggplot(my_df2, aes(PC1, PC2, colour = dataset)) +
  geom_point() +
  theme_minimal() +
  ggtitle("After harmony") -> p1

ggplot(my_df2, aes(PC1, PC2, colour = cell_type)) +
  geom_point() +
  theme_minimal() -> p2

p1 + p2
```

## Using harmony with Seurat

Following the [Using harmony with Seurat](https://github.com/immunogenomics/harmony/blob/master/doc/Seurat.Rmd) tutorial, which describes how to use harmony in Seurat v5 single-cell analysis workflows. Also, it will provide some basic downstream analyses demonstrating the properties of harmonised cell embeddings and a brief explanation of the exposed algorithm parameters.

### Data

For this demo, we will be aligning two groups of PBMCs [Kang et al., 2017](https://doi.org/10.1038/nbt.4042):

* Control PBMCs
* Stimulated PBMCs treated with interferon beta.

```{r pbmc_stim}
data("pbmc_stim")
str(pbmc.ctrl)
str(pbmc.stim)
```

The full dataset used for this vignette have been upload to [Zenodo](https://zenodo.org/record/8164711) but currently [does not work](https://github.com/immunogenomics/harmony/issues/248#issuecomment-2051755519) with newer versions of R.

### Create Seurat object

Create a Seurat object with treatment conditions in the metadata.

```{r create_seurat_object}
pbmc <- CreateSeuratObject(
  counts = cbind(pbmc.stim, pbmc.ctrl),
  project = "Kang",
  min.cells = 5
)

pbmc@meta.data$stim <- c(rep("STIM", ncol(pbmc.stim)), rep("CTRL", ncol(pbmc.ctrl)))
pbmc
```

### Seurat workflow

Harmony works on an existing matrix with cell embeddings and outputs its transformed version with the datasets aligned according to some user-defined experimental conditions. By default, harmony will look up the `pca` cell embeddings and use these to run harmony. Therefore, it assumes that the Seurat object has these embeddings already precomputed.

We will run the Seurat workflow to generate the embeddings.

Here, using `Seurat::NormalizeData()`, we will be generating a union of highly variable genes using each condition (the control and stimulated cells). These features are going to be subsequently used to generate the 20 PCs with `Seurat::RunPCA()`.

Note that the defaults for `NormalizeData` are:

* `normalization.method` = "LogNormalize"
* `scale.factor` = 10000

```{r seurat_norm_hvg}
pbmc <- NormalizeData(pbmc, verbose = FALSE)

pbmc <- FindVariableFeatures(object = pbmc, selection.method = "vst", nfeatures = 2000)

cell_by_cond <- split(row.names(pbmc@meta.data), pbmc@meta.data$stim)

vfeatures <- lapply(cell_by_cond, function(cells){
  FindVariableFeatures(object = pbmc[, cells], selection.method = "vst", nfeatures = 2000) |>
    VariableFeatures()
})

VariableFeatures(pbmc) <- unique(unlist(vfeatures))
length(VariableFeatures(pbmc))
```

Scale and perform PCA.

```{r seurat_scale_pca}
pbmc <- ScaleData(pbmc, verbose = FALSE) |>
  RunPCA(features = VariableFeatures(pbmc), npcs = 20, verbose = FALSE)
```

`RunHarmony()` is a generic function designed to interact with Seurat objects. To run harmony on a Seurat object after it has been normalised, only one argument needs to be specified which contains the batch covariate located in the metadata. For this vignette, further parameters are specified to align the dataset but the minimum parameters are shown in the snippet below and is **not run**.

```{r run_harmony_minimal, eval=FALSE}
pbmc <- RunHarmony(pbmc, "stim")
```

Here, we will be running harmony with some indicative parameters and plotting the convergence plot to illustrate some of the under the hood functionality. By setting `plot_converge=TRUE`, harmony will generate a plot with its objective showing the flow of the integration. Each point represents the cost measured after a clustering round. Different colors represent different Harmony iterations which is controlled by `max_iter` (assuming that `early_stop=FALSE`). Here `max_iter=10` and up to 10 correction steps are expected. However, `early_stop=TRUE` so harmony will stop after the cost plateaus.

```{r run_harmony_pbmc}
pbmc <- RunHarmony(pbmc, "stim", plot_convergence = TRUE, nclust = 50, max_iter = 10, early_stop = TRUE)
```

`RunHarmony` has several parameters accessible to users which are outlined below.

* `object` (required) - The Seurat object. This vignette assumes Seurat objects are version 5.
* `group.by.vars` (required) - A character vector that specifies all the experimental covariates to be corrected/harmonized by the algorithm.

When using `RunHarmony()` with Seurat, harmony will look up the `group.by.vars` metadata fields in the Seurat Object metadata. For example, given the `pbmc[["stim"]]` exists as the stim condition, setting `group.by.vars="stim"` will perform integration of these samples accordingly. If you want to integrate on another variable, it needs to be present in Seurat object's meta.data. To correct for several covariates, specify them in a vector: `group.by.vars = c("stim", "new_covariate")`.

* `reduction.use` - The cell embeddings to be used for the batch alignment. This parameter assumes that a reduced dimension already exists in the reduction slot of the Seurat object.  By default, the `pca` reduction is used.
* `dims.use` - Optional parameter which can use a name vector to select specific dimensions to be harmonised.
* `nclust` - is a positive integer. Under the hood, harmony applies k-means soft-clustering. For this task, `k` needs to be determined. `nclust` corresponds to `k`. The harmonisation results and performance are not particularly sensitive for a reasonable range of this parameter value. If this parameter is not set, harmony will autodetermine this based on the dataset size with a maximum cap of 200. For dataset with a vast amount of different cell types and batches this pamameter may need to be determined manually.
* `sigma` - a positive scalar that controls the soft clustering probability assignment of single-cells to different clusters. Larger values will assign a larger probability to distant clusters of cells resulting in a different correction profile. Single-cells are assigned to clusters by their euclidean distance $d$ to some cluster center $Y$ after cosine normalisation which is defined in the range [0,4]. The clustering probability of each cell is calculated as $e^{-\frac{d}{\sigma}}$ where $\sigma$ is controlled by the `sigma` parameter. Default value of `sigma` is 0.1 and it generally works well since it defines probability assignment of a cell in the range $[e^{-40}, e^0]$. Larger values of `sigma`  restrict the dynamic range of probabilities that can be assigned to cells. For example, `sigma=1` will yield a probabilities in the range of $[e^{-4}, e^0]$.
* `theta` - `theta` is a positive scalar vector that determines the coefficient of harmony's diversity penalty for each corrected experimental covariate. In challenging experimental conditions, increasing theta may result in better integration results. Theta is an expontential parameter of the diversity penalty, thus setting `theta=0` disables this penalty while increasing it to greater values than 1 will perform more aggressive corrections in an expontential manner. By default, it will set `theta=2` for each experimental covariate.
* `max_iter` - The number of correction steps harmony will perform before completing the data set integration. In general, more iterations than necessary increases computational runtime especially which becomes evident in bigger datasets. Setting `early_stop=TRUE` may reduce the actual number of correction steps which will be smaller than `max_iter`.
* `early_stop` - Under the hood, harmony minimizes its objective function through a series of clustering and integration tests. By setting `early_stop=TRUE`, when the objective function is less than `1e-4` after a correction step harmony exits before reaching the `max_iter` correction steps. This parameter can drastically reduce run-time in bigger datasets. 
* `.options` - A set of internal algorithm parameters that can be overriden. For advanced users only.

These parameters are Seurat-specific and do not affect the flow of the algorithm.

* `project_dim` - Toggle-like parameter, by default `project_dim=TRUE`. When enabled, `RunHarmony()` calculates genomic feature loadings using Seurat's `ProjectDim()` that correspond to the harmonized cell embeddings.
* `reduction.save` - The new Reduced Dimension slot identifier. By default, `reduction.save=TRUE`. This option allows several independent runs of harmony to be retained in the appropriate slots in the SeuratObjects. It is useful if you want to try Harmony with multiple parameters and save them as e.g. 'harmony_theta0', 'harmony_theta1', 'harmony_theta2'.

Miscellaneous parameters

These parameters help users troubleshoot harmony. 

* `plot_convergence` - Option that plots the convergence plot after the execution of the algorithm. By default `FALSE`. Setting it to `TRUE` will collect harmony's objective value and plot it allowing the user to troubleshoot the flow of the algorithm and fine-tune the parameters of the dataset integration procedure.

### Results

`RunHarmony()` returns the Seurat object which contains the harmonised cell embeddings in a slot named `harmony`. This entry can be accessed via `pbmc@reductions$harmony`. To access the values of the cell embeddings we can also use `Embeddings`.

```{r harmony_embeddings}
harmony.embeddings <- Embeddings(pbmc, reduction = "harmony")
head(harmony.embeddings)
```

After Harmony integration, we should inspect the quality of the harmonisation and contrast it with the unharmonised algorithm input. Ideally, cells from different conditions will align along the Harmonized PCs. If they are not, you could increase the *theta* value above to force a more aggressive fit of the dataset and rerun the workflow.

```{r contrast_harmony}
p1 <- DimPlot(object = pbmc, reduction = "harmony", pt.size = .1, group.by = "stim")
p2 <- VlnPlot(object = pbmc, features = "harmony_1", group.by = "stim",  pt.size = .1)
p1 + p2
```

Plot Genes correlated with the Harmonized PCs

```{r heatmap_harmony}
DimHeatmap(object = pbmc, reduction = "harmony", cells = 500, dims = 1:3)
```

The harmonised cell embeddings generated by harmony can be used for further integrated analyses. In this workflow, the Seurat object contains the harmony `reduction` modality name in the method that requires it.

Perform clustering using the harmonized vectors of cells

```{r find_clusters}
pbmc <- FindNeighbors(pbmc, reduction = "harmony") |>
  FindClusters(resolution = 0.5) 
```

TSNE visualisation of harmony embeddings.

```{r tsne}
pbmc <- RunTSNE(pbmc, reduction = "harmony")

p1 <- DimPlot(pbmc, reduction = "tsne", group.by = "stim", pt.size = .1)
p2 <- DimPlot(pbmc, reduction = "tsne", label = TRUE, pt.size = .1)
p1 + p2
```

One important observation is to assess that the harmonised data contain biological states of the cells. Therefore by checking the following genes we can see that biological cell states are preserved after harmonisation.

Expression of gene panel heatmap in the harmonized PBMC dataset.

```{r feature_plot, fig.width = 7, fig.height = 7}
FeaturePlot(
  object = pbmc,
  features= c("CD3D", "SELL", "CREM", "CD8A", "GNLY", "CD79A", "FCGR3A", "CCL2", "PPBP"), 
  min.cutoff = "q9",
  cols = c("lightgrey", "blue"),
  pt.size = 0.5
)
```

Similar to TSNE, we can run UMAP by passing the harmony reduction in the function.

```{r umap_harmony}
pbmc <- RunUMAP(pbmc, reduction = "harmony",  dims = 1:20)

p1 <- DimPlot(pbmc, reduction = "umap", group.by = "stim", pt.size = .1)
p2 <- DimPlot(pbmc, reduction = "umap", label = TRUE,  pt.size = .1)
p1 + p2
```