analysis/seriation.Rmd

---
title: "Ordering Objects using Seriation in R"
date: "`r Sys.Date()`"
output:
  workflowr::wflow_html:
    toc: true
---

```{r setup, include=FALSE}
library(tidyverse)
library(patchwork)
library(dendextend)
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

From the [seriation R package](https://github.com/mhahsler/seriation).

>Seriation arranges a set of objects into a linear order given available data with the goal of revealing structural information. This package provides the infrastructure for ordering objects with an implementation of many seriation/sequencing/ordination techniques to reorder data matrices, dissimilarity matrices, correlation matrices, and dendrograms (see below for a complete list). The package provides several visualizations (grid and ggplot2) to reveal structural information, including permuted image plots, reordered heatmaps, Bertin plots, clustering visualizations like dissimilarity plots, and visual assessment of cluster tendency plots (VAT and iVAT).

## Installation

Install stable CRAN version.

```{r install}
if(! "seriation" %in% installed.packages()[, 1]){
  install.packages("seriation", repos = c("https://mhahsler.r-universe.dev", "https://cloud.r-project.org/"))
}

library(seriation)
packageVersion("seriation")
```

## Getting started

Use the [example dataset](https://github.com/mhahsler/seriation#usage) `SupremeCourt`, which:

>Contains a (a subset of the) decisions for the stable 8-yr period 1995-2002 of the second Rehnquist Supreme Court. Decisions are aggregated to the joint probability for disagreement between judges.

```{r supreme_court}
data("SupremeCourt")
SupremeCourt
```

Convert to distance matrix.

```{r distance_matrix}
d <- as.dist(SupremeCourt)
d
```

Perform the default seriation method to reorder the objects.

```{r default_seriation}
my_order <- seriate(d)
get_order(my_order)
```

Plot heatmap.

```{r heatmap, message=FALSE, warning=FALSE}
p1 <- ggpimage(d, upper_tri = TRUE) +
  ggtitle("Judges (original order)")

p2 <- ggpimage(d, my_order, upper_tri = TRUE) +
  ggtitle("Judges (seriation order)")

p1 + p2 & scale_fill_gradientn(colours = c("darkgrey", "skyblue"))
```

Return linear configuration where more similar objects are located closer to each other.

```{r get_config}
sort(get_config(my_order))
```

Plot linear configuration.

```{r plot_config}
plot_config(my_order)
```

Hierarchical cluster with average linkage.

```{r hclust}
plot(hclust(d, method = "average"))
```

## Heatmaps with seriation

The `Wood` dataset consists of:

>A data matrix containing a sample of the normalized gene expression data for 6 locations in the stem of Popla trees published in the study by Herzberg et al (2001). The sample of 136 genes selected by Caraux and Pinloche (2005).

```{r wood}
data("Wood")
dim(Wood)
```

Check out `Wood`.

```{r wood_glimpse}
head(Wood)
```

Methods of interest for heatmaps are dendrogram leaf order-based methods applied to rows and columns. This is done using `method = "heatmap"`. The actual seriation method can be passed on as parameter seriation_method, but it has a suitable default if it is omitted.

```{r wood_order}
wood_hc_complete <- seriate(Wood, method = "Heatmap", seriation_method = "HC_complete")
wood_olo_complete <- seriate(Wood, method = "Heatmap", seriation_method = "OLO_complete")
get_order(wood_olo_complete)
```

Ignore the numbers of the order above; they indicate the index of the gene in `Wood`. `AI165492` and `AI166057` are the most similar to each other. If I use those values in `Wood`, I get their expression data.

```{r index_wood}
Wood[c(106, 128), ]
```

We can clearly see the similar expression patterns for these subset of genes.

```{r plot_eg}
Wood[get_order(wood_olo_complete)[1:6], ] |>
  as.data.frame() |>
  tibble::rownames_to_column('gene') |>
  tidyr::pivot_longer(-gene) |>
  ggplot(data = _, aes(name, value, group = gene, colour = gene)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  labs(title = "Genes with similar expression patterns", y = "Normalised expression", x = "location")
```

`wood_olo_complete` is also a `hclust`.

```{r wood_olo_class}
class(wood_olo_complete[[1]])
```

Plot hierarchical clustering result.

```{r fig.height=20, fig.width=14}
as.dendrogram(wood_olo_complete[[1]]) %>%
  plot(horiz = TRUE)
```

Gene order of the dendrogram matches the order produced by seriation using OLO complete.

```{r gene_order}
o <- wood_olo_complete[[1]]$order
identical(wood_olo_complete[[1]]$labels[o], names(get_order(wood_olo_complete)))
```

Location ordering.

```{r comp_order}
get_order(wood_hc_complete, 2)
get_order(wood_olo_complete, 2)
```

Heatmap.

```{r wood_heatmap, message=FALSE, warning=FALSE}
p1 <- ggpimage(Wood) +
  ggtitle("Wood (no order)") +
  theme(legend.position = "none")
  
p2 <- ggpimage(Wood, wood_hc_complete) +
  ggtitle("Wood (HC complete)") +
  theme(legend.position = "none")

p3 <- ggpimage(Wood, wood_olo_complete) +
  ggtitle("Wood (OLO complete)")

p1 + p2 + p3 & scale_fill_gradientn(colours = c("skyblue", "red"))
```

## Evaluate clusters using dissimilarity plots

>[Dissimilarity plots](https://mhahsler.github.io/seriation/seriation_cluster_evaluation.html) can be used to visually inspect the quality of a cluster solution. The plot uses image plots of the reordered dissimilarity matrix organised by the clusters to display the clustered data. This display allows the user to visually assess clustering quality.

>The Ruspini dataset from package `cluster` is a popular dataset for illustrating clustering techniques. It consists of 75 points in two-dimensional space with four clearly distinguishable groups and thus is easy to cluster.

```{r ruspini}
library(cluster)
data(ruspini)
set.seed(1234)
ruspini |> sample_frac() -> ruspini
head(ruspini)
```

Plot.

```{r plot_ruspini}
plot(ruspini, pch = 16)
```

Cluster with _k_-means and produce a dissimilarity plot.

```{r four_clusters, fig.show="hold", out.width="50%", fig.asp=1, fig.align="default"}
set.seed(1234)
cl_ruspini <- kmeans(ruspini, centers=4, nstart=5)
d_ruspini  <- dist(ruspini)

ggdissplot(d_ruspini, cl_ruspini$cluster) + ggtitle("Dissimilarity Plot")

# labels= 4 = only the ellipses are labelled in the plot
clusplot(ruspini, cl_ruspini$cluster, labels = 4)
```

Dissimilarity plots visualise the distances between points in a distance matrix. A distance matrix for $n$ objects is a $n \times n$ matrix with pairwise distances as values. The diagonal contains the distances between each object and itself and therefore is always zero. In the dissimilarity plot above, low distance values are shown using a darker colour. The result of a "good" clustering should be a matrix with low dissimilarity values forming blocks around the main diagonal corresponding to the clusters.

Let's manually recreate the lower triangle of the dissimilarity plot using base R.

```{r dissimilarity_plot}
my_mat <- as.matrix(d_ruspini)
my_clus <- as.integer(names(sort(cl_ruspini$cluster)))
my_order <- match(my_clus, colnames(my_mat))
my_mat <- my_mat[my_order, my_order]

image(my_mat, col = rev(RColorBrewer::brewer.pal(n = 9, name = "Blues")))
```

The dissimilarity plot shows a good clustering structure with the clusters forming four dark squares. In the `ggdissplot` plot the lower triangle shows the pairwise distances and the upper triangle shows cluster averages. The clusters are ordered by similarity indicating that closer cluster are more similar and clusters further away from each other are the most dissimilar.

Deciding on the number of clusters is a difficult problem. Lets specify three clusters this time.

```{r three_clusters, fig.show="hold", out.width="50%", fig.asp=1, fig.align="default"}
set.seed(1234)
cl_ruspini3 <- kmeans(ruspini, centers=3, nstart=5)

ggdissplot(d_ruspini, cl_ruspini3$cluster) + ggtitle("Dissimilarity Plot")
clusplot(ruspini, cl_ruspini3$cluster, labels = 4)
```

We can also use dissimilarity plots for exploring data without clustering.

```{r dissimilarity_plot_no_clusters}
ggdissplot(d_ruspini) + ggtitle("Dissimilarity plot without clustering")
```

Dissimilarity plots scale well with the dimensionality of the data and by reordering clusters and objects within clusters, we can get a very concise structural representation of the clustering. Dissimiarlity plots are also helpful in spotting the mis-specification of the number of clusters used for partitioning.