analysis/seriation.Rmd

---
title: "Ordering Objects using Seriation in R"
date: "`r Sys.Date()`"
output:
  workflowr::wflow_html:
    toc: true
---

```{r setup, include=FALSE}
library(tidyverse)
library(patchwork)
library(dendextend)
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

From the [seriation R package](https://github.com/mhahsler/seriation).

>Seriation arranges a set of objects into a linear order given available data with the goal of revealing structural information. This package provides the infrastructure for ordering objects with an implementation of many seriation/sequencing/ordination techniques to reorder data matrices, dissimilarity matrices, correlation matrices, and dendrograms (see below for a complete list). The package provides several visualizations (grid and ggplot2) to reveal structural information, including permuted image plots, reordered heatmaps, Bertin plots, clustering visualizations like dissimilarity plots, and visual assessment of cluster tendency plots (VAT and iVAT).

## Installation

Install stable CRAN version.

```{r install}
if(! "seriation" %in% installed.packages()[, 1]){
  install.packages("seriation", repos = c("https://mhahsler.r-universe.dev", "https://cloud.r-project.org/"))
}

library(seriation)
packageVersion("seriation")
```

## Getting started

Use the [example dataset](https://github.com/mhahsler/seriation#usage) `SupremeCourt`, which:

>Contains a (a subset of the) decisions for the stable 8-yr period 1995-2002 of the second Rehnquist Supreme Court. Decisions are aggregated to the joint probability for disagreement between judges.

```{r supreme_court}
data("SupremeCourt")
SupremeCourt
```

Convert to distance matrix.

```{r distance_matrix}
d <- as.dist(SupremeCourt)
d
```

Perform the default seriation method to reorder the objects.

```{r default_seriation}
my_order <- seriate(d)
get_order(my_order)
```

Plot heatmap.

```{r heatmap, message=FALSE, warning=FALSE}
p1 <- ggpimage(d, upper_tri = TRUE) +
  ggtitle("Judges (original order)")

p2 <- ggpimage(d, my_order, upper_tri = TRUE) +
  ggtitle("Judges (seriation order)")

p1 + p2 & scale_fill_gradientn(colours = c("darkgrey", "skyblue"))
```

Return linear configuration where more similar objects are located closer to each other.

```{r get_config}
sort(get_config(my_order))
```

Plot linear configuration.

```{r plot_config}
plot_config(my_order)
```

Hierarchical cluster with average linkage.

```{r hclust}
plot(hclust(d, method = "average"))
```

## Heatmaps with seriation

The `Wood` dataset consists of:

>A data matrix containing a sample of the normalized gene expression data for 6 locations in the stem of Popla trees published in the study by Herzberg et al (2001). The sample of 136 genes selected by Caraux and Pinloche (2005).

```{r wood}
data("Wood")
dim(Wood)
```

Check out `Wood`.

```{r wood_glimpse}
head(Wood)
```

Methods of interest for heatmaps are dendrogram leaf order-based methods applied to rows and columns. This is done using `method = "heatmap"`. The actual seriation method can be passed on as parameter seriation_method, but it has a suitable default if it is omitted.

```{r wood_order}
wood_hc_complete <- seriate(Wood, method = "Heatmap", seriation_method = "HC_complete")
wood_olo_complete <- seriate(Wood, method = "Heatmap", seriation_method = "OLO_complete")
get_order(wood_olo_complete)
```

Ignore the numbers of the order above; they indicate the index of the gene in `Wood`. `AI165492` and `AI166057` are the most similar to each other. If I use those values in `Wood`, I get their expression data.

```{r index_wood}
Wood[c(106, 128), ]
```

We can clearly see the similar expression patterns for these subset of genes.

```{r plot_eg}
Wood[get_order(wood_olo_complete)[1:6], ] |>
  as.data.frame() |>
  tibble::rownames_to_column('gene') |>
  tidyr::pivot_longer(-gene) |>
  ggplot(data = _, aes(name, value, group = gene, colour = gene)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  labs(title = "Genes with similar expression patterns", y = "Normalised expression", x = "location")
```

`wood_olo_complete` is also a `hclust`.

```{r wood_olo_class}
class(wood_olo_complete[[1]])
```

Plot hierarchical clustering result.

```{r fig.height=20, fig.width=14}
as.dendrogram(wood_olo_complete[[1]]) %>%
  plot(horiz = TRUE)
```

Gene order of the dendrogram matches the order produced by seriation using OLO complete.

```{r gene_order}
o <- wood_olo_complete[[1]]$order
identical(wood_olo_complete[[1]]$labels[o], names(get_order(wood_olo_complete)))
```

Location ordering.

```{r comp_order}
get_order(wood_hc_complete, 2)
get_order(wood_olo_complete, 2)
```

Heatmap.

```{r wood_heatmap, message=FALSE, warning=FALSE}
p1 <- ggpimage(Wood) +
  ggtitle("Wood (no order)") +
  theme(legend.position = "none")
  
p2 <- ggpimage(Wood, wood_hc_complete) +
  ggtitle("Wood (HC complete)") +
  theme(legend.position = "none")

p3 <- ggpimage(Wood, wood_olo_complete) +
  ggtitle("Wood (OLO complete)")

p1 + p2 + p3 & scale_fill_gradientn(colours = c("skyblue", "red"))
```