-
Notifications
You must be signed in to change notification settings - Fork 1
/
seriation.Rmd
243 lines (171 loc) · 8.24 KB
/
seriation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
title: "Ordering Objects using Seriation in R"
date: "`r Sys.Date()`"
output:
workflowr::wflow_html:
toc: true
---
```{r setup, include=FALSE}
library(tidyverse)
library(patchwork)
library(dendextend)
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
From the [seriation R package](https://github.com/mhahsler/seriation).
>Seriation arranges a set of objects into a linear order given available data with the goal of revealing structural information. This package provides the infrastructure for ordering objects with an implementation of many seriation/sequencing/ordination techniques to reorder data matrices, dissimilarity matrices, correlation matrices, and dendrograms (see below for a complete list). The package provides several visualizations (grid and ggplot2) to reveal structural information, including permuted image plots, reordered heatmaps, Bertin plots, clustering visualizations like dissimilarity plots, and visual assessment of cluster tendency plots (VAT and iVAT).
## Installation
Install stable CRAN version.
```{r install}
if(! "seriation" %in% installed.packages()[, 1]){
install.packages("seriation", repos = c("https://mhahsler.r-universe.dev", "https://cloud.r-project.org/"))
}
library(seriation)
packageVersion("seriation")
```
## Getting started
Use the [example dataset](https://github.com/mhahsler/seriation#usage) `SupremeCourt`, which:
>Contains a (a subset of the) decisions for the stable 8-yr period 1995-2002 of the second Rehnquist Supreme Court. Decisions are aggregated to the joint probability for disagreement between judges.
```{r supreme_court}
data("SupremeCourt")
SupremeCourt
```
Convert to distance matrix.
```{r distance_matrix}
d <- as.dist(SupremeCourt)
d
```
Perform the default seriation method to reorder the objects.
```{r default_seriation}
my_order <- seriate(d)
get_order(my_order)
```
Plot heatmap.
```{r heatmap, message=FALSE, warning=FALSE}
p1 <- ggpimage(d, upper_tri = TRUE) +
ggtitle("Judges (original order)")
p2 <- ggpimage(d, my_order, upper_tri = TRUE) +
ggtitle("Judges (seriation order)")
p1 + p2 & scale_fill_gradientn(colours = c("darkgrey", "skyblue"))
```
Return linear configuration where more similar objects are located closer to each other.
```{r get_config}
sort(get_config(my_order))
```
Plot linear configuration.
```{r plot_config}
plot_config(my_order)
```
Hierarchical cluster with average linkage.
```{r hclust}
plot(hclust(d, method = "average"))
```
## Heatmaps with seriation
The `Wood` dataset consists of:
>A data matrix containing a sample of the normalized gene expression data for 6 locations in the stem of Popla trees published in the study by Herzberg et al (2001). The sample of 136 genes selected by Caraux and Pinloche (2005).
```{r wood}
data("Wood")
dim(Wood)
```
Check out `Wood`.
```{r wood_glimpse}
head(Wood)
```
Methods of interest for heatmaps are dendrogram leaf order-based methods applied to rows and columns. This is done using `method = "heatmap"`. The actual seriation method can be passed on as parameter seriation_method, but it has a suitable default if it is omitted.
```{r wood_order}
wood_hc_complete <- seriate(Wood, method = "Heatmap", seriation_method = "HC_complete")
wood_olo_complete <- seriate(Wood, method = "Heatmap", seriation_method = "OLO_complete")
get_order(wood_olo_complete)
```
Ignore the numbers of the order above; they indicate the index of the gene in `Wood`. `AI165492` and `AI166057` are the most similar to each other. If I use those values in `Wood`, I get their expression data.
```{r index_wood}
Wood[c(106, 128), ]
```
We can clearly see the similar expression patterns for these subset of genes.
```{r plot_eg}
Wood[get_order(wood_olo_complete)[1:6], ] |>
as.data.frame() |>
tibble::rownames_to_column('gene') |>
tidyr::pivot_longer(-gene) |>
ggplot(data = _, aes(name, value, group = gene, colour = gene)) +
geom_line() +
geom_point() +
theme_minimal() +
labs(title = "Genes with similar expression patterns", y = "Normalised expression", x = "location")
```
`wood_olo_complete` is also a `hclust`.
```{r wood_olo_class}
class(wood_olo_complete[[1]])
```
Plot hierarchical clustering result.
```{r fig.height=20, fig.width=14}
as.dendrogram(wood_olo_complete[[1]]) %>%
plot(horiz = TRUE)
```
Gene order of the dendrogram matches the order produced by seriation using OLO complete.
```{r gene_order}
o <- wood_olo_complete[[1]]$order
identical(wood_olo_complete[[1]]$labels[o], names(get_order(wood_olo_complete)))
```
Location ordering.
```{r comp_order}
get_order(wood_hc_complete, 2)
get_order(wood_olo_complete, 2)
```
Heatmap.
```{r wood_heatmap, message=FALSE, warning=FALSE}
p1 <- ggpimage(Wood) +
ggtitle("Wood (no order)") +
theme(legend.position = "none")
p2 <- ggpimage(Wood, wood_hc_complete) +
ggtitle("Wood (HC complete)") +
theme(legend.position = "none")
p3 <- ggpimage(Wood, wood_olo_complete) +
ggtitle("Wood (OLO complete)")
p1 + p2 + p3 & scale_fill_gradientn(colours = c("skyblue", "red"))
```
## Evaluate clusters using dissimilarity plots
>[Dissimilarity plots](https://mhahsler.github.io/seriation/seriation_cluster_evaluation.html) can be used to visually inspect the quality of a cluster solution. The plot uses image plots of the reordered dissimilarity matrix organised by the clusters to display the clustered data. This display allows the user to visually assess clustering quality.
>The Ruspini dataset from package `cluster` is a popular dataset for illustrating clustering techniques. It consists of 75 points in two-dimensional space with four clearly distinguishable groups and thus is easy to cluster.
```{r ruspini}
library(cluster)
data(ruspini)
set.seed(1234)
ruspini |> sample_frac() -> ruspini
head(ruspini)
```
Plot.
```{r plot_ruspini}
plot(ruspini, pch = 16)
```
Cluster with _k_-means and produce a dissimilarity plot.
```{r four_clusters, fig.show="hold", out.width="50%", fig.asp=1, fig.align="default"}
set.seed(1234)
cl_ruspini <- kmeans(ruspini, centers=4, nstart=5)
d_ruspini <- dist(ruspini)
ggdissplot(d_ruspini, cl_ruspini$cluster) + ggtitle("Dissimilarity Plot")
# labels= 4 = only the ellipses are labelled in the plot
clusplot(ruspini, cl_ruspini$cluster, labels = 4)
```
Dissimilarity plots visualise the distances between points in a distance matrix. A distance matrix for $n$ objects is a $n \times n$ matrix with pairwise distances as values. The diagonal contains the distances between each object and itself and therefore is always zero. In the dissimilarity plot above, low distance values are shown using a darker colour. The result of a "good" clustering should be a matrix with low dissimilarity values forming blocks around the main diagonal corresponding to the clusters.
Let's manually recreate the lower triangle of the dissimilarity plot using base R.
```{r dissimilarity_plot}
my_mat <- as.matrix(d_ruspini)
my_clus <- as.integer(names(sort(cl_ruspini$cluster)))
my_order <- match(my_clus, colnames(my_mat))
my_mat <- my_mat[my_order, my_order]
image(my_mat, col = rev(RColorBrewer::brewer.pal(n = 9, name = "Blues")))
```
The dissimilarity plot shows a good clustering structure with the clusters forming four dark squares. In the `ggdissplot` plot the lower triangle shows the pairwise distances and the upper triangle shows cluster averages. The clusters are ordered by similarity indicating that closer cluster are more similar and clusters further away from each other are the most dissimilar.
Deciding on the number of clusters is a difficult problem. Lets specify three clusters this time.
```{r three_clusters, fig.show="hold", out.width="50%", fig.asp=1, fig.align="default"}
set.seed(1234)
cl_ruspini3 <- kmeans(ruspini, centers=3, nstart=5)
ggdissplot(d_ruspini, cl_ruspini3$cluster) + ggtitle("Dissimilarity Plot")
clusplot(ruspini, cl_ruspini3$cluster, labels = 4)
```
We can also use dissimilarity plots for exploring data without clustering.
```{r dissimilarity_plot_no_clusters}
ggdissplot(d_ruspini) + ggtitle("Dissimilarity plot without clustering")
```
Dissimilarity plots scale well with the dimensionality of the data and by reordering clusters and objects within clusters, we can get a very concise structural representation of the clustering. Dissimiarlity plots are also helpful in spotting the mis-specification of the number of clusters used for partitioning.