**Author:** Prof. AJ Smit

Department of Biodiversity and Conservation Biology

University of the Western Cape

# Topic 13: Cluster analysis

In this example we will apply two types of cluster analyses, viz. **K-means clustering** and **hierarchical clustering**. Whereas ordination attempts to display the presence and influence of gradients, clustering tries to place our samples into a certain number of discrete units or clusters. We have seen that the WHO/SDG data seem to form neat groupings of countries within their respective parent locations. Let's explore this dataset with cluster analysis.

Additional examples of clustering to study are:

1. Numerical Ecology in R, pp. 53-62. Later pages in the Cluster chapter go deeper into clustering and you should read over it for a broad overview. For the purpose of this module, we will focus on 4.3 Hierarchical Clustering and 4.4 Agglomerative Clustering.
2. A [Kaggle challenge](https://www.kaggle.com/rohan0301/unsupervised-learning-on-country-data) with examples of both Hierarchical Clustering and K-means Clustering.

## Set-up the analysis environment

In [1]:
library(tidyverse) 
library(GGally)
library(cluster)
library(dendextend)
library(ggcorrplot)
library(factoextra)
library(gridExtra)
library(vegan)

── [1mAttaching packages[22m ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2


---------------------
Welcome to dendextend version 1.15.1
Type citation('dendextend') for how to cite the package.

Type browse

## Load the SDG data

I load the combined dataset that already had their missing values imputed (as per the [PCA](https://github.com/ajsmit/Quantitative_Ecology/blob/main/jupyter_lab/Topic_8-PCA-SDG-example.ipynb) example).

In [5]:
SDGs <- read_csv("/Users/ajsmit/Dropbox/R/workshops/Quantitative_Ecology/exercises/WHO/SDG_complete.csv")
head(SDGs)


[36m──[39m [1m[1mColumn specification[1m[22m [36m─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
cols(
  .default = col_double(),
  ParentLocation = [31mcol_character()[39m,
  Location = [31mcol_character()[39m
)
[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m for the full column specifications.




ParentLocation,Location,other_1,other_2,SDG1.a,SDG16.1,SDG3.1_1,SDG3.2_1,SDG3.2_2,SDG3.2_3,⋯,SDG3.b_4,SDG3.c_1,SDG3.c_2,SDG3.c_3,SDG3.c_4,SDG3.d_1,SDG3.7,SDG3.a,SDG3.1_2,SDG3.b_5
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Eastern Mediterranean,Afghanistan,61.65,15.59,2.14,9.02,673,134.802054,230.59499,175.34049,⋯,62.0,2.78,1.48,0.034,0.47,42.0,97.20786,14.35881,66.57461,68.86648
Europe,Albania,77.84,21.13,9.62,3.78,16,7.552973,11.31237,10.04785,⋯,98.0,12.16,36.5,5.163544,6.4885512,82.02704,17.3,29.7,94.66754,56.50277
Africa,Algeria,76.54,21.81,10.73,1.66,113,37.99931,61.62882,53.23844,⋯,61.0,18.33,22.43,4.262742,5.6060842,73.0,43.97764,19.1,87.91192,65.47076
Africa,Angola,61.72,16.71,5.43,9.82,246,124.853365,341.33898,228.84115,⋯,55.0,-5.914274,-9.936513,-1.082242,-0.7075023,55.48743,111.43513,13.59252,61.05841,68.24041
Americas,Antigua and Barbuda,76.14,20.43,11.61,2.42,43,5.940594,10.89109,9.90099,⋯,85.61069,27.54121,44.87,5.042694,6.333692,81.0,31.1,23.60171,94.20909,55.99846
Americas,Argentina,76.17,20.98,13.47,6.23,40,11.006454,18.74016,16.69441,⋯,82.0,40.01,25.82,5.358117,6.6479615,76.0,25.9356,23.6,98.4,56.0


The parent locations:

In [7]:
unique(SDGs$ParentLocation)
length(unique(SDGs$ParentLocation))

The number of countries:

In [8]:
length(SDGs$Location)