# Unsupervised Learning

In [None]:
library(tidyverse)

In [None]:
head(iris)

## Review of tidyverse

**1**. How many rows and columns are there in the DataFrame?

**2**. Show just the Species column and any columns with information about `Petal`.

**3**. Show a DataFrame with just two columns showing the Length: Width ratio for Sepal and Petal.

**4**. Count the number of each species where `Sepal.Length` is less than 6.

**5**. Summarize the mean and standard deviation for each measurement grouped by Species.

**6**. Convert `iris` to "tall" form and assign the resulting DataFrame to `iris_t` with three columns `Species`, `Measurement` and `Value`.

## Pairwise scatter plot

In [None]:
pairs(iris)

In [None]:
pairs(iris, col=iris$Species)

## Calculating distances

Mathematical conditions for a distance function - positivity, symmetry, triangle inequality.

In [None]:
small <- iris %>% select(-Species) %>% sample_n(6) 
small

**7**. Find the Euclidean distance between row 1 and row 2

### Distance matrix

In [None]:
dist(small)

In [None]:
dist(small, upper = T, diag=T)

### Scaling before distance

In [None]:
dist(scale(small), upper=T, diag=T)

### What is the distance matrix is showing?

In [None]:
dist(small, method = "maximum", diag=T, upper=T)

## Agglomerative hierarchical clustering

In [None]:
iris %>% select(-Species) -> df

In [None]:
c1 <- hclust(dist(df))

In [None]:
plot(c1)

In [None]:
z1 <- cutree(c1, 3)

In [None]:
z1

Note that the label values are arbitrary - all we know is that the `1`s belong to the same cluster, the `2`s belong a another cluster, and the `3`s belong to the final cluster. We have no idea what Species thee cluster labels represent, or even if the assignment is "correct" compared to the ground truth.

In [None]:
par(mfrow=c(1,2))
plot(iris[, 1], iris[,2], col=iris$Species)
plot(iris[, 1], iris[,2], col=z1)

### Different linkage methods can give different cluster assignments

In [None]:
c2 <- hclust(dist(df), method="average")

In [None]:
plot(c2)

In [None]:
z2 <- cutree(c2, k=3)

In [None]:
par(mfrow=c(1,2))
plot(iris[, 1], iris[,2], col=iris$Species)
plot(iris[, 1], iris[,2], col=z2)

### But in reality, we would not know the true number of clusters!

In [None]:
plot(c2)
abline(h=1.5, col='red')

### Slightly more informative plot

In [None]:
plot(c2)
rect.hclust(c2, h=1.5, border="red")

In [None]:
z3 <- cutree(c2, h=1.5)

In [None]:
par(mfrow=c(1,2))
plot(iris[, 1], iris[,2], col=iris$Species)
plot(iris[, 1], iris[,2], col=z3)

## K-means clustering

In [None]:
k1 <- kmeans(dist(df), centers=3)

In [None]:
str(k1)

In [None]:
as.vector(k1$cluster)

### Finding cluster means

#### Base R

These incantations are hard to remember. I suggest you stick to `tidyverse` methods.

In [None]:
aggregate(df,by=list(k1$cluster),FUN=mean)

**8**. Find the centers of the 3 clusters using `dplyr` and save as `centroids`

In [None]:
par(mfrow=c(1,2))
plot(iris[,"Sepal.Length"], iris[,"Sepal.Width"])
points(centroids[["Sepal.Length"]], centroids[["Sepal.Width"]], col="red",  pch="x")
plot(iris[,"Sepal.Length"], iris[,"Sepal.Width"], col=k1$cluster)

## Dimension reduction and ordination

### PCA

In [None]:
pc <- prcomp(df, center=T, scale=T)

In [None]:
str(pc)

In [None]:
summary(pc)

In [None]:
plot(pc$x[,1], pc$x[,2])

### MDS

In [None]:
mds <- cmdscale(dist(df), k = 2)

In [None]:
str(mds)

In [None]:
summary(mds)

In [None]:
plot(mds[,1], mds[,2])

## Plotting heatmaps

In [None]:
library(pheatmap)

In [None]:
pheatmap(scale(df))

In [None]:
pheatmap(df, kmeans_k = 3)

In [None]:
pheatmap(dist(df))