# Unsupervised Learning

## Golub data set

In [None]:
suppressPackageStartupMessages(library(multtest))
suppressPackageStartupMessages(library(golubEsets))
suppressPackageStartupMessages(library(tidyverse))

In [None]:
data(Golub_Merge)
dim(Golub_Merge)

### Extract gene expression values

In [None]:
golub <- exprs(Golub_Merge)

There are 72 patients and 7192 probe sets.

In [None]:
dim(golub)

In [None]:
head(golub)

For this exercise, we consider the probe values to be variables and the patients to be observations, so it is convenient to work with the matrix transpose. 

In [None]:
golub <- t(golub)

In [None]:
dim(golub)

In [None]:
golub[1:3, ]

## Distances

### Pairwise distance between first 3 patinets

In [None]:
dist(golub[1:3,])

In [None]:
dist(golub[1:3,], diag = TRUE)

In [None]:
dist(golub[1:3,], diag = TRUE, upper=TRUE)

### Manual calculation

Euclidean distance is just the n-dimensional application of Pythagoras theorem.

If we have points $x = (0,0)$ and $y = (3, 4)$, then the distance between $x$ and $y$ is 

$$
\sqrt{(3-0)^2 + (4-0)^2} = 5
$$

We write a vectorized calculation of the above.

In [None]:
distance <- function(x, y) { sqrt(sum((x - y)^2))}

In [None]:
x <- golub[1,]
y <- golub[2,]
round(distance(x, y), 2)

## Ordination

### MDS

In [None]:
mds <- as.data.frame(cmdscale(dist(golub), k = 2))

In [None]:
dim(mds)

In [None]:
phenotype <- Golub_Merge@phenoData@data$ALL.AML

In [None]:
mds <- mds %>% mutate(phenotype=phenotype)

In [None]:
head(mds)

In [None]:
plot(mds$V1, mds$V2, type="n")
text(mds$V1, mds$V2, labels = mds$phenotype, col=as.integer(mds$phenotype))

### PCA

In [None]:
pca <- as.data.frame(prcomp(golub, center=TRUE, scale=TRUE, rank=2)$x)

In [None]:
dim(pca)

In [None]:
pca <- pca %>% mutate(phenotype=phenotype)

In [None]:
head(pca)

In [None]:
plot(pca$PC1, pca$PC2, type="n")
text(pca$PC1, pca$PC2, labels = pca$phenotype, col=as.integer(pca$phenotype))

## Preserving the distances

### Scale to have zero mean and unit standard deviation

In [None]:
scexpdat <- scale(golub)

In [None]:
dim(scexpdat)

### Check 

In [None]:
apply(scexpdat[, 1:4], 2, mean)

In [None]:
apply(scexpdat[, 1:4], 2, sd)

### Using `dplyr`

In [None]:
as.data.frame(scexpdat) %>% 
select(1:4) %>%
summarise_all(mean) %>%
round

In [None]:
as.data.frame(scexpdat) %>% 
select(1:4) %>%
summarise_all(sd) %>%
round

## Clustering

### Agglomerative hierarchical clustering (AHC)

In [None]:
names = c("ATL", "BOS", "ORD", "DCA")
airports <- c(0, 934, 585, 542, 934, 0, 853, 392, 
              585, 853, 0, 598, 542, 392, 598, 0)
airports <- matrix(airports, ncol=4, byrow=F, dimnames = list(names, names))

In [None]:
airports

In [None]:
as.dist(airports)

In [None]:
tree <- hclust(as.dist(airports), method="single")
plot(tree)

In [None]:
tree <- hclust(as.dist(airports), method="complete")
plot(tree)

### Road-trip USA

In [None]:
plot(hclust(UScitiesD, method="complete"))

### A trip to Europe

In [None]:
plot(hclust(eurodist, method="complete"))

In [None]:
eurotree <- hclust(eurodist, method="complete")

#### Find clusters by height

In [None]:
groups <- cutree(tree = eurotree, h = 1500)
data.frame(groups) %>% 
rownames_to_column("city") %>% 
arrange(groups)

In [None]:
plot(eurotree)
rect.hclust(eurotree, h=1500, border = "red")

#### Find clusters by number

In [None]:
groups <- cutree(tree = eurotree, k = 8)
data.frame(groups) %>% 
rownames_to_column("city") %>% 
arrange(groups)

In [None]:
plot(eurotree)
rect.hclust(eurotree, k=8, border = "red")

### k-means clustering

In [None]:
kmeans.golub <- kmeans(golub, centers=4)

In [None]:
plot(mds$V1, mds$V2, type="n")
text(mds$V1, mds$V2, labels = mds$phenotype, col=as.integer(kmeans.golub$cluster))

#### Grouped by data source

In [None]:
plot(mds$V1, mds$V2, type="n")
text(mds$V1, mds$V2, labels = mds$phenotype, 
     col=as.integer(Golub_Merge@phenoData@data$Source))

## Semi-supervised learning (Noise discovery)

In [None]:
suppressPackageStartupMessages(library(genefilter))
suppressPackageStartupMessages(library(pheatmap))

### Simulate noise data set

Note that EVERY expression value is drawn from a standard normal distribution. Hence there should not be any meaningful distinction between the "groups".

In [None]:
m <- 20000 # number of genes
n <- 20 # number of subjects
alpha <- 0.005

grp <- factor(rep(c('N', 'Y'), c(n, n)))
genes <- paste("Gene", 1:m, sep="")
subjects <- paste("PID", 1:(2*n), sep="")
expr <- matrix(rnorm(2 * n * m), m, 2 * n)
rownames(expr) <- genes
colnames(expr) <- subjects

#### Find genes that are different across group at specified significance level

In [None]:
pvals <- rowttests(expr, grp)$p.value

In [None]:
df <- data.frame(expr, pvals)

In [None]:
top.genes <- df %>% 
filter(pvals < alpha) %>%
select(-pvals) 

In [None]:
dim(top.genes)

#### Show heatmap and AHC clustering for top genes

In [None]:
annot <- data.frame(grp=grp, row.names=colnames(top.genes))

In [None]:
head(annot)

#### Simple version of heatmap

In [None]:
pheatmap(top.genes)

#### Fancy version of heatmap

In [None]:
pheatmap(top.genes,
         annotation_col = annot,
         color = colorRampPalette(c("red3", "black", "green3"))(50),
         annotation_colors = list(grp = c(Y = "blue", N = "yellow")),
         show_rownames = FALSE, show_colnames = FALSE,
        )

### MDS of top genes

In [None]:
mds <- cmdscale(dist(t(top.genes)))
plot(mds, col=as.integer(grp))

**Exercise 1**

Load the `iris` data set. Each row has 4 features and a Species label. 

- Reduce the dimensionality of the features to 2 using each of the methods described above (PCA, MDS). 
- Plot scatter plots for each method, coloring by Species. 
- Are the Species separate in these dimensionality-reduced plots?

**Exercise 2**

Load the `iris` data set. Each row has 4 features and a Species label. 

- Scale the data to have zero mean and unit standard deviation
- Calculate a pairwise distance matrix (explore different distance measures)
- Perform hierarchical clustering (explore different linkage measures)
- Plot a dendrogram for the hierarchical clustering, showing 3 clusters (see the `rect.hclust` function)
- Create a scatter plot of the first two features colored by the cluster label (see teh `cutree` function)

**Exercise 3**

Load the `iris` data set. Each row has 4 features and a Species label. 

- Scale the data to have zero mean and unit standard deviation
- Perform k-means clustering using 2,3,4 and 10 clusters
- Create a scatter plot of the first two features colored by the cluster label for each cluster number 
- How could you assess how many clusters is appropriate?