Switch branches/tags
Nothing to show
Find file Copy path
fb5d489 Mar 30, 2016
3 contributors

Users who have contributed to this file

@mikelove @rafalab @alexnones
131 lines (88 sloc) 5.28 KB
layout title
Distance lecture
opts_chunk$set(fig.path=paste0("figure/", sub("(.*).Rmd","\\1",basename(knitr:::knit_concord$get('infile'))), "-"))

Distance and Dimension Reduction


The concept of distance is quite intuitive. For example, when we cluster animals into subgroups, we are implicitly defining a distance that permits us to say what animals are "close" to each other.

Clustering of animals.

Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. Many clustering and machine learning techniques rely on being able to define distance, using features or predictors. For example, to create heatmaps, which are widely used in genomics and other high-throughput fields, a distance is computed explicitly.

Example of heatmap. Image Source: Heatmap, Gaeddal, 01.28.2007 Wikimedia Commons

In these plots the measurements, which are stored in a matrix, are represented with colors after the columns and rows have been clustered. (A side note: red and green, a common color theme for heatmaps, are two of the most difficult colors for many color-blind people to discern.) Here we will learn the necessary mathematics and computing skills to understand and create heatmaps. We start by reviewing the mathematical definition of distance.

Euclidean Distance

As a review, let's define the distance between two points, $A$ and $B$, on a Cartesian plane.


The euclidean distance between $A$ and $B$ is simply:

$$\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}$$

Distance in High Dimensions

We introduce a dataset with gene expression measurements for 22,215 genes from 189 samples. The R objects can be downloaded like this:


The data represent RNA expression levels for eight tissues, each with several individuals.

dim(e) ##e contains the expression data
table(tissue) ##tissue[i] tells us what tissue is represented by e[,i]

We are interested in describing distance between samples in the context of this dataset. We might also be interested in finding genes that behave similarly across samples.

To define distance, we need to know what the points are since mathematical distance is computed between points. With high dimensional data, points are no longer on the Cartesian plane. Instead they are in higher dimensions. For example, sample $i$ is defined by a point in 22,215 dimensional space: $(Y_{1,i},\dots,Y_{22215,i})^\top$. Feature $g$ is defined by a point in 189 dimensions $(Y_{g,1},\dots,Y_{g,189})^\top$

Once we define points, the Euclidean distance is defined in a very similar way as it is defined for two dimensions. For instance, the distance between two samples $i$ and $j$ is:

$$ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } $$

and the distance between two features $h$ and $g$ is:

$$ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } $$

Note: In practice, distances between features are typically applied after standardizing the data for each feature. This is equivalent to computing one minus the correlation. This is done because the differences in overall levels between features are often not due to biological effects but technical ones. More details on this topic can be found in this presentation.

Distance with matrix algebra

The distance between samples $i$ and $j$ can be written as

$$ \mbox{dist}(i,j) = (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j)$$

with $\mathbf{Y}_i$ and $\mathbf{Y}_j$ columns $i$ and $j$. This result can be very convenient in practice as computations can be made much faster using matrix multiplication.


We can now use the formulas above to compute distance. Let's compute distance between samples 1 and 2, both kidneys, and then to sample 87, a colon.

x <- e[,1]
y <- e[,2]
z <- e[,87]

As expected, the kidneys are closer to each other. A faster way to compute this is using matrix algebra:

sqrt( crossprod(x-y) )
sqrt( crossprod(x-z) )

Now to compute all the distances at once, we have the function dist. Because it computes the distance between each row, and here we are interested in the distance between samples, we transpose the matrix

d <- dist(t(e))

Note that this produces an object of class dist and, to access the entries using row and column indices, we need to coerce it into a matrix:


It is important to remember that if we run dist on e, it will compute all pairwise distances between genes. This will try to create a $22215 \times 22215$ matrix that may crash your R sessions.