# Distance function basics

* Metrics measure distance between two items
* Norms measure size of something

## Eucledian distance

The most basic distance function in vector space, based on Pythagorean Theorem. $$d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$$

In [4]:
A <- c(1,2)
B <- c(5,9)
dims <- c("x", "y")

m <- rbind(A, B)
colnames(m) <- dims
print(m)

d <- sqrt((5-1)^2 + (9-2)^2)
print(d)

  x y
A 1 2
B 5 9
[1] 8.062258


When applied to 3-dimensional space, the distance can be calculated as such:

$$d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2}$$

Thus, generalized formula for eucledian discance in N-dimensional space can be defined as follows

$$ d(a, b) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2} $$

When implemented in R, the code should look something like this -

In [5]:
myEucledian <- function(A, B) {
  sum <- 0
  for(i in seq(along=A)) {
    sum <- sum + (A[i] - B[i])^2
  }
  return(sqrt(sum))
}

>Note that all examples in this file, such as the one above, are horribly bad because for loops in R are bad (and the author should feel bad).

In [6]:
d <- myEucledian(A, B)
print(d)

[1] 8.062258


In [16]:
A <- c(2,7,4)
B <- c(3,4,5)
dims <- c("x", "y", "z")

m <- rbind(A, B)
colnames(m) <- dims
print(m)

d <- myEucledian(A, B)
print(d)

  x y z
A 2 7 4
B 3 4 5
[1] 3.316625


## Manhattan distance

Sometimes the most direct path from point A to point B is not a straight line. Think taxycab that has to drive around buildings.

$$ d(a, b) = \sum_{i=1}^{n} \lvert a_i - b_i \lvert $$

In [17]:
myManhattan <- function(A, B) {
  sum <- 0
  for(i in seq(along=A)) {
    sum <- sum + abs(A[i] - B[i])
  }
  return(sum)
}

In [18]:
d <- myManhattan(A, B)
print(d)

[1] 5


## Chebyshev distance

Also known as "chessboard distance" where distance between two points is the greatest possible move size. Think chessboard where pieces can jump in any direction but moveset is limited.

$$ d(a, b) = \lim(\sum_{i=1}^{n}\lvert a_i - b_i \lvert^k)^{1/k} $$

$$ d(a, b) = max_i(\lvert  a_i - b_i \lvert) $$

Given two 3-dimensional vectors, the distance can be calculated as such

$$ d = max(\lvert x_2 - x_1 \lvert, \lvert y_2 - y_1 \lvert, \lvert z_2 - z_1 \lvert) $$

In [36]:
print(m)
d <- max( abs(3 - 2), abs(4 - 7), abs(5 - 4) )
print(d)

  x y z
A 2 7 4
B 3 4 5
[1] 3


In [37]:
# implement your own R function here
myCheb <- function(A, B) {
    dist <- 0
    return(dist)
}

d <- myCheb(A, B)
print(d)

[1] 0


## Canberra distance

Canberra distance is a weighted version of Manhattan distance, often used for comparing ranked lists. Distance between Canberra and Sydney might be significant for an Estonian but not for locals who are used to vast distances between cities.

$$ d(a, b) = \sum_{i=1}^{n} \frac{\lvert a_i - b_i \lvert}{\lvert a_i \lvert + \lvert b_i \lvert} $$

In [41]:
d <- sum( ( abs(3 - 2) / ( abs(3) + abs(2) ) ), ( abs(4 - 7) / ( abs(4) + abs(7) ) ), ( abs(5 - 4) / ( abs(5) + abs(4)) ) )
print(d)

[1] 0.5838384


In [44]:
# implement your own R function here
myCanberra <- function(A, B) {
    dist <- 0
    return(dist)
}

d <- myCanberra(A, B)
print(d)

[1] 0


## Mahalanobis distance

Mahalanobis distance measures the distance between a point P and distribution D, essentially showing how many standard deviations a point differs from mean in multidimensional space. This idea can be used to measure dissimilarity between two vectors within the same distribution.

$$ d(a, b) = \sqrt{(a - b)^T COV^{-1} (a - b)} $$

Note that $COV^{-1}$ stands for inverse covariance matrix of all points withing background distribution. For example, lets assume that points A and B belong to random 3-dimensional standard distribution D that comprises of 100 data points.

In [55]:
N = 3
M = 100
D <- matrix( rnorm(M*N,mean=0,sd=5), M, N)
print(head(D))

           [,1]       [,2]      [,3]
[1,] -4.4343376  0.7124866 -5.609467
[2,]  1.1815045  1.9585748  1.780719
[3,] -5.6168014  3.8287208 -1.235574
[4,]  6.3211163 -1.5551644  3.084430
[5,] -0.5977857  3.4636118 -2.650562
[6,] -3.4372185 -9.2523566 -0.297560


We can then calculate the inverse covariance matrix.

In [62]:
COV <- cov( D )
invCOV <- solve( COV )
print(COV)
print(invCOV)

           [,1]      [,2]       [,3]
[1,] 19.9259073 -1.021957  0.5742338
[2,] -1.0219571 25.201937  2.7458520
[3,]  0.5742338  2.745852 18.0060311
             [,1]         [,2]         [,3]
[1,]  0.050357737  0.002254478 -0.001949768
[2,]  0.002254478  0.040450837 -0.006240499
[3,] -0.001949768 -0.006240499  0.056550780


In [64]:
# myMahal(c(1,2), c(2,3), solve(cov(matr)))
myMahal <- function(A, B, invCOV) {
  diff = A - B
  dist = sqrt( t(diff) %*% invCOV %*% diff )
  return(dist)
}

In [63]:
d <- myMahal(A, B, invCOV)
print(d)

          [,1]
[1,] 0.7007015


Note that we have to multiply the transpose of deriviative, in addition to deriviative itself. Luckily R makes transpose operation very simple to use.

In [74]:
print(m)

  x y z
A 2 7 4
B 3 4 5


In [75]:
print(t(m))

  A B
x 2 3
y 7 4
z 4 5


## Cosine distance

Prior distance measures are primarily designed for working in numerical vector spaces. However, this does not translate well into text data mining applications where 