# Distance function basics

* Metrics measure distance between two items
* Norms measure size of something

## Eucledian distance

The most basic distance function in vector space, based on Pythagorean Theorem. $$d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$$

In [88]:
A <- c(1,2)
B <- c(5,9)
dims <- c("x", "y")

m <- rbind(A, B)
colnames(m) <- dims
print(m)

d <- sqrt((5-1)^2 + (9-2)^2)
print(d)

  x y
A 1 2
B 5 9
[1] 8.062258


When applied to 3-dimensional space, the distance can be calculated as such:

$$d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + (z_2 - z_1)^2}$$

Thus, generalized formula for eucledian discance in N-dimensional space can be defined as follows

$$ d(a, b) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2} $$

When implemented in R, the code should look something like this -

In [89]:
myEucledian <- function(A, B) {
  sum <- 0
  for(i in seq(along=A)) {
    sum <- sum + (A[i] - B[i])^2
  }
  return(sqrt(sum))
}

>Note that all examples in this file, such as the one above, are horribly bad because for loops in R are bad (and the author should feel bad).

In [90]:
d <- myEucledian(A, B)
print(d)

[1] 8.062258


In [91]:
A <- c(2,7,4)
B <- c(3,4,5)
dims <- c("x", "y", "z")

m <- rbind(A, B)
colnames(m) <- dims
print(m)

d <- myEucledian(A, B)
print(d)

  x y z
A 2 7 4
B 3 4 5
[1] 3.316625


## Manhattan distance

Sometimes the most direct path from point A to point B is not a straight line. Think taxycab that has to drive around buildings.

$$ d(a, b) = \sum_{i=1}^{n} \lvert a_i - b_i \lvert $$

In [92]:
myManhattan <- function(A, B) {
  sum <- 0
  for(i in seq(along=A)) {
    sum <- sum + abs(A[i] - B[i])
  }
  return(sum)
}

In [93]:
d <- myManhattan(A, B)
print(d)

[1] 5


## Chebyshev distance

Also known as "chessboard distance" where distance between two points is the greatest possible move size. Think chessboard where pieces can jump in any direction but moveset is limited.

$$ d(a, b) = \lim(\sum_{i=1}^{n}\lvert a_i - b_i \lvert^k)^{1/k} $$

$$ d(a, b) = max_i(\lvert  a_i - b_i \lvert) $$

Given two 3-dimensional vectors, the distance can be calculated as such

$$ d = max(\lvert x_2 - x_1 \lvert, \lvert y_2 - y_1 \lvert, \lvert z_2 - z_1 \lvert) $$

In [94]:
print(m)
d <- max( abs(3 - 2), abs(4 - 7), abs(5 - 4) )
print(d)

  x y z
A 2 7 4
B 3 4 5
[1] 3


In [95]:
# implement your own R function here
myCheb <- function(A, B) {
    dist <- 0
    return(dist)
}

d <- myCheb(A, B)
print(d)

[1] 0


## Canberra distance

Canberra distance is a weighted version of Manhattan distance, often used for comparing ranked lists. Distance between Canberra and Sydney might be significant for an Estonian but not for locals who are used to vast distances between cities.

$$ d(a, b) = \sum_{i=1}^{n} \frac{\lvert a_i - b_i \lvert}{\lvert a_i \lvert + \lvert b_i \lvert} $$

In [96]:
d <- sum( ( abs(3 - 2) / ( abs(3) + abs(2) ) ), ( abs(4 - 7) / ( abs(4) + abs(7) ) ), ( abs(5 - 4) / ( abs(5) + abs(4)) ) )
print(d)

[1] 0.5838384


In [97]:
# implement your own R function here
myCanberra <- function(A, B) {
    dist <- 0
    return(dist)
}

d <- myCanberra(A, B)
print(d)

[1] 0


## Mahalanobis distance

Mahalanobis distance measures the distance between a point P and distribution D, essentially showing how many standard deviations a point differs from mean in multidimensional space. This idea can be used to measure dissimilarity between two vectors within the same distribution.

$$ d(a, b) = \sqrt{(a - b)^T COV^{-1} (a - b)} $$

Note that $COV^{-1}$ stands for inverse covariance matrix of all points withing background distribution. For example, lets assume that points A and B belong to random 3-dimensional standard distribution D that comprises of 100 data points.

In [98]:
N = 3
M = 100
D <- matrix( rnorm(M*N,mean=0,sd=5), M, N)
print(head(D))

           [,1]       [,2]       [,3]
[1,]  -1.743575  2.6698704 -8.7842476
[2,]   3.056308 -0.4322666 -5.3165275
[3,]   1.919312 -1.8424484  0.4574133
[4,]  12.747409  5.1947253  4.6359105
[5,] -10.145219  2.4702066  5.9390592
[6,]  -9.347715  0.2057458  3.5709220


We can then calculate the inverse covariance matrix.

In [99]:
COV <- cov( D )
invCOV <- solve( COV )
print(COV)
print(invCOV)

           [,1]       [,2]       [,3]
[1,] 25.0184163  1.5949081  0.9307629
[2,]  1.5949081 23.6353062  0.9626425
[3,]  0.9307629  0.9626425 27.8636132
             [,1]         [,2]         [,3]
[1,]  0.040186706 -0.002660864 -0.001250478
[2,] -0.002660864  0.042545386 -0.001380990
[3,] -0.001250478 -0.001380990  0.035978582


In [100]:
# myMahal(c(1,2), c(2,3), solve(cov(matr)))
myMahal <- function(A, B, invCOV) {
  diff = A - B
  dist = sqrt( t(diff) %*% invCOV %*% diff )
  return(dist)
}

In [101]:
d <- myMahal(A, B, invCOV)
print(d)

          [,1]
[1,] 0.6934147


Note that we have to multiply the transpose of deriviative, in addition to deriviative itself. Luckily R makes transpose operation very simple to use.

In [102]:
print(m)

  x y z
A 2 7 4
B 3 4 5


In [103]:
print(t(m))

  A B
x 2 3
y 7 4
z 4 5


## Cosine distance

Prior distance measures are primarily designed for working in numerical vector spaces. However, this does not translate well into text data mining applications for obvious reasons. A common trick in text mining is to create a "bag of words" and apply cosine distance on it. Imagine that our example 3-dimensional vectors represent two distinct documents. Data dimensions would in this case represent word counts per document.

In [104]:
words <- c("yes", "no", "maybe")
documents <- c("doc1", "doc2")

colnames(m) <- words
rownames(m) <- documents

print(m)

     yes no maybe
doc1   2  7     4
doc2   3  4     5


Cosine distance, or rather cosine similarity, is based on the idea that $\cos(90) = 0$ while $\cos(0) = 1$. In other words, parallel vectors are identical while orthogonal vectors are distinct. Thus, when applied on bag of words, we are able to convert textual documents to high dimensional vector space and apply standard data mining methods that would otherwise be unsuitable for string input.

Cosine similarity formula is defined as ratio between dot product and magnitude of vectors.

$$ s(a, b) = \frac{a \cdot b}{\lvert \lvert a \lvert \lvert \lvert \lvert b \lvert \lvert} = \frac{ \sum_{i=1}^{n}a_i b_i }{\sqrt{\sum_{i=1}^{n}a_i^2} \sqrt{\sum_{i=1}^{n}b_i^2 } } $$

The result does not conform to all metric requirements as result can be between -1 and 1, and input vectors must be positive values. However, as word counts can only be $\ge 0$. Thus, an easy way to measure distance between two documents is to calculate inverse cosine similarity.

$$ d(a, b) = 1 - s(a, b) $$

In [144]:
# implement cosine distance here
myCosine <- function(A, B) {
    dist <- 0
    return(dist)
}

## Jaccard distance

Jaccard similarity and distance calculation is similar to cosine method. However, it calculates the ratio between intersection and union between two documents. In other words, the number of words that two documents have in common divided by number of words total.

$$ d(a, b) = 1 - \frac{ a \cap b }{a \cup b}  $$

Let us add another word into example dataset.

In [145]:
ok <- c(0, 3)
m2 <- cbind(m, ok)
print(m2)

     yes no maybe ok
doc1   2  7     4  0
doc2   3  4     5  3


We are only concerned if word exists or not, no need for counts.

In [146]:
exists <- ifelse(m2 > 0, 1, 0)
print(exists)

     yes no maybe ok
doc1   1  1     1  0
doc2   1  1     1  1


Now we need a count of total unique words and words common in both documents. Since R is mostly concerned with numerics, then most existing funcitons returned union or intersect elements, as opposed to their counts. So I wrote my own simple function to check weather variance of vector is 0 (all elements identical) or not.

In [147]:
uniq <- length(unique(colnames(exists)))

isEqual <- function(x){
    if ( var(x) == 0 ) {
        return(TRUE)
    } else {
        return(FALSE)
    }
}

common <- apply(exists, 2, isEqual)
print(uniq)
print(common)

d <- 1 - ( length(which(common)) / uniq )
print(d)

[1] 4
  yes    no maybe    ok 
 TRUE  TRUE  TRUE FALSE 
[1] 0.25


>Note that apply funciton was used over matrix columns instead of iterating with for loop because this approach is a lot faster in R.

## Distance matrix

In practice, you will often need to calculate pariwise distnces for all data points for all data points. Unless you are implementing previously discussed algorithms from scratch in a low level language like C, Go or Rust, it is advisable to generate this distance matrix with a efficient method. Especially in R.

Lets generate a 2-dimensional data set with 5 points as an example.

In [22]:
points <- 5
data <- rnorm(points*2, mean=c(0,0), 1)
data <- matrix(data, ncol=2, nrow=points)
print(head(data))

           [,1]       [,2]
[1,] -0.1171901  0.1005121
[2,] -0.8883891  1.4588151
[3,] -0.7064374 -0.3573795
[4,]  1.8945399 -1.1514493
[5,]  0.1936501 -0.2047024


In [23]:
eucDistMatrix <- dist(data, method = "euclidean", diag = TRUE)
print(eucDistMatrix)

          1         2         3         4         5
1 0.0000000                                        
2 1.5619651 0.0000000                              
3 0.7462420 1.8252861 0.0000000                    
4 2.3694862 3.8155175 2.7194907 0.0000000          
5 0.4356346 1.9844645 0.9129446 1.9466268 0.0000000


Notice that distance to itself is always zero. Furthermore, only lower or upper triangle needs to be populated as distance between two points can always be measured from either direction while the result will be the same, provided that chosen distance funciton satisfies one of the four metric requirements.

Non-negativity
$$ d(a, b) \ge 0 $$

Identity of indiscernibles
$$ d(a, a) = 0 $$

Symmetry
$$ d(a, b) = d(b, a) $$

Triangle inequality
$$ d(a, c) \le d(a, b) + d(b, c) $$