Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up| --- | |
| title: "Distances" | |
| output: rmarkdown::html_vignette | |
| vignette: > | |
| %\VignetteIndexEntry{Distances} | |
| %\VignetteEngine{knitr::rmarkdown} | |
| \usepackage[utf8]{inputenc} | |
| --- | |
| ## How to use `distance()` | |
| The `distance()` function implemented in `philentropy` is able to compute 46 different distances/similarities between probability density functions (see `?philentropy::distance` for details). | |
| ### Simple Example | |
| The `distance()` function is implemented using the same _logic_ as R's base functions `stats::dist()` and takes a `matrix` or `data.frame` as input. The corresponding `matrix` or `data.frame` should store probability density functions (as rows) for which distance computations should be performed. | |
| ```r | |
| # define a probability density function P | |
| P <- 1:10/sum(1:10) | |
| # define a probability density function Q | |
| Q <- 20:29/sum(20:29) | |
| # combine P and Q as matrix object | |
| x <- rbind(P,Q) | |
| ``` | |
| Please note that when defining a `matrix` from vectors, probability vectors should be combined as rows (`rbind()`). | |
| ```r | |
| library(philentropy) | |
| # compute the Euclidean Distance with default parameters | |
| distance(x, method = "euclidean") | |
| ``` | |
| ``` | |
| euclidean | |
| 0.1280713 | |
| ``` | |
| For this simple case you can compare the results with R's base function to compute the euclidean distance `stats::dist()`. | |
| ```r | |
| # compute the Euclidean Distance using R's base function | |
| stats::dist(x, method = "euclidean") | |
| ``` | |
| ``` | |
| P | |
| Q 0.1280713 | |
| ``` | |
| However, the R base function `stats::dist()` only computes the following distance measures: `"euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"`, whereas `distance()` allows you to choose from 46 distance/similarity measures. | |
| To find out which `method`s are implemented in `distance()` you can consult | |
| the `getDistMethods()` function. | |
| ```r | |
| # names of implemented distance/similarity functions | |
| getDistMethods() | |
| ``` | |
| ``` | |
| [1] "euclidean" "manhattan" "minkowski" "chebyshev" | |
| [5] "sorensen" "gower" "soergel" "kulczynski_d" | |
| [9] "canberra" "lorentzian" "intersection" "non-intersection" | |
| [13] "wavehedges" "czekanowski" "motyka" "kulczynski_s" | |
| [17] "tanimoto" "ruzicka" "inner_product" "harmonic_mean" | |
| [21] "cosine" "hassebrook" "jaccard" "dice" | |
| [25] "fidelity" "bhattacharyya" "hellinger" "matusita" | |
| [29] "squared_chord" "squared_euclidean" "pearson" "neyman" | |
| [33] "squared_chi" "prob_symm" "divergence" "clark" | |
| [37] "additive_symm" "kullback-leibler" "jeffreys" "k_divergence" | |
| [41] "topsoe" "jensen-shannon" "jensen_difference" "taneja" | |
| [45] "kumar-johnson" "avg" | |
| ``` | |
| Now you can choose any distance/similarity `method` that serves you. | |
| ```r | |
| # compute the Jaccard Distance with default parameters | |
| distance(x, method = "jaccard") | |
| ``` | |
| ``` | |
| jaccard | |
| 0.133869 | |
| ``` | |
| Analogously, in case a probability matrix is specified the following output is generated. | |
| ```r | |
| # combine three probabilty vectors to a probabilty matrix | |
| ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39)) | |
| rownames(ProbMatrix) <- paste0("Example", 1:3) | |
| # compute the euclidean distance between all | |
| # pairwise comparisons of probability vectors | |
| distance(ProbMatrix, method = "euclidean") | |
| ``` | |
| ``` | |
| #> Metric: 'euclidean'; comparing: 3 vectors. | |
| v1 v2 v3 | |
| v1 0.0000000 0.12807130 0.13881717 | |
| v2 0.1280713 0.00000000 0.01074588 | |
| v3 0.1388172 0.01074588 0.00000000 | |
| ``` | |
| Alternatively, users can specify the argument `use.row.names = TRUE` to maintain the | |
| rownames of the input matrix and pass them as rownames and colnames to the output distance matrix. | |
| ```r | |
| # compute the euclidean distance between all | |
| # pairwise comparisons of probability vectors | |
| distance(ProbMatrix, method = "euclidean", use.row.names = TRUE) | |
| ``` | |
| ``` | |
| #> Metric: 'euclidean'; comparing: 3 vectors. | |
| Example1 Example2 Example3 | |
| Example1 0.0000000 0.12807130 0.13881717 | |
| Example2 0.1280713 0.00000000 0.01074588 | |
| Example3 0.1388172 0.01074588 0.00000000 | |
| ``` | |
| This output differs from the output of `stats::dist()`. | |
| ```r | |
| # compute the euclidean distance between all | |
| # pairwise comparisons of probability vectors | |
| # using stats::dist() | |
| stats::dist(ProbMatrix, method = "euclidean") | |
| ``` | |
| ``` | |
| 1 2 | |
| 2 0.12807130 | |
| 3 0.13881717 0.01074588 | |
| ``` | |
| Whereas `distance()` returns a symmetric distance matrix, `stats::dist()` returns only one part of the symmetric matrix. | |
| However, users can also specify the argument `as.dist.obj = TRUE` in `philentropy::distance()` | |
| to retrieve a `philentropy::distance()` output which is an object of type `stats::dist()`. | |
| ```r | |
| ProbMatrix <- rbind(1:10/sum(1:10), 20:29/sum(20:29),30:39/sum(30:39)) | |
| rownames(ProbMatrix) <- paste0("test", 1:3) | |
| distance(ProbMatrix, method = "euclidean", use.row.names = TRUE, as.dist.obj = TRUE) | |
| ``` | |
| ``` | |
| Metric: 'euclidean'; comparing: 3 vectors. | |
| test1 test2 | |
| test2 0.12807130 | |
| test3 0.13881717 0.01074588 | |
| ``` | |
| Now let's compare the run times of base R and `philentropy`. For this purpose you need to install the `microbenchmark` package. | |
| ```r | |
| # install.packages("microbenchmark") | |
| library(microbenchmark) | |
| microbenchmark( | |
| distance(x,method = "euclidean", test.na = FALSE), | |
| dist(x,method = "euclidean"), | |
| euclidean(x[1 , ], x[2 , ], FALSE) | |
| ) | |
| ``` | |
| ``` | |
| Unit: microseconds | |
| expr min lq mean median uq max neval | |
| distance(x, method = "euclidean", test.na = FALSE) 26.518 28.3495 29.73174 29.2210 30.1025 62.096 100 | |
| dist(x, method = "euclidean") 11.073 12.9375 14.65223 14.3340 15.1710 65.130 100 | |
| euclidean(x[1, ], x[2, ], FALSE) 4.329 4.9605 5.72378 5.4815 6.1240 22.510 100 | |
| ``` | |
| As you can see, although the `distance()` function is quite fast, the internal checks cause it to be 2x slower than the base `dist()` function (for the `euclidean` example). Nevertheless, in case you need to implement a faster version of the corresponding distance measure you can type `philentropy::` and then `TAB` allowing you to select the base distance computation functions (written in C++), e.g. `philentropy::euclidean()` which is almost 3x faster than the base `dist()` function. | |
| The advantage of `distance()` is that it implements 46 distance measures based on base C++ functions that can be accessed individually by typing `philentropy::` and then `TAB`. In future versions of `philentropy` I will optimize the `distance()` function so that internal checks for data type correctness and correct input data will take less termination time than the base `dist()` function. | |
| ## Detailed assessment of individual similarity and distance metrics | |
| The vast amount of available similarity metrics raises the immediate question which metric should be used for which application. Here, I will review the origin of each individual metric and will discuss the most recent literature that aims to compare these measures. I hope that users will find valuable insights and might be stimulated to conduct their own comparative research since this is a field of ongoing research. | |
| ### $L_p$ Minkowski Family | |
| #### Euclidean distance | |
| The [euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) is (named after [Euclid](https://en.wikipedia.org/wiki/Euclid)) a straight line distance between two points. | |
| Euclid argued that that the __shortest__ distance between two points is always a line. | |
| > $d = \sqrt{\sum_{i = 1}^N | P_i - Q_i |^2)}$ | |
| #### Manhattan distance | |
| > $d = \sum_{i = 1}^N | P_i - Q_i |$ | |
| #### Minkowski distance | |
| > $d = ( \sum_{i = 1}^N | P_i - Q_i |^p)^{1/p}$ | |
| #### Chebyshev distance | |
| > $d = max | P_i - Q_i |$ | |