Skip to content

Commit

Permalink
Merge pull request #17 from elbamos/0.1.7
Browse files Browse the repository at this point in the history
0.1.7 - See News.md for details
  • Loading branch information
elbamos committed Aug 11, 2016
2 parents 654da27 + 98457ae commit ee71208
Show file tree
Hide file tree
Showing 37 changed files with 1,050 additions and 1,132 deletions.
16 changes: 11 additions & 5 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@
^wikiwWordVis\.Rda
^wij
^visscratc\.R
^wikiwordcoords\.Rda
^ng20wij\.Rda
^time\.Rda
^wordcoords\.Rda
^wikiwordcoords\.Rda$
^ng20wij\.Rda$
^time\.Rda$
^wordcoords\.Rda$
^vignettedata/$
^vignettedata/.$
^stm\.Rda
Expand Down Expand Up @@ -58,4 +58,10 @@
^poliblog/.Rda
^log4j/.spark/.log
^mnist$
^derby/.log
^derby/.log
^Examples/.Rmd$
^test/.R$
^vignettedatamnistcoords/.Rda$
^vignettedatangcoords/.Rda$
^wordcoordsweightingbyp/.Rda$
^words/.png$
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,4 @@ before_install: |
after_success:
- Rscript -e 'covr::codecov(branch="reference")'
- Rscript -e 'covr::codecov()'
5 changes: 3 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: largeVis
Type: Package
Title: High-Quality Visualizations of Large, High-Dimensional Datasets
Version: 0.1.6
Version: 0.1.7
Author: Amos B. Elberg
Maintainer: Amos Elberg <amos.elberg@gmail.com>
Description: Implements the largeVis algorithm for visualizing very large high-dimensional datasets. Also very fast search for approximate nearest neighbors.
Expand All @@ -18,7 +18,8 @@ Imports:
ggplot2 (>= 0.9.2.1),
dbscan
LinkingTo: Rcpp,RcppProgress (>= 0.2.1),RcppArmadillo (>= 0.7.100.3.0),testthat(>= 1.0.2)
Suggests: testthat,
Suggests:
testthat,
covr,
knitr,
rmarkdown,
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ export(buildWijMatrix)
export(distance)
export(ggManifoldMap)
export(largeVis)
export(lof)
export(manifoldMap)
export(manifoldMapStretch)
export(neighborsToVectors)
Expand Down
85 changes: 53 additions & 32 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,68 @@
### largeVis 0.1.7
* Bug fixes
+ Largely reduced the "fuzzies"
* API Improvements
+ Allow the seed to be set for projectKNNs and randomProjectionTreeSearch
+ If a seed is given, multi-threading is disabled during sgd and the annoy phase of the neighbor search. These
phases of the algorithm would otherwise be non-deterministic. Note that the performance impact is substantial.
+ Verbosity now defaults to the R system option
+ The neighbor matrix returned by randomProjectionTreeSearch is now sorted by distance
* Testing
+ Improved testing for cosine similarity
+ Many tests are improved by ability to set seed
* Clustering
+ LOF search now tested and exported.
* Refactorings & Improvements
+ Refactored neighbor search to unify code for sparse and dense neighbors, substantially improving sparse performance
+ Now using managed pointers in many places
### largeVis 0.1.6

* Revisions for CRAN release, including verifying correctness by reproducing paper examples, and timing tests/benchmarks
+ Tested against the paper authors' wiki-doc and wiki-word datasets
+ Tested with up to 2.5m rows, 100m edges (processed in 12 hours).
+ Tested against the paper authors' wiki-doc and wiki-word datasets
+ Tested with up to 2.5m rows, 100m edges (processed in 12 hours).
* Neighbor search:
+ Dense search is much, much faster and more efficient
+ Tree search for cosine distances uses normalized vectors
+ Dense search is much, much faster and more efficient
+ Tree search for cosine distances uses normalized vectors
* projectKNNs
+ Should be 10x faster for small datasets
+ Replaced binary search ( O(n log n) ) with the alias algorithm for weighted sampling ( O(1) )
+ Clips and smooths gradients, per discussion with paper authors
+ Optimized implementation for alpha == 1
+ Removed option for mixing weights into loss function - doesn't make sense if gradients are being clipped.
+ Fixed OpenMP-related bug which caused visualizations to be "fuzzy"
+ Should be 10x faster for small datasets
+ Replaced binary search ( O(n log n) ) with the alias algorithm for weighted sampling ( O(1) )
+ Clips and smooths gradients, per discussion with paper authors
+ Optimized implementation for alpha == 1
+ Removed option for mixing weights into loss function - doesn't make sense if gradients are being clipped.
+ Fixed OpenMP-related bug which caused visualizations to be "fuzzy"
+ Switched to the STL random number generator, allowing the user to set a seed for reproducible results.
* Vignettes:
+ Reuse initialization matrices and neighbors, to make it easier to see the effect of hyperparameters
+ Benchmarks now a separate vignette, more detailed
+ Examples removed from vignettes and moved to readme
+ Added examples of manifold map with color faces using OpenFace vectors
+ Reuse initialization matrices and neighbors, to make it easier to see the effect of hyperparameters
+ Benchmarks now a separate vignette, more detailed
+ Examples removed from vignettes and moved to readme
+ Added examples of manifold map with color faces using OpenFace vectors
* Sigms, P_ij matrix, w_ij matrix
+ Replaced C++ code entirely with new code based on reference implementation
+ Refactored R code into `buildEdgeMatrix()` and `buildWijMatrix()`, which are simpler.
+ Replaced C++ code entirely with new code based on reference implementation
+ Refactored R code into `buildEdgeMatrix()` and `buildWijMatrix()`, which are simpler.
* Visualization
+ Color manifold maps work
+ Ported Karpathy's function for non-overlapping embeddings (experimental)
+ Removed transparency parameter
+ Added ggManifoldMap function for adding a manifold map to a ggplot2 plot
* vis
+ Whether to return neighbors and sigmas now adjustable parameters, for memory reasons
+ Runs gc() periodically
+ Color manifold maps work
+ Ported Karpathy's function for non-overlapping embeddings (experimental)
+ Removed transparency parameter
+ Added ggManifoldMap function for adding a manifold map to a ggplot2 plot
* largeVis
+ vis function renamed largeVis
+ Whether to return neighbors now an adjustable parameter, for memory reasons
+ No longer return sigmas under any circumstance
+ Runs gc() periodically
* Data
+ Removed most data and extdata that had been included before; this is to reduce size for CRAN submission
+ Removed most data and extdata that had been included before; this is to reduce size for CRAN submission
* Dependencies & Build
+ Many misc changes to simplify dependencies for CRAN
+ Re-added ARMA_64BIT_WORD; otherwise, could exceed the limitation on size of an arma sparse matrix with moderately sized datasets (~ 1 M rows, K = 100)
+ Now depends on R >= 3.0.2, so RcppProgress and RcppArmadillo could be moved from the Depends section of the DESCRIPTION file
+ Will now compile on systems that lack OpenMP (e.g., OS X systems with old versions of xcode).
+ Many misc changes to simplify dependencies for CRAN
+ Re-added ARMA_64BIT_WORD; otherwise, could exceed the limitation on size of an arma sparse matrix with moderately sized datasets (~ 1 M rows, K = 100)
+ Now depends on R >= 3.0.2, so RcppProgress and RcppArmadillo could be moved from the Depends section of the DESCRIPTION file
+ Will now compile on systems that lack OpenMP (e.g., OS X systems with old versions of xcode).
* Correctness and Testing
+ Tests are separated by subject
+ Additional, more extensive tests with greater code coverage
+ Added travis testing against OSX
+ Tests are separated by subject
+ Additional, more extensive tests with greater code coverage
+ Added travis testing against OSX
* Clustering
+ Very preliminary support for dbscan and optics added
+ Very preliminary support for dbscan and optics added, however these functions have not been exported.

### largeVis 0.1.5

Expand Down
16 changes: 8 additions & 8 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ silhouetteDbscan <- function(edges, sil) {
invisible(.Call('largeVis_silhouetteDbscan', PACKAGE = 'largeVis', edges, sil))
}

searchTrees <- function(threshold, n_trees, K, maxIter, data, distMethod, verbose) {
.Call('largeVis_searchTrees', PACKAGE = 'largeVis', threshold, n_trees, K, maxIter, data, distMethod, verbose)
searchTrees <- function(threshold, n_trees, K, maxIter, data, distMethod, seed, verbose) {
.Call('largeVis_searchTrees', PACKAGE = 'largeVis', threshold, n_trees, K, maxIter, data, distMethod, seed, verbose)
}

fastDistance <- function(is, js, data, distMethod, verbose) {
Expand All @@ -49,15 +49,15 @@ referenceWij <- function(i, j, d, perplexity) {
.Call('largeVis_referenceWij', PACKAGE = 'largeVis', i, j, d, perplexity)
}

sgd <- function(coords, targets_i, sources_j, ps, weights, gamma, rho, n_samples, M, alpha, verbose) {
.Call('largeVis_sgd', PACKAGE = 'largeVis', coords, targets_i, sources_j, ps, weights, gamma, rho, n_samples, M, alpha, verbose)
sgd <- function(coords, targets_i, sources_j, ps, weights, gamma, rho, n_samples, M, alpha, seed, verbose) {
.Call('largeVis_sgd', PACKAGE = 'largeVis', coords, targets_i, sources_j, ps, weights, gamma, rho, n_samples, M, alpha, seed, verbose)
}

searchTreesCSparse <- function(threshold, n_trees, K, maxIter, i, p, x, distMethod, verbose) {
.Call('largeVis_searchTreesCSparse', PACKAGE = 'largeVis', threshold, n_trees, K, maxIter, i, p, x, distMethod, verbose)
searchTreesCSparse <- function(threshold, n_trees, K, maxIter, i, p, x, distMethod, seed, verbose) {
.Call('largeVis_searchTreesCSparse', PACKAGE = 'largeVis', threshold, n_trees, K, maxIter, i, p, x, distMethod, seed, verbose)
}

searchTreesTSparse <- function(threshold, n_trees, K, maxIter, i, j, x, distMethod, verbose) {
.Call('largeVis_searchTreesTSparse', PACKAGE = 'largeVis', threshold, n_trees, K, maxIter, i, j, x, distMethod, verbose)
searchTreesTSparse <- function(threshold, n_trees, K, maxIter, i, j, x, distMethod, seed, verbose) {
.Call('largeVis_searchTreesTSparse', PACKAGE = 'largeVis', threshold, n_trees, K, maxIter, i, j, x, distMethod, seed, verbose)
}

2 changes: 1 addition & 1 deletion R/buildEdgeMatrix.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
buildEdgeMatrix <- function(data,
neighbors,
distance_method = "Euclidean",
verbose = TRUE) {
verbose = options("verbose")) {
indices <- neighborsToVectors(neighbors)
distances <- distance(indices$i, indices$j, x = data, distance_method, verbose)
mat <- sparseMatrix(
Expand Down
47 changes: 36 additions & 11 deletions R/dbscan.R
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ optics <- function(data = NULL,
minPts = nrow(data) + 1,
eps_cl,
xi,
verbose = TRUE) {
verbose = getOption("verbose", TRUE)) {
if (! is.null(edges) && is.null(data))
ret <- optics_e(edges = edges,
eps = as.double(eps), minPts = as.integer(minPts),
Expand Down Expand Up @@ -96,7 +96,7 @@ dbscan <- function(data = NULL,
eps,
minPts = nrow(data) + 1,
partition = !missing(edges),
verbose = TRUE) {
verbose = getOption("verbose", TRUE)) {

if (! is.null(edges) && is.null(data))
ret <- dbscan_e(edges = edges,
Expand Down Expand Up @@ -150,8 +150,28 @@ edgeMatrixToKNNS <- function(edges) {
list(dist = t(dist), id = t(id), k = k)
}

# The source code for function lof is based on code that bore this license:
#######################################################################
# dbscan - Density Based Clustering of Applications with Noise
# and Related Algorithms
# Copyright (C) 2015 Michael Hahsler

#' Local Outlier Factor Score
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.


#' @title Local Outlier Factor Score
#'
#' @description Calculate the Local Outlier Factor (LOF) score for each data point given knowledge
#' of k-Nearest Neighbors.
Expand All @@ -161,21 +181,26 @@ edgeMatrixToKNNS <- function(edges) {
#' @references Based on code in the \code{\link[dbscan]{dbscan}} package.
#'
#' @return A vector of LOF values for each data point.
#' @export
lof <- function(edges) {
kNNlist <- edgeMatrixToKNNS(edges)
N <- nrow(kNNlist$id)
K <- kNNlist$k

# lrd <- rep(0, N)
lrd <- rep(0, N)
for(i in 1:N) {
input <- kNNlist$dist[c(i, kNNlist$id[i, ]) ,]
lrd[i] <- 1 / (sum(apply(input, MARGIN = 1, max)) / K)
}
# for(i in 1:N) {
# input <- kNNlist$dist[c(i, kNNlist$id[i, ]) ,]
# lrd[i] <- 1 / (sum(apply(input, MARGIN = 1, max)) / K)
# }
for(i in 1:N) lrd[i] <- 1/(sum(apply(
cbind(kNNlist$dist[kNNlist$id[i,], K], kNNlist$dist[i,]),
1, max)) / K)

lof <- rep(0, N)
for (i in 1:N) lof[i] <- sum(lrd[kNNlist$id[i,]])/K / lrd[i]
ret <- rep(0, N)
for (i in 1:N) ret[i] <- sum(lrd[kNNlist$id[i,]])/K / lrd[i]

lof[is.nan(lof)] <- NA
ret[is.nan(ret)] <- NA

lof
ret
}
6 changes: 3 additions & 3 deletions R/distance.R
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ distance.matrix <- function(x,
i,
j,
distance_method = "Euclidean",
verbose = TRUE) {
verbose = getOption("verbose", TRUE)) {
return (fastDistance(i,
j,
x,
Expand All @@ -39,7 +39,7 @@ distance.CsparseMatrix <- function(x,
i,
j,
distance_method = "Euclidean",
verbose = TRUE) {
verbose = getOption("verbose", TRUE)) {
return(fastCDistance(i,
j,
x@i,
Expand All @@ -56,7 +56,7 @@ distance.TsparseMatrix <- function(
i,
j,
distance_method="Euclidean",
verbose=TRUE) {
verbose = getOption("verbose", TRUE)) {
return(fastSDistance(i,
j,
x@i,
Expand Down
26 changes: 3 additions & 23 deletions R/largeVis.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,7 @@
#' @param max_iter See \code{\link{randomProjectionTreeSearch}}.
#' @param distance_method One of "Euclidean" or "Cosine." See \code{\link{randomProjectionTreeSearch}}.
#' @param perplexity See \code{\link{buildWijMatrix}}.
#' @param sgd_batches See \code{\link{projectKNNs}}.
#' @param M See \code{\link{projectKNNs}}.
#' @param alpha See \code{\link{projectKNNs}}.
#' @param gamma See \code{\link{projectKNNs}}.
#' @param rho See \code{\link{projectKNNs}}.
#' @param save_neighbors Whether to include in the output the adjacency matrix of nearest neighbors.
#' @param coords A [N,K] matrix of coordinates to use as a starting point -- useful for refining an embedding in stages.
#' @param verbose Verbosity
#' @param ... Additional arguments passed to \code{\link{projectKNNs}}.
#'
Expand Down Expand Up @@ -53,26 +47,18 @@
#'
largeVis <- function(x,
dim = 2,
K = 40,
K = 50,

n_trees = 50,
tree_threshold = max(10, ncol(x)),
max_iter = 1,
distance_method = "Euclidean",

perplexity = 50,

sgd_batches = NULL,
M = 5,
alpha = 1,
gamma = 7,
rho = 1,

coords = NULL,
perplexity = max(50, K / 3),

save_neighbors = TRUE,

verbose = TRUE,
verbose = getOption("verbose", TRUE),
...) {

#############################################
Expand Down Expand Up @@ -109,13 +95,7 @@ largeVis <- function(x,
#######################################################
coords <- projectKNNs(wij = wij,
dim = dim,
sgd_batches = sgd_batches,
M = M,
gamma = gamma,
verbose = verbose,
alpha = alpha,
coords = coords,
rho = rho,
...)

#######################################################
Expand Down

0 comments on commit ee71208

Please sign in to comment.