GitHub - gasparian/lsh-search-go: Locality-sensitive hashing index implementation [EXPERIMENT]

lsh-search-go

Here is my take on implementing locality-sensitive hashing algorithm from scratch.
The simplest possible search algorithm would be just NN search with linear space and time complexity. But in practice we want to perform search in << O(N) time.
So there is exist a set of algorithms called "approximate nearest neighbors" (aka "ANN", here is cool presentation by on of the key researches in that field). And we can divide it into two subsets:

space partitioning methods, like locality-sensitive hashing, (presentation);
graph-based approaches - like local search over proximity graphs, for example hierarchical navigatable small world graphs (great presentation by Yandex Research);

I've decided to go with the LSH algorithm first, since:

It seems like it's pretty simple to understand and implement it.
Generated hashes could be easily stored in any database.
The largest downside I see here - is that LSH needs too much memory to store its' index.

So this repo contains library that has the functionality to create LSH index and perform search by given query vector.
It worth to mention, that I heavily relayed on annoy and ann-benchmarks.

Locality sensitive hashing short reference

LSH implies space partitioning with random hyperplanes and search across "buckets" formed by intersections of those planes. So we can expect that nearby vectors have the higher probability to be in the same "bucket".
My implementation is closer to the earlier versions of Annoy, rather then "classic" LSH, since during index construction, I picking up two random points and calculate the plane that lies in the middle between those points, and then repeat this process recursevely for points that lies on each side of this newly generated plane.
To maximize the number of detected nearest neighbors during the search, usually it's enough to run ~10-100 plane generations (NTrees parameter).
So during "training" stage we end up with many trees that contains plane coefs in the leaves. As a final step, we just need to generate hashes for each vector in a train set, by travesrsing built trees, and keep those hashes in some storage.
For each query vector we generate a set of hashes (one per single "tree"), based on which side of each plane the query point lies on, and then we add all the candidates to the min-heap to finally get k-closest point to our query point.

Here is just super-simple visual example of the random space partitioning:

Here are some simple "rules" for the algorithm tuning, that I used:

more "trees" you create --> more space you use, more time for creating search index you need, but more accurate the model could become (search time becomes unsignificantly higher too, though);
decreasing the minimum amount of points in a "bucket" can make search faster, but it can be less accurate (more false negative errors, potentially);
larger distance threshold you make --> more "candidate" points you will have during the search phase, so you can satisfy the "max. nearest neighbors" condition faster, but potentially decrease the accuracy.

API

The storage and hashing parts are decoupled from each other.
You need to implement only two interfaces to make everything work:

store, in order to use any storage you prefer.
metric, to use your custom distance metric.

LSH index object has a simple interface:

NewLsh(config lsh.Config) (*LSHIndex, error) is for creating the new instance of index by given config;
Train(records [][]float64, ids []string) error for filling search index with vectors and ids;
Search(query []float64, maxNN int, distanceThrsh float64) ([]lsh.Record, error) to find MaxNN nearest neighbors to the query vector;

Here is the usage example:

...
import (
    "log"
    lsh "github.com/gasparian/lsh-search-go/lsh"
    "github.com/gasparian/lsh-search-go/store/kv"
)

// Create train dataset as a pair of unique id and vector
var trainVecs [][]float64 = ...
var trainIds []string = ...
sampleSize := 10000
var queryPoint []float64 = ...

const (
    distanceThrsh = 2200 // Distance threshold in non-normilized space
    maxNN         = 10   // Maximum number of nearest neighbors to find
)

// Define search parameters
lshConfig := lsh.Config{
    IndexConfig: lsh.IndexConfig{
        BatchSize:     250,  // How much points to process in a single goroutine 
                             // during the training phase
        MaxCandidates: 5000, // Maximum number of points that will be stored
                             // in a min heap, where we then get MaxNN vectors
    },
    HasherConfig: lsh.HasherConfig{
        NTrees:   10,        // Number of planes trees (planes permutations) to generate
        KMinVecs: 500,       // Minimum number of points to stop growing planes tree
        Dims:     784,       // Space dimensionality
    },
}
// Store implementation, you can use yours
s := kv.NewKVStore()
// Metric implementation, L2 is good for the current dataset
metric := lsh.NewL2()
lshIndex, err := lsh.NewLsh(lshConfig, s, metric)
if err != nil {
    log.Fatal(err)
}

// Create search index; It will take some significant amount of time
lshIndex.Train(trainVecs, trainIds)

// Perform search
closest, err := lshIndex.Search(queryPoint, maxNN, distanceThrsh)
if err != nil {
    log.Fatal(err)
}
// Example of closest neighbors for 2D:
/*
[
    {096389f9-8d59-4799-a479-d8ec6d9de435 [0.07666666666666666 -0.003333333333333327]}
    {703eed19-cacc-47cf-8cf3-797b2576441f [0.06666666666666667 0.006666666666666682]}
    {1a447147-6576-41ef-8c2e-20fab35a9fc6 [0.05666666666666666 0.016666666666666677]}
    {b5c64ce0-0e32-4fa6-9180-1d04fdc553d1 [0.06666666666666667 -0.013333333333333322]}
]
*/

Testing

To perform regular unit-tests, first install go deps:

make install-go-deps

And then run tests for lsh and storage packages:

make test

If you want to run benchmarks, where LSH compared to the regular NN search, first install hdf-5 for opening bench datasets:

make install-hdf5 && make download-annbench-data

And just run go test passing the needed test name:

make annbench test=TestEuclideanFashionMnist

Search parameters that you can find here has been selected "empirically", based on precision and recall metrics measured on validation datasets.

Results

I used 16 core/60Gb RAM machine for tests and in-memory store implementation (kv.KVStore).
The following datasets has been used:

Dataset	N dimensions	Train examples	Test examples	Metric
Fashion MNIST	784	60000	10000	Euclidean
SIFT	128	1000000	10000	Euclidean
NY times	256	290000	10000	Cosine
GloVe	200	1183514	10000	Cosine

I end up using 10 closest nearest neighbors to calculate the metrics.
Both precision and recall has been calculated using distance-based definition of these metrics, like in the ANN-Benchmarks paper. See the example of "approximate" recall:

In all experiments I set ε=0.05.
I picked parameters manually, to get the best tradeoff between speed and accuracy.
It can be fine-tuned, but it takes a lot of time to play with parameters.

Fashion MNIST:

Approach	Traning time, s	Avg. search time, ms	Precision	Recall
Exact nearest neighbors	0.36	517	1.0	1.0
LSH	8.87	15	0.95	0.95

SIFT:

Approach	Traning time, s	Avg. search time, ms	Precision	Recall
Exact nearest neighbors	6.48	5015	1.0	1.0
LSH	480	69	0.940	0.935

So seems like it works with both datasets, giving the 30-70x speed up, with just a slightly lower metrics values.

Known problems

For cosine datasets results are worser - only slightly speed up with the large decrease in metrics. Also for both datasets I generated way more trees (>100) comparing to the previous two datasets.
Unfortunately, I can't say yet exactly why it happening, still working on that...
NY times:

Approach	Traning time, s	Avg. search time, ms	Precision	Recall
Exact nearest neighbors	1.8	1053	0.985	0.985
LSH	700	268	0.868	0.868

GloVe:

Approach	Traning time, s	Avg. search time, ms	Precision	Recall
Exact nearest neighbors	7.44	3901	1.0	1.0
LSH	????	????	????	????

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
.github/workflows		.github/workflows
annbench		annbench
lsh		lsh
pics		pics
store		store
test-data		test-data
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

annbench

annbench

lsh

lsh

pics

pics

store

store

test-data

test-data

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

go.mod

go.mod

go.sum

go.sum

Repository files navigation

lsh-search-go

Locality sensitive hashing short reference

API

Testing

Results

Known problems

About

Languages

License

gasparian/lsh-search-go

Folders and files

Latest commit

History

Repository files navigation

lsh-search-go

Locality sensitive hashing short reference

API

Testing

Results

Known problems

About

Topics

Resources

License

Stars

Watchers

Forks

Languages