in-memory data structure for approximate nearest neighbor search in high-dimensional spaces
The HNSW algorithm is an efficient and scalable method for approximate nearest neighbor search in high-dimensional spaces. This Golang library provides a generic implementation of the HNSW algorithm, allowing users to build, search, and manage HNSW graphs for their data. See the original publication: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.
- Fast and scalable nearest neighbor search.
- Efficient indexing and querying of high-dimensional data.
- Thread-safe and concurrent I/O.
- Customizable parameters for tuning performance.
- Easy-to-use API for building and querying HNSW graphs.
- Generics for building and querying domain objects.
- Command-line utility for index analysis.
The latest version of the module is available at main
branch. All development, including new features and bug fixes, take place on the main
branch using forking and pull requests as described in contribution guidelines. The stable version is available via Golang modules.
Use go get
to retrieve the library and add it as dependency to your application.
go get -u github.com/fogfish/hnsw
Here is a simple example to get you started with this HNSW library:
package main
import (
"fmt"
"github.com/fogfish/hnsw"
"github.com/fogfish/hnsw/vector"
surface "github.com/kshard/vector"
)
func main() {
// Create new HNSW index
index := hnsw.New(vector.SurfaceVF32(surface.Cosine()))
// Add some data points to the index
index.Insert(vector.VF32{Key: 1, Vec: []float32{0.1, 0.2, /* ... */ 0.128}})
index.Insert(vector.VF32{Key: 2, Vec: []float32{0.4, 0.5, /* ... */ 0.128}})
// Perform a nearest neighbor search
query := vector.VF32{Vec: []float32{0.15, 0.25, /* ... */ 0.128}
neighbors := index.Search(query, 10, 100)
// Print the results
fmt.Println("Nearest neighbors:", neighbors)
}
The full documentation is available on GoDoc.
Data types definition is an essential part of development with the library. Golang structs declares domain of your application.
The HNSW data structure utilizes Golang generics, allowing library clients to define their own types. This design decision, while potentially complicating initial adoption, makes the structure highly versatile and usable in any context. The library assumes nothing about the type it stores; instead, the client is only required to define the distance function against own types as
type Surface[Vector any] interface {
Distance(Vector, Vector) float32
}
To assist with implementation, the library provides a reference implementation at github.com/fogfish/hnsw/vector
for float32 vector traits annotated with unique IDs (see vector.VF32
and vector.KF32
). Additionally, the companion library vector helps implement efficient vector algebra.
Here is a example to implement HNSW index for data type with custom attributes:
import (
"github.com/kshard/vector"
)
// Custom type
type Embedding struct {
Vec surface.F32
Text string
/* ... add more attributes ... */
}
// Declare surface distance function for Embedding.
var (
surface = vector.ContraMap[vector.F32, Embedding]{
Surface: vector.Cosine(),
ContraMap: func(e Embedding) vector.F32 { return e.Vec },
}
)
// Create index for Embedding type
index := hnsw.New(surface)
To create a new HNSW index, you need to specify the configuration parameters such as the distance function and the construction parameter:
hnsw.New(vector.SurfaceVF32(surface.Cosine()),
// Construction Efficiency Factor (efConstruction)
hnsw.WithEfConstruction(200),
// Maximum number of connections per node (M)
hnsw.WithM(16),
// Maximum number of connections per node at level 0 (M0)
hnsw.WithM0(32),
)
Optimization of given algorithm parameters is key to achieving the best performance for your specific use case. The primary parameters to focus on includes M
, M
and efConstruction
. M
and M0
controls the maximum number of connections per node, balancing between memory usage and search efficiency. efConstruction
determines the number of candidate nodes evaluated during graph construction, influencing both the construction time and the accuracy of the graph. Additionally, the mL
parameter, or maximum level, dictates the hierarchical structure's depth, affecting both connectivity and search efficiency. You can optimize the balance between accuracy, speed, and resource usage, tailoring the HNSW algorithm to the demands of your dataset and application. The command line utility supports the hyper optimization process.
Be aware that managing disconnected nodes is crucial in Hierarchical Navigable Small World (HNSW) Graphs to maintain the algorithm's efficiency and robustness. Disconnected nodes can occur when connections are pruned during the insertion of new nodes, leading to isolated nodes that degrade the performance of nearest neighbor searches. To mitigate this issue tune M0
. This approach minimizes the risk of disconnections, ensuring reliable and efficient graph traversal during search operations.
To insert vector to the index, use the Insert
method passing appropriate data type:
index.Insert(vector.VF32{Key: 1, Vec: []float32{0.1, 0.2, /* ... */ 0.128}})
The HNSW library supports batch insert operations, making it efficient to add large datasets. It leverages Golang channels to handle parallel writes, ensuring that multiple data points can be inserted concurrently. This design maximizes performance and minimizes the time required to build the HNSW graph, making the library suitable for handling high-throughput data insertion scenarios.
// create channel for parallel writes
ch := index.Pipe(runtime.NumCPU())
// insert vector to the index
ch <- vector.VF32{Key: 1, Vec: []float32{0.1, 0.2, /* ... */ 0.128}}
Searching for nearest neighbors in the HNSW library is performed using the Search
function. This method requires a query vector parameter, which represents the point in the high-dimensional space for which you want to find the nearest neighbors. You have to wrap the vector to same data type as index support. The efSearch
parameter controls the number of candidate nodes to evaluate during the search process, directly affecting the trade-off between search speed and accuracy. A higher efSearch
value typically results in more accurate results at the expense of increased computation. The k
parameter specifies the number of nearest neighbors to return. By tuning efSearch
and k
, you can balance performance and precision according to your specific needs.
query := vector.VF32{Vec: []float32{0.15, 0.25, /* ... */ 0.128}}
neighbors := index.Search(query, 10, 100)
Use Distance
function to fine-tune neighbors results, filtering non relevant values.
for _, vector := range neighbors {
if index.Distance(query, vector) < 0.2 {
// do something
}
}
The HNSW library includes a breadth-first search functionality through the ForAll
method. This method performs a full scan, iterating over all nodes linked at a specific level of the graph. It takes a visitor function as an argument, defined as func(rank int, vector Vector, vertex []Vector) error
, where rank is the level of the node, vector is the node's vector, and vertex represents all outgoing edges. By performing a full scan, the ForAll
method ensures comprehensive exploration of the graph's nodes, making it useful for applications that require a complete overview of the graph structure at a given level.
// Scan all nodes at level 0
index.ForAll(0,
func(rank int, vector Vector, vertex []Vector) error {
//
return nil
},
)
The following visualization illustrates a Hierarchical Navigable Small World (HNSW) graph constructed from 4,000 vectors representing the top English words. This graph showcases the hierarchical structure and navigability of the small-world network built using these word vectors.
HNSW Layer 2 | |
HNSW Layer 1 | |
HNSW Layer 0 | |
Vector point cloud |
The HNSW library includes a command-line utility designed for hyper-optimizing the parameters of your HNSW graphs. This tool allows you to efficiently explore various configurations, such as M
, M0
, efConstruction
, and others, to find the optimal settings for your specific dataset and use case.
You can install the utility from source code.
go install github.com/fogfish/hnsw/cmd/hnsw@latest
The command line utility is tailored for working with GLoVe, SIFT or your own dataset in the textual (GLoVe format).
The command-line utility support the creation and optimization of HNSW indices, using GloVe-like datasets. You can create
an HNSW index tailored to your dataset, evaluate the index quality by query
the same indexed words, and draw
the graph layers to gain insights into the structure and connectivity of your data. The dataset assumes usage of cosine distance between words.
Obtain GLoVe dataset copy for evaluation from https://nlp.stanford.edu/projects/glove/, or use any textual representation following the format:
word -0.37604 0.24116 ... -0.26098 -0.0079604
hnsw help glove
hnsw glove create \
-i path/to/hnsw/index \
-g path/to/glove/dataset.txt \
-e 200 -m 16 -0 32
hnsw glove query \
-i path/to/hnsw/index \
-g path/to/glove/dataset.txt
hnsw glove draw \
-i path/to/hnsw/index \
--html path/to/graph.html
-l 2
The command-line utility support the creation and validation of HNSW indices, using SIFT datasets that provides vectors dataset together with ground truth matches. You can create
an HNSW index tailored to your dataset, evaluate the index quality by query
the ground truth data , and draw
the graph layers to gain insights into the structure and connectivity of your data. The dataset assumes usage of euclidean distance between indexed vectors.
Obtain GLoVe dataset copy for evaluation from http://corpus-texmex.irisa.fr
hnsw help sift
hnsw sift create \
-i path/to/hnsw/index \
-g path/to/sift/dataset.txt \
-e 200 -m 16 -0 32
hnsw sift query \
-i path/to/hnsw/index \
-g path/to/sift/dataset.txt
hnsw sift draw \
-i path/to/hnsw/index \
--html path/to/graph.html
-l 2
The library is MIT licensed and accepts contributions via GitHub pull requests:
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Added some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
The build and testing process requires Go version 1.21 or later.
The commit message helps us to write a good release note, speed-up review process. The message should address two question what changed and why. The project follows the template defined by chapter Contributing to a Project of Git book.
If you experience any issues with the library, please let us know via GitHub issues. We appreciate detailed and accurate reports that help us to identity and replicate the issue.
- Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
- GloVe: Global Vectors for Word Representation
- Datasets for approximate nearest neighbor search
- t-SNE: t-distributed stochastic neighbor embedding
- How to Use t-SNE Effectively
- https://github.com/danaugrs/go-tsne
- https://github.com/dwyl/english-words
- https://echarts.apache.org/examples/en/index.html#chart-type-graph