ANN search performance with multiple segments

To support ANN search, Elasticsearch indexes the vector values of each segment as an HNSW graph. Usually an index has multiple segments, which means ANN must check several HNSW graphs as it searches one segment after another. With fewer segments, the ANN search is more efficient, since we check a smaller number of graphs. Another way to describe it: searching an HNSW graph is sub-linear in the number of vectors, so it's better to search larger graphs vs. several smaller ones.

It's certainly possible to force merge the index to one segment, but force merge can take substantial time and has some downsides (explained [here](https://www.elastic.co/guide/en/elasticsearch/reference/8.4/indices-forcemerge.html)).

This meta issue explores ideas to improve performance that don't require force merge. These are early ideas and may change.
* Improvements to Elasticsearch's flush logic
  * Fix the surprising behavior in Elasticsearch flush where we sometimes create tiny segments ([#90776](#90776))
  * Enhance `IndexMemoryController` to no longer flush all pending segments when hitting the `indices.memory.index_buffer_size` threshold, and instead only flush the largest pending segment ([#34553](#34553))
  * When uploading a large set of vectors in bulk, the segment size is usually limited by `index.translog.flush_threshold_size` (512MB by default). Are there ways to avoid bumping up against this limit?
* Enable merge-on-refresh to avoid publishing tiny segments in point-in-time views of shards.
* Consider a dedicated merge policy optimized for vector search that encourages larger segments.
  * Decrease `index.merge.policy.segments_per_tier` to encourage the merge policy to merge segments more aggressively.
  * Increase `index.merge.policy.max_merged_segment` to not stop merging segments at 5GB.
  * Increase `index.merge.policy.floor_segment` to further bias the merge policy against small segments.
* Improve search speed in presence of multiple segments.
  * Maybe we don't need to gather the full `num_candidates` from each segment?
* Perhaps we could have a special "bulk vector upload" operation that does not allow concurrent searches, and makes sure to create a single graph by tweaking settings under-the-hood.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ANN search performance with multiple segments #90700

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ANN search performance with multiple segments #90700

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions