Try index sorting to reduce number of shards/segments accessed #628

alexklibisz · 2023-12-07T16:43:29Z

Discussed in #375

^{Originally posted by alexklibisz July 17, 2022}
Copying (and slightly modifying) this idea from a comment for better visibility:

In September 2021 I attended Adrien Grand's talk "Speed up your Lucene queries by avoiding searching" at the 2021 Virtual ApacheCon.

Index Sorting was one of the methods that Adrien covered. In short, Index Sorting allows the user to customize how a Lucene index sorts documents stored in its segments. For example, an application might sort documents by date if they are always returned in chronological order. Elasticsearch also has an API for index sorting.

How does Index Sorting apply to vector search? My current hypothesis is that Index Sorting can be used to assign vectors to shards and segments more intelligently than a random assignment. At index time, we use index sorting to influence the placement of vectors into shards in a way that preserves spatial proximity. In other words, each shard is a cluster of spatially proximate vectors. At query time, the vector is still hashed via LSH, but now all or most of the hashes are stored in very few shards. So the query needs to access fewer shards in order to access and re-rank the candidates. Fewer shards means less IO, means faster query.

The initial experiment for Index Sorting involved sorting the vectors by two values:

Distance to from the vector to the origin (vector of all zeros).
Cosine of the angle between the vector and the origin vector.

Sorting the index by these two values led to a ~60% query-time speedup on one of the ten million vector datasets used for the 2021 "big ANN benchmarks" challenge. More testing and optimization is needed, but it looks like a fruitful direction.

The code was on the elastiknn-278-lucene-benchmarks branch, in the files LuceneAlgorithm.scala and LuceneStore.scala. I'll attach a zip of that directory just in case that branch goes missing at some point. elastiknn-elastiknn-278-lucene-benchmarks.tar.gz.

I also had a Gist with directions on running that mess of a branch:
d3e47e30d3c7468f3ee56844e5ee6f7a-ee25c17a896845c0c4fc63f64b44fc896fdd668c.zip

alexklibisz · 2023-12-07T16:45:40Z

I took a pass at implementing this in #614 . It seems to be working based on curling the index/_mapping, index/_settings, and index/_search routes. But it did not make a difference on the Fashion Mnist benchmarks. I think it's possible it would still help on larger benchmarks. I'll re-visit if I decide to work on those at some point.

alexklibisz · 2024-03-28T01:29:16Z

Reopening as I don't think I gave this a fair shake in #614. I updated the indexing code, but I didn't update any search code. It seems like to get the effect of sorting, we need to actually use one of the sorted fields at query-time: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-index-sorting.html#early-terminate

alexklibisz added the performance label Dec 7, 2023

alexklibisz closed this as completed Dec 7, 2023

alexklibisz mentioned this issue Dec 7, 2023

Performance: index sorting #614

Closed

alexklibisz reopened this Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try index sorting to reduce number of shards/segments accessed #628

Try index sorting to reduce number of shards/segments accessed #628

alexklibisz commented Dec 7, 2023

alexklibisz commented Dec 7, 2023

alexklibisz commented Mar 28, 2024 •

edited

Try index sorting to reduce number of shards/segments accessed #628

Try index sorting to reduce number of shards/segments accessed #628

Comments

alexklibisz commented Dec 7, 2023

Discussed in #375

alexklibisz commented Dec 7, 2023

alexklibisz commented Mar 28, 2024 • edited

alexklibisz commented Mar 28, 2024 •

edited