LUCENE-9507 Custom order for leaves in IndexReader and IndexWriter #2256

mayya-sharipova · 2021-01-27T21:17:58Z

Add an option to supply a custom leaf sorter for IndexWriter.
A DirectoryReader opened from this IndexWriter will have its leaf
readers sorted with the provided leaf sorter. This is useful for
indices on which it is expected to run many queries with particular
sort criteria (e.g. for time-based indices this is usually a
descending sort on timestamp). Providing leafSorter allows
to speed up early termination for this particular type of
sort queries.
Add an option to supply a custom leaf sorter for
StandardDirectoryReader. The leaf readers of this
StandardDirectoryReader will be sorted according the the provided leaf
sorter.

1. Add an option to supply a custom leaf sorter for IndexWriter. A DirectoryReader opened from this IndexWriter will have its leaf readers sorted with the provided leaf sorter. This is useful for indices on which it is expected to run many queries with particular sort criteria (e.g. for time-based indices this is usually a descending sort on timestamp). Providing leafSorter allows to speed up early termination for this particular type of sort queries. 2. Add an option to supply a custom leaf sorter for StandardDirectoryReader. The leaf readers of this StandardDirectoryReader will be sorted according the the provided leaf sorter.

sonatype-lift · 2021-01-27T23:34:52Z

lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

@@ -933,6 +936,31 @@ protected final void ensureOpen() throws AlreadyClosedException {
   *     low-level IO error
   */
  public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {
+    this(d, conf, null);


THREAD_SAFETY_VIOLATION: Unprotected write. Non-private method IndexWriter(...) indirectly writes to field this.mergeScheduler.infoStream outside of synchronization.
Reporting because another access to the same memory occurs on a background thread, although this access may not.

mikemccand

This is an exciting change, especially for non-concurrent early termination cases.

I left a couple questions.

Thanks @mayya-sharipova !

mikemccand · 2021-01-29T17:54:04Z

lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

@@ -941,6 +969,11 @@ public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {
    // obtain the write.lock. If the user configured a timeout,
    // we wrap with a sleeper and this might take some time.
    writeLock = d.obtainLock(WRITE_LOCK_NAME);
+    if (config.getIndexSort() != null && leafSorter != null) {
+      throw new IllegalArgumentException(
+          "[IndexWriter] can't use index sort and leaf sorter at the same time!");


Hmm, why is that? I thought this change sorts whole segments, and index sorting sorts documents within one segment? Why do they conflict?

@mikemccand Thank you for your review! From the discussion on the Jira ticket, we also wanted to use writer's leafSorter during merging for arranging docs in a merged segment (by putting first docs from a segment with highest sort values according to leafSorter).

This will be in conflict with indexSorter, as if provided it will arrange merged docs according to its sort.

I think it's OK - that MultiSorter is sorting (by index sort) documents in several segments being merged; it will control the order of the documents within the new segment, but shouldn't have any influence on or conflict with the order of the segments being merged

@msokolov Thank you for your feedback and explanation. Sorry, I am still not super clear about this point. It seems to me as indexSorter maps each leaf's documents into the merged segment according to its sort, the same way leafSorter will map each leaf's documents into the merged segment according to its sort (given several merging segments, in the merged segment the first docs should be docs from a segment with highest sort values..)

But I am ok to remove this check in this PR, as this PR is not concerned with merging, and follow up on this point in the following PR.

Addressed in 7ddff67

OK, maybe I misunderstood the intent. Perhaps an example would clarify. Say that we have three segments, A, B , C containing documents A={0, 3, 6}; B={1, 4, 7}; C={2, 5, 8}, where the documents are understood to have a single field with the value shown, and the index sort is ordered in the natural way. Without this change, if we merged A and B, we'd get a new segment A+B={0, 1, 3, 4, 6, 7}. Now suppose there is no index sort (and the documents just "happen" to be in the index in the order given above, for the sake of the example), and we apply a leafSorter that sorts by the minimum value of any document in the segment (I guess it could be any sort of aggregate over the segment?), then we would get A+B={0, 3, 6, 1, 4, 7}. Now if we apply both sorts, we would get the same result as in the first case, right? I'm still unclear how the conflict arises.

Hmm I do see where you said we want to use leafSorter to sort documents. I guess I might challenge that design in favor of building on the indexSort we already have? But perhaps this is better sorted out in the context of a later PR, as you suggested.

@msokolov Thank you for the clarification. Indeed, it is much clear with an example you provided. Looks like we need to think and discuss more about merging scenario, may be in the next PR or Jira ticket.

lucene/core/src/java/org/apache/lucene/index/IndexWriter.java

- Move leafSorter from IndexWriter to IndexWriterConfig - Remove exception on setting both indexSorter and leafSorter

mikemccand

Thanks @mayya-sharipova -- I left a few comments. This is a neat new capability ...

mikemccand · 2021-02-03T18:21:48Z

lucene/core/src/java/org/apache/lucene/index/StandardDirectoryReader.java

@@ -143,12 +158,17 @@ static StandardDirectoryReader open(

      writer.incRefDeleter(segmentInfos);

+      if (writer.getConfig().getLeafSorter() != null) {
+        readers.sort(writer.getConfig().getLeafSorter());


Is IWC.leafSorter allowed to be live-updated? If so, maybe we should pull out local variable here to hold onto the leafSorter and use that variable here and below, to ensure it is the same?

@mikemccand Thank you for your feedback.
The intent was to keep leafSorter constant and non-updatable, as indexWriter.getConfig returns LiveIndexWriterConfig that doesn't allow setting of a new leafSorter. I understand that leafSorter in config can still be changed, and your comment is very valid.

But In 8155f2e I've moved sorting of leaves to the constructor, so hopefully this should resolve this concern.

mikemccand · 2021-02-03T18:24:53Z

lucene/core/src/java/org/apache/lucene/index/StandardDirectoryReader.java

@@ -169,7 +189,10 @@ static StandardDirectoryReader open(
   * @lucene.internal
   */
  public static DirectoryReader open(
-      Directory directory, SegmentInfos infos, List<? extends LeafReader> oldReaders)
+      Directory directory,


Are we handling the near-real-time reader (IndexWriter.getReader(...)) path here too?

It looks like IndexWriter.getReader(...) path is covered by a previous open method on line 123.

mikemccand · 2021-02-03T18:26:31Z

lucene/core/src/java/org/apache/lucene/index/StandardDirectoryReader.java

-  /** called only from static open() methods */
+  /**
+   * package private constructor, called only from static open() methods. If parameter {@code
+   * leafSorter} is not {@code null}, the provided {@code readers} are expected to be already sorted


Could we add an assertLeavesSorted method to confirm they really were properly sorted by the caller? There are so many paths for creating StandardDirectoryReader through are code base that it would not surprise me if one of the misses to sort properly.

Great suggestion! I moved sorting of leaves to the constructor.
Addressed in 8155f2e.

Move sorting of readers to the constructor of StandardDirectoryReader, to ensure that they are always sorted.

jimczi

The change looks good. I left some additional comments and questions.

lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java

lucene/core/src/java/org/apache/lucene/index/StandardDirectoryReader.java

lucene/core/src/test/org/apache/lucene/index/TestIndexWriterReader.java

jimczi · 2021-02-09T15:49:36Z

lucene/core/src/java/org/apache/lucene/index/IndexWriterConfig.java

+   * @param leafSorter – a comparator for sorting leaf readers
+   * @return IndexWriterConfig with leafSorter set.
+   */
+  public IndexWriterConfig setLeafSorter(Comparator<LeafReader> leafSorter) {


You added a specific unit test for this feature but we could also set a random value in LuceneTestCase#newIndexWriterConfig to improve the coverage.

How will leafSorter sort leaves in this case? By what criteria? If a sorting criteria is anything different than the existing order of leaves, this will break some of our tests that rely on specific docIds. For example tests in the class TestBasics that check for specific docIDs.

mayya-sharipova · 2021-03-22T15:57:59Z

I've moved this PR to a new PR in the new Lucene github repo. Closing this PR.

mayya-sharipova force-pushed the indexreader-writer-with-sort branch from 953de6f to d9fcc94 Compare January 27, 2021 21:51

sonatype-lift bot reviewed Jan 27, 2021

View reviewed changes

mayya-sharipova requested a review from jimczi January 29, 2021 13:56

mikemccand reviewed Jan 29, 2021

View reviewed changes

Address feedback

7ddff67

- Move leafSorter from IndexWriter to IndexWriterConfig - Remove exception on setting both indexSorter and leafSorter

mikemccand reviewed Feb 3, 2021

View reviewed changes

mayya-sharipova added 2 commits February 4, 2021 13:28

Address feedback

8155f2e

Move sorting of readers to the constructor of StandardDirectoryReader, to ensure that they are always sorted.

Fix formatting

0a4c982

jimczi reviewed Feb 9, 2021

View reviewed changes

mayya-sharipova closed this Mar 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9507 Custom order for leaves in IndexReader and IndexWriter #2256

LUCENE-9507 Custom order for leaves in IndexReader and IndexWriter #2256

mayya-sharipova commented Jan 27, 2021

sonatype-lift bot Jan 27, 2021

mikemccand left a comment

mikemccand Jan 29, 2021

mayya-sharipova Feb 1, 2021

msokolov Feb 1, 2021

mayya-sharipova Feb 1, 2021

msokolov Feb 1, 2021

msokolov Feb 1, 2021

mayya-sharipova Feb 2, 2021

mikemccand left a comment

mikemccand Feb 3, 2021

mayya-sharipova Feb 4, 2021

mikemccand Feb 3, 2021

mayya-sharipova Feb 4, 2021

mikemccand Feb 3, 2021

mayya-sharipova Feb 4, 2021

jimczi left a comment

jimczi Feb 9, 2021

mayya-sharipova Mar 22, 2021

mayya-sharipova commented Mar 22, 2021

LUCENE-9507 Custom order for leaves in IndexReader and IndexWriter #2256

LUCENE-9507 Custom order for leaves in IndexReader and IndexWriter #2256

Conversation

mayya-sharipova commented Jan 27, 2021

sonatype-lift bot Jan 27, 2021

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova commented Mar 22, 2021