New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9507 Custom order for leaves in IndexReader and IndexWriter #2256
LUCENE-9507 Custom order for leaves in IndexReader and IndexWriter #2256
Conversation
1. Add an option to supply a custom leaf sorter for IndexWriter. A DirectoryReader opened from this IndexWriter will have its leaf readers sorted with the provided leaf sorter. This is useful for indices on which it is expected to run many queries with particular sort criteria (e.g. for time-based indices this is usually a descending sort on timestamp). Providing leafSorter allows to speed up early termination for this particular type of sort queries. 2. Add an option to supply a custom leaf sorter for StandardDirectoryReader. The leaf readers of this StandardDirectoryReader will be sorted according the the provided leaf sorter.
953de6f
to
d9fcc94
Compare
@@ -933,6 +936,31 @@ protected final void ensureOpen() throws AlreadyClosedException { | |||
* low-level IO error | |||
*/ | |||
public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException { | |||
this(d, conf, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THREAD_SAFETY_VIOLATION: Unprotected write. Non-private method IndexWriter(...)
indirectly writes to field this.mergeScheduler.infoStream
outside of synchronization.
Reporting because another access to the same memory occurs on a background thread, although this access may not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an exciting change, especially for non-concurrent early termination cases.
I left a couple questions.
Thanks @mayya-sharipova !
@@ -941,6 +969,11 @@ public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException { | |||
// obtain the write.lock. If the user configured a timeout, | |||
// we wrap with a sleeper and this might take some time. | |||
writeLock = d.obtainLock(WRITE_LOCK_NAME); | |||
if (config.getIndexSort() != null && leafSorter != null) { | |||
throw new IllegalArgumentException( | |||
"[IndexWriter] can't use index sort and leaf sorter at the same time!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why is that? I thought this change sorts whole segments, and index sorting sorts documents within one segment? Why do they conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikemccand Thank you for your review! From the discussion on the Jira ticket, we also wanted to use writer's leafSorter
during merging for arranging docs in a merged segment (by putting first docs from a segment with highest sort values according to leafSorter
).
This will be in conflict with indexSorter
, as if provided it will arrange merged docs according to its sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK - that MultiSorter is sorting (by index sort) documents in several segments being merged; it will control the order of the documents within the new segment, but shouldn't have any influence on or conflict with the order of the segments being merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msokolov Thank you for your feedback and explanation. Sorry, I am still not super clear about this point. It seems to me as indexSorter
maps each leaf's documents into the merged segment according to its sort, the same way leafSorter
will map each leaf's documents into the merged segment according to its sort (given several merging segments, in the merged segment the first docs should be docs from a segment with highest sort values..)
But I am ok to remove this check in this PR, as this PR is not concerned with merging, and follow up on this point in the following PR.
Addressed in 7ddff67
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe I misunderstood the intent. Perhaps an example would clarify. Say that we have three segments, A, B , C containing documents A={0, 3, 6}; B={1, 4, 7}; C={2, 5, 8}
, where the documents are understood to have a single field with the value shown, and the index sort is ordered in the natural way. Without this change, if we merged A and B, we'd get a new segment A+B={0, 1, 3, 4, 6, 7}
. Now suppose there is no index sort (and the documents just "happen" to be in the index in the order given above, for the sake of the example), and we apply a leafSorter
that sorts by the minimum value of any document in the segment (I guess it could be any sort of aggregate over the segment?), then we would get A+B={0, 3, 6, 1, 4, 7}
. Now if we apply both sorts, we would get the same result as in the first case, right? I'm still unclear how the conflict arises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I do see where you said we want to use leafSorter
to sort documents. I guess I might challenge that design in favor of building on the indexSort
we already have? But perhaps this is better sorted out in the context of a later PR, as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msokolov Thank you for the clarification. Indeed, it is much clear with an example you provided. Looks like we need to think and discuss more about merging scenario, may be in the next PR or Jira ticket.
- Move leafSorter from IndexWriter to IndexWriterConfig - Remove exception on setting both indexSorter and leafSorter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mayya-sharipova -- I left a few comments. This is a neat new capability ...
@@ -143,12 +158,17 @@ static StandardDirectoryReader open( | |||
|
|||
writer.incRefDeleter(segmentInfos); | |||
|
|||
if (writer.getConfig().getLeafSorter() != null) { | |||
readers.sort(writer.getConfig().getLeafSorter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is IWC.leafSorter allowed to be live-updated? If so, maybe we should pull out local variable here to hold onto the leafSorter
and use that variable here and below, to ensure it is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikemccand Thank you for your feedback.
The intent was to keep leafSorter
constant and non-updatable, as indexWriter.getConfig
returns LiveIndexWriterConfig
that doesn't allow setting of a new leafSorter
. I understand that leafSorter
in config can still be changed, and your comment is very valid.
But In 8155f2e I've moved sorting of leaves to the constructor, so hopefully this should resolve this concern.
@@ -169,7 +189,10 @@ static StandardDirectoryReader open( | |||
* @lucene.internal | |||
*/ | |||
public static DirectoryReader open( | |||
Directory directory, SegmentInfos infos, List<? extends LeafReader> oldReaders) | |||
Directory directory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we handling the near-real-time reader (IndexWriter.getReader(...)
) path here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like IndexWriter.getReader(...)
path is covered by a previous open
method on line 123.
/** called only from static open() methods */ | ||
/** | ||
* package private constructor, called only from static open() methods. If parameter {@code | ||
* leafSorter} is not {@code null}, the provided {@code readers} are expected to be already sorted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add an assertLeavesSorted
method to confirm they really were properly sorted by the caller? There are so many paths for creating StandardDirectoryReader
through are code base that it would not surprise me if one of the misses to sort properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! I moved sorting of leaves to the constructor.
Addressed in 8155f2e.
Move sorting of readers to the constructor of StandardDirectoryReader, to ensure that they are always sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good. I left some additional comments and questions.
lucene/core/src/java/org/apache/lucene/index/StandardDirectoryReader.java
Show resolved
Hide resolved
* @param leafSorter – a comparator for sorting leaf readers | ||
* @return IndexWriterConfig with leafSorter set. | ||
*/ | ||
public IndexWriterConfig setLeafSorter(Comparator<LeafReader> leafSorter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added a specific unit test for this feature but we could also set a random value in LuceneTestCase#newIndexWriterConfig
to improve the coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will leafSorter sort leaves in this case? By what criteria? If a sorting criteria is anything different than the existing order of leaves, this will break some of our tests that rely on specific docIds. For example tests in the class TestBasics
that check for specific docIDs.
I've moved this PR to a new PR in the new Lucene github repo. Closing this PR. |
Add an option to supply a custom leaf sorter for IndexWriter.
A DirectoryReader opened from this IndexWriter will have its leaf
readers sorted with the provided leaf sorter. This is useful for
indices on which it is expected to run many queries with particular
sort criteria (e.g. for time-based indices this is usually a
descending sort on timestamp). Providing leafSorter allows
to speed up early termination for this particular type of
sort queries.
Add an option to supply a custom leaf sorter for
StandardDirectoryReader. The leaf readers of this
StandardDirectoryReader will be sorted according the the provided leaf
sorter.