Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9507 Custom order for leaves in IndexReader and IndexWriter #2256

Conversation

mayya-sharipova
Copy link
Contributor

  1. Add an option to supply a custom leaf sorter for IndexWriter.
    A DirectoryReader opened from this IndexWriter will have its leaf
    readers sorted with the provided leaf sorter. This is useful for
    indices on which it is expected to run many queries with particular
    sort criteria (e.g. for time-based indices this is usually a
    descending sort on timestamp). Providing leafSorter allows
    to speed up early termination for this particular type of
    sort queries.

  2. Add an option to supply a custom leaf sorter for
    StandardDirectoryReader. The leaf readers of this
    StandardDirectoryReader will be sorted according the the provided leaf
    sorter.

1. Add an option to supply a custom leaf sorter for IndexWriter.
A DirectoryReader opened from this IndexWriter will have its leaf
readers sorted with the provided leaf sorter. This is useful for
indices on which it is expected to run many queries with particular
sort criteria (e.g. for time-based indices this is usually a
descending sort on timestamp). Providing leafSorter allows
to speed up early termination for this particular type of
sort queries.

2. Add an option to supply a custom leaf sorter for
StandardDirectoryReader. The leaf readers of this
StandardDirectoryReader will be sorted according the the provided leaf
sorter.
@@ -933,6 +936,31 @@ protected final void ensureOpen() throws AlreadyClosedException {
* low-level IO error
*/
public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {
this(d, conf, null);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THREAD_SAFETY_VIOLATION: Unprotected write. Non-private method IndexWriter(...) indirectly writes to field this.mergeScheduler.infoStream outside of synchronization.
Reporting because another access to the same memory occurs on a background thread, although this access may not.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an exciting change, especially for non-concurrent early termination cases.

I left a couple questions.

Thanks @mayya-sharipova !

@@ -941,6 +969,11 @@ public IndexWriter(Directory d, IndexWriterConfig conf) throws IOException {
// obtain the write.lock. If the user configured a timeout,
// we wrap with a sleeper and this might take some time.
writeLock = d.obtainLock(WRITE_LOCK_NAME);
if (config.getIndexSort() != null && leafSorter != null) {
throw new IllegalArgumentException(
"[IndexWriter] can't use index sort and leaf sorter at the same time!");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why is that? I thought this change sorts whole segments, and index sorting sorts documents within one segment? Why do they conflict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikemccand Thank you for your review! From the discussion on the Jira ticket, we also wanted to use writer's leafSorter during merging for arranging docs in a merged segment (by putting first docs from a segment with highest sort values according to leafSorter).

This will be in conflict with indexSorter, as if provided it will arrange merged docs according to its sort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK - that MultiSorter is sorting (by index sort) documents in several segments being merged; it will control the order of the documents within the new segment, but shouldn't have any influence on or conflict with the order of the segments being merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov Thank you for your feedback and explanation. Sorry, I am still not super clear about this point. It seems to me as indexSorter maps each leaf's documents into the merged segment according to its sort, the same way leafSorter will map each leaf's documents into the merged segment according to its sort (given several merging segments, in the merged segment the first docs should be docs from a segment with highest sort values..)

But I am ok to remove this check in this PR, as this PR is not concerned with merging, and follow up on this point in the following PR.

Addressed in 7ddff67

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, maybe I misunderstood the intent. Perhaps an example would clarify. Say that we have three segments, A, B , C containing documents A={0, 3, 6}; B={1, 4, 7}; C={2, 5, 8}, where the documents are understood to have a single field with the value shown, and the index sort is ordered in the natural way. Without this change, if we merged A and B, we'd get a new segment A+B={0, 1, 3, 4, 6, 7}. Now suppose there is no index sort (and the documents just "happen" to be in the index in the order given above, for the sake of the example), and we apply a leafSorter that sorts by the minimum value of any document in the segment (I guess it could be any sort of aggregate over the segment?), then we would get A+B={0, 3, 6, 1, 4, 7}. Now if we apply both sorts, we would get the same result as in the first case, right? I'm still unclear how the conflict arises.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I do see where you said we want to use leafSorter to sort documents. I guess I might challenge that design in favor of building on the indexSort we already have? But perhaps this is better sorted out in the context of a later PR, as you suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov Thank you for the clarification. Indeed, it is much clear with an example you provided. Looks like we need to think and discuss more about merging scenario, may be in the next PR or Jira ticket.

- Move leafSorter from IndexWriter to IndexWriterConfig
- Remove exception on setting both indexSorter and leafSorter
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova -- I left a few comments. This is a neat new capability ...

@@ -143,12 +158,17 @@ static StandardDirectoryReader open(

writer.incRefDeleter(segmentInfos);

if (writer.getConfig().getLeafSorter() != null) {
readers.sort(writer.getConfig().getLeafSorter());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is IWC.leafSorter allowed to be live-updated? If so, maybe we should pull out local variable here to hold onto the leafSorter and use that variable here and below, to ensure it is the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikemccand Thank you for your feedback.
The intent was to keep leafSorter constant and non-updatable, as indexWriter.getConfig returns LiveIndexWriterConfig that doesn't allow setting of a new leafSorter. I understand that leafSorter in config can still be changed, and your comment is very valid.

But In 8155f2e I've moved sorting of leaves to the constructor, so hopefully this should resolve this concern.

@@ -169,7 +189,10 @@ static StandardDirectoryReader open(
* @lucene.internal
*/
public static DirectoryReader open(
Directory directory, SegmentInfos infos, List<? extends LeafReader> oldReaders)
Directory directory,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we handling the near-real-time reader (IndexWriter.getReader(...)) path here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like IndexWriter.getReader(...) path is covered by a previous open method on line 123.

/** called only from static open() methods */
/**
* package private constructor, called only from static open() methods. If parameter {@code
* leafSorter} is not {@code null}, the provided {@code readers} are expected to be already sorted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an assertLeavesSorted method to confirm they really were properly sorted by the caller? There are so many paths for creating StandardDirectoryReader through are code base that it would not surprise me if one of the misses to sort properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! I moved sorting of leaves to the constructor.
Addressed in 8155f2e.

Move sorting of readers to the constructor of StandardDirectoryReader,
to ensure that they are always sorted.
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good. I left some additional comments and questions.

* @param leafSorter – a comparator for sorting leaf readers
* @return IndexWriterConfig with leafSorter set.
*/
public IndexWriterConfig setLeafSorter(Comparator<LeafReader> leafSorter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added a specific unit test for this feature but we could also set a random value in LuceneTestCase#newIndexWriterConfig to improve the coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will leafSorter sort leaves in this case? By what criteria? If a sorting criteria is anything different than the existing order of leaves, this will break some of our tests that rely on specific docIds. For example tests in the class TestBasics that check for specific docIDs.

@mayya-sharipova
Copy link
Contributor Author

I've moved this PR to a new PR in the new Lucene github repo. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants