Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for index sorting with document blocks #12829

Merged
merged 47 commits into from Jan 11, 2024

Conversation

s1monw
Copy link
Member

@s1monw s1monw commented Nov 21, 2023

Today index sorting with likely break document blocks added with IndexWriter#addDocuments(...) and friends since the index sorter has no indication of what documents are part of a block. This change proposes a marker field as a requirement for parent documents if the block API is used in conjunction with index sorting.
At this point this change requires a NumericDV field only present on the last document of a block or on every document indexed as an individual document iff a parent field is configured on the Sort. This can potentially be extended to a Term which is not as straight forward since validations might be more difficult and today postings are not available when we sort the flushed segment.

Relates to #12711

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few potshots ...

writer.addDocuments(Arrays.asList(doc, new Document()));
});
assertEquals(
"only the last document in the block must contain a numeric doc values field named: parent",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will prevent us from handling multiple levels of nested child documents -- with a sort. I guess that would require multiple bitsets so you can do different kinds of joins. Seems kind of involved, but I think it is possible today?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the semantics of this would be. I think we should not violate the semantics of the addDocuments API that guarantees the insert order being preserved across merges etc. from that perspective I think it's actually good that we don't do that?
Also if you rely on a certain order you should sort it ahead of time or at retrieval time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov I think multiple joins (nested levels) would work, except the user cannot use this parent DV field to record the sub-structure? Another field must be added by the user to record the child -> grandchild blocks. I agree it'd be nice if this parent field would accommodate more than one level of nesting, but maybe that can be a followon improvement.

This field would only handle a single level (the entire block of child and grandchild docs).

@jpountz
Copy link
Contributor

jpountz commented Nov 22, 2023

In general, I like the idea of making block joins more of a first-class citizen. I have been thinking for a long time about changing how blocks are identified from using bitsets to using a doc-value field where only parents documents have a value for the field, and the value must be the number of child documents that the parent has. This PR would like to identify blocks with a doc-values field as well, which is interesting. This makes me wonder if Lucene could be adding this doc value field for root documents automatically, similarly to what it is doing for soft deletes? (Possibly in a follow-up PR)

@msokolov
Copy link
Contributor

I like @jpountz's idea to make the value of this field be the number of children. It is simple and makes sense, and is pretty close to having the degree of flexibility that the current setup has.

I'm a little worried about giving up functionality, but I think if we had a list of parent-fields rather than a single parent-field that would cover what we can do today? Maybe one is enough - you can still do hierarchies with one field, but today's API actually let's you define multiple overlapping relations.

@s1monw
Copy link
Member Author

s1monw commented Nov 23, 2023

I'm a little worried about giving up functionality, but I think if we had a list of parent-fields rather than a single parent-field that would cover what we can do today? Maybe one is enough - you can still do hierarchies with one field, but today's API actually let's you define multiple overlapping relations.

I don't think we give up any functionality. can you elaborate what functionality you are referring to? I don't think we should have a list of parent fields that IW requires, what would that list be used for? I also don't understand how the APU lets you define overlapping relations that this change would prevent you from?

@s1monw s1monw requested a review from msokolov November 23, 2023 15:25
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @s1monw -- I made a first pass review!

lucene/core/src/java/org/apache/lucene/search/Sort.java Outdated Show resolved Hide resolved
lucene/core/src/java/org/apache/lucene/search/Sort.java Outdated Show resolved Hide resolved
writer.addDocuments(Arrays.asList(doc, new Document()));
});
assertEquals(
"only the last document in the block must contain a numeric doc values field named: parent",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov I think multiple joins (nested levels) would work, except the user cannot use this parent DV field to record the sub-structure? Another field must be added by the user to record the child -> grandchild blocks. I agree it'd be nice if this parent field would accommodate more than one level of nesting, but maybe that can be a followon improvement.

This field would only handle a single level (the entire block of child and grandchild docs).

* it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there
* is still a tie after all SortFields are checked, the internal Lucene docid is used to break it.
*
* @param parentField the name of a numeric doc values field that marks the last document of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it clearer that you only use this for index-time sorting? It has no meaning at search time (and should not be set)?

Or ... does it have search-time meaning? Can I use this to retrieve whole blocks in my search hits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I guess we could do something with that field at search time. Yet, at this point it's indexing time only

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not change Sort if possible. It's weird that it needs to know about a parentField, and IW should take care during static sort time (flush, merge) to validate that what it (IWC) knows as the parent field is consistent with the index-time sort?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikemccand do you mean we should rather make this a property of IWC rather than sort? That would be no issue. I used sort here since it's the only thing that is using this?

* document blocks indexed with {@link
* org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's update relatives. This
* is required for indices that use index sorting in combination with document blocks in order
* to maintain the document order of the blocks documents. Index sorting will effectively
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the Sort is otherwise free to break up the block? I.e. this parentField will override sort fields that might attempt to sort child docs away from one another?

@mikemccand
Copy link
Member

using a doc-value field where only parents documents have a value for the field, and the value must be the number of child documents that the parent has

This is a neat idea too -- would this maybe make completing the join at search time more efficient? Today we scan for the next set bit, but I guess with this approach, we'd still scan for the next NumericDocValues that is set, and then see the child count, but that would be sort of redundant (we could still subtract docids to get the number of children)? Also, I'm not sure how the nested case would be supported (same as the current PR's approach too).

This makes me wonder if Lucene could be adding this doc value field for root documents automatically, similarly to what it is doing for soft deletes? (Possibly in a follow-up PR)

This is a nice idea too, but I think we'd still need user input to give us a free field name to use for this purpose.

@msokolov
Copy link
Contributor

I don't think we give up any functionality. can you elaborate what functionality you are referring to? I don't think we should have a list of parent fields that IW requires, what would that list be used for? I also don't understand how the APU lets you define overlapping relations that this change would prevent you from?

I think we have always been able to model a multi-level hierarchy. Suppose I index a book broken into documents representing paragraphs grouped into sections, chapters, and parts; so it's a five-level hierarchy. There can be documents at every level and we want to be able to do roll-ups to any level. If I index a book as a block, we support this today.

I might be confused, but I didn't think that would work with this arrangement. However maybe it does? I guess one describes every component document as having the book as a parent, and then later can still use a different parent bitset to do the roll-ups to other levels?

@s1monw
Copy link
Member Author

s1monw commented Nov 28, 2023

@mikemccand @jpountz thanks for your ideas. I'd love to flash this out more before we add anything we write to the index. Today we'd only use this for sorting but if that field can be useful to speed up or improve searching I am happy to extend this further.
My suggestion here is to move the parentField away from sort to IWC and if you configure it we automatically add this field in DWPT to every parent document. We can then also check if it's set and a sort is used and fail if a block is indexed and it's unset. That would allow us to gradually add this in 9.x and require it for sorting in 10? That way I can also go ahead and set the number of children as the value for that field. WDYT?

@msokolov there is no reduction of features here. you can do all the things you have done before except of re-sorting the do cuments within a block. This is a API guarantee that we try to fix here. You can still add another field to each doc specifying which hierarchy level it belongs to and then do books -> chapters -> paragraphs whatever relations to that. All we try to prevent is that the API guarantee is broken up due to a sort etc. Hope that helps?

@msokolov
Copy link
Contributor

@s1monw that makes sense. I think I was confusing index-time changes and query-time changes. This whole piece of functionality is a little confusing given how loosely coupled these things are!

@s1monw
Copy link
Member Author

s1monw commented Dec 3, 2023

@mikemccand @jpountz I went down a hybrid way. This change now only requires the user to specify the field name and IW adds it to the last doc in the block if it's configured. I will for sure add some more tests to this once we settled on the approach. The only think I am torn on is, if we set the num of children as a value for the DV field then I guess we should have a good usecase for that?

@mikemccand
Copy link
Member

Thanks @s1monw -- I'll try to review soon.

The only think I am torn on is, if we set the num of children as a value for the DV field then I guess we should have a good usecase for that?

One thing this would be helpful for is stronger consistency check in CheckIndex? Without doing this we can just validate that the parent docs are sorted correctly, whereas today we verify that all docs are sorted correctly (since user must provide a sort congruent with their doc-blocks). So we are weakening CheckIndex a bit by skipping checking the sort of the child docs ... if we stored the number of children in the new DV field, we could gain that back?

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, I made a first pass review! Brain hurts now ...

Net/net I think I like this approach, but I'm worried about performance impact during indexing (I can help test this), and about changing Sort at all to know about parent fields. I left lots of other small comments ... thanks @s1monw!

* it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there
* is still a tie after all SortFields are checked, the internal Lucene docid is used to break it.
*
* @param parentField the name of a numeric doc values field that marks the last document of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not change Sort if possible. It's weird that it needs to know about a parentField, and IW should take care during static sort time (flush, merge) to validate that what it (IWC) knows as the parent field is consistent with the index-time sort?

.setIndexSort(sort)
.setMergePolicy(newLogMergePolicy()))) {
// add 10 docs
for (int i = 0; i < 10; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing indexing doc blocks into an old (pre-10.0) index right? In this case does IW still add the DV field to the parent doc? The test here seems not to be configuring a parent field, so I guess no? Only indices created 10.0+ will do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, unless we 1. backport this which is possible and 2. if you set a parent field in the old index. yet I can't fully test that just yet since it's not backported. makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this makes sense -- if you have a pre-10.0 index, then none of the new code is run. You must create a new index to get the stronger checking.

(Or, 9.10.x index if we backport).

Function<IndexSorter.DocComparator, IndexSorter.DocComparator> comparatorWrapper = in -> in;

if (state.segmentInfo.getHasBlocks() && indexSort.getParentField() != null) {
final DocIdSetIterator readerValues =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that this is non-null and throw CorruptIndexException or so if it is null?

lucene/core/src/java/org/apache/lucene/index/Sorter.java Outdated Show resolved Hide resolved

/**
* Wraps the given field in a reserved field and registers it as reserved. Only DWPT should do
* this to mark fields as private / reserved to prevent this fieldname to be used from the outside
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that we are recording the name so it cannot be created with a different type, and indeed prevent it from being added to a document except by IndexingChain -- are there any other restrictions on the use of the field? Could we clear up the javadoc to be more explicit? I think eg I can still get a DV iterator over this field and read its values from "outside". Maybe we could say "prevent this fieldname from being used when writing documents ..."?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to tighten up the language.

@s1monw
Copy link
Member Author

s1monw commented Dec 12, 2023

@mikemccand I agree we should not add this to sort but rather tread it the same way we treat the softDeletes field. it's essentially the same thing from an IW perspective. I will go ahead an make that change.

@mikemccand
Copy link
Member

One small observation here: one can use the add/updateDocuments API today with no intention of using those as doc blocks at search time, purely as an optimization over calling separate addDocument every time.

I'm not sure what performance difference this makes, but it should only help speed up indexing throughput.

E.g. luceneserver (the search engine behind jirasearch and soon now githubsearch) does this in its bulk indexing API, I think.

This saves the IW/DWPT overhead, but comes with some risk if your blocks are too big since IW cannot flush until the block is done.

With this change, I think such usage would still be fine if you have no index sort? But if you have an index sort, then IW will always do the parent block validation / tracking?

I think net/net this is fine, I just wanted to call it out, for crazy people trying to eek out every last bit of indexing throughput :)

@mikemccand
Copy link
Member

I opened mikemccand/luceneutil#252 to try to measure the performance change of addDocument N times vs addDocuments once.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great -- I left a bunch of polishy comments, and still one head-scratcher comment. Thanks @s1monw.

lucene/CHANGES.txt Outdated Show resolved Hide resolved
lucene/CHANGES.txt Outdated Show resolved Hide resolved
lucene/CHANGES.txt Outdated Show resolved Hide resolved
throw new IllegalArgumentException(
"cannot configure ["
+ parentFieldName
+ "] as parent document field ; this index uses ["
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think I'm confused about how this code is working/catching something.

parentFieldName is what was configured in IWC, and the incoming fieldName is coming from ... a document being newly indexed where this in-memory segment is seeing this fieldName for the first time and creating a FieldInfo for it? Hmm but then how is that new FieldInfo setting its .isParentField? Sorry I gotta stare at this / think about this a bit more :)


/**
* Wraps the given field in a reserved field and registers it as reserved. Only DWPT should do
* this to mark fields as private / reserved to prevent this fieldname to be used from the outside
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to tighten up the language.

@mikemccand
Copy link
Member

"cannot configure ["
+ parentFieldName
+ "] as parent document field ; this index uses ["
+ fieldName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I am still confused here -- this looks backwards to me (fieldName and parentFieldName should be swapped)?

I.e. parentFieldName was derived from the index as the already established parent field, and now a newly indexed document / current IWC is attempting to set fieldName as the parent field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is confusing from it's name. The parentFieldName comes from IWC and fieldname is the incoming field. I think it's right the way it is, maybe we need to do some rewording altogether but when you look at the tests I think the error message makes sense?
I would like to find a way that users don't have to specify these fieldnames altogether but that should be a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to find a way that users don't have to specify these fieldnames altogether but that should be a follow up

Yeah +1 to defer.

Phew, I fired up a GitHub workspace to spelunk the code/PR better! I think (maybe?) I understand what's going on and my confusion! I need more caffeine now ...

I think this code is supposed to only catch the addIndexes mismatch case? Because, by design, if user is indexing documents, they cannot tickle this code -- the parentFieldName (from IWC) and the newly born FieldInfo (in-memory DWPT segment) will also use IWC's configured parent field, so these ifs should never trigger from simply indexing documents?

Maybe we can rename parentFieldName to iwcParentFieldName making its provenance clear? Then, by contrast, it's more clear that fieldName is the new foreign entity that must be validated against IWC's config.

And then I find this index super confusing -- in the test case it seems to refer to the incoming index (via addIndexes) and not to the currently open index that your IndexWriter is pointing to?

How about changing the wording for all of these exceptions to something like the current index is configured with parentName=X; cannot add external index with parentName=Y or so?

throw new IllegalArgumentException(
"this index has ["
+ fieldName
+ "] as parent document field already but parent document field is not configured in IWC");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here? parentFieldName was derived from the segments in the current index (not set), and now a document (current IWC) is trying to set fieldName as the parent field?

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew! I'm tired :)

})
.getMessage();
assertEquals(
"cannot configure [foo] as parent document field ; this index uses [foobar] as parent document field already",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See I think this exception message is backwards :) Shouldn't it be:

cannot configure [foobar] as parent document field ; this index uses [foo] as parent document field already

I.e. "this index" is referring to dir2/w2 (the index into which we are adding an external index), and it is configured to use foo as the parent field, while the incoming index via addIndexes (dir1) is attempting (illegally) to change that to foobar?

Edit: OK I think this all comes down to this index being super confusing to me :) If it's referring to the incoming (newly added) index via addIndexes, the message makes sense. But I would think this index means the one you opened with IW...

if (info.isParentField()) {
if (parentField != null && parentField.equals(info.name) == false) {
throw new IllegalArgumentException(
"multiple parent fields [" + info.name + ", " + parentField + "]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we improve this wording @s1monw ^^? It is confusing because it does not actually say specifically what is wrong (that you must have only one parent field).

@@ -78,6 +80,7 @@ public FieldInfos(FieldInfo[] infos) {
boolean hasPointValues = false;
boolean hasVectorValues = false;
String softDeletesField = null;
String parentField = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe fix these silly pointless initializers ^^?

This FieldInfos.java is some of the oldest, most "pristine"/"untouched" code in Lucene ;) You can almost count the tree rings ...

})
.getMessage();
assertEquals(
"cannot configure [foo] as parent document field ; this index uses [foobar] as parent document field already",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

"cannot configure ["
+ parentFieldName
+ "] as parent document field ; this index uses ["
+ fieldName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to find a way that users don't have to specify these fieldnames altogether but that should be a follow up

Yeah +1 to defer.

Phew, I fired up a GitHub workspace to spelunk the code/PR better! I think (maybe?) I understand what's going on and my confusion! I need more caffeine now ...

I think this code is supposed to only catch the addIndexes mismatch case? Because, by design, if user is indexing documents, they cannot tickle this code -- the parentFieldName (from IWC) and the newly born FieldInfo (in-memory DWPT segment) will also use IWC's configured parent field, so these ifs should never trigger from simply indexing documents?

Maybe we can rename parentFieldName to iwcParentFieldName making its provenance clear? Then, by contrast, it's more clear that fieldName is the new foreign entity that must be validated against IWC's config.

And then I find this index super confusing -- in the test case it seems to refer to the incoming index (via addIndexes) and not to the currently open index that your IndexWriter is pointing to?

How about changing the wording for all of these exceptions to something like the current index is configured with parentName=X; cannot add external index with parentName=Y or so?

@s1monw
Copy link
Member Author

s1monw commented Jan 9, 2024

@mikemccand I did another pass on it and change the wording. I think I know why you are confused and I tired to adress it. We use the exact same wording in the soft deletes case. I will work on this in a sep. issue

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, thanks @s1monw -- looks great!!

@s1monw s1monw merged commit df6bd25 into apache:main Jan 11, 2024
4 checks passed
s1monw added a commit to s1monw/lucene that referenced this pull request Jan 15, 2024
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to  parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents.

Relates to apache#12711
s1monw added a commit to s1monw/lucene that referenced this pull request Jan 15, 2024
A parent field can now safely be added to indices that are created on
or after 9.9.x that have no doc blocks as well as newly created indices.
Adding a parent field to older indices is not permitted.

Relates to apache#12829
@s1monw s1monw mentioned this pull request Jan 15, 2024
s1monw added a commit to s1monw/lucene that referenced this pull request Jan 16, 2024
This change prevents users from adding a parent field to an existing index.
Parent field must be added before any documents are added to the index to
prevent documents without the parent field from being indexed and later
to be treated as child documents upon merge.

Relates to apache#12829
s1monw added a commit that referenced this pull request Jan 16, 2024
This change prevents users from adding a parent field to an existing index.
Parent field must be added before any documents are added to the index to
prevent documents without the parent field from being indexed and later
to be treated as child documents upon merge.

Relates to #12829
slow-J pushed a commit to slow-J/lucene that referenced this pull request Jan 16, 2024
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to  parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. 

Relates to apache#12711
slow-J pushed a commit to slow-J/lucene that referenced this pull request Jan 16, 2024
This change prevents users from adding a parent field to an existing index.
Parent field must be added before any documents are added to the index to
prevent documents without the parent field from being indexed and later
to be treated as child documents upon merge.

Relates to apache#12829
s1monw added a commit that referenced this pull request Jan 17, 2024
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to  parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents.

Relates to #12711
s1monw added a commit that referenced this pull request Jan 17, 2024
This change prevents users from adding a parent field to an existing index.
Parent field must be added before any documents are added to the index to
prevent documents without the parent field from being indexed and later
to be treated as child documents upon merge.

Relates to #12829
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants