-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve document API for stored fields. #12116
Improve document API for stored fields. #12116
Conversation
Currently stored fields have to look at binaryValue(), stringValue() and numericValue() to guess the type of the value and then store it. This has a few issues: - If there is a problem, e.g. all of these 3 methods return null, it's currently discovered late, when we already passed the responsibility of writing data from IndexingChain to the codec. - numericValue() is used both for numeric doc values and storage. This makes it impossible to implement a `double` field that is stored and doc-valued, as numericValue() needs to return simultaneously a number that consists of the double for storage, and the long bits of the double for doc values. - binaryValue() is used both for sorted(_set) doc values and storage. This makes it impossible to implement `keyword` fields that is stored and doc-valued, as the field returns a non-null value for both binaryValue() and stringValue() and stored fields no longer know which field to store. This commit introduces `IndexableField#storedValue()`, which is used only for stored fields. This addresses the above issues. IndexingChain passes the storedValue() directly to the codec, so it's impossible for a stored fields format to mistakenly use binaryValue()/stringValue()/numericValue() instead of storedValue().
This is great, thanks for looking into it. It moves "type-guessing" into the one place that should be doing it, which is the generic Field.java i didn't really think too deeply about implementation but I suppose there might be a few options:
edit: try to organize thoughts better here |
@rmuir I tried your idea of using a different method per type of stored value, and I liked the result: |
lucene/core/src/java/org/apache/lucene/document/StoredValue.java
Outdated
Show resolved
Hide resolved
I went one step further:
This looks good to me overall, so I marked the PR as ready for review, what do you think? |
Thanks for removing the type-guessing from the "structured" fields and supporting store! Can we fix the class-level javadocs for They still say this: "If you also need to store the value, you should add a separate StoredField instance." |
Thanks for catching this, I just pushed a fix. |
@@ -43,7 +38,7 @@ | |||
* <ol> | |||
* <li>For every document, {@link #startDocument()} is called, informing the Codec that a new | |||
* document has started. | |||
* <li>{@link #writeField(FieldInfo, IndexableField)} is called for each field in the document. | |||
* <li>{@link #writeField} is called for each field in the document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would maybe just change this to {@code writeField}
. Not sure which one of the overloaded methods it links too now?
* Called before writing the stored fields of the document. {@link #writeField(FieldInfo, | ||
* IndexableField)} will be called for each stored field. Note that this is called even if the | ||
* document has no stored fields. | ||
* Called before writing the stored fields of the document. {@link #writeField} will be called for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here, i think {@code
would be better
* #writeField(FieldInfo, IndexableField)}, and {@link #finish(int)}, returning the number of | ||
* documents that were written. Implementations can override this method for more sophisticated | ||
* merging (bulk-byte copying, etc). | ||
* #writeField}, and {@link #finish(int)}, returning the number of documents that were written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one last ambiguous @link
that might better be @code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great, not only does it open up new options for Field classes to be simpler, but it cleans up a lot of type-safety and IMO makes a lot of these pieces easier to read.
Currently stored fields have to look at binaryValue(), stringValue() and numericValue() to guess the type of the value and then store it. This has a few issues: - If there is a problem, e.g. all of these 3 methods return null, it's currently discovered late, when we already passed the responsibility of writing data from IndexingChain to the codec. - numericValue() is used both for numeric doc values and storage. This makes it impossible to implement a `double` field that is stored and doc-valued, as numericValue() needs to return simultaneously a number that consists of the double for storage, and the long bits of the double for doc values. - binaryValue() is used both for sorted(_set) doc values and storage. This makes it impossible to implement `keyword` fields that is stored and doc-valued, as the field returns a non-null value for both binaryValue() and stringValue() and stored fields no longer know which field to store. This commit introduces `IndexableField#storedValue()`, which is used only for stored fields. This addresses the above issues. IndexingChain passes the storedValue() directly to the codec, so it's impossible for a stored fields format to mistakenly use binaryValue()/stringValue()/numericValue() instead of storedValue().
Currently stored fields have to look at binaryValue(), stringValue() and numericValue() to guess the type of the value and then store it. This has a few issues:
double
field that is stored and doc-valued, as numericValue() needs to return simultaneously a number that consists of the double for storage, and the long bits of the double for doc values.keyword
fields that is stored and doc-valued, as the field returns a non-null value for both binaryValue() and stringValue() and stored fields no longer know which field to store.This commit introduces
IndexableField#storedValue()
, which is used only for stored fields. This addresses the above issues. IndexingChain passes the storedValue() directly to the codec, so it's impossible for a stored fields format to mistakenly use binaryValue()/stringValue()/numericValue() instead of storedValue().