Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve document API for stored fields. #12116

Merged
merged 9 commits into from
Feb 6, 2023

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jan 27, 2023

Currently stored fields have to look at binaryValue(), stringValue() and numericValue() to guess the type of the value and then store it. This has a few issues:

  • If there is a problem, e.g. all of these 3 methods return null, it's currently discovered late, when we already passed the responsibility of writing data from IndexingChain to the codec.
  • numericValue() is used both for numeric doc values and storage. This makes it impossible to implement a double field that is stored and doc-valued, as numericValue() needs to return simultaneously a number that consists of the double for storage, and the long bits of the double for doc values.
  • binaryValue() is used both for sorted(_set) doc values and storage. This makes it impossible to implement keyword fields that is stored and doc-valued, as the field returns a non-null value for both binaryValue() and stringValue() and stored fields no longer know which field to store.

This commit introduces IndexableField#storedValue(), which is used only for stored fields. This addresses the above issues. IndexingChain passes the storedValue() directly to the codec, so it's impossible for a stored fields format to mistakenly use binaryValue()/stringValue()/numericValue() instead of storedValue().

Currently stored fields have to look at binaryValue(), stringValue() and
numericValue() to guess the type of the value and then store it. This has a few
issues:
 - If there is a problem, e.g. all of these 3 methods return null, it's
   currently discovered late, when we already passed the responsibility of
   writing data from IndexingChain to the codec.
 - numericValue() is used both for numeric doc values and storage. This makes
   it impossible to implement a `double` field that is stored and doc-valued,
   as numericValue() needs to return simultaneously a number that consists of
   the double for storage, and the long bits of the double for doc values.
 - binaryValue() is used both for sorted(_set) doc values and storage. This
   makes it impossible to implement `keyword` fields that is stored and
   doc-valued, as the field returns a non-null value for both binaryValue() and
   stringValue() and stored fields no longer know which field to store.

This commit introduces `IndexableField#storedValue()`, which is used only for
stored fields. This addresses the above issues. IndexingChain passes the
storedValue() directly to the codec, so it's impossible for a stored fields
format to mistakenly use binaryValue()/stringValue()/numericValue() instead of
storedValue().
@rmuir
Copy link
Member

rmuir commented Jan 27, 2023

This is great, thanks for looking into it. It moves "type-guessing" into the one place that should be doing it, which is the generic Field.java

i didn't really think too deeply about implementation but I suppose there might be a few options:

  • "Box" the value along with its type, like what you've done here (are there perf implications?)
  • change api of the StoredFieldsWriter (codec api), e.g. to provide field's type (like the enum inside your current "box") as parameter or separate method (e.g. writeIntField, writeStringField, ...)

edit: try to organize thoughts better here

@jpountz
Copy link
Contributor Author

jpountz commented Jan 30, 2023

@rmuir I tried your idea of using a different method per type of stored value, and I liked the result: StoredValue no longer escapes the document API and the default impl for merging became more straightforward.

@jpountz jpountz marked this pull request as ready for review February 2, 2023 15:53
@jpountz
Copy link
Contributor Author

jpountz commented Feb 2, 2023

I went one step further:

  • StringField now implements storedValue() instead of relying on the guessing of Field#storedValue()
  • IntField, LongField, FloatField and DoubleField now support storing the value

This looks good to me overall, so I marked the PR as ready for review, what do you think?

@rmuir
Copy link
Member

rmuir commented Feb 2, 2023

Thanks for removing the type-guessing from the "structured" fields and supporting store! Can we fix the class-level javadocs for IntField & co?

They still say this: "If you also need to store the value, you should add a separate StoredField instance."

@jpountz
Copy link
Contributor Author

jpountz commented Feb 2, 2023

Thanks for catching this, I just pushed a fix.

@@ -43,7 +38,7 @@
* <ol>
* <li>For every document, {@link #startDocument()} is called, informing the Codec that a new
* document has started.
* <li>{@link #writeField(FieldInfo, IndexableField)} is called for each field in the document.
* <li>{@link #writeField} is called for each field in the document.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would maybe just change this to {@code writeField}. Not sure which one of the overloaded methods it links too now?

* Called before writing the stored fields of the document. {@link #writeField(FieldInfo,
* IndexableField)} will be called for each stored field. Note that this is called even if the
* document has no stored fields.
* Called before writing the stored fields of the document. {@link #writeField} will be called for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here, i think {@code would be better

* #writeField(FieldInfo, IndexableField)}, and {@link #finish(int)}, returning the number of
* documents that were written. Implementations can override this method for more sophisticated
* merging (bulk-byte copying, etc).
* #writeField}, and {@link #finish(int)}, returning the number of documents that were written.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last ambiguous @link that might better be @code

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great, not only does it open up new options for Field classes to be simpler, but it cleans up a lot of type-safety and IMO makes a lot of these pieces easier to read.

@jpountz jpountz merged commit 96136b4 into apache:main Feb 6, 2023
@jpountz jpountz deleted the improve_indexable_field_stored_value branch February 6, 2023 16:08
@jpountz jpountz added this to the 9.6.0 milestone Feb 6, 2023
jpountz added a commit that referenced this pull request Feb 6, 2023
jpountz added a commit that referenced this pull request Feb 6, 2023
Currently stored fields have to look at binaryValue(), stringValue() and
numericValue() to guess the type of the value and then store it. This has a few
issues:
 - If there is a problem, e.g. all of these 3 methods return null, it's
   currently discovered late, when we already passed the responsibility of
   writing data from IndexingChain to the codec.
 - numericValue() is used both for numeric doc values and storage. This makes
   it impossible to implement a `double` field that is stored and doc-valued,
   as numericValue() needs to return simultaneously a number that consists of
   the double for storage, and the long bits of the double for doc values.
 - binaryValue() is used both for sorted(_set) doc values and storage. This
   makes it impossible to implement `keyword` fields that is stored and
   doc-valued, as the field returns a non-null value for both binaryValue() and
   stringValue() and stored fields no longer know which field to store.

This commit introduces `IndexableField#storedValue()`, which is used only for
stored fields. This addresses the above issues. IndexingChain passes the
storedValue() directly to the codec, so it's impossible for a stored fields
format to mistakenly use binaryValue()/stringValue()/numericValue() instead of
storedValue().
jpountz added a commit that referenced this pull request Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants