Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate index/store document APIs take 2? #12142

Open
jpountz opened this issue Feb 10, 2023 · 6 comments
Open

Separate index/store document APIs take 2? #12142

jpountz opened this issue Feb 10, 2023 · 6 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented Feb 10, 2023

Description

As @rmuir managed to make me look into reducing the amount of guessing we're doing in our document API, I think that a requirement for doing it right will be to split our index and store document APIs. Currently, Document and IndexableField are trying to cover both, which creates a confusing API.

For the store API, I wonder if we still need a Document abstraction. Maybe we could return a simple List<StoredValue> and push the work on users to convert it into a map if they want to be able to access fields by name. This is something that is easy to do with Java streams nowadays.

For the index API, I'd like to model IndexableField according to how it's getting consumed by IndexingChain. Something like:

class IndexableField {

  // Populates the inverted index, IndexableValue wraps one of a BytesRef or a TokenStream
  IndexableValue invertedValue(Analyzer analyzer, TokenStream reuse);

  // Populates points
  byte[] pointValue();

  // Populates NUMERIC or SORTED_NUMERIC doc values
  long numericDocValue(); 

  // Populates BINARY, SORTED or SORTED_SET doc values
  BytesRef docValue();

  // Populates stored fields
  StoredValue storedValue();

  // Populates byte[] vectors
  byte[] binaryVectorValue();

  // Populates float[] vectors
  float[] floatVectorValue();

}

I haven't thought much about it but I'm curious to hear thoughts.

@rmuir
Copy link
Member

rmuir commented Feb 10, 2023

I want the additional type-safety of things like Field classes that users use, but I think at a high-level, Document/Field api is good and intuitive model for end users.

So I'm not sure about the value of introducing different apis or more separation.
You can already avoid using Document if you don't want to use that.

I don't agree with all the methods on Document today that pretend to offer map-like access but are really linear time searches through a list though, I would support deprecating all those.

@rmuir
Copy link
Member

rmuir commented Feb 10, 2023

As far as what gets constructed by the "default" / "easy" stored fields visitor, i do think that one should be a Map and not conflated with Document used for indexing.

@rmuir
Copy link
Member

rmuir commented Feb 10, 2023

Honestly i think a big challenge is naming. Last time we tried this, the name was StoredDocument but I think that name is confusing. Too bad we created a class named StoredFields in #11998 as I almost think that would be perfect, for what you should "get back from the index".

@jpountz
Copy link
Contributor Author

jpountz commented Feb 10, 2023

I'm not sure about the value of introducing different apis or more separation.

The issue in my mind is that IndexableField has something called storedValue for indexing which is never populated when retrieving stored fields, and it also has a numericValue() for retrieval that can be any number type, but than IndexingChain always treats as a long.

I don't agree with all the methods on Document today that pretend to offer map-like access but are really linear time searches through a list though, I would support deprecating all those.

+1

@rmuir
Copy link
Member

rmuir commented Feb 10, 2023

yeah and just to clarify at high-level: Indexing Time:

  • I think getting Document/Field api should be geared at this. Remove ability for document to be a "map" backed by a slow list. Try to add type-safety to the Field classes like we have been doing. I do think storing should be easy here (e.g. Field.Store.YES like people are accustomed to).

retrieval time:

  • should be something different, probably map-like by "default/easy", but
  • Let's put some serious thought into naming first, before we try this again.

@ArtemLukanin-TomTom
Copy link

I feel, that this discussion is close to my comment #10374 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants