Skip to content

Conversation

@imotov
Copy link
Contributor

@imotov imotov commented Mar 27, 2012

Here is another version to compare. I moved configuration completely to the index level and kept compression and include/exclude functionality in the SourceFieldMapper. I think it looks much simpler and doesn't loose any functionality.

@kimchy
Copy link
Member

kimchy commented Mar 29, 2012

Heya, looks much better. Some notes:

  • Why not have the SourceFilter work with BytesHolder when serializing and return BytesHolder? In many cases, the extra parsing now required is not needed, for example, when storing the full source in an external system. Also, in this case, if the filter returns null, we don't really need to store the _source at all. This also means that the part that deserializes the source should handle BytesHolder, and it might be null.
  • With the above note, SourceFilter name might not make sense, maybe ExternalSourceProvider?
  • Still on the fence on the per type source thingy, and not index wide. Teh main reason currently is that we can optimize things if its on the index leve, for example, a system that stores the full source externally does not even need to load the _suorce from the document, so it can denote it somehow using the ExternalSourceProvider, so we can optimize the load process of fields from the doc. On the other hand, in Lucene 4.0, _source will probably move to be stored as a doc value, so it won't be relevant...
  • FetchPhase call copyBytes to pass the byte[] to the InternalHIt, why not just pass the bytes as is, and not do the extra copy?

@imotov
Copy link
Contributor Author

imotov commented Mar 29, 2012

I just pushed the following changes

  • Serialization now works with BytesHolder and returns BytesHolder. I have also added the abstract class ParsingExternalSourceProvider that does parsing to simplify writing providers that need it.
  • SourceFilter is renamed into ExternalSourceProvider.
  • Regarding type-level implementation, it would be difficult to optimize things on the index level because source extraction logic is encapsulated inside SourceFieldMapper, which operates on the type level. One idea that I had is to group results by type on the FetchPhase level and then retrieve all sources for a given type in a single request. That might improve performance for external source providers with high request latency (SQL-based, for example). So, it might make more sense to optimize it on this level in the future. In the current implementation, however, providers should return null to indicate that a record source is not available, and no source processing should be done on retrieval; an empty array to indicate that a record source is available, but they don't need any additional information to retrieve source; or any additional field that are required to obtain source during retrieval.
  • Unnecessary FetchPhase call to copyBytes is removed.

@kimchy
Copy link
Member

kimchy commented Apr 3, 2012

More comments :): Also, can you squash your comments into a single one?

@imotov
Copy link
Contributor Author

imotov commented Apr 3, 2012

I have updated SourceFieldMapper as you requested, squashed everything into a single commit and rebased it against the master.

Regarding the test, I intentionally implemented it in such a way that if the dehydrate method returns null, the rehydrate method is not called on retrieval. I thought it would be a nice way for the provider to indicate if source is "enabled" or "disabled" for the given record. Should I change it so rehydrate is always called?

By the way, here is an alternative version with a provider implemented completely on the index level: imotov@58931d3 If you still think that implementation with one provider per index is better, I can close this pull request and open another one based on that version.

@kimchy
Copy link
Member

kimchy commented Apr 4, 2012

I like the index level source provider much better :), lets go with that one for now, can you create another pull request?

Regarding returning null, I think that it makes sense for a case where the source provider does not persist anything the index index, but still can provide the source based on the type/id.

@imotov
Copy link
Contributor Author

imotov commented Apr 5, 2012

Moved to #1847

@imotov imotov closed this Apr 5, 2012
@imotov imotov deleted the external-source-provider-index-level branch May 1, 2020 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants