-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow binary sort values. #17959
Allow binary sort values. #17959
Conversation
* Returns a string representation of an {@link InetAddress} that is | ||
* compatible with sorting and {@link InetAddress#getByName(String)}. | ||
*/ | ||
public static String formatToSortableString(InetAddress ip) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really the best place for this? Seems it has nothing to do with actual networking, but just with the ip mapping, so it should probably go there? And does it really need to be public or can it be an impl detail of the mapper?
I'm a little concerned that this change just propagates craziness through the internal api that was a hack before (turning BytesRef into Text). We have ipfield mapper using sorted doc values, so why can't we sort on byte[]? |
We do sort on byte[]. This change is only about how we expose the sort values to the user since I don't think BytesRef should be exposed in the client API (I chose to go with String but byte[] would work too, even though it is not straightforward since json does not support binary values). I did it the change this way because I think this is the option that would play best in terms of backward compatibility. But we have other options:
|
Why isn't how doc values are formatted for response separate from how they are transferred for distributed sorting? |
Just had a quick discussion with Ryan about it: one source of confusion here is due to the fact that mappings might not be available on the coordinating node, so shards have to get information about how to render sort values and then serialize it back to the coordinating node where the rendering will happen. |
I opened #17965 to discuss the general issue. However I don't think it should block this PR. |
@jpountz Instead of needing to pass along how to format to the coordinating node, could we pass along the formatted value (formatted on each shard using the mapper), but to the coordinating node that is just a black box string that is read, and inserted in the results for the docs which are chosen in top N? |
And that would mean also passing along the binary encoded values and sorting based on the binary value. So kind of a variation of your first option you proposed above. |
Note that it would also mean we would not need to render strings as sortable. eg we could keep |
@jpountz don't forget that sort values should be reusable in their rendered form by the seach_after feature |
Will do!
Actually it is not needed in the current PR either. I first thought that we should aim at returning strings that sort in the same order as the underlying binary values but I'm starting to think this is not worth the trouble. I'll remove it. |
Right, this is why I had to add the |
d88b7ba
to
ca42c15
Compare
I pushed a new commit that removes the ability to render sort values as sortable.
I tried to do this but this ended up making merging top docs on the coordinating node more complicated, since each shard would have its TopDocs and Object[][] for sort values (one Object[] per ScoreDoc), then we would call Lucene's TopDocs.merge to compute the top hits, and then we would have to associate each ScoreDoc object back to the rigth Object[]. Since it was making things more complicated, I gave up on this idea, what do you think? |
@rjernst Any opinions about the comment above? |
This looks good, thanks for the the change to printable sort values, I think the inet addresses look much better with the minimized format. We can revisit whether/how to simplify (ie removing SortAndFormats) depending on what happens with #17965. |
The `ip` field uses a binary representation internally. This breaks when rendering sort values in search responses since elasticsearch tries to write a binary byte[] as an utf8 json string. This commit extends the `DocValueFormat` API in order to give fields a chance to choose how to render values. Closes elastic#6077
ca42c15
to
de8354d
Compare
The
ip
field uses a binary representation internally. This breaks whenrendering sort values in search responses since elasticsearch tries to write a
binary byte[] as an utf8 json string. This commit extends the
DocValueFormat
API in order to give fields a chance to choose how to render values.
Closes #6077
Relates to #17971