Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow binary sort values. #17959

Merged
merged 1 commit into from
May 6, 2016
Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Apr 25, 2016

The ip field uses a binary representation internally. This breaks when
rendering sort values in search responses since elasticsearch tries to write a
binary byte[] as an utf8 json string. This commit extends the DocValueFormat
API in order to give fields a chance to choose how to render values.

Closes #6077

Relates to #17971

@jpountz jpountz added >enhancement :Search/Search Search-related issues that do not fall into other categories v5.0.0-alpha2 labels Apr 25, 2016
* Returns a string representation of an {@link InetAddress} that is
* compatible with sorting and {@link InetAddress#getByName(String)}.
*/
public static String formatToSortableString(InetAddress ip) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really the best place for this? Seems it has nothing to do with actual networking, but just with the ip mapping, so it should probably go there? And does it really need to be public or can it be an impl detail of the mapper?

@rjernst
Copy link
Member

rjernst commented Apr 25, 2016

I'm a little concerned that this change just propagates craziness through the internal api that was a hack before (turning BytesRef into Text). We have ipfield mapper using sorted doc values, so why can't we sort on byte[]?

@jpountz
Copy link
Contributor Author

jpountz commented Apr 25, 2016

We do sort on byte[]. This change is only about how we expose the sort values to the user since I don't think BytesRef should be exposed in the client API (I chose to go with String but byte[] would work too, even though it is not straightforward since json does not support binary values).

I did it the change this way because I think this is the option that would play best in terms of backward compatibility. But we have other options:

  • returning sort values for all fields that use binary/sorted doc values using eg. hexadecimal. This would be a break for string fields since sort values used to be the string value that was used at index time and it would now be an hexadecimal string.
  • not exposing sort values in SearchResponse anymore and only using them internally to merge results coming from different shards.

@rjernst
Copy link
Member

rjernst commented Apr 25, 2016

Why isn't how doc values are formatted for response separate from how they are transferred for distributed sorting?

@jpountz
Copy link
Contributor Author

jpountz commented Apr 25, 2016

Just had a quick discussion with Ryan about it: one source of confusion here is due to the fact that mappings might not be available on the coordinating node, so shards have to get information about how to render sort values and then serialize it back to the coordinating node where the rendering will happen.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 25, 2016

I opened #17965 to discuss the general issue. However I don't think it should block this PR.

@rjernst
Copy link
Member

rjernst commented Apr 26, 2016

@jpountz Instead of needing to pass along how to format to the coordinating node, could we pass along the formatted value (formatted on each shard using the mapper), but to the coordinating node that is just a black box string that is read, and inserted in the results for the docs which are chosen in top N?

@rjernst
Copy link
Member

rjernst commented Apr 26, 2016

And that would mean also passing along the binary encoded values and sorting based on the binary value. So kind of a variation of your first option you proposed above.

@rjernst
Copy link
Member

rjernst commented Apr 26, 2016

Note that it would also mean we would not need to render strings as sortable. eg we could keep ::1 because it would only be used for display, not for actual sorting on the coordinating node.

@clintongormley
Copy link

@jpountz don't forget that sort values should be reusable in their rendered form by the seach_after feature

@jpountz
Copy link
Contributor Author

jpountz commented Apr 26, 2016

could we pass along the formatted value (formatted on each shard using the mapper), but to the coordinating node that is just a black box string that is read, and inserted in the results for the docs which are chosen in top N?

Will do!

Note that it would also mean we would not need to render strings as sortable. eg we could keep ::1 because it would only be used for display, not for actual sorting on the coordinating node.

Actually it is not needed in the current PR either. I first thought that we should aim at returning strings that sort in the same order as the underlying binary values but I'm starting to think this is not worth the trouble. I'll remove it.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 26, 2016

@jpountz don't forget that sort values should be reusable in their rendered form by the seach_after feature

Right, this is why I had to add the DocValuesFormat.parseBytesRef(String) method so that we can get back to the binary representation from the string value. This will likely also be needed for parsing include/exclude lists of terms aggregations, see #17705.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 27, 2016

I pushed a new commit that removes the ability to render sort values as sortable.

Instead of needing to pass along how to format to the coordinating node, could we pass along the formatted value (formatted on each shard using the mapper), but to the coordinating node that is just a black box string that is read

I tried to do this but this ended up making merging top docs on the coordinating node more complicated, since each shard would have its TopDocs and Object[][] for sort values (one Object[] per ScoreDoc), then we would call Lucene's TopDocs.merge to compute the top hits, and then we would have to associate each ScoreDoc object back to the rigth Object[]. Since it was making things more complicated, I gave up on this idea, what do you think?

@jpountz
Copy link
Contributor Author

jpountz commented May 2, 2016

@rjernst Any opinions about the comment above?

@rjernst
Copy link
Member

rjernst commented May 4, 2016

This looks good, thanks for the the change to printable sort values, I think the inet addresses look much better with the minimized format. We can revisit whether/how to simplify (ie removing SortAndFormats) depending on what happens with #17965.

The `ip` field uses a binary representation internally. This breaks when
rendering sort values in search responses since elasticsearch tries to write a
binary byte[] as an utf8 json string. This commit extends the `DocValueFormat`
API in order to give fields a chance to choose how to render values.

Closes elastic#6077
@jpountz jpountz merged commit de8354d into elastic:master May 6, 2016
@jpountz jpountz deleted the enhancement/sort_values branch May 27, 2016 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v5.0.0-alpha3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants