Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index ids in binary form. #25352

Merged
merged 2 commits into from Jul 7, 2017

Conversation

Projects
None yet
6 participants
@jpountz
Copy link
Contributor

commented Jun 22, 2017

This is a first iteration that mostly aims at triggering some discussion.
Indexing ids in binary form should help with indexing speed since we would
have to compare fewer bytes upon sorting, should help with memory usage of
the live version map since keys will be shorter, and might help with disk
usage depending on how efficient the terms dictionary is at compressing
terms.

Since we can only expect base64 ids in the auto-generated case, this PR tries
to use an encoding that makes the binary id equal to the base64-decoded id in
the majority of cases (253 out of 256). It also specializes numeric ids, since
this seems to be common when content that is stored in Elasticsearch comes
from another database that uses eg. auto-increment ids.

Another option could be to require base64 ids all the time. It would make things
simpler but I'm not sure users would welcome this requirement.

This PR should bring some benefits, but I expect it to be mostly useful when
coupled with something like #24615.

Many tests do not pass since they expect to find a string representation of the
id in the index.

Closes #18154

@nik9000

This comment has been minimized.

Copy link
Contributor

commented Jun 22, 2017

Neat!

@jpountz jpountz force-pushed the jpountz:feature/binary_ids branch Jul 6, 2017

Index ids in binary form.
Indexing ids in binary form should help with indexing speed since we would
have to compare fewer bytes upon sorting, should help with memory usage of
the live version map since keys will be shorter, and might help with disk
usage depending on how efficient the terms dictionary is at compressing
terms.

Since we can only expect base64 ids in the auto-generated case, this PR tries
to use an encoding that makes the binary id equal to the base64-decoded id in
the majority of cases (253 out of 256). It also specializes numeric ids, since
this seems to be common when content that is stored in Elasticsearch comes
from another database that uses eg. auto-increment ids.

Another option could be to require base64 ids all the time. It would make things
simpler but I'm not sure users would welcome this requirement.

This PR should bring some benefits, but I expect it to be mostly useful when
coupled with something like #24615.

Closes #18154

@jpountz jpountz force-pushed the jpountz:feature/binary_ids branch to 9d6d81d Jul 6, 2017

@jpountz jpountz added v6.0.0 and removed WIP labels Jul 6, 2017

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 6, 2017

I removed the WIP label, now that we removed the ability to have multiple types in 6.x indices, I was able to fix the backward compatibility layer and tests.

@s1monw I'd be interested to have your opinion about the approach.
@rjernst I'd like to have your review on the encoding of ids. Can you look at the approach that is taken in Uid.encodeId?

@jpountz jpountz requested review from s1monw and rjernst Jul 6, 2017

@s1monw

This comment has been minimized.

Copy link
Contributor

commented Jul 7, 2017

@jpountz can you list the downsides (if there are any) of this change?

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 7, 2017

I don't think there are significant downsides. I mostly want reviews because those are things for which handling bw compat is a PITA so I'd like to get it right as much as possible and reduce chances that we think there is a better way in the next months. This is also why I'd like to fold it into 6.0 so that we do not need to support old ids anymore as soon as 7.0.

If your ID is not recognized as a numeric id or base64 id then we will prepend a byte to it which will make it one byte longer. However for numerics and base64 ids it should make things better: it should make numeric ids a bit less than 50% shorter and base64 ids about 33% shorter.

I don't think it would add parse-time overhead since the cost of base64 decoding should be the same as the cost of UTF-8 encoding. However the shorter keys might help Lucene since fewer bytes need to be compared upon sorting, which might help both with flushing when we radix sort the ids, and merging when we need to sort ids coming from multiple segments on the fly using a heap. It might also make indices slightly smaller, especially those that index few fields.

At the moment, there is no optimization for autogenerated ids to keep things simple, but actually we first generate a binary id then encode it as a string and then decode it again. We might be able to skip the string representation entirely hopefully at some point in the future.

One downside that I do not care too much about is that the encoded representation does not preserve order, so I switched fielddata on the _id field to expose the binary doc values API rather than sorted. Everything will keep working, it might just be a bit slower. I'd like to remove fielddata support on _id and _uid eventually anyway (#25240) so I don't think this is much of an issue.

@s1monw

s1monw approved these changes Jul 7, 2017

Copy link
Contributor

left a comment

left some minor nits

BytesRef[] bytesRefs = new BytesRef[values.size()];
for (int i = 0; i < bytesRefs.length; i++) {
BytesRef id;
if (context.indexVersionCreated().onOrAfter(Version.V_6_0_0_alpha3)) {

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 7, 2017

Contributor

hmm maybe it's worth to move the if check out of the loop and assign it a boolean? I also wonder why the context is nullable? this looks like asking for trouble here?

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 7, 2017

Author Contributor

yeah this is old leniency that we removed over time. I believe we reached a state that allows us to remove the nullable annotation everywhere now

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 7, 2017

Contributor

++

}
break;
default:
throw new AssertionError();

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 7, 2017

Contributor

please give it some text here it's always better to have a message :)

for (int i = 0; i < length; ++i) {
final char c = id.charAt(i);
final boolean allowed =
(c >= '0' && c <= '9') ||

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 7, 2017

Contributor

0-9A-Za-z being consecutive in ASCII would have made things much easier... ;)

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 7, 2017

Author Contributor

ohhhh yes!!!

@jpountz jpountz merged commit 40bb166 into elastic:master Jul 7, 2017

1 of 2 checks passed

elasticsearch-ci Build started sha1 is merged.
Details
CLA Commit author is a member of Elasticsearch
Details

@jpountz jpountz deleted the jpountz:feature/binary_ids branch Jul 7, 2017

@jpountz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 7, 2017

@rjernst I merged in order to have time to watch for failures before leaving, but I'm still interested in having your feedback on the encoding of ids!

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jul 7, 2017

Merge branch 'master' into restore-local-checkpoint-tracker
* master: (42 commits)
  Harden global checkpoint tracker
  Remove deprecated created and found from index, delete and bulk (elastic#25516)
  fix testEnsureVersionCompatibility for 5.5.0 release
  fix Version.v6_0_0 min compatibility version to 5.5.0
  Add bwc indices for 5.5.0
  Add v5_5_1 constant
  [DOCS] revise high level client Search Scroll API docs (elastic#25599)
  Improve REST error handling when endpoint does not support HTTP verb, add OPTIONS support (elastic#24437)
  Avoid SecurityException in repository-S3 on DefaultS3OutputStream.flush() (elastic#25254)
  [Tests] Add tests for CompletionSuggestionBuilder#build() (elastic#25575)
  Enable cross-setting validation
  [Docs] Fix typo in bootstrap-checks.asciidoc (elastic#25597)
  Index ids in binary form. (elastic#25352)
  bwc checkout should fetch from all remotes
  IndexingIT should check for global checkpoints regardless of master version
  [Tests] Add tests for PhraseSuggestionBuilder#build() (elastic#25571)
  Remove unused class MinimalMap (elastic#25590)
  [Docs] Document Scroll API for Java High Level REST Client (elastic#25554)
  Disable date field mapping changing (elastic#25285)
  Allow BWC Testing against a specific branch (elastic#25510)
  ...
@rjernst
Copy link
Member

left a comment

This is nice! Sorry it took so long to look at. I left some ideas for optimization.

// The last 3 symbols (18 bits) are encoding 2 bytes (16 bits)
// so the last symbol only actually uses 16-12=4 bits and can only take 16 values
last = id.charAt(length - 1);
if (last != 'A' && last != 'E' && last != 'I' && last != 'M' && last != 'Q'&& last != 'U'&& last != 'Y'

This comment has been minimized.

Copy link
@rjernst

rjernst Jul 12, 2017

Member

These conditionals (in case 2 above, and the allowed case below) could be optimized with boolean lookup tables.

for (int i = 0; i < id.length(); i += 2) {
int b1 = id.charAt(i) - '0';
int b2;
if (i + 1 == id.length()) {

This comment has been minimized.

Copy link
@rjernst

rjernst Jul 12, 2017

Member

You could avoid this conditional by checking for an odd length before the loop, and only iterating through the loop on an even number, then have a single condition after the loop to add the last value in the odd case, followed by the end marker.

final int b1 = (b >>> 4);
final int b2 = b & 0x0f;
chars[(i - 1) * 2] = (char) (b1 + '0');
if (i == idBytes.length - 1 && b2 == 0x0f) {

This comment has been minimized.

Copy link
@rjernst

rjernst Jul 12, 2017

Member

You could avoid this conditional by checking the last byte for the end marker before the loop, so that you only iterate over pairs of real numbers, then decode the last value in the odd case after the loop.

private static String decodeBase64Id(byte[] idBytes) {
assert Byte.toUnsignedInt(idBytes[0]) <= BASE64_ESCAPE;
if (Byte.toUnsignedInt(idBytes[0]) == BASE64_ESCAPE) {
idBytes = Arrays.copyOfRange(idBytes, 1, idBytes.length);

This comment has been minimized.

Copy link
@rjernst

rjernst Jul 12, 2017

Member

It is unfortunate Base64.Encoder.encodeToString only has a byte[] variant and not a ByteBuffer version that would allow avoiding the copy...

final int b = Byte.toUnsignedInt(idBytes[i]);
final int b1 = (b >>> 4);
final int b2 = b & 0x0f;
chars[(i - 1) * 2] = (char) (b1 + '0');

This comment has been minimized.

Copy link
@rjernst

rjernst Jul 12, 2017

Member

Maybe try using two 256 entry lookup tables for the upper and lower values. That would save two additions, the right shift, and cast from byte to int.

@ddorian

This comment has been minimized.

Copy link

commented May 5, 2018

Can we do a similar approach when indexing number ids fields, on cases when we only need equality operator and we're storing as keyword instead of numeric ?

Either by an option in the mapping (should I open a feature request?).
Or in the client (should I just base64?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.