Index ids in binary form. #25352
This is a first iteration that mostly aims at triggering some discussion.
Since we can only expect base64 ids in the auto-generated case, this PR tries
Another option could be to require base64 ids all the time. It would make things
This PR should bring some benefits, but I expect it to be mostly useful when
Many tests do not pass since they expect to find a string representation of the
I removed the
I don't think there are significant downsides. I mostly want reviews because those are things for which handling bw compat is a PITA so I'd like to get it right as much as possible and reduce chances that we think there is a better way in the next months. This is also why I'd like to fold it into 6.0 so that we do not need to support old ids anymore as soon as 7.0.
If your ID is not recognized as a numeric id or base64 id then we will prepend a byte to it which will make it one byte longer. However for numerics and base64 ids it should make things better: it should make numeric ids a bit less than 50% shorter and base64 ids about 33% shorter.
I don't think it would add parse-time overhead since the cost of base64 decoding should be the same as the cost of UTF-8 encoding. However the shorter keys might help Lucene since fewer bytes need to be compared upon sorting, which might help both with flushing when we radix sort the ids, and merging when we need to sort ids coming from multiple segments on the fly using a heap. It might also make indices slightly smaller, especially those that index few fields.
At the moment, there is no optimization for autogenerated ids to keep things simple, but actually we first generate a binary id then encode it as a string and then decode it again. We might be able to skip the string representation entirely hopefully at some point in the future.
One downside that I do not care too much about is that the encoded representation does not preserve order, so I switched fielddata on the