New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OAK-10384: Fix stripping of large indexed ordered properties #1071
Conversation
amit-jain
commented
Aug 16, 2023
- Convert to BytesRef and then recheck and trim by the amount exceeding
- Convert to BytesRef and then recheck and trim by the amount exceeding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have a better truncation algorithm, and better test cases.
...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java
Outdated
Show resolved
Hide resolved
I would make the truncation method public and then write a good unit test for it.
|
...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java
Outdated
Show resolved
Hide resolved
|
||
private static BytesRef checkTruncateLength(String prop, String value, String path, int maxLength) { | ||
String truncated = value; | ||
if (value.length() > maxLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could lead to a null pointer exception. From what I can see, we aren't guaranteed that it is not null.
- Handling surrogates correctly using byte level manipulations (Code from Thomas Mueller)
- Add valid string check after truncation
@@ -315,6 +313,38 @@ protected boolean indexTypeOrderedFields(Document doc, String pname, int tag, Pr | |||
} | |||
return fieldAdded; | |||
} | |||
|
|||
protected static BytesRef checkTruncateLength(String prop, String value, String path, int maxLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change the name of the function to truncateToMaxLength
, getTruncatedBytesRef
something like that? IMO checkTruncateLength
suggests that the function might just be verifying something rather than actually performing a truncation operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkTruncateLength
is meant to be checkAndTruncate
and I removd And
for brevity. So, maybe checkAndTruncateLength
if its more clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would then be even clearer to split the function in two: one that truncates and the other checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that splitting will make it clearer though. The method has the check interwined also to check the higher code points are not split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the function does quite a lot of things - and I generally like our code to follow the SRP. Some minor refactoring can lead to this, which IMO is a bit clearer. It is more code, but I think it's clearer - especially when we fix the TODOs for new people to read it.
// TODO: explain why we use 10
private static final int EMERGENCY_TRUNCATION_LENGTH = 10;
...
// assuming we rename checkTruncateLength() -> truncateToMaxLength()
protected static BytesRef truncateToMaxLength(String prop, String value, String path, int maxLength) {
log.trace("Property {} at path:[{}] has value {}", prop, path, value);
BytesRef ref = new BytesRef(value);
if (ref.length <= maxLength) {
return ref;
}
ref = truncateBytesRef(ref, maxLength);
if (ref.length > maxLength) {
ref = emergencyTruncate(ref, maxLength);
}
return ref;
}
private static BytesRef truncateBytesRef(BytesRef ref, int maxLength) {
log.info("Truncating as length after encoding {} is > {} ", ref.length, maxLength);
int end = calculateEndPosition(ref, maxLength);
byte[] truncatedBytes = Arrays.copyOf(ref.bytes, end + 1);
BytesRef truncatedRef = new BytesRef(new String(truncatedBytes, StandardCharsets.UTF_8));
log.trace("Truncated to {}", truncatedRef.utf8ToString());
return truncatedRef;
}
// TODO: explain this more in detail, with examples and why it's necessary.
private static int calculateEndPosition(BytesRef ref, int maxLength) {
int end = maxLength - 1;
while ((ref.bytes[end] & 0b11000000) == 0b10000000) {
end--;
}
if ((ref.bytes[end] & 0b11000000) == 0b11000000) {
end--;
}
return end;
}
// TODO: explain why this is necessary and with examples.
private static BytesRef emergencyTruncate(BytesRef ref, int maxLength) {
log.error("Truncation did not work: still {} bytes", ref.length);
String truncated = ref.utf8ToString();
while (ref.length > maxLength) {
truncated = truncated.substring(0, truncated.length() - EMERGENCY_TRUNCATION_LENGTH);
ref = new BytesRef(truncated);
}
return ref;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...I think this takes the refactoring too far and in fact makes readability harder. Generally if there was a need for reuse for some of the sub methods it would have made sense. Explanation of the code can happen through documentation as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, I see where you're coming from about the code length. But I believe that while it might appear longer, it doesn't reduce readability. In fact, breaking the code down helps us describe and understand what each segment is doing, and more crucially, makes it easier to explain why it's doing that.
Imagine we come across a pesky bug half a year from now and then stumble upon this piece of code:
while ((ref.bytes[end] & 0b11000000) == 0b10000000) {
end--;
}
if ((ref.bytes[end] & 0b11000000) == 0b11000000) {
end--;
}
Yeah I see that it skips over tails of utf-8 multi-byte sequences up to 3 bytes. But why?
Thus, wouldn't it be so much easier if we immediately knew the reason behind this logic, instead of scratching our heads trying to decode it? That's why I believe refactoring it is so crucial - it makes it so much easier to explain why we are doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you don't like the refactoring that's okay. I think what's most crucial is that we just document why we are doing those magical things - with some examples as well.
String truncated = new String(bytes2, StandardCharsets.UTF_8); | ||
ref = new BytesRef(truncated); | ||
log.trace("Truncated property {} at path:[{}] to {}", prop, path, ref.utf8ToString()); | ||
while (ref.length > maxLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to document this "emergency" even more. When can it happen, what do we do to fix it, and why is it okay for us to fix it the way we are? From a quick glance at the code, it is not immediately clear why this works.
Specifically, one or more clear examples would make it very clear imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the original comment noted it should never happen but Thomas added for unforseen cases rather than just throeing an error.
prop, path, ref.length, maxLength); | ||
int end = maxLength - 1; | ||
// skip over tails of utf-8 multi-byte sequences (up to 3 bytes) | ||
while ((ref.bytes[end] & 0b11000000) == 0b10000000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add examples of these skipping and removals of multi-byte sequences? And also add explanations for why they are necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are tests available in LuceneLargeStringPropertyTest.java. Is tht not enough, maybe I can add a bit of documentation.
if ((ref.bytes[end] & 0b11000000) == 0b11000000) { | ||
end--; | ||
} | ||
byte[] bytes2 = Arrays.copyOf(ref.bytes, end + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for clarity, I would use something like truncatedBytes
instead of bytes2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense.
Address review suggestions - Change name of the method to getTruncatedBytesRef - Add javadocs for the new method and the tests - Rename variable
@@ -316,6 +314,58 @@ protected boolean indexTypeOrderedFields(Document doc, String pname, int tag, Pr | |||
return fieldAdded; | |||
} | |||
|
|||
/** | |||
* Returns a {@code BytesRef} object constructed from the given {@code String} value and also truncates the length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could improve the comment here a little, as it essentially it says: "truncates the length to ensure that the multi-byte sequences are properly truncated". I understand that it does some truncation. But I'm still left with the question of why this is necessary, and in which scenarios? What are some examples of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok makes sense to add a clarifying comment that lucene limits the length and the length that is checked is the BytesRef.length
* | ||
* <p>Multi-byte sequences will be of the form {@code 11xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}. | ||
* The method first truncates continuation bytes, which start with {@code 10} in binary. It then truncates the head byte, which | ||
* starts with {@code 11}. Both truncation operations use a binary mask of {@code 11100000}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do both truncation operations really use a binary mask of 11100000
?
From what I can see, the first while loop:
while ((ref.bytes[end] & 0b11000000) == 0b10000000) {
end--;
}
is checking if the byte at ref.bytes[end]
is of the form 10xxxxxx
. It's doing that by masking the byte with 11000000
and checking if the result is 10000000
. If it is, then the byte has the form 10xxxxxx
.
And the following if
statement:
if ((ref.bytes[end] & 0b11000000) == 0b11000000) {
end--;
}
checks in a similar manner if the byte-sequence is of the pattern 110xxxxx
. But I don't see that any of the truncation operations use a binary mask of 11100000
.
In any case, I think this operation is complex enough that I'd be in strong favour of refactoring it into its own function and documented accordingly. That way, writing a proper example wouldn't clutter the getTruncatedBytesRef
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right the comment essentially mentions the wrong mask.
I think this operation is complex enough that I'd be in strong favour of refactoring it into its own function and documented accordingly.
I am not in favor of refactoring it as the operation is atomic and really does one thing which is to get a bytesRef. But then if you feel strongly let's take the refactoring as a separate issue.
Address review suggestions - Add comment on BytesRef truncation - Fix typo
- Clarify test comment
- Truncate BytesRef value and handle surrogates correctly (Code from Thomas Mueller)