Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-10384: Fix stripping of large indexed ordered properties #1071

Merged
merged 6 commits into from Sep 11, 2023

Conversation

amit-jain
Copy link
Contributor

  • Convert to BytesRef and then recheck and trim by the amount exceeding

- Convert to BytesRef and then recheck and trim by the amount exceeding
Copy link
Member

@thomasmueller thomasmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a better truncation algorithm, and better test cases.

@thomasmueller
Copy link
Member

I would make the truncation method public and then write a good unit test for it.

private static final Logger log = LoggerFactory.getLogger(LuceneDocumentMaker.class);

@Test
public void test() {
    Random r = new Random(1);
    for (int i = 0; i < 1000; i++) {
        String x = randomUnicodeString(r, 5);
        BytesRef ref = checkTruncateLength("x", x, "/x", 5);
        assertTrue(ref.length > 0 && ref.length <= 5);
    }
}

private String randomUnicodeString(Random r, int len) {
    StringBuilder buff = new StringBuilder();
    for(int i=0; i<len; i++) {
        // see https://en.wikipedia.org/wiki/UTF-8
        switch (r.nextInt(6)) {
        case 2:
            // 2 UTF-8 bytes
            buff.append('£');
            break;
        case 3:
            // 3 UTF-8 bytes
            buff.append('€');
            break;
        case 4:
            // 4 UTF-8 bytes
            buff.append("\uD800\uDF48");
            break;
        default:
            // most cases:
            // 1 UTF-8 byte (ASCII)
            buff.append('$');
        }
    }
    return buff.toString();
}

public static BytesRef checkTruncateLength(String prop, String value, String path, int maxLength) {
    BytesRef ref = new BytesRef(value);
    if (ref.length <= maxLength) {
        return ref;
    }
    log.info("Truncating property {} at path:[{}] as length after encoding {} is > {} ", 
            prop, path, ref.length, maxLength);
    int end = maxLength - 1;
    // skip over tails of utf-8 multi-byte sequences (up to 3 bytes)
    while ((ref.bytes[end] & 0b11000000) == 0b10000000) {
        end--;
    }
    // remove one head of a utf-8 multi-byte sequence (at most 1)
    if ((ref.bytes[end] & 0b11000000) == 0b11000000) {
        end--;
    }
    byte[] bytes2 = Arrays.copyOf(ref.bytes, end + 1);
    String truncated = new String(bytes2, StandardCharsets.UTF_8);
    ref = new BytesRef(truncated);
    while (ref.length > maxLength) {
        log.error("Truncation did not work: still {} bytes", ref.length);
        // this may not properly work with unicode surrogates:
        // it is an "emergency" procedure and should never happen
        truncated = truncated.substring(0, truncated.length() - 10);
        ref = new BytesRef(truncated);
    }
    return ref;
}


private static BytesRef checkTruncateLength(String prop, String value, String path, int maxLength) {
String truncated = value;
if (value.length() > maxLength) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could lead to a null pointer exception. From what I can see, we aren't guaranteed that it is not null.

amjain added 2 commits September 1, 2023 12:36
- Handling surrogates correctly using byte level manipulations (Code from Thomas Mueller)
- Add valid string check after truncation
@@ -315,6 +313,38 @@ protected boolean indexTypeOrderedFields(Document doc, String pname, int tag, Pr
}
return fieldAdded;
}

protected static BytesRef checkTruncateLength(String prop, String value, String path, int maxLength) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change the name of the function to truncateToMaxLength, getTruncatedBytesRef something like that? IMO checkTruncateLength suggests that the function might just be verifying something rather than actually performing a truncation operation.

Copy link
Contributor Author

@amit-jain amit-jain Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkTruncateLength is meant to be checkAndTruncate and I removd And for brevity. So, maybe checkAndTruncateLength if its more clearer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would then be even clearer to split the function in two: one that truncates and the other checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that splitting will make it clearer though. The method has the check interwined also to check the higher code points are not split.

Copy link
Contributor

@steffenvan steffenvan Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the function does quite a lot of things - and I generally like our code to follow the SRP. Some minor refactoring can lead to this, which IMO is a bit clearer. It is more code, but I think it's clearer - especially when we fix the TODOs for new people to read it.

// TODO: explain why we use 10
private static final int EMERGENCY_TRUNCATION_LENGTH = 10;
...
// assuming we rename checkTruncateLength() -> truncateToMaxLength()
protected static BytesRef truncateToMaxLength(String prop, String value, String path, int maxLength) {
    log.trace("Property {} at path:[{}] has value {}", prop, path, value);

    BytesRef ref = new BytesRef(value);
    if (ref.length <= maxLength) {
        return ref;
    }

    ref = truncateBytesRef(ref, maxLength);
    
    if (ref.length > maxLength) {
        ref = emergencyTruncate(ref, maxLength);
    }

    return ref;
}

private static BytesRef truncateBytesRef(BytesRef ref, int maxLength) {
    log.info("Truncating as length after encoding {} is > {} ", ref.length, maxLength);
    
    int end = calculateEndPosition(ref, maxLength);
    byte[] truncatedBytes = Arrays.copyOf(ref.bytes, end + 1);
    
    BytesRef truncatedRef = new BytesRef(new String(truncatedBytes, StandardCharsets.UTF_8));
    log.trace("Truncated to {}", truncatedRef.utf8ToString());
    
    return truncatedRef;
}

// TODO: explain this more in detail, with examples and why it's necessary. 
private static int calculateEndPosition(BytesRef ref, int maxLength) {
    int end = maxLength - 1;

    while ((ref.bytes[end] & 0b11000000) == 0b10000000) {
        end--;
    }
    if ((ref.bytes[end] & 0b11000000) == 0b11000000) {
        end--;
    }

    return end;
}

// TODO: explain why this is necessary and with examples. 
private static BytesRef emergencyTruncate(BytesRef ref, int maxLength) {
    log.error("Truncation did not work: still {} bytes", ref.length);
    String truncated = ref.utf8ToString();

    while (ref.length > maxLength) {
        truncated = truncated.substring(0, truncated.length() - EMERGENCY_TRUNCATION_LENGTH);
        ref = new BytesRef(truncated);
    }

    return ref;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I think this takes the refactoring too far and in fact makes readability harder. Generally if there was a need for reuse for some of the sub methods it would have made sense. Explanation of the code can happen through documentation as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, I see where you're coming from about the code length. But I believe that while it might appear longer, it doesn't reduce readability. In fact, breaking the code down helps us describe and understand what each segment is doing, and more crucially, makes it easier to explain why it's doing that.

Imagine we come across a pesky bug half a year from now and then stumble upon this piece of code:

while ((ref.bytes[end] & 0b11000000) == 0b10000000) { 
    end--; 
} 

if ((ref.bytes[end] & 0b11000000) == 0b11000000) { 
    end--; 
}

Yeah I see that it skips over tails of utf-8 multi-byte sequences up to 3 bytes. But why?

Thus, wouldn't it be so much easier if we immediately knew the reason behind this logic, instead of scratching our heads trying to decode it? That's why I believe refactoring it is so crucial - it makes it so much easier to explain why we are doing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you don't like the refactoring that's okay. I think what's most crucial is that we just document why we are doing those magical things - with some examples as well.

String truncated = new String(bytes2, StandardCharsets.UTF_8);
ref = new BytesRef(truncated);
log.trace("Truncated property {} at path:[{}] to {}", prop, path, ref.utf8ToString());
while (ref.length > maxLength) {
Copy link
Contributor

@steffenvan steffenvan Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to document this "emergency" even more. When can it happen, what do we do to fix it, and why is it okay for us to fix it the way we are? From a quick glance at the code, it is not immediately clear why this works.

Specifically, one or more clear examples would make it very clear imo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the original comment noted it should never happen but Thomas added for unforseen cases rather than just throeing an error.

prop, path, ref.length, maxLength);
int end = maxLength - 1;
// skip over tails of utf-8 multi-byte sequences (up to 3 bytes)
while ((ref.bytes[end] & 0b11000000) == 0b10000000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add examples of these skipping and removals of multi-byte sequences? And also add explanations for why they are necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tests available in LuceneLargeStringPropertyTest.java. Is tht not enough, maybe I can add a bit of documentation.

if ((ref.bytes[end] & 0b11000000) == 0b11000000) {
end--;
}
byte[] bytes2 = Arrays.copyOf(ref.bytes, end + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity, I would use something like truncatedBytes instead of bytes2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense.

Address review suggestions
- Change name of the method to getTruncatedBytesRef
- Add javadocs for the new method and the tests
- Rename variable
@@ -316,6 +314,58 @@ protected boolean indexTypeOrderedFields(Document doc, String pname, int tag, Pr
return fieldAdded;
}

/**
* Returns a {@code BytesRef} object constructed from the given {@code String} value and also truncates the length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could improve the comment here a little, as it essentially it says: "truncates the length to ensure that the multi-byte sequences are properly truncated". I understand that it does some truncation. But I'm still left with the question of why this is necessary, and in which scenarios? What are some examples of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense to add a clarifying comment that lucene limits the length and the length that is checked is the BytesRef.length

*
* <p>Multi-byte sequences will be of the form {@code 11xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}.
* The method first truncates continuation bytes, which start with {@code 10} in binary. It then truncates the head byte, which
* starts with {@code 11}. Both truncation operations use a binary mask of {@code 11100000}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do both truncation operations really use a binary mask of 11100000?

From what I can see, the first while loop:

while ((ref.bytes[end] & 0b11000000) == 0b10000000) {
    end--;
}

is checking if the byte at ref.bytes[end] is of the form 10xxxxxx. It's doing that by masking the byte with 11000000 and checking if the result is 10000000. If it is, then the byte has the form 10xxxxxx.

And the following if statement:

if ((ref.bytes[end] & 0b11000000) == 0b11000000) {
    end--;
}

checks in a similar manner if the byte-sequence is of the pattern 110xxxxx. But I don't see that any of the truncation operations use a binary mask of 11100000.

In any case, I think this operation is complex enough that I'd be in strong favour of refactoring it into its own function and documented accordingly. That way, writing a proper example wouldn't clutter the getTruncatedBytesRef function.

Copy link
Contributor Author

@amit-jain amit-jain Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right the comment essentially mentions the wrong mask.

I think this operation is complex enough that I'd be in strong favour of refactoring it into its own function and documented accordingly.

I am not in favor of refactoring it as the operation is atomic and really does one thing which is to get a bytesRef. But then if you feel strongly let's take the refactoring as a separate issue.

amjain added 2 commits September 8, 2023 14:51
Address review suggestions
- Add comment on BytesRef truncation
- Fix typo
@amit-jain amit-jain merged commit 1e55c01 into apache:trunk Sep 11, 2023
1 of 2 checks passed
@amit-jain amit-jain deleted the OAK-10384 branch September 11, 2023 04:47
mbaedke pushed a commit that referenced this pull request Sep 19, 2023
- Truncate BytesRef value and handle surrogates correctly (Code from Thomas Mueller)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants