OAK-10384: Fix stripping of large indexed ordered properties #1071

amit-jain · 2023-08-16T05:30:54Z

Convert to BytesRef and then recheck and trim by the amount exceeding

- Convert to BytesRef and then recheck and trim by the amount exceeding

thomasmueller

We need to have a better truncation algorithm, and better test cases.

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java

thomasmueller · 2023-08-16T07:44:12Z

I would make the truncation method public and then write a good unit test for it.

private static final Logger log = LoggerFactory.getLogger(LuceneDocumentMaker.class);

@Test
public void test() {
    Random r = new Random(1);
    for (int i = 0; i < 1000; i++) {
        String x = randomUnicodeString(r, 5);
        BytesRef ref = checkTruncateLength("x", x, "/x", 5);
        assertTrue(ref.length > 0 && ref.length <= 5);
    }
}

private String randomUnicodeString(Random r, int len) {
    StringBuilder buff = new StringBuilder();
    for(int i=0; i<len; i++) {
        // see https://en.wikipedia.org/wiki/UTF-8
        switch (r.nextInt(6)) {
        case 2:
            // 2 UTF-8 bytes
            buff.append('£');
            break;
        case 3:
            // 3 UTF-8 bytes
            buff.append('€');
            break;
        case 4:
            // 4 UTF-8 bytes
            buff.append("\uD800\uDF48");
            break;
        default:
            // most cases:
            // 1 UTF-8 byte (ASCII)
            buff.append('$');
        }
    }
    return buff.toString();
}

public static BytesRef checkTruncateLength(String prop, String value, String path, int maxLength) {
    BytesRef ref = new BytesRef(value);
    if (ref.length <= maxLength) {
        return ref;
    }
    log.info("Truncating property {} at path:[{}] as length after encoding {} is > {} ", 
            prop, path, ref.length, maxLength);
    int end = maxLength - 1;
    // skip over tails of utf-8 multi-byte sequences (up to 3 bytes)
    while ((ref.bytes[end] & 0b11000000) == 0b10000000) {
        end--;
    }
    // remove one head of a utf-8 multi-byte sequence (at most 1)
    if ((ref.bytes[end] & 0b11000000) == 0b11000000) {
        end--;
    }
    byte[] bytes2 = Arrays.copyOf(ref.bytes, end + 1);
    String truncated = new String(bytes2, StandardCharsets.UTF_8);
    ref = new BytesRef(truncated);
    while (ref.length > maxLength) {
        log.error("Truncation did not work: still {} bytes", ref.length);
        // this may not properly work with unicode surrogates:
        // it is an "emergency" procedure and should never happen
        truncated = truncated.substring(0, truncated.length() - 10);
        ref = new BytesRef(truncated);
    }
    return ref;
}

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java

steffenvan · 2023-08-16T08:09:05Z

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java

+
+    private static BytesRef checkTruncateLength(String prop, String value, String path, int maxLength) {
+        String truncated = value;
+        if (value.length() > maxLength) {


I think this could lead to a null pointer exception. From what I can see, we aren't guaranteed that it is not null.

- Handling surrogates correctly using byte level manipulations (Code from Thomas Mueller)

- Add valid string check after truncation

steffenvan · 2023-09-07T08:33:12Z

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java

@@ -315,6 +313,38 @@ protected boolean indexTypeOrderedFields(Document doc, String pname, int tag, Pr
        }
        return fieldAdded;
    }
+
+    protected static BytesRef checkTruncateLength(String prop, String value, String path, int maxLength) {


Could we change the name of the function to truncateToMaxLength, getTruncatedBytesRef something like that? IMO checkTruncateLength suggests that the function might just be verifying something rather than actually performing a truncation operation.

checkTruncateLength is meant to be checkAndTruncate and I removd And for brevity. So, maybe checkAndTruncateLength if its more clearer

Perhaps it would then be even clearer to split the function in two: one that truncates and the other checks?

I don't think that splitting will make it clearer though. The method has the check interwined also to check the higher code points are not split.

I think the function does quite a lot of things - and I generally like our code to follow the SRP. Some minor refactoring can lead to this, which IMO is a bit clearer. It is more code, but I think it's clearer - especially when we fix the TODOs for new people to read it.

// TODO: explain why we use 10 private static final int EMERGENCY_TRUNCATION_LENGTH = 10; ... // assuming we rename checkTruncateLength() -> truncateToMaxLength() protected static BytesRef truncateToMaxLength(String prop, String value, String path, int maxLength) { log.trace("Property {} at path:[{}] has value {}", prop, path, value); BytesRef ref = new BytesRef(value); if (ref.length <= maxLength) { return ref; } ref = truncateBytesRef(ref, maxLength); if (ref.length > maxLength) { ref = emergencyTruncate(ref, maxLength); } return ref; } private static BytesRef truncateBytesRef(BytesRef ref, int maxLength) { log.info("Truncating as length after encoding {} is > {} ", ref.length, maxLength); int end = calculateEndPosition(ref, maxLength); byte[] truncatedBytes = Arrays.copyOf(ref.bytes, end + 1); BytesRef truncatedRef = new BytesRef(new String(truncatedBytes, StandardCharsets.UTF_8)); log.trace("Truncated to {}", truncatedRef.utf8ToString()); return truncatedRef; } // TODO: explain this more in detail, with examples and why it's necessary. private static int calculateEndPosition(BytesRef ref, int maxLength) { int end = maxLength - 1; while ((ref.bytes[end] & 0b11000000) == 0b10000000) { end--; } if ((ref.bytes[end] & 0b11000000) == 0b11000000) { end--; } return end; } // TODO: explain why this is necessary and with examples. private static BytesRef emergencyTruncate(BytesRef ref, int maxLength) { log.error("Truncation did not work: still {} bytes", ref.length); String truncated = ref.utf8ToString(); while (ref.length > maxLength) { truncated = truncated.substring(0, truncated.length() - EMERGENCY_TRUNCATION_LENGTH); ref = new BytesRef(truncated); } return ref; }

Hmm...I think this takes the refactoring too far and in fact makes readability harder. Generally if there was a need for reuse for some of the sub methods it would have made sense. Explanation of the code can happen through documentation as well.

You know, I see where you're coming from about the code length. But I believe that while it might appear longer, it doesn't reduce readability. In fact, breaking the code down helps us describe and understand what each segment is doing, and more crucially, makes it easier to explain why it's doing that.

Imagine we come across a pesky bug half a year from now and then stumble upon this piece of code:

while ((ref.bytes[end] & 0b11000000) == 0b10000000) { end--; } if ((ref.bytes[end] & 0b11000000) == 0b11000000) { end--; }

Yeah I see that it skips over tails of utf-8 multi-byte sequences up to 3 bytes. But why?

Thus, wouldn't it be so much easier if we immediately knew the reason behind this logic, instead of scratching our heads trying to decode it? That's why I believe refactoring it is so crucial - it makes it so much easier to explain why we are doing it.

But if you don't like the refactoring that's okay. I think what's most crucial is that we just document why we are doing those magical things - with some examples as well.

steffenvan · 2023-09-07T09:37:49Z

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java

+        String truncated = new String(bytes2, StandardCharsets.UTF_8);
+        ref = new BytesRef(truncated);
+        log.trace("Truncated property {} at path:[{}] to {}", prop, path, ref.utf8ToString());
+        while (ref.length > maxLength) {


I think it's important to document this "emergency" even more. When can it happen, what do we do to fix it, and why is it okay for us to fix it the way we are? From a quick glance at the code, it is not immediately clear why this works.

Specifically, one or more clear examples would make it very clear imo.

As the original comment noted it should never happen but Thomas added for unforseen cases rather than just throeing an error.

steffenvan · 2023-09-07T09:44:03Z

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java

+            prop, path, ref.length, maxLength);
+        int end = maxLength - 1;
+        // skip over tails of utf-8 multi-byte sequences (up to 3 bytes)
+        while ((ref.bytes[end] & 0b11000000) == 0b10000000) {


Could we add examples of these skipping and removals of multi-byte sequences? And also add explanations for why they are necessary?

There are tests available in LuceneLargeStringPropertyTest.java. Is tht not enough, maybe I can add a bit of documentation.

fabriziofortino · 2023-09-07T10:05:33Z

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java

+        if ((ref.bytes[end] & 0b11000000) == 0b11000000) {
+            end--;
+        }
+        byte[] bytes2 = Arrays.copyOf(ref.bytes, end + 1);


for clarity, I would use something like truncatedBytes instead of bytes2

Yes that makes sense.

Address review suggestions - Change name of the method to getTruncatedBytesRef - Add javadocs for the new method and the tests - Rename variable

steffenvan · 2023-09-08T08:39:02Z

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java

@@ -316,6 +314,58 @@ protected boolean indexTypeOrderedFields(Document doc, String pname, int tag, Pr
        return fieldAdded;
    }

+    /**
+     * Returns a {@code BytesRef} object constructed from the given {@code String} value and also truncates the length


I think we could improve the comment here a little, as it essentially it says: "truncates the length to ensure that the multi-byte sequences are properly truncated". I understand that it does some truncation. But I'm still left with the question of why this is necessary, and in which scenarios? What are some examples of this?

Ok makes sense to add a clarifying comment that lucene limits the length and the length that is checked is the BytesRef.length

steffenvan · 2023-09-08T08:51:06Z

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java

+     *
+     * <p>Multi-byte sequences will be of the form {@code 11xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}.
+     * The method first truncates continuation bytes, which start with {@code 10} in binary. It then truncates the head byte, which
+     * starts with {@code 11}. Both truncation operations use a binary mask of {@code 11100000}.


Do both truncation operations really use a binary mask of 11100000?

From what I can see, the first while loop:

while ((ref.bytes[end] & 0b11000000) == 0b10000000) { end--; }

is checking if the byte at ref.bytes[end] is of the form 10xxxxxx. It's doing that by masking the byte with 11000000 and checking if the result is 10000000. If it is, then the byte has the form 10xxxxxx.

And the following if statement:

if ((ref.bytes[end] & 0b11000000) == 0b11000000) { end--; }

checks in a similar manner if the byte-sequence is of the pattern 110xxxxx. But I don't see that any of the truncation operations use a binary mask of 11100000.

In any case, I think this operation is complex enough that I'd be in strong favour of refactoring it into its own function and documented accordingly. That way, writing a proper example wouldn't clutter the getTruncatedBytesRef function.

You are right the comment essentially mentions the wrong mask.

I think this operation is complex enough that I'd be in strong favour of refactoring it into its own function and documented accordingly.

I am not in favor of refactoring it as the operation is atomic and really does one thing which is to get a bytesRef. But then if you feel strongly let's take the refactoring as a separate issue.

Address review suggestions - Add comment on BytesRef truncation - Fix typo

- Clarify test comment

- Truncate BytesRef value and handle surrogates correctly (Code from Thomas Mueller)

OAK-10384: Fix stripping of large indexed ordered properties

56271c3

- Convert to BytesRef and then recheck and trim by the amount exceeding

amit-jain requested review from tihom88 and thomasmueller August 16, 2023 05:30

thomasmueller requested changes Aug 16, 2023

View reviewed changes

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java Outdated Show resolved Hide resolved

thomasmueller requested changes Aug 16, 2023

View reviewed changes

...lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneDocumentMaker.java Outdated Show resolved Hide resolved

steffenvan reviewed Aug 16, 2023

View reviewed changes

amjain added 2 commits September 1, 2023 12:36

OAK-10384: Fix stripping of large indexed ordered properties

677cc2b

- Handling surrogates correctly using byte level manipulations (Code from Thomas Mueller)

OAK-10384: Fix stripping of large indexed ordered properties

7323fd2

- Add valid string check after truncation

amit-jain requested review from thomasmueller and steffenvan September 6, 2023 07:50

thomasmueller approved these changes Sep 7, 2023

View reviewed changes

steffenvan reviewed Sep 7, 2023

View reviewed changes

fabriziofortino approved these changes Sep 7, 2023

View reviewed changes

OAK-10384: Fix stripping of large indexed ordered properties

8896550

Address review suggestions - Change name of the method to getTruncatedBytesRef - Add javadocs for the new method and the tests - Rename variable

steffenvan reviewed Sep 8, 2023

View reviewed changes

amjain added 2 commits September 8, 2023 14:51

OAK-10384: Fix stripping of large indexed ordered properties

cd215ab

Address review suggestions - Add comment on BytesRef truncation - Fix typo

OAK-10384: Fix stripping of large indexed ordered properties

c6d58a8

- Clarify test comment

amit-jain merged commit 1e55c01 into apache:trunk Sep 11, 2023
1 of 2 checks passed

amit-jain deleted the OAK-10384 branch September 11, 2023 04:47

mbaedke pushed a commit that referenced this pull request Sep 19, 2023

OAK-10384: Fix stripping of large indexed ordered properties (#1071)

159c99d

- Truncate BytesRef value and handle surrogates correctly (Code from Thomas Mueller)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAK-10384: Fix stripping of large indexed ordered properties #1071

OAK-10384: Fix stripping of large indexed ordered properties #1071

amit-jain commented Aug 16, 2023

thomasmueller left a comment

thomasmueller commented Aug 16, 2023

steffenvan Aug 16, 2023 •

edited

steffenvan Sep 7, 2023

amit-jain Sep 7, 2023 •

edited

steffenvan Sep 7, 2023

amit-jain Sep 7, 2023

steffenvan Sep 7, 2023 •

edited

amit-jain Sep 7, 2023

steffenvan Sep 7, 2023

steffenvan Sep 7, 2023

steffenvan Sep 7, 2023 •

edited

amit-jain Sep 7, 2023

steffenvan Sep 7, 2023

amit-jain Sep 7, 2023

fabriziofortino Sep 7, 2023

amit-jain Sep 7, 2023

steffenvan Sep 8, 2023

amit-jain Sep 8, 2023

steffenvan Sep 8, 2023

amit-jain Sep 8, 2023 •

edited

OAK-10384: Fix stripping of large indexed ordered properties #1071

OAK-10384: Fix stripping of large indexed ordered properties #1071

Conversation

amit-jain commented Aug 16, 2023

thomasmueller left a comment

Choose a reason for hiding this comment

thomasmueller commented Aug 16, 2023

steffenvan Aug 16, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amit-jain Sep 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steffenvan Sep 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steffenvan Sep 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amit-jain Sep 8, 2023 • edited

Choose a reason for hiding this comment

steffenvan Aug 16, 2023 •

edited

amit-jain Sep 7, 2023 •

edited

steffenvan Sep 7, 2023 •

edited

steffenvan Sep 7, 2023 •

edited

amit-jain Sep 8, 2023 •

edited