Adds lazy reader support for reading clobs #638

zslayton · 2023-09-01T20:58:50Z

Adds LazyRawTextReader support for matching and reading clobs.

Fixes #634.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov · 2023-09-01T21:00:30Z

Codecov Report

Patch coverage is 93.61% of modified lines.

Files Changed	Coverage
src/lazy/raw_value_ref.rs	`ø`
src/lazy/str_ref.rs	`0.00%`
src/lazy/text/encoded_value.rs	`0.00%`
src/lazy/text/value.rs	`0.00%`
src/lazy/text/matched.rs	`93.62%`
src/lazy/text/buffer.rs	`99.00%`
src/lazy/binary/raw/value.rs	`100.00%`
src/lazy/text/raw/reader.rs	`100.00%`
src/lazy/value_ref.rs	`100.00%`

📢 Thoughts on this report? Let us know!.

zslayton

🗺️ PR tour

zslayton · 2023-09-03T15:20:40Z

src/lazy/binary/raw/value.rs

@@ -413,7 +413,7 @@ impl<'data> LazyRawBinaryValue<'data> {
    fn read_clob(&self) -> ValueParseResult<'data, BinaryEncoding> {
        debug_assert!(self.encoded_value.ion_type() == IonType::Clob);
        let bytes = self.value_body()?;
-        Ok(RawValueRef::Clob(bytes))
+        Ok(RawValueRef::Clob(bytes.into()))


🗺️ Reading a clob now returns a BytesRef<'_> instead of a &[u8] to accommodate the escape decoding process that happens in text clobs. This change mirrors the one made for blobs in #629.

zslayton · 2023-09-03T15:24:44Z

src/lazy/str_ref.rs

+            Cow::Owned(text) => Vec::from(text).into(),
+        }
+    }
+}


🗺️ This impl converts a String into its underlying Vec or a &str to its underlying &[u8].

zslayton · 2023-09-03T15:25:54Z

src/lazy/text/buffer.rs

@@ -1002,13 +1008,13 @@ impl<'data> TextBufferView<'data> {

    /// Returns a matched buffer and a boolean indicating whether any escaped characters were
    /// found in the short string.
-    fn match_short_string_body(self) -> IonParseResult<'data, (Self, bool)> {
+    pub(crate) fn match_short_string_body(self) -> IonParseResult<'data, (Self, bool)> {


🗺️ The clob reading logic re-uses the short- and long-form string matchers to isolate the content within the larger match.

zslayton · 2023-09-03T15:30:48Z

src/lazy/text/matched.rs

        let text = String::from_utf8(sanitized).unwrap();
        Ok(StrRef::from(text.to_string()))
    }
 }

-fn escape_text(matched_input: TextBufferView, sanitized: &mut Vec<u8>) -> IonResult<()> {
+fn decode_text_containing_escapes(


🗺️ I renamed this method to make it clearer which "direction" we were going. It accepts text with escapes and decodes them into bytes.

This name is still confusing for me because it's possible to "decode" text to bytes (e.g. base64) and to "decode" bytes to text (e.g. UTF-8). What about something like convert_escaped_text_to_bytes or decode_escaped_text_into_bytes?

zslayton · 2023-09-03T15:34:09Z

src/lazy/text/matched.rs

    let mut remaining = matched_input;
+
+    // For ways to optimize this in the future, look at the `memchr` crate.
+    let match_byte = |byte: &u8| *byte == b'\\' || *byte == b'\r';


🗺️ The logic needed to normalize an unescaped \r differs from that needed to replace an escaped \r (or any other escape). We're looking for a raw byte value 0x0A that is not prefixed with a \.

zslayton · 2023-09-03T16:34:18Z

src/lazy/text/buffer.rs

+                // being allocated when it isn't strictly necessary.
+                contains_escaped_chars = true;
+                continue;
+            }


🗺️ In long-form clobs and long-form strings, we need to normalize unescaped \r and \r\n to \n. This throws the naming off a bit; contains_escapes should really be something like requires_substitutions. However, I think escapes is a more obvious/suggestive name. Open to input here; I left it as-is because a consistent rename across usages/modules would touch a lot of lines and I'd rather do it in another PR.

requires_substitutions_of_escaped_characters? (What a mouthful... maybe too long.)

zslayton · 2023-09-03T16:36:48Z

src/lazy/text/matched.rs

+            // Normalize newlines
+            true,
+            // Support unicode escapes
+            true,


🗺️ I considered enums for these two bools to make them self-documenting, but as they're not part of the public API I decided to just comment the handful of places where this method is called.

popematt · 2023-09-05T16:47:20Z

src/lazy/text/buffer.rs

+                // being allocated when it isn't strictly necessary.
+                contains_escaped_chars = true;
+                continue;
+            }


requires_substitutions_of_escaped_characters? (What a mouthful... maybe too long.)

popematt · 2023-09-05T17:25:18Z

src/lazy/text/matched.rs

    List,
    SExp,
    Struct,
-    // TODO: ...the other types


popematt · 2023-09-05T17:30:23Z

src/lazy/text/matched.rs

        let text = String::from_utf8(sanitized).unwrap();
        Ok(StrRef::from(text.to_string()))
    }
 }

-fn escape_text(matched_input: TextBufferView, sanitized: &mut Vec<u8>) -> IonResult<()> {
+fn decode_text_containing_escapes(


This name is still confusing for me because it's possible to "decode" text to bytes (e.g. base64) and to "decode" bytes to text (e.g. UTF-8). What about something like convert_escaped_text_to_bytes or decode_escaped_text_into_bytes?

popematt · 2023-09-05T17:47:37Z

src/lazy/text/matched.rs

+    Short,
+    Long,


Please add even a brief doc comment for these.

Also, is it worth having separate cases for with and without escapes? Or long with single vs multiple segments? (Did we already talk about this? I think we might have.)

There's a narrow case that benefits: single-segment clobs that only contain ASCII. Every other case requires a sanitization/decoding buffer anyway. I concluded that I'd wait to see if anyone actually uses clobs outside of ion-tests before worrying about optimizing it further.

popematt

Sorry, I meant to approve this because none of my latest comments are things that would block the PR.

zslayton · 2023-09-07T11:59:06Z

Sorry, I meant to approve this because none of my latest comments are things that would block the PR.

Thanks, I've got another PR out that depends on this one (#639), so I'll go ahead and merge this and address the comments as part of #635.

Feedback from PRs: * #609 * #614 * #616 * #619 * #620 * #627 * #628 * #638 * #639

Zack Slayton added 30 commits July 24, 2023 16:54

Top-level nulls, bools, ints

e0a83d8

Consolidate impls of AsUtf8 w/helper fn

89f79aa

Improved TextBufferView docs, removed DataSource

840be4d

Adds lazy text floats

5db1ff0

Adds LazyRawTextReader support for comments

07d4a70

Adds LazyRawTextReader support for reading strings

181e0a5

clippy fixes

357ca8f

Fix a couple of unit tests

716ff34

Less ambitious float eq comparison

e29fec5

Adds LazyRawTextReader support for reading symbols

8f79a36

Adds more doc comments

4cb9b2b

More doc comments

54470d2

Adds LazyRawTextReader support for reading lists

78014e7

Adds LazyRawTextReader support for structs

a6a3aa8

More doc comments

4fc9078

Adds LazyRawTextReader support for reading IVMs

11174ac

Initial impl of a LazyRawAnyReader

719dbaa

Improved comments.

f603872

Adds LazyRawTextReader support for annotations

4696ca5

Adds lazy reader support for timestamps

c7129ac

Lazy reader support for s-expressions

44435ea

Fixed doc comments

d50e05b

Fix internal doc link

8283422

Adds lazy reader support for decimals

0f01099

Fixed bad unit test example case

b60f1fe

clippy fixes

915c83a

Adds lazy reader support for blobs

fe922ff

Adds lazy reader support for long strings

066ddd8

Merged long string matcher tests into overall string tests

c58e5f0

wip

6b5ce1c

Zack Slayton added 2 commits September 1, 2023 16:55

Merge main, complete support for clobs

e45ec35

clippy suggestion

a3f8a21

Zack Slayton added 4 commits September 3, 2023 11:09

Adds lazy reader support for clobs

62be7c9

clippy suggestion

0eacd3a

Fix newline normalization, add unit tests

175009d

comment cleanup

3421393

zslayton commented Sep 3, 2023

View reviewed changes

zslayton marked this pull request as ready for review September 3, 2023 16:49

zslayton requested review from popematt and desaikd September 3, 2023 16:49

popematt reviewed Sep 5, 2023

View reviewed changes

popematt approved these changes Sep 7, 2023

View reviewed changes

Merge branch 'main' into lazy-clobs

45cbf40

zslayton merged commit 7583129 into main Sep 7, 2023
18 checks passed

zslayton deleted the lazy-clobs branch September 7, 2023 12:07

zslayton pushed a commit that referenced this pull request Sep 7, 2023

Feedback from PR #638

9852f2f

zslayton mentioned this pull request Sep 7, 2023

Incorporates pending feedback from lazy reader PRs #642

Merged

zslayton added a commit that referenced this pull request Sep 7, 2023

Incorporates pending feedback from lazy reader PRs (#642)

ec91888

Feedback from PRs: * #609 * #614 * #616 * #619 * #620 * #627 * #628 * #638 * #639

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds lazy reader support for reading clobs #638

Adds lazy reader support for reading clobs #638

zslayton commented Sep 1, 2023 •

edited

codecov bot commented Sep 1, 2023 •

edited

zslayton left a comment

zslayton Sep 3, 2023

zslayton Sep 3, 2023

zslayton Sep 3, 2023

zslayton Sep 3, 2023

popematt Sep 5, 2023

zslayton Sep 3, 2023

zslayton Sep 3, 2023

popematt Sep 5, 2023

zslayton Sep 3, 2023

popematt Sep 5, 2023

popematt Sep 5, 2023

popematt Sep 5, 2023

popematt Sep 5, 2023

zslayton Sep 7, 2023

popematt left a comment

zslayton commented Sep 7, 2023

Adds lazy reader support for reading clobs #638

Adds lazy reader support for reading clobs #638

Conversation

zslayton commented Sep 1, 2023 • edited

codecov bot commented Sep 1, 2023 • edited

Codecov Report

zslayton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

popematt left a comment

Choose a reason for hiding this comment

zslayton commented Sep 7, 2023

zslayton commented Sep 1, 2023 •

edited

codecov bot commented Sep 1, 2023 •

edited