Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds lazy reader support for reading clobs #638

Merged
merged 37 commits into from
Sep 7, 2023
Merged

Adds lazy reader support for reading clobs #638

merged 37 commits into from
Sep 7, 2023

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Sep 1, 2023

Adds LazyRawTextReader support for matching and reading clobs.

Fixes #634.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Sep 1, 2023

Codecov Report

Patch coverage is 93.61% of modified lines.

Files Changed Coverage
src/lazy/raw_value_ref.rs ø
src/lazy/str_ref.rs 0.00%
src/lazy/text/encoded_value.rs 0.00%
src/lazy/text/value.rs 0.00%
src/lazy/text/matched.rs 93.62%
src/lazy/text/buffer.rs 99.00%
src/lazy/binary/raw/value.rs 100.00%
src/lazy/text/raw/reader.rs 100.00%
src/lazy/value_ref.rs 100.00%

📢 Thoughts on this report? Let us know!.

Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR tour

@@ -413,7 +413,7 @@ impl<'data> LazyRawBinaryValue<'data> {
fn read_clob(&self) -> ValueParseResult<'data, BinaryEncoding> {
debug_assert!(self.encoded_value.ion_type() == IonType::Clob);
let bytes = self.value_body()?;
Ok(RawValueRef::Clob(bytes))
Ok(RawValueRef::Clob(bytes.into()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Reading a clob now returns a BytesRef<'_> instead of a &[u8] to accommodate the escape decoding process that happens in text clobs. This change mirrors the one made for blobs in #629.

Cow::Owned(text) => Vec::from(text).into(),
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This impl converts a String into its underlying Vec or a &str to its underlying &[u8].

@@ -1002,13 +1008,13 @@ impl<'data> TextBufferView<'data> {

/// Returns a matched buffer and a boolean indicating whether any escaped characters were
/// found in the short string.
fn match_short_string_body(self) -> IonParseResult<'data, (Self, bool)> {
pub(crate) fn match_short_string_body(self) -> IonParseResult<'data, (Self, bool)> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The clob reading logic re-uses the short- and long-form string matchers to isolate the content within the larger match.

let text = String::from_utf8(sanitized).unwrap();
Ok(StrRef::from(text.to_string()))
}
}

fn escape_text(matched_input: TextBufferView, sanitized: &mut Vec<u8>) -> IonResult<()> {
fn decode_text_containing_escapes(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I renamed this method to make it clearer which "direction" we were going. It accepts text with escapes and decodes them into bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is still confusing for me because it's possible to "decode" text to bytes (e.g. base64) and to "decode" bytes to text (e.g. UTF-8). What about something like convert_escaped_text_to_bytes or decode_escaped_text_into_bytes?

let mut remaining = matched_input;

// For ways to optimize this in the future, look at the `memchr` crate.
let match_byte = |byte: &u8| *byte == b'\\' || *byte == b'\r';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The logic needed to normalize an unescaped \r differs from that needed to replace an escaped \r (or any other escape). We're looking for a raw byte value 0x0A that is not prefixed with a \.

// being allocated when it isn't strictly necessary.
contains_escaped_chars = true;
continue;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ In long-form clobs and long-form strings, we need to normalize unescaped \r and \r\n to \n. This throws the naming off a bit; contains_escapes should really be something like requires_substitutions. However, I think escapes is a more obvious/suggestive name. Open to input here; I left it as-is because a consistent rename across usages/modules would touch a lot of lines and I'd rather do it in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires_substitutions_of_escaped_characters? (What a mouthful... maybe too long.)

// Normalize newlines
true,
// Support unicode escapes
true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I considered enums for these two bools to make them self-documenting, but as they're not part of the public API I decided to just comment the handful of places where this method is called.

@zslayton zslayton marked this pull request as ready for review September 3, 2023 16:49
// being allocated when it isn't strictly necessary.
contains_escaped_chars = true;
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires_substitutions_of_escaped_characters? (What a mouthful... maybe too long.)

List,
SExp,
Struct,
// TODO: ...the other types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

let text = String::from_utf8(sanitized).unwrap();
Ok(StrRef::from(text.to_string()))
}
}

fn escape_text(matched_input: TextBufferView, sanitized: &mut Vec<u8>) -> IonResult<()> {
fn decode_text_containing_escapes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is still confusing for me because it's possible to "decode" text to bytes (e.g. base64) and to "decode" bytes to text (e.g. UTF-8). What about something like convert_escaped_text_to_bytes or decode_escaped_text_into_bytes?

Comment on lines +970 to +971
Short,
Long,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add even a brief doc comment for these.

Also, is it worth having separate cases for with and without escapes? Or long with single vs multiple segments? (Did we already talk about this? I think we might have.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a narrow case that benefits: single-segment clobs that only contain ASCII. Every other case requires a sanitization/decoding buffer anyway. I concluded that I'd wait to see if anyone actually uses clobs outside of ion-tests before worrying about optimizing it further.

Copy link
Contributor

@popematt popematt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant to approve this because none of my latest comments are things that would block the PR.

@zslayton
Copy link
Contributor Author

zslayton commented Sep 7, 2023

Sorry, I meant to approve this because none of my latest comments are things that would block the PR.

Thanks, I've got another PR out that depends on this one (#639), so I'll go ahead and merge this and address the comments as part of #635.

@zslayton zslayton merged commit 7583129 into main Sep 7, 2023
18 checks passed
@zslayton zslayton deleted the lazy-clobs branch September 7, 2023 12:07
zslayton pushed a commit that referenced this pull request Sep 7, 2023
zslayton added a commit that referenced this pull request Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement lazy reader support for clobs
2 participants