Adds LazyRawTextReader support for reading strings #614

zslayton · 2023-07-28T21:39:40Z

Builds on #609, #612, and #613.

Adds support for reading short-form strings using the LazyRawTextReader, including
all escape sequences. Support for long-form strings will be added later.

Because text strings with escapes need to be modified before being returned to the user,
this PR also introduces a StrRef type that wraps a Cow<'a, str>, allowing it to provide
either a slice directly from input (when no escapes are present) or a new String (when
some bytes had to be replaced).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov · 2023-07-29T00:23:32Z

Codecov Report

Patch coverage: 78.74% and project coverage change: -0.03% ⚠️

Comparison is base (b00fb2f) 81.66% compared to head (3de64e4) 81.64%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #614      +/-   ##
==========================================
- Coverage   81.66%   81.64%   -0.03%     
==========================================
  Files         118      119       +1     
  Lines       21202    21547     +345     
  Branches    21202    21547     +345     
==========================================
+ Hits        17315    17591     +276     
- Misses       2231     2312      +81     
+ Partials     1656     1644      -12

Files Changed	Coverage Δ
src/lazy/struct.rs	`66.66% <ø> (ø)`
src/lazy/text/as_utf8.rs	`31.25% <0.00%> (-7.22%)`	⬇️
src/lazy/text/encoded_value.rs	`66.01% <0.00%> (-7.51%)`	⬇️
src/lazy/text/value.rs	`33.33% <0.00%> (-8.05%)`	⬇️
src/lazy/text/parse_result.rs	`26.56% <9.09%> (-3.60%)`	⬇️
src/lazy/system_reader.rs	`77.15% <50.00%> (ø)`
src/lazy/raw_value_ref.rs	`68.78% <53.84%> (-3.09%)`	⬇️
src/lazy/str_ref.rs	`66.66% <66.66%> (ø)`
src/lazy/text/matched.rs	`69.33% <73.50%> (+10.32%)`	⬆️
src/lazy/value_ref.rs	`74.07% <75.00%> (+0.27%)`	⬆️
... and 3 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

popematt · 2023-08-21T22:26:51Z

src/lazy/raw_value_ref.rs

+            // We cannot compare lazy containers as we cannot guarantee that their complete contents
+            // are available in the buffer. Is `{foo: bar}` equal to `{foo: b`?


What is the motivation for implementing PartialEq?

If there was some way of having 3-value logic here (true, false, unknown) that would make me a little more comfortable. I don't like how this would say that {foo: bar} and {foo: bar} are not equal.

What is the motivation for implementing PartialEq?

It allows us to test for equality using assert_eq! in unit tests.

If there was some way of having 3-value logic here (true, false, unknown) that would make me a little more comfortable. I don't like how this would say that {foo: bar} and {foo: bar} are not equal.

This implementation is leaning on the Partial part of PartialEq. Just as there are values in f64 that cannot be compared and always return false (i.e. NaN), there are values in RawValueRef that cannot be compared and always return false (i.e. container types).

Let me know if that makes it feel less yucky.

Sounds good.

popematt · 2023-08-21T22:34:50Z

src/lazy/str_ref.rs

+    }
+
+    pub fn text(&self) -> &str {
+        self.as_ref()


I don't understand how this is correct. I don't see any impl AsRef for StrRef, so I don't know where self.as_ref() is coming from.

Shouldn't it be this instead?

Suggested change

self.as_ref()

self.text.as_ref()

StrRef implements Deref to &str the same way that Cow<'_, str> does, so it's transparently doing what you wrote.

Oh, right. I forgot about that... and that's why it's easy to misuse Deref. I think impl Deref for StrRef is a good use of it though.

That being said, I would gently recommend that the functions in the inherent impl not depend on the Deref implementation. It's a little easier to see what's going on, IMO. (My mental model, at least, is that you have the struct, then the inherent impls build on the struct, and the trait impls build on the struct and its inherent impls. YMMV.)

Sure, I'm on board with that.

popematt · 2023-08-21T22:37:01Z

src/lazy/text/buffer.rs

+            // Missing a trailing quote
+            r#"
+            "hello
+            "#,


Does the test expect a complete match, or just something at the beginning to match? Does it make sense to add a test like this?

Suggested change

"#,

"#,

// Unescaped quote

r#"

"hello"world"

"#,

The mismatch cases only look for a failure to match the entire string. We can add that test and it will fail (as it should) because it won't match the whole thing.

popematt · 2023-08-21T22:40:50Z

src/lazy/str_ref.rs

+use std::ops::Deref;
+
+#[derive(Clone, PartialEq, Debug)]
+pub struct StrRef<'data> {


Can you add doc comment here?

popematt · 2023-08-21T22:42:42Z

src/lazy/text/matched.rs

+
+#[derive(Clone, Copy, Debug, PartialEq)]
+pub(crate) struct MatchedShortString {
+    contains_escaped_chars: bool,


I think I mentioned it already (offline, somewhere else?) that we should just have two separate enum variants. I don't know if you're planning on introducing that into this PR though.

I did end up making these into enum variants in PR #619.

popematt · 2023-08-21T22:45:15Z

src/lazy/text/matched.rs

+/// This helper function detects high surrogates (which are only used in utf-16) so the parser
+/// can know to require a second one immediately following.


I didn't think we needed to support UTF-16. Why is this necessary?

The specification mentions surrogate pair handling in a "MAY"-style statement.

Ion does not specify the behavior of specifying invalid Unicode code points or surrogate code points (used only for UTF-16) using the escape sequences. It is highly recommended that Ion implementations reject such escape sequences as they are not proper Unicode as specified by the standard. To this point, consider the Ion string sequence, "\uD800\uDC00". A compliant parser may throw an exception because surrogate characters are specified outside of the context of UTF-16, accept the string as a technically invalid sequence of two Unicode code points (i.e. U+D800 and U+DC00), or interpret it as the single Unicode code point U+00010000. In this regard, the Ion string data type does not conform to the Unicode specification. A strict Unicode implementation of the Ion text should not accept such sequences.

So you're right that they're not required (and even actively discouraged). However,ion-java supports them and ion-tests has tests for them. The existing ion-rust text reader also supports them. I'm not in a rush to implement this and could be convinced to formalize it as an error case instead.

popematt · 2023-08-21T23:00:01Z

src/lazy/text/parse_result.rs

+        message.push_str("; buffer: ");
+        let input = invalid_input_error.input;
+        let buffer_text = if let Ok(text) = invalid_input_error.input.as_text() {
+            // TODO: This really should be graphemes instead of chars()


If we're being pedantic... it should be grapheme clusters, right? 😉

That being said, I don't think this really needs to be a "todo" item. It's unclear how valuable this is given that this message is already not going to contain the full text because of the truncation. From what I can tell, the ability to manipulate graphemes was removed from the standard library because of the size of the unicode tables and because it was not good for the std library to be coupled to a specific Unicode version.

If we're being pedantic... it should be grapheme clusters, right? 😉

Yep! 😄

That being said, I don't think this really needs to be a "todo" item [...]

I was concerned the code might do something weird if the 32 byte limit fell in a boundary between unicode scalars. However, on closer inspection of the char documentation, I see that:

USVs are also the exact set of values that may be encoded in UTF-8. Because char values are USVs and str values are valid UTF-8, it is safe to store any char in a str or read any character from a str as a char.

so it should always produce a legal (if misleading) string if it's truncated.

Feedback from PRs: * #609 * #614 * #616 * #619 * #620 * #627 * #628 * #638 * #639

Zack Slayton added 8 commits July 24, 2023 16:54

Top-level nulls, bools, ints

e0a83d8

Consolidate impls of AsUtf8 w/helper fn

89f79aa

Improved TextBufferView docs, removed DataSource

840be4d

Adds lazy text floats

5db1ff0

Adds LazyRawTextReader support for comments

07d4a70

Adds LazyRawTextReader support for reading strings

181e0a5

clippy fixes

357ca8f

Fix a couple of unit tests

716ff34

Less ambitious float eq comparison

e29fec5

This was referenced Aug 1, 2023

Adds LazyRawTextReader support for reading symbols #616

Merged

Adds LazyRawTextReader support for reading lists #617

Merged

zslayton marked this pull request as ready for review August 3, 2023 16:55

zslayton requested review from desaikd and jobarr-amzn August 3, 2023 16:56

This was referenced Aug 18, 2023

Lazy reader support for s-expressions #627

Merged

Adds lazy reader support for decimals #628

Merged

Adds lazy reader support for blobs #629

Merged

Adds lazy reader support for long strings #630

Merged

popematt reviewed Aug 21, 2023

View reviewed changes

Base automatically changed from lazy-comments to main August 22, 2023 08:29

popematt approved these changes Aug 22, 2023

View reviewed changes

Merge remote-tracking branch 'origin/main' into lazy-strings

3de64e4

zslayton merged commit 6d22b6f into main Aug 23, 2023
17 of 18 checks passed

zslayton deleted the lazy-strings branch August 23, 2023 00:01

zslayton self-assigned this Aug 29, 2023

zslayton mentioned this pull request Sep 7, 2023

Incorporates pending feedback from lazy reader PRs #642

Merged

zslayton added a commit that referenced this pull request Sep 7, 2023

Incorporates pending feedback from lazy reader PRs (#642)

ec91888

Feedback from PRs: * #609 * #614 * #616 * #619 * #620 * #627 * #628 * #638 * #639

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds LazyRawTextReader support for reading strings #614

Adds LazyRawTextReader support for reading strings #614

zslayton commented Jul 28, 2023

codecov bot commented Jul 29, 2023 •

edited

popematt Aug 21, 2023

zslayton Aug 21, 2023

popematt Aug 21, 2023

popematt Aug 21, 2023

zslayton Aug 21, 2023

popematt Aug 21, 2023

zslayton Aug 22, 2023

popematt Aug 21, 2023

zslayton Aug 21, 2023

popematt Aug 21, 2023

popematt Aug 21, 2023

zslayton Aug 21, 2023

popematt Aug 21, 2023

zslayton Aug 21, 2023

popematt Aug 21, 2023

zslayton Aug 21, 2023 •

edited

		// We cannot compare lazy containers as we cannot guarantee that their complete contents
		// are available in the buffer. Is `{foo: bar}` equal to `{foo: b`?

-            "#,
+            "#,
+            // Unescaped quote
+            r#"
+            "hello"world"
+            "#,

		/// This helper function detects high surrogates (which are only used in utf-16) so the parser
		/// can know to require a second one immediately following.

Adds LazyRawTextReader support for reading strings #614

Adds LazyRawTextReader support for reading strings #614

Conversation

zslayton commented Jul 28, 2023

codecov bot commented Jul 29, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zslayton Aug 21, 2023 • edited

Choose a reason for hiding this comment

codecov bot commented Jul 29, 2023 •

edited

zslayton Aug 21, 2023 •

edited