Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds LazyRawTextReader support for reading strings #614

Merged
merged 10 commits into from
Aug 23, 2023
Merged

Conversation

zslayton
Copy link
Contributor

Builds on #609, #612, and #613.

Adds support for reading short-form strings using the LazyRawTextReader, including
all escape sequences. Support for long-form strings will be added later.

Because text strings with escapes need to be modified before being returned to the user,
this PR also introduces a StrRef type that wraps a Cow<'a, str>, allowing it to provide
either a slice directly from input (when no escapes are present) or a new String (when
some bytes had to be replaced).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Jul 29, 2023

Codecov Report

Patch coverage: 78.74% and project coverage change: -0.03% ⚠️

Comparison is base (b00fb2f) 81.66% compared to head (3de64e4) 81.64%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #614      +/-   ##
==========================================
- Coverage   81.66%   81.64%   -0.03%     
==========================================
  Files         118      119       +1     
  Lines       21202    21547     +345     
  Branches    21202    21547     +345     
==========================================
+ Hits        17315    17591     +276     
- Misses       2231     2312      +81     
+ Partials     1656     1644      -12     
Files Changed Coverage Δ
src/lazy/struct.rs 66.66% <ø> (ø)
src/lazy/text/as_utf8.rs 31.25% <0.00%> (-7.22%) ⬇️
src/lazy/text/encoded_value.rs 66.01% <0.00%> (-7.51%) ⬇️
src/lazy/text/value.rs 33.33% <0.00%> (-8.05%) ⬇️
src/lazy/text/parse_result.rs 26.56% <9.09%> (-3.60%) ⬇️
src/lazy/system_reader.rs 77.15% <50.00%> (ø)
src/lazy/raw_value_ref.rs 68.78% <53.84%> (-3.09%) ⬇️
src/lazy/str_ref.rs 66.66% <66.66%> (ø)
src/lazy/text/matched.rs 69.33% <73.50%> (+10.32%) ⬆️
src/lazy/value_ref.rs 74.07% <75.00%> (+0.27%) ⬆️
... and 3 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +43 to +44
// We cannot compare lazy containers as we cannot guarantee that their complete contents
// are available in the buffer. Is `{foo: bar}` equal to `{foo: b`?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for implementing PartialEq?

If there was some way of having 3-value logic here (true, false, unknown) that would make me a little more comfortable. I don't like how this would say that {foo: bar} and {foo: bar} are not equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for implementing PartialEq?

It allows us to test for equality using assert_eq! in unit tests.

If there was some way of having 3-value logic here (true, false, unknown) that would make me a little more comfortable. I don't like how this would say that {foo: bar} and {foo: bar} are not equal.

This implementation is leaning on the Partial part of PartialEq. Just as there are values in f64 that cannot be compared and always return false (i.e. NaN), there are values in RawValueRef that cannot be compared and always return false (i.e. container types).

Let me know if that makes it feel less yucky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

}

pub fn text(&self) -> &str {
self.as_ref()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this is correct. I don't see any impl AsRef for StrRef, so I don't know where self.as_ref() is coming from.

Shouldn't it be this instead?

Suggested change
self.as_ref()
self.text.as_ref()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StrRef implements Deref to &str the same way that Cow<'_, str> does, so it's transparently doing what you wrote.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. I forgot about that... and that's why it's easy to misuse Deref. I think impl Deref for StrRef is a good use of it though.

That being said, I would gently recommend that the functions in the inherent impl not depend on the Deref implementation. It's a little easier to see what's going on, IMO. (My mental model, at least, is that you have the struct, then the inherent impls build on the struct, and the trait impls build on the struct and its inherent impls. YMMV.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm on board with that.

// Missing a trailing quote
r#"
"hello
"#,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the test expect a complete match, or just something at the beginning to match? Does it make sense to add a test like this?

Suggested change
"#,
"#,
// Unescaped quote
r#"
"hello"world"
"#,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mismatch cases only look for a failure to match the entire string. We can add that test and it will fail (as it should) because it won't match the whole thing.

use std::ops::Deref;

#[derive(Clone, PartialEq, Debug)]
pub struct StrRef<'data> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add doc comment here?


#[derive(Clone, Copy, Debug, PartialEq)]
pub(crate) struct MatchedShortString {
contains_escaped_chars: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned it already (offline, somewhere else?) that we should just have two separate enum variants. I don't know if you're planning on introducing that into this PR though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +375 to +376
/// This helper function detects high surrogates (which are only used in utf-16) so the parser
/// can know to require a second one immediately following.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think we needed to support UTF-16. Why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specification mentions surrogate pair handling in a "MAY"-style statement.

Ion does not specify the behavior of specifying invalid Unicode code points or surrogate code points (used only for UTF-16) using the escape sequences. It is highly recommended that Ion implementations reject such escape sequences as they are not proper Unicode as specified by the standard. To this point, consider the Ion string sequence, "\uD800\uDC00". A compliant parser may throw an exception because surrogate characters are specified outside of the context of UTF-16, accept the string as a technically invalid sequence of two Unicode code points (i.e. U+D800 and U+DC00), or interpret it as the single Unicode code point U+00010000. In this regard, the Ion string data type does not conform to the Unicode specification. A strict Unicode implementation of the Ion text should not accept such sequences.

So you're right that they're not required (and even actively discouraged). However,ion-java supports them and ion-tests has tests for them. The existing ion-rust text reader also supports them. I'm not in a rush to implement this and could be convinced to formalize it as an error case instead.

message.push_str("; buffer: ");
let input = invalid_input_error.input;
let buffer_text = if let Ok(text) = invalid_input_error.input.as_text() {
// TODO: This really should be graphemes instead of chars()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're being pedantic... it should be grapheme clusters, right? 😉

That being said, I don't think this really needs to be a "todo" item. It's unclear how valuable this is given that this message is already not going to contain the full text because of the truncation. From what I can tell, the ability to manipulate graphemes was removed from the standard library because of the size of the unicode tables and because it was not good for the std library to be coupled to a specific Unicode version.

Copy link
Contributor Author

@zslayton zslayton Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're being pedantic... it should be grapheme clusters, right? 😉

Yep! 😄

That being said, I don't think this really needs to be a "todo" item [...]

I was concerned the code might do something weird if the 32 byte limit fell in a boundary between unicode scalars. However, on closer inspection of the char documentation, I see that:

USVs are also the exact set of values that may be encoded in UTF-8. Because char values are USVs and str values are valid UTF-8, it is safe to store any char in a str or read any character from a str as a char.

so it should always produce a legal (if misleading) string if it's truncated.

Base automatically changed from lazy-comments to main August 22, 2023 08:29
@zslayton zslayton merged commit 6d22b6f into main Aug 23, 2023
17 of 18 checks passed
@zslayton zslayton deleted the lazy-strings branch August 23, 2023 00:01
@zslayton zslayton self-assigned this Aug 29, 2023
zslayton added a commit that referenced this pull request Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants