-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds LazyRawTextReader support for reading strings #614
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #614 +/- ##
==========================================
- Coverage 81.66% 81.64% -0.03%
==========================================
Files 118 119 +1
Lines 21202 21547 +345
Branches 21202 21547 +345
==========================================
+ Hits 17315 17591 +276
- Misses 2231 2312 +81
+ Partials 1656 1644 -12
☔ View full report in Codecov by Sentry. |
// We cannot compare lazy containers as we cannot guarantee that their complete contents | ||
// are available in the buffer. Is `{foo: bar}` equal to `{foo: b`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for implementing PartialEq?
If there was some way of having 3-value logic here (true
, false
, unknown
) that would make me a little more comfortable. I don't like how this would say that {foo: bar}
and {foo: bar}
are not equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for implementing
PartialEq
?
It allows us to test for equality using assert_eq!
in unit tests.
If there was some way of having 3-value logic here (true, false, unknown) that would make me a little more comfortable. I don't like how this would say that {foo: bar} and {foo: bar} are not equal.
This implementation is leaning on the Partial
part of PartialEq
. Just as there are values in f64
that cannot be compared and always return false
(i.e. NaN
), there are values in RawValueRef
that cannot be compared and always return false
(i.e. container types).
Let me know if that makes it feel less yucky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
} | ||
|
||
pub fn text(&self) -> &str { | ||
self.as_ref() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this is correct. I don't see any impl AsRef for StrRef
, so I don't know where self.as_ref()
is coming from.
Shouldn't it be this instead?
self.as_ref() | |
self.text.as_ref() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StrRef
implements Deref
to &str
the same way that Cow<'_, str>
does, so it's transparently doing what you wrote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. I forgot about that... and that's why it's easy to misuse Deref
. I think impl Deref for StrRef
is a good use of it though.
That being said, I would gently recommend that the functions in the inherent impl not depend on the Deref
implementation. It's a little easier to see what's going on, IMO. (My mental model, at least, is that you have the struct, then the inherent impls build on the struct, and the trait impls build on the struct and its inherent impls. YMMV.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm on board with that.
// Missing a trailing quote | ||
r#" | ||
"hello | ||
"#, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the test expect a complete match, or just something at the beginning to match? Does it make sense to add a test like this?
"#, | |
"#, | |
// Unescaped quote | |
r#" | |
"hello"world" | |
"#, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mismatch
cases only look for a failure to match the entire string. We can add that test and it will fail (as it should) because it won't match the whole thing.
use std::ops::Deref; | ||
|
||
#[derive(Clone, PartialEq, Debug)] | ||
pub struct StrRef<'data> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add doc comment here?
|
||
#[derive(Clone, Copy, Debug, PartialEq)] | ||
pub(crate) struct MatchedShortString { | ||
contains_escaped_chars: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I mentioned it already (offline, somewhere else?) that we should just have two separate enum variants. I don't know if you're planning on introducing that into this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did end up making these into enum variants in PR #619.
/// This helper function detects high surrogates (which are only used in utf-16) so the parser | ||
/// can know to require a second one immediately following. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think we needed to support UTF-16. Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification mentions surrogate pair handling in a "MAY"-style statement.
Ion does not specify the behavior of specifying invalid Unicode code points or surrogate code points (used only for UTF-16) using the escape sequences. It is highly recommended that Ion implementations reject such escape sequences as they are not proper Unicode as specified by the standard. To this point, consider the Ion string sequence, "\uD800\uDC00". A compliant parser may throw an exception because surrogate characters are specified outside of the context of UTF-16, accept the string as a technically invalid sequence of two Unicode code points (i.e. U+D800 and U+DC00), or interpret it as the single Unicode code point U+00010000. In this regard, the Ion string data type does not conform to the Unicode specification. A strict Unicode implementation of the Ion text should not accept such sequences.
So you're right that they're not required (and even actively discouraged). However,ion-java
supports them and ion-tests
has tests for them. The existing ion-rust
text reader also supports them. I'm not in a rush to implement this and could be convinced to formalize it as an error case instead.
src/lazy/text/parse_result.rs
Outdated
message.push_str("; buffer: "); | ||
let input = invalid_input_error.input; | ||
let buffer_text = if let Ok(text) = invalid_input_error.input.as_text() { | ||
// TODO: This really should be graphemes instead of chars() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're being pedantic... it should be grapheme clusters, right? 😉
That being said, I don't think this really needs to be a "todo" item. It's unclear how valuable this is given that this message is already not going to contain the full text because of the truncation. From what I can tell, the ability to manipulate graphemes was removed from the standard library because of the size of the unicode tables and because it was not good for the std library to be coupled to a specific Unicode version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're being pedantic... it should be grapheme clusters, right? 😉
Yep! 😄
That being said, I don't think this really needs to be a "todo" item [...]
I was concerned the code might do something weird if the 32
byte limit fell in a boundary between unicode scalars. However, on closer inspection of the char
documentation, I see that:
USVs are also the exact set of values that may be encoded in UTF-8. Because char values are USVs and str values are valid UTF-8, it is safe to store any char in a str or read any character from a str as a char.
so it should always produce a legal (if misleading) string if it's truncated.
Builds on #609, #612, and #613.
Adds support for reading short-form
string
s using theLazyRawTextReader
, includingall escape sequences. Support for long-form strings will be added later.
Because text
string
s with escapes need to be modified before being returned to the user,this PR also introduces a
StrRef
type that wraps aCow<'a, str>
, allowing it to provideeither a slice directly from input (when no escapes are present) or a new
String
(whensome bytes had to be replaced).
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.