Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial raw lazy text reader (top-level nulls, bools, ints) #609

Merged
merged 3 commits into from
Aug 10, 2023

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Jul 24, 2023

This adds an initial implementation of a LazyRawTextReader that can yield LazyRawTextValues. At present it is limited to reading top-level nulls, bools, and ints of any radix. Support for other types, annotations, and containers will be added in follow-on PRs.

This implementation splits the parsing process for each value into two phases:

  1. Matching finds a slice at the head of the input that corresponds to an element from the Ion grammar. It does not perform full semantic validation of the value.
  2. Reading fully validates the previously matched input and turns it into a value.

This split makes it possible for the reader to avoid the cost of validation and materialization for values that the user simply skips over. To minimize rework done in the two phases, the matching phase records information it discovers along the way (subfield offsets, radix, etc) so it is available if and when the reading phase begins.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Jul 24, 2023

Codecov Report

Patch coverage: 67.52% and project coverage change: -0.67% ⚠️

Comparison is base (42adf28) 82.39% compared to head (840be4d) 81.72%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #609      +/-   ##
==========================================
- Coverage   82.39%   81.72%   -0.67%     
==========================================
  Files         111      118       +7     
  Lines       20048    20969     +921     
  Branches    20048    20969     +921     
==========================================
+ Hits        16518    17137     +619     
- Misses       1912     2193     +281     
- Partials     1618     1639      +21     
Files Changed Coverage Δ
src/lazy/binary/raw/reader.rs 76.31% <ø> (ø)
src/lazy/binary/raw/sequence.rs 60.93% <ø> (ø)
src/lazy/binary/raw/struct.rs 62.22% <ø> (ø)
src/lazy/encoding.rs 0.00% <0.00%> (ø)
src/lazy/reader.rs 66.66% <ø> (ø)
src/lazy/sequence.rs 73.07% <ø> (ø)
src/lazy/struct.rs 66.66% <ø> (ø)
src/lazy/system_reader.rs 77.15% <ø> (ø)
src/lazy/value.rs 74.61% <ø> (ø)
src/result/decoding_error.rs 42.85% <12.50%> (-40.48%) ⬇️
... and 12 more

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR tour

@@ -66,7 +66,7 @@ num-bigint = "0.4.3"
num-integer = "0.1.44"
num-traits = "0.2"
arrayvec = "0.7"
smallvec = "1.9.0"
smallvec = {version ="1.9.0", features = ["const_generics"]}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The const_generics feature of the smallvec crate provides trait implementations for all sizes of backing array ([u8; N] in our case) rather than just 0-32 and several powers of two beyond. It's a feature because smallvec predates const generics and didn't want to force a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This file was moved to the parent directory, not deleted. It appears again later in the diff.

@@ -35,7 +35,7 @@ impl<'a> Debug for LazyRawBinaryValue<'a> {
fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
write!(
f,
"LazyRawValue {{\n val={:?},\n buf={:?}\n}}\n",
"LazyRawBinaryValue {{\n val={:?},\n buf={:?}\n}}\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ LazyRawBinaryValue used to be the only kind of lazy value and was called LazyRawValue. Now LazyRawValue is the trait and this type is LazyRawBinaryValue.

@@ -54,6 +54,10 @@ impl<'data> LazyRawValue<'data, BinaryEncoding> for LazyRawBinaryValue<'data> {
self.ion_type()
}

fn is_null(&self) -> bool {
self.is_null()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️is_null is a method on the LazyRawValue trait. This implementation delegates the call to the is_null method that lives on the LazyRawBinaryValue concrete type.

This arrangement produces a small amount of extra code (there are two fn is_nulls), but means that consumer code that works explicitly with LazyRawBinaryValue doesn't have to import a trait just to use its methods.

Comment on lines +25 to +31
impl<'data> LazyDecoder<'data> for BinaryEncoding {
type Reader = LazyRawBinaryReader<'data>;
type Value = LazyRawBinaryValue<'data>;
type Sequence = LazyRawBinarySequence<'data>;
type Struct = LazyRawBinaryStruct<'data>;
type AnnotationsIterator = RawBinaryAnnotationsIterator<'data>;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ As mentioned earlier in the diff, encoding.rs was moved here from the binary directory. The BinaryEncoding struct and this impl existed before this PR, but appear as additions here because of the move. The TextEncoding struct and the content below this impl (line 33+) are new.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This PR deals with two types that are effectively wrappers around a &[u8]: TextBufferView and SmallVec. Most of the time we don't need to view them as validated UTF-8 &strs--we can rely on the grammar to reject anything invalid--but sometimes we do. This trait defines an extension method that streamlines the "view these bytes as a &str or raise an IonError" task.

self.data.is_empty()
}

pub fn match_whitespace(self) -> IonMatchResult<'data> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Note that only a handful of the match_ parsing methods are pub. Most are helper-parsers.

Comment on lines +126 to +138
// impl<'data> From<InvalidInputError<'data>> for IonError {
// fn from(value: InvalidInputError) -> Self {
// dbg!(&value.backtrace);
// let mut message = String::from(value.description().unwrap_or("invalid text Ion syntax"));
// if let Some(label) = value.label {
// message.push_str(" while ");
// message.push_str(label.as_ref());
// }
// let position = Position::with_offset(value.input.offset()).with_length(value.input.len());
// let decoding_error = DecodingError::new(message).with_position(position);
// IonError::Decoding(decoding_error)
// }
// }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I'll remove this obsolete code after the PR is reviewed. Pushing code as I write the PR tour comments causes some number of the comments to be discarded 😐.

use crate::IonType;

#[test]
fn test_top_level() -> IonResult<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ This test demonstrates the total capability introduced in this PR. Namely, a LazyRawTextReader that can read top-level, unannotated nulls, bools, and ints in a text stream.

Comment on lines +14 to +15
/// If only part of the value is in the input buffer, calls to [`LazyRawTextValue::read`] (which examines
/// bytes beyond the value's header) may return [`IonError::Incomplete`](crate::result::IonError::Incomplete).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I have a lingering reference to "the value's header" that was a copy/paste error from this type's binary counterpart. I'll fix that before merging.

@zslayton zslayton requested review from jobarr-amzn, popematt and desaikd and removed request for jobarr-amzn July 25, 2023 14:45
@zslayton zslayton marked this pull request as ready for review July 25, 2023 18:31
Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ couple more PR tour comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Much of this was copied over from the old text reader implementation's error handling code and then updated.

Comment on lines +195 to +201
pub(crate) trait AddContext<'data, T> {
fn with_context(
self,
label: impl Into<Cow<'static, str>>,
input: TextBufferView<'data>,
) -> IonResult<(TextBufferView<'data>, T)>;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I plan to add a doc comment to this trait before merging. I intentionally avoided providing an Into<IonError> impl for IonParseError because some of the nom error kinds that bubble up as IonParseError don't retain context. This trait is an easy way for methods that could surface without context to require more information before being passed on to the user.

For an example of its usage, see LazyRawTextReader::next():
image

@@ -6,12 +7,23 @@ use thiserror::Error;
#[error("{description}")]
pub struct DecodingError {
description: Cow<'static, str>,
position: Option<Position>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Suggestion) Add comment regarding why position is optional here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this in a follow-on PR to avoid needing to modify the queue.

}

/// Matches a single scalar value or the beginning of a container.
pub fn match_value(self) -> IonParseResult<'data, LazyRawTextValue<'data>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Question) Should this be called read_value as it returns a parsed value? Same for match_top_level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reserving the read_* names for methods that produce one of the RawValueRef::* variants. The LazyRawTextValue that this returns has its own read() method that serves that purpose.

@zslayton zslayton merged commit 31ec961 into main Aug 10, 2023
17 of 18 checks passed
@zslayton zslayton deleted the lazy-text-reader branch August 10, 2023 18:18
@zslayton zslayton self-assigned this Aug 29, 2023
zslayton pushed a commit that referenced this pull request Sep 7, 2023
zslayton added a commit that referenced this pull request Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants