Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of the Ion 1.1 text reader #645

Merged
merged 4 commits into from
Sep 28, 2023
Merged

Conversation

zslayton
Copy link
Contributor

@zslayton zslayton commented Sep 25, 2023

Syntax changes

Ion 1.1 only introduces a single text syntactic element: encoding expressions. Encoding expressions (e-expressions for short) represent a macro invocation in the data stream, and look like this:

(:macro_id arg1 arg2 arg3 ... argN)

Apart from the leading smiley ((:) and macro_id, they follow s-expression syntax rules.

E-expressions can appear anywhere that an Ion value can appear, including:

  • In a sequence:
    • Top level: $ion_1_1 (:foo)
    • S-expression: (1 2 (:foo) 4)
    • List: [1, 2, (:foo), 4]
  • In a struct:
    • In field value position: { a: 1, foo: (:bar), c: 3 }
    • In field name position with no associated value: { a: 1, (:bar), c: 3 }
  • As an argument to another e-expression: (:foo 1 2 (:bar) 4)

To support this change to the grammar, the LazyDecoder trait has been modified to accommodate the possibility of an e-expression in these locations. In addition to the associated type Value, each LazyDecoder implementation D (for example: TextEncoding_1_1, BinaryEncoding_1_0, etc) now has to declare an associated type MacroInvocation that represents that encoding's syntactic representation of an e-expression. Rather than D::Reader::next() returning a D::Value, it now returns a LazyRawValueExpr<D>, an enum that could be either a value or an e-expression:

pub enum LazyRawValueExpr<'data, D: LazyDecoder<'data>> {
    /// A text Ion value literal. For example: `5`, `foo`, or `"hello"`
    ValueLiteral(D::Value),
    /// A text Ion 1.1+ macro invocation. For example: `(:employee 12345 "Sarah" "Gonzalez")`
    MacroInvocation(D::MacroInvocation),
}

Similarly, rather than returning a D::Field, a D::Struct's iterator will return a LazyRawFieldExpr<D>:

/// An item found in field position within a struct.
/// This item may be:
///   * a name/value pair (as it is in Ion 1.0)
///   * a name/e-expression pair
///   * an e-expression
#[derive(Debug)]
pub enum LazyRawFieldExpr<'data, D: LazyDecoder<'data>> {
    NameValuePair(RawSymbolTokenRef<'data>, LazyRawValueExpr<'data, D>),
    MacroInvocation(D::MacroInvocation),
}

Because Ion 1.0 will never use these new enum variants, a special type called Never has been introduced that allows the compiler to recognize code paths related to macros in Ion 1.0 as dead branches that can be optimized out:

/// An uninhabited type that signals to the compiler that related code paths are not reachable.
#[derive(Debug, Copy, Clone)]
pub enum Never {
    // Has no variants, cannot be instantiated.
}

This should result in Ion 1.0 maintaining the same performance even though ion-rust's data model is now more complex.


Expansion behavior

Upon evaluation, an e-expression expands to zero or more Ion values -- they cannot produce version markers, NOPs, or other e-expressions. This PR implements a limited set of macros:

  • (:values arg1 arg2 ... argN): always expands to its arguments.
  • (:void): always expands to 0 values. (This is equivalent to (:values /* no args */)).
  • (:make_string arg1 arg2 ... argN): builds a string by evaluating each of its arguments to produce a stream of text values which are then concatenated.

When an e-expression appears in a sequence, the values in its expansion are considered part of that sequence.

// Top level
$ion_1_1 1 2 (:values 3 4) 5 6    // 1 2 3 4 5 6

// List
[1, 2, (:values 3 4), 5, 6]       // [1, 2, 3, 4, 5, 6]

// S-expression
(1 2 (:values 3 4) 5 6)           // (1 2 3 4 5 6)

When an e-expression appears in a struct, its expansion behavior depends on whether it was in field value or field name position.

In field value position, each value in its expansion is treated as a field value with the same field name as preceded the e-expression.

// Field value position
{
    foo: (:values 1 2 3),
    bar: (:values 4),
    baz: (:void)
}                       
// Expands to:
{
    foo: 1,
    foo: 2,
    foo: 3,
    bar: 4
}

In field name position, the macro must evaluate to a single struct. That struct's fields will then be merged into the host struct.

// Field name position
{
    foo: 1,
    (:values {bar: 2, baz: 3}),
    quux: 4,
}                       
// Expands to:
{
    foo: 1,
    bar: 2,
    baz: 3,
    quux: 4
}

To enable this behavior, this patch introduces a fourth layer of abstraction into the lazy reader model:

  1. Raw: where bits on the wire are turned into syntax tokens (IVMs, values, symbol ID tokens, e-expressions)
  2. --> new <-- Expanded: e-expressions encountered in the raw level are fully expanded. The reader API hands out LazyExpandedValues, each of which may be backed either by an as-of-yet-unread value literal from the input stream or by the result of a macro's evaluation.
  3. System: which filters out and processes encoding directives (symbol tables, module declarations) from the values surfaced by lower levels of abstraction.
  4. Application: only data relevant to the application's data model is visible.

Macro evaluation

This PR introduces a new type, MacroEvaluator, which can evaluate a macro incrementally; it only calculates the next value in the expansion upon request. This enables e-expressions to be evaluated lazily, making skip-scanning even more powerful and preventing the unexpected spikes in memory usage that might come from eagerly evaluating expressions that may or may not be needed.

Incremental argument evaluation

The MacroEvaluator achieves this by maintaining a stack of macros that are in the process of being evaluated. Each time the next value is requested, the macro at the top of the stack is given the opportunity to either:

  • yield another value
  • push another macro onto the stack
  • indicate that its evaluation is complete

Some macros like (:values ...) need to partially evaluate their arguments until they find another value. Consider:

(:values
  1
  (:values 2 3)
  4)

When evaluating this expression, the evaluator will follow these steps:

  1. Push the entire expression onto the stack. Yield 1.
  2. Push (:values 2 3) onto the stack, yield 2.
  3. Yield 3.
  4. Pop (:values 2 3) off of the stack, yield 4.
  5. Pop the entire expression off of the stack. yield None.

Eager argument evaluation

Other macros like (:make_string) need access to the expanded form of all of their arguments to yield their next value. In this case, the macro can construct a transient (short-lived) evaluator of its own.

(:make_string
  foo_
  (:values bar_ baz_)
  quux_)

The evaluation steps for this are:

  1. Push the entire expression onto the primary stack; construct an empty string buffer and a transient evaluator.
    a. Push the text content foo_ onto the buffer.
    b. We cannot push (:values bar_ baz_) onto the primary evaluator without yielding flow control. Instead, we push (:values bar_ baz_) onto the secondary, transient evaluator's stack.
    c. Ask the secondary evaluator for the next value, get bar_, push its text onto the buffer.
    d. Ask the secondary evaluator for the next value, get baz_, push its text onto the buffer.
    e. Pop (:values bar_ baz_) off the secondary evaluator's stack. Get quux_, push its text onto the buffer.
    f. Pop the entire expression off of the stack, yield "foo_bar_baz_quux".

From the caller's perspective, they called next() and received "foo_bar_baz_quux". However, internally the same incremental evaluation logic was being used in a smaller scope. Note that this means each substring was evaluated one at a time; we did not need to collect them or hold them all in memory at the same time (modulo the buffer's contents).

Bump allocation

In the description of (:make_string) above, perceptive readers may have noticed some possible sources of allocations; in particular, constructing a string buffer and constructing a transient evaluator.

This PR leverages bumpalo to be able to trivially allocate and deallocate structures using a preallocated block of heap memory. Values created in this way (including string buffers, and the Vecs backing our transient evaluators) cannot live beyond the current top level value, but happily that's exactly how long we need them! As a result, we get to allocate and deallocate by simply bumping an offset within our preallocated memory.


TODO

  • Benchmarking (particularly of Ion 1.0 code)
  • Wiring up the LazyReader to switch between underlying readers when a different IVM is encountered.
  • Template macros (including variable expansion)
  • Encoding directives (module definition)
  • Other built-in macros (make_timestamp, if_void, etc)
  • Binary encoding

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@zslayton zslayton marked this pull request as ready for review September 25, 2023 17:24
Copy link
Contributor Author

@zslayton zslayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ PR tour

fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
<Element as Display>::fmt(self, f)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The default Debug implementation for Element is so verbose that your eyes glaze over just trying to see that this contains an int. It's used transitively by some other types' Debug implementations too. I've switched it over to forwarding the call to its Display impl, yielding text Ion. It's usually much more helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ There are a lot of changes in this file, but they're pretty uninteresting. Each of the Any types wraps an enum-dispatched concrete implementation of one of the encodings. We added a new encoding (TextEncoding_1_1), so lots of this boilerplate was updated. There isn't much logic to consider.

@@ -67,7 +67,7 @@ pub trait ElementReader {
/// returning the complete sequence as a `Vec<Element>`.
///
/// If an error occurs while reading, returns `Err(IonError)`.
fn read_all_elements(&mut self) -> IonResult<Vec<Element>> {
fn read_all_elements(&mut self) -> IonResult<Sequence> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ A previous PR changed Element::read_all to return a Sequence (which is just an opaque wrapper around Vec<Element>. However, we neglected to make the same change to the read_all_elements method in the ElementReader trait. I've made that change here for interop/consistency.

@@ -78,12 +137,12 @@ pub struct LazyRawAnyReader<'data> {
}

pub enum RawReaderKind<'data> {
Text_1_0(LazyRawTextReader<'data>),
Text_1_0(LazyRawTextReader_1_0<'data>),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ LazyRawTextReader has been split into LazyRawTextReader_1_0 and LazyRawTextReader_1_1.

@@ -171,8 +171,8 @@ mod tests {
let value = reader.next()?.expect_value()?;
let lazy_struct = value.read()?.expect_struct()?;
let mut fields = lazy_struct.iter();
let field1 = fields.next().expect("field 1")?;
assert_eq!(field1.name(), 4.as_raw_symbol_token_ref()); // 'name'
let (name, _value) = fields.next().expect("field 1")?.expect_name_value()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ When we encounter a struct field at the raw level, we now need to check whether it's a (name, value), a (name, macro), or a (macro). In Ion 1.0, it's always (name, value).

Comment on lines -48 to -62
fn find(&self, name: &str) -> IonResult<Option<LazyRawBinaryValue<'data>>> {
let name: RawSymbolTokenRef = name.as_raw_symbol_token_ref();
for field in self {
let field = field?;
if field.name() == name {
let value = field.value;
return Ok(Some(value));
}
}
Ok(None)
}

fn get(&self, name: &str) -> IonResult<Option<RawValueRef<'data, BinaryEncoding_1_0>>> {
self.find(name)?.map(|f| f.read()).transpose()
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ It no longer makes sense for the raw level struct to have find and get methods because some number of the fields may be encoded as macro invocations that are opaque to the raw reader. These methods have been moved to the Expanded layer.

Comment on lines +41 to +54
impl<'data> TextEncoding<'data> for TextEncoding_1_0 {
fn value_from_matched(
matched: MatchedRawTextValue<'data>,
) -> <Self as LazyDecoder<'data>>::Value {
LazyRawTextValue_1_0::from(matched)
}
}
impl<'data> TextEncoding<'data> for TextEncoding_1_1 {
fn value_from_matched(
matched: MatchedRawTextValue<'data>,
) -> <Self as LazyDecoder<'data>>::Value {
LazyRawTextValue_1_1::from(matched)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ Text Ion 1.0 and Ion 1.1 have the same parsing rules for scalar value literals. This trait implementation allows us to reuse scalar matching/reading logic the same way in both version implementations.

MatchedRawTextValue doesn't have any methods, and only contains offset/length information about matched containers. You cannot try to read a MatchedRawTextValue's data until you've converted it to a LazyRawTextValue_1_0 or LazyRawTextValue_1_1, each of which have their own approach to parsing the container bytes. In particular, the _1_1 flavor knows to look for and handle nested macro invocations.

Comment on lines +252 to +268
// XXX: This `unsafe` is a workaround for https://github.com/rust-lang/rust/issues/70255
// There is a rustc fix for this limitation on the horizon. See:
// https://smallcultfollowing.com/babysteps/blog/2023/09/22/polonius-part-1/
// Indeed, using the experimental `-Zpolonius` flag on the nightly compiler allows the
// version of this code without this `unsafe` hack to work. The alternative to the
// hack is wrapping the SymbolTable in something like `RefCell`, which adds a small
// amount of overhead to each access. Given that the `SymbolTable` is on the hot
// path and that a fix is inbound, I think this use of `unsafe` is warranted.
// SAFETY: At this point, the only thing that's holding potentially holding references to
// the symbol table is the lazy value that represented an LST directive. We've
// already read through that value in full to populate the `PendingLst`. Updating
// the symbol table will invalidate data in that lazy value, so we just have to take
// care not to read from it after updating the symbol table.
let symbol_table = unsafe {
let mut_ptr = ptr as *mut SymbolTable;
&mut *mut_ptr
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ I want to draw additional attention to this as it's a short-term hack. The linked blog post happened to be published after I'd been banging my head against the wall on this for a few hours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ The LazySystemReader now wraps a LazyExpandedReader<D> instead of a LazyRawReader<D>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗺️ There are likely a lot of opportunities to DRY this up, sharing logic between some of the 1.0 and 1.1 container parsing methods. I'd like to leave that for a future PR.

Copy link
Contributor

@popematt popematt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think it looks pretty good. There are a few things that look odd to me, and I've commented on them. Any of the comments related to DRY-ing or other refactoring are things that can be punted for later, though I would appreciate it if that one function with 4 levels of control-flow nesting could be broken up as part of this PR.

src/lazy/any_encoding.rs Show resolved Hide resolved
src/lazy/any_encoding.rs Show resolved Hide resolved
src/lazy/any_encoding.rs Show resolved Hide resolved
src/lazy/binary/raw/sequence.rs Show resolved Hide resolved
src/lazy/encoding.rs Show resolved Hide resolved
src/lazy/expanded/template.rs Show resolved Hide resolved
src/lazy/text/raw/v1_1/reader.rs Show resolved Hide resolved
src/lazy/text/value.rs Show resolved Hide resolved
src/lazy/value_ref.rs Show resolved Hide resolved
src/types/struct.rs Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Sep 28, 2023

Codecov Report

Attention: 776 lines in your changes are missing coverage. Please review.

Files Coverage Δ
src/element/mod.rs 81.70% <100.00%> (+1.33%) ⬆️
src/element/reader.rs 91.34% <100.00%> (ø)
src/lazy/expanded/e_expression.rs 100.00% <100.00%> (ø)
src/lazy/reader.rs 75.63% <100.00%> (+0.84%) ⬆️
src/lazy/text/encoded_value.rs 90.97% <ø> (ø)
src/lazy/bytes_ref.rs 50.72% <0.00%> (-0.75%) ⬇️
src/lazy/expanded/template.rs 96.96% <96.96%> (ø)
src/symbol_table.rs 90.41% <0.00%> (-1.26%) ⬇️
src/lazy/binary/raw/sequence.rs 47.22% <60.00%> (+1.87%) ⬆️
src/lazy/sequence.rs 73.27% <75.00%> (+0.61%) ⬆️
... and 25 more

... and 2 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

@zslayton zslayton merged commit 1d2d46b into main Sep 28, 2023
21 of 22 checks passed
@zslayton zslayton deleted the 1_1-text-reader branch September 28, 2023 19:54
zslayton added a commit that referenced this pull request Sep 30, 2023
* Initial implementation of 1.1 text reader

* Fixed private doc links

* Recursive expansion of TDL containers

* Relocate `From<LazyExpandedValue> for LazyValue` impls

* Incorporates feedback from PR #645

* Expanded doc comments

---------

Co-authored-by: Zack Slayton <zslayton@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants