Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow for complex schema to refer to previously named schema #53

Closed
wants to merge 12 commits into from

Conversation

stew
Copy link

@stew stew commented Sep 14, 2018

No description provided.

Kyle Phelps and others added 6 commits June 26, 2018 21:17
Instead of trying to resolve the references at parse time, we store references
in the AST, and then use the SchemaParseContext not only for parsing but also
for decoding. This allows us to support recursive type definitions.

I believe this patch should also include some improvements in how we handle
namespaces and aliases. Tests are still needed to prove aliases are working
correctly.

Tests are still needed for testing encoding.
@stew stew changed the title allow for complex schemata to refer to allow for complex schemata to refer to previously named schema Sep 14, 2018
@stew stew changed the title allow for complex schemata to refer to previously named schema allow for complex schema to refer to previously named schema Sep 14, 2018
@stew
Copy link
Author

stew commented Sep 14, 2018

I believe this allow for recursive schemata which admits thing such as linked lists and trees.

@flavray flavray self-requested a review September 15, 2018 08:08
Copy link
Owner

@flavray flavray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for taking a stab at this, this is massive work!

I was not able to go through all the changes yet, but I left a few comments already, on the overall design of the change that I feel are worth discussing.

The fact that avro allows recursive types personally feels like madness to me, but heh, it's in the spec so we have to consider adding support for it, so I am truly grateful that you are doing this. 🙂

Value::Union(ref inner) if inner.as_ref() == &Value::Null => visitor.visit_none(),
Value::Union(ref inner) => visitor.visit_some(&mut Deserializer::new(inner)),
_ => Err(Error::custom("not a union")),
ref inner =>visitor.visit_some(&mut Deserializer::new(inner)),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing space after the =>

@@ -24,8 +25,8 @@ fn decode_len<R: Read>(reader: &mut R) -> Result<usize, Error> {
}

/// Decode a `Value` from avro format given its `Schema`.
pub fn decode<R: Read>(schema: &Schema, reader: &mut R) -> Result<Value, Error> {
match *schema {
pub fn decode<R: Read>(schema: &Arc<Schema>, reader: &mut R, context: &mut SchemaParseContext) -> Result<Value, Error> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to keep backwards compatibility, would it make sense to have a new function (e.g: decode_with_context) with this signature?

That way, we can keep a decode function with the previous signature, that would just call decode_with_context(schema, reader, <empty context>)

@@ -172,9 +172,10 @@ impl<'a, 'de> de::Deserializer<'de> for &'a mut Deserializer<'de> {
V: Visitor<'de>,
{
match *self.input {
Value::Null => visitor.visit_none(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why this is needed now. Would you mind giving some details? 🙂
The deserializer code is responsible for decoding data that exactly matches the Rust data type (all sorts of transformations can then be done via Value::resolve)

(Same for the ref inner 3 lines below)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either, I tossed it in because it fixed a problem I was having deserializing an Option. If I can reproduce this, it deserves a unit test and to go into its own PR

@@ -24,8 +25,8 @@ fn decode_len<R: Read>(reader: &mut R) -> Result<usize, Error> {
}

/// Decode a `Value` from avro format given its `Schema`.
pub fn decode<R: Read>(schema: &Schema, reader: &mut R) -> Result<Value, Error> {
match *schema {
pub fn decode<R: Read>(schema: &Arc<Schema>, reader: &mut R, context: &mut SchemaParseContext) -> Result<Value, Error> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unclear to me why the &Schema had to be changed into Arc<Schema>, would you mind elaborating on this as well? 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I have done was to take extend the previous PR trying to accomplish this which made the transition from &Schema to Rc when we started temporarily additionally storing the schemata in the SchemaParseContext, and then I changed Rc to Arc because of the new requirement for Send to be implemented here:

https://gh/stew/avro-rs/blob/feature%2Frefernce-types/src/schema.rs#L1234

This is an area where my lack of experience with Rust is going to be obvious, so I welcome any advice, but it does seem to me that we should be able to go back to using references if we track the lifetimes all they way through so that we are proving that the additional references we are keeping only last for the lifetime of the decode operation. That was a battle I wasn't ready to wage with the typechecker when I was trying to get all this proved out, but I can take another stab at it.

@@ -152,5 +154,8 @@ pub fn decode<R: Read>(schema: &Schema, reader: &mut R) -> Result<Value, Error>
Err(DecodeError::new("enum symbol not found").into())
}
},
Schema::TypeReference(ref name) => context.lookup_type(name, &context)
.map_or_else(|| Err(DecodeError::new("enum symbol not found").into()),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe change enum into something else in the error message?

@@ -10,7 +11,11 @@ use util::{zig_i32, zig_i64};
/// be valid with regards to the schema. Schema are needed only to guide the
/// encoding for complex type values.
pub fn encode(value: &Value, schema: &Schema, buffer: &mut Vec<u8>) {
encode_ref(&value, schema, buffer)
encode_ref_inner(&value, &Arc::new(schema.clone()), buffer, &mut SchemaParseContext::new())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't cloning a schema every time a value is encoding end up being expensive?

We would have to run the benchmarks to see if there is a noticeable difference, but it seems like extra work is being done in this function

@@ -32,7 +37,7 @@ fn encode_int(i: i32, buffer: &mut Vec<u8>) {
/// **NOTE** This will not perform schema validation. The value is assumed to
/// be valid with regards to the schema. Schema are needed only to guide the
/// encoding for complex type values.
pub fn encode_ref(value: &Value, schema: &Schema, buffer: &mut Vec<u8>) {
pub(crate) fn encode_ref_inner(value: &Value, schema: &Arc<Schema>, buffer: &mut Vec<u8>, context: &mut SchemaParseContext) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for decode, would it make sense to keep encode_ref available somewhere, to avoid backwards-incompatible changes?

return Ok(())
},
Err(e) => if let ErrorKind::UnexpectedEof = e.downcast::<::std::io::Error>()?.kind() {
Err(e) =>{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing space

/// A reference to a type defined in this schema
TypeReference(NameRef),
/// A `array` Avro schema.
/// `Array` holds a counted reference (`Rc`) to the `Schema` of its items.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rc -> Arc

/// A `map` Avro schema.
/// `Map` holds a pointer to the `Schema` of its values, which must all be the same schema.
/// `Map` keys are assumed to be `string`.
Map(Box<Schema>),
Map(Arc<Schema>),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also curious as to why Array and Map now need an Arc, instead of a simple Box. Would you mind elaborating a bit? 🙂

@stew
Copy link
Author

stew commented Oct 1, 2018

Wow, thanks for all your feedback 😄

I'm going to try to address all the items you brought up, might take me a day or two. I'm pretty new to rust, so I'm likely making some simple mistakes, so thanks for bearing with me. I also am expecting that I've negatively impacted performance and I haven't yet taken a look at how bad, I wanted to get some feedback on the approach before getting too hung up on any details, and I've probably taken some shortcuts with some negative performance impacts (like using Arc), which I'm happy to try my best to figure out.

Of course, the very first set of types I'm needing to encode includes a Tree, which is naturally a recursive data structure, which is where my motivation is coming from. In trying to further my own project while waiting for feedback from you, I've tossed some bugfixes into this branch which don't necessarily fit with this PR, I can make some effort to separate those and rebase those against master to simplify this PR if that's wanted.

@HarkonenBade
Copy link

I'd be very interested in this as well. As while the schema i'm using does not define recursive types, it does define some custom types (like a type for UUID values) which are defined at first use and then afterwards only referenced by name in the schemata (which was generated using avro tools from a idl spec).

@jackmead515
Copy link

Been a while. How's this pull going? Need some help? How far to completion?

@stew
Copy link
Author

stew commented Apr 18, 2020

closing in favor of #99 ; I'm thrilled that someone else is making an effort here, (thanks @GregBowyer :) )

I apologize that the conditions that got me motivated to get this working didn't exist for long :(

@stew stew closed this Apr 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants