Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3683: [Rust] Read/Write with multiple schemas #2014

Merged
merged 14 commits into from
Mar 13, 2023

Conversation

martin-g
Copy link
Member

@martin-g martin-g commented Dec 15, 2022

AVRO-3683

What is the purpose of the change

Make it possible to read/write Avro messages by using several schemata which depend on each other.
The new APIs provide methods which are similar to the single schema ones, but use a main schema and secondary ones.
The secondary ones are used to resolve any dependencies which are not in the main one.

Verifying this change

New tests are added.

Documentation

  • Does this pull request introduce a new feature? yes
  • Rustdoc is added/updated

@github-actions github-actions bot added the Rust label Dec 15, 2022
@martin-g
Copy link
Member Author

@markfarnan The impl is almost ready but as you can see the test in lang/rust/avro/tests/to_from_avro_datum_schemata.rs fails because both schemata validate against the read bytes.
This is a blocker :-/

@markfarnan
Copy link

Thanks,

It's also panicing with my test case that uses slightly more complicated structs (the ones in the) PR and a struct rather than manually constructed value.

I'll see if I can work out why/if its me, or I'll update the PR in the morning with test case that uses your new functions.

Here is the snippet that panics inside "to_avro_datum_schemata" in case usefull.

let record = MultiSchemaTestTypeA {
        b: Some(MultiSchemaTestTypeB {
            d: String::from("tom"),
            e: 451,
        }),
        c: default_multischematesttypea_c(),
    };

    let schemata: Vec<Schema> = Schema::parse_list(&[schema_TypeA, schema_TypeB]).unwrap();
    let schemata: Vec<&Schema> = schemata.iter().collect();

    let record_value = to_value(&record).unwrap();
    let actual = to_avro_datum_schemata(&schemata.as_slice(), record_value).unwrap();

@martin-g martin-g changed the title Avro 3683 multiple schemas AVRO-3683: [Rust] Read/Write with multiple schemas Dec 16, 2022
@martin-g
Copy link
Member Author

The only way I see to support this is to have a main schema, e.g.:

 let actual = to_avro_datum_schemata(&main_schema, &schemata.as_slice(), record_value).unwrap();

where main_schema is the schema that should be used for read/write and is a member of schemata.
I.e. #parse_list() returns a Vec<Schema>, the user picks the main one from them and passes all as a second argument for schema resolution.

@markfarnan
Copy link

That fits with how I was thinking it would need to be done, both for read and write.

The only alternative I can think of, would be to pass the Name of the record to be used, and let the resolver find it in the Schemata.

Either way, ideally there is an easy way to find the relevant schema in the Schemata slice.

@martin-g
Copy link
Member Author

@markfarnan I've re-worked it. Now the new roundtrip test passes.
Please give it a try!
I need to update the Reader/Writer APIs to support this too.

@markfarnan
Copy link

Awesome, Will do !

@markfarnan
Copy link

markfarnan commented Dec 16, 2022

Found a problem.

The order the Schema's given to parse_list seems to matter. If they are passed in descending order of reference, everything works.

If they are passed out of order for references to resolve forward, then avro_to_datum_schemata panics. (Parse List seems to manage fine with any order)

i.e. if you modify your test thus:, it will panic.

let schemata: Vec<Schema> = Schema::parse_list(&[SCHEMA_B_STR, SCHEMA_A_STR]).unwrap();
 let schemata: Vec<&Schema> = schemata.iter().collect();

 // this is the Schema we want to use for write/read
 let schema_b = schemata[0];

This will be a problem with large schema's. For my use case some Records have references that use up to 20+ schema's, guarenteeing they are provided in the right order could be a nightmare.

@markfarnan
Copy link

On the flip side, I tested with one slightly more complex schema, passed in the right order, and it seems to round trip correctly !
As that seems to work, I'll do some more complex tests and let you know.

@untereiner
Copy link

untereiner commented Dec 18, 2022

Hi! I take the conversation on the fly and I am wondering about a few things. If I understand it correctly, this implementation means that for a complex message (I.e with deep schemas recursion) the user of the API has to build a vector with all the schemas and track the root one. Right ?
What If the user gives an expanded schema ? I mean a schema where all named types are expanded to their real schema ?

@martin-g
Copy link
Member Author

If I understand it correctly, this implementation means that for a complex message (I.e with deep schemas recursion) the user of the API has to build a vector with all the schemas and track the root one. Right ?

This PR is about the use case when there are more than one schemata and they refer to each other.
When reading/writing the user has to provide the root/main schema as first parameter and all schemata as second parameter (to be able to resolve any references in the root/main one and all others).

@martin-g
Copy link
Member Author

What If the user gives an expanded schema ? I mean a schema where all named types are expanded to their real schema ?

Then the user could just use the current APIs with just one argument with the root/main schema.

@markfarnan
Copy link

Update: I've been testing this PR for our protocol schema's, and so far it works fine.

  • I went through and ordered the schemas by dependancy, so it no longer panics. I think being able to handle schema-resolution out of order would be a nice-to-have in the "ResolvedSchema::try_from(schemata)?;" section

  • All 208 schema's are now parsing and resolving correctly in a single pass. This covers all messages of the protocl.

  • For finding the 'root' schema's, I'm post procssing the output of Schema::parse_list to create a qualifiedname/schema map to make it easy to aquire a 'root' schema when sending messages. Works so far.

@untereiner
Copy link

Then the user could just use the current APIs with just one argument with the root/main schema.

ha ok! Do you think there is a performance difference between the two methods ? Considering the same schema, one time completely expanded and another time with this API

@martin-g
Copy link
Member Author

performance difference between the two methods

I am pretty sure the new (multiple schemata) methods will be slower than the old (single schema)!
How much ? - Only benchmarks can tell you!

@untereiner
Copy link

thanks @martin-g ! I will give it a try.

…te/write

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Use a main schema and pass all other schemata for resolution.

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Add support for multiple schemata in Reader/Writer APIs

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
It is much easier to deal with.

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
…o file

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
@martin-g martin-g marked this pull request as ready for review January 9, 2023 21:44
@martin-g
Copy link
Member Author

martin-g commented Jan 9, 2023

@markfarnan I think I am ready with this PR. Please review it and test it before I merge!

@martin-g
Copy link
Member Author

@markfarnan I see you thumped up my comment above. Could you please explicitly comment whether you have tested the changes ? Thanks!

@martin-g
Copy link
Member Author

@markfarnan Ping!

@markfarnan
Copy link

Checking this week. I'm still somewhat blocked by the missing upstream PR's, though I've got a temp workaround for now.

@chupaty
Copy link
Contributor

chupaty commented Feb 14, 2023

Just wanted to add my support for this PR from the sidelines - I hope it will unblock the use of rust in my org - we have hundreds of interdependent schemas and ran across this problem during evaluation.

Thanks for all your good work!

@martin-g
Copy link
Member Author

@chupaty Do you say that you have reviewed and tested the PR with your application(s) ?

@chupaty
Copy link
Contributor

chupaty commented Feb 17, 2023

I'd love to be able to commit more time to this. But my very brief comments are:

I initially tried running it against my dataset (~200 schemas), but ran into problems with ambiguous schema defs (note that my previous workflow of using to_avro_datum(...) still works as well as it did in 0.14.0).

Tried to reproduce the above with a minimal dataset (hacked test_avro_3683_schemata_writer_reader), but ran into problems when I changed the order of schemas loaded by Schema::parse_list, ie (switch schema 'a' and schema 'b'):

let schemata: Vec = Schema::parse_list(&[SCHEMA_B_STR, SCHEMA_A_STR]).unwrap();

I do have concerns about the structure of the schemata in my use case (ie lots of schemas). It seems like a fairly big value of N that the schemata O(N) search uses, plus potentially some big-ish Vecs being passed around.

I probably can't investigate much more in the short term, but will update when I can.

@martin-g
Copy link
Member Author

martin-g commented Mar 10, 2023

There is a discussion about releasing Avro 1.11.2 in the dev@ mailing list.
I still have no confirmation that the proposed changes in this PR work for the requesters ...

Also CC @woile @WaterKnight1998

@woile
Copy link
Contributor

woile commented Mar 10, 2023

LGTM, this doesn't really affect me, as the avdl parser needs to solve all the references before even dealing with values. But it's a good addition 👍🏻

@markfarnan
Copy link

Checking this over the weekend.

@markfarnan
Copy link

markfarnan commented Mar 11, 2023

@martin-g I confirm this works for me. Thanks !. - I think ready for merge and include in release

@martin-g martin-g merged commit b8b83b7 into master Mar 13, 2023
martin-g added a commit that referenced this pull request Mar 13, 2023
* AVRO-3683: Add support for using multiple schemata for resolve/validate/write

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: WIP

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: WIP compiles

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: WIP compiles and all tests pass

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: WIP Add support for reading

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: WIP

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: WIP

Use a main schema and pass all other schemata for resolution.

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: Formatting

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: WIP

Add support for multiple schemata in Reader/Writer APIs

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3646: Formatting

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: Use Vec instead of slice reference for schemata

It is much easier to deal with.

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: Fix the resolving of the writer schema when reading an Avro file

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

* AVRO-3683: Cleaup

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

---------

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>
(cherry picked from commit b8b83b7)
@martin-g martin-g deleted the avro-3683-multiple-schemas branch March 13, 2023 07:10
@martin-g
Copy link
Member Author

Thank you, @markfarnan !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants