AVRO-3683: [Rust] Read/Write with multiple schemas #2014

martin-g · 2022-12-15T14:03:35Z

What is the purpose of the change

Make it possible to read/write Avro messages by using several schemata which depend on each other.
The new APIs provide methods which are similar to the single schema ones, but use a main schema and secondary ones.
The secondary ones are used to resolve any dependencies which are not in the main one.

Verifying this change

New tests are added.

Documentation

Does this pull request introduce a new feature? yes
Rustdoc is added/updated

martin-g · 2022-12-15T14:06:05Z

@markfarnan The impl is almost ready but as you can see the test in lang/rust/avro/tests/to_from_avro_datum_schemata.rs fails because both schemata validate against the read bytes.
This is a blocker :-/

markfarnan · 2022-12-15T22:45:34Z

Thanks,

It's also panicing with my test case that uses slightly more complicated structs (the ones in the) PR and a struct rather than manually constructed value.

I'll see if I can work out why/if its me, or I'll update the PR in the morning with test case that uses your new functions.

Here is the snippet that panics inside "to_avro_datum_schemata" in case usefull.

let record = MultiSchemaTestTypeA {
        b: Some(MultiSchemaTestTypeB {
            d: String::from("tom"),
            e: 451,
        }),
        c: default_multischematesttypea_c(),
    };

    let schemata: Vec<Schema> = Schema::parse_list(&[schema_TypeA, schema_TypeB]).unwrap();
    let schemata: Vec<&Schema> = schemata.iter().collect();

    let record_value = to_value(&record).unwrap();
    let actual = to_avro_datum_schemata(&schemata.as_slice(), record_value).unwrap();

martin-g · 2022-12-16T08:06:44Z

The only way I see to support this is to have a main schema, e.g.:

 let actual = to_avro_datum_schemata(&main_schema, &schemata.as_slice(), record_value).unwrap();

where main_schema is the schema that should be used for read/write and is a member of schemata.
I.e. #parse_list() returns a Vec<Schema>, the user picks the main one from them and passes all as a second argument for schema resolution.

markfarnan · 2022-12-16T09:17:45Z

That fits with how I was thinking it would need to be done, both for read and write.

The only alternative I can think of, would be to pass the Name of the record to be used, and let the resolver find it in the Schemata.

Either way, ideally there is an easy way to find the relevant schema in the Schemata slice.

martin-g · 2022-12-16T11:20:14Z

@markfarnan I've re-worked it. Now the new roundtrip test passes.
Please give it a try!
I need to update the Reader/Writer APIs to support this too.

markfarnan · 2022-12-16T14:55:44Z

Awesome, Will do !

markfarnan · 2022-12-16T22:28:58Z

Found a problem.

The order the Schema's given to parse_list seems to matter. If they are passed in descending order of reference, everything works.

If they are passed out of order for references to resolve forward, then avro_to_datum_schemata panics. (Parse List seems to manage fine with any order)

i.e. if you modify your test thus:, it will panic.

let schemata: Vec<Schema> = Schema::parse_list(&[SCHEMA_B_STR, SCHEMA_A_STR]).unwrap();
 let schemata: Vec<&Schema> = schemata.iter().collect();

 // this is the Schema we want to use for write/read
 let schema_b = schemata[0];

This will be a problem with large schema's. For my use case some Records have references that use up to 20+ schema's, guarenteeing they are provided in the right order could be a nightmare.

markfarnan · 2022-12-16T22:32:33Z

On the flip side, I tested with one slightly more complex schema, passed in the right order, and it seems to round trip correctly !
As that seems to work, I'll do some more complex tests and let you know.

untereiner · 2022-12-18T14:13:46Z

Hi! I take the conversation on the fly and I am wondering about a few things. If I understand it correctly, this implementation means that for a complex message (I.e with deep schemas recursion) the user of the API has to build a vector with all the schemas and track the root one. Right ?
What If the user gives an expanded schema ? I mean a schema where all named types are expanded to their real schema ?

martin-g · 2022-12-18T19:28:27Z

If I understand it correctly, this implementation means that for a complex message (I.e with deep schemas recursion) the user of the API has to build a vector with all the schemas and track the root one. Right ?

This PR is about the use case when there are more than one schemata and they refer to each other.
When reading/writing the user has to provide the root/main schema as first parameter and all schemata as second parameter (to be able to resolve any references in the root/main one and all others).

martin-g · 2022-12-18T19:29:18Z

What If the user gives an expanded schema ? I mean a schema where all named types are expanded to their real schema ?

Then the user could just use the current APIs with just one argument with the root/main schema.

markfarnan · 2022-12-18T20:14:33Z

Update: I've been testing this PR for our protocol schema's, and so far it works fine.

I went through and ordered the schemas by dependancy, so it no longer panics. I think being able to handle schema-resolution out of order would be a nice-to-have in the "ResolvedSchema::try_from(schemata)?;" section
All 208 schema's are now parsing and resolving correctly in a single pass. This covers all messages of the protocl.
For finding the 'root' schema's, I'm post procssing the output of Schema::parse_list to create a qualifiedname/schema map to make it easy to aquire a 'root' schema when sending messages. Works so far.

untereiner · 2022-12-19T16:03:09Z

Then the user could just use the current APIs with just one argument with the root/main schema.

ha ok! Do you think there is a performance difference between the two methods ? Considering the same schema, one time completely expanded and another time with this API

martin-g · 2022-12-19T21:23:43Z

performance difference between the two methods

I am pretty sure the new (multiple schemata) methods will be slower than the old (single schema)!
How much ? - Only benchmarks can tell you!

untereiner · 2022-12-20T08:28:46Z

thanks @martin-g ! I will give it a try.

…te/write Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

Use a main schema and pass all other schemata for resolution. Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

Add support for multiple schemata in Reader/Writer APIs Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

It is much easier to deal with. Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

…o file Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

martin-g · 2023-01-09T21:45:13Z

@markfarnan I think I am ready with this PR. Please review it and test it before I merge!

martin-g · 2023-01-20T10:23:30Z

@markfarnan I see you thumped up my comment above. Could you please explicitly comment whether you have tested the changes ? Thanks!

martin-g · 2023-01-30T14:17:58Z

@markfarnan Ping!

markfarnan · 2023-02-01T09:32:42Z

Checking this week. I'm still somewhat blocked by the missing upstream PR's, though I've got a temp workaround for now.

chupaty · 2023-02-14T23:47:50Z

Just wanted to add my support for this PR from the sidelines - I hope it will unblock the use of rust in my org - we have hundreds of interdependent schemas and ran across this problem during evaluation.

Thanks for all your good work!

martin-g · 2023-02-15T06:36:15Z

@chupaty Do you say that you have reviewed and tested the PR with your application(s) ?

chupaty · 2023-02-17T00:49:58Z

I'd love to be able to commit more time to this. But my very brief comments are:

I initially tried running it against my dataset (~200 schemas), but ran into problems with ambiguous schema defs (note that my previous workflow of using to_avro_datum(...) still works as well as it did in 0.14.0).

Tried to reproduce the above with a minimal dataset (hacked test_avro_3683_schemata_writer_reader), but ran into problems when I changed the order of schemas loaded by Schema::parse_list, ie (switch schema 'a' and schema 'b'):

let schemata: Vec = Schema::parse_list(&[SCHEMA_B_STR, SCHEMA_A_STR]).unwrap();

I do have concerns about the structure of the schemata in my use case (ie lots of schemas). It seems like a fairly big value of N that the schemata O(N) search uses, plus potentially some big-ish Vecs being passed around.

I probably can't investigate much more in the short term, but will update when I can.

martin-g · 2023-03-10T08:13:18Z

There is a discussion about releasing Avro 1.11.2 in the dev@ mailing list.
I still have no confirmation that the proposed changes in this PR work for the requesters ...

Also CC @woile @WaterKnight1998

lang/rust/avro/src/reader.rs

woile · 2023-03-10T09:27:29Z

LGTM, this doesn't really affect me, as the avdl parser needs to solve all the references before even dealing with values. But it's a good addition 👍🏻

markfarnan · 2023-03-10T12:31:00Z

Checking this over the weekend.

markfarnan · 2023-03-11T12:41:59Z

@martin-g I confirm this works for me. Thanks !. - I think ready for merge and include in release

* AVRO-3683: Add support for using multiple schemata for resolve/validate/write Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: WIP Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: WIP compiles Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: WIP compiles and all tests pass Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: WIP Add support for reading Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: WIP Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: WIP Use a main schema and pass all other schemata for resolution. Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: Formatting Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: WIP Add support for multiple schemata in Reader/Writer APIs Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3646: Formatting Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: Use Vec instead of slice reference for schemata It is much easier to deal with. Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: Fix the resolving of the writer schema when reading an Avro file Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> * AVRO-3683: Cleaup Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> --------- Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org> (cherry picked from commit b8b83b7)

martin-g · 2023-03-13T07:10:32Z

Thank you, @markfarnan !

github-actions bot added the Rust label Dec 15, 2022

martin-g force-pushed the avro-3683-multiple-schemas branch from cfe8512 to 2274338 Compare December 16, 2022 07:56

martin-g changed the title ~~Avro 3683 multiple schemas~~ AVRO-3683: [Rust] Read/Write with multiple schemas Dec 16, 2022

martin-g force-pushed the avro-3683-multiple-schemas branch from 992cf06 to 9e1f6ec Compare January 4, 2023 12:55

martin-g added 10 commits January 9, 2023 14:42

AVRO-3683: Add support for using multiple schemata for resolve/valida…

c894e28

…te/write Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3683: WIP

3ffb96a

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3683: WIP compiles

dd79b47

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3683: WIP compiles and all tests pass

6bec2cf

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3683: WIP Add support for reading

1dcad92

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3683: WIP

4872040

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3683: WIP

6cf4263

Use a main schema and pass all other schemata for resolution. Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3683: Formatting

51065a4

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3683: WIP

2f54f23

Add support for multiple schemata in Reader/Writer APIs Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3646: Formatting

777cf97

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

martin-g force-pushed the avro-3683-multiple-schemas branch from 22fe1e1 to 777cf97 Compare January 9, 2023 12:42

martin-g added 3 commits January 9, 2023 16:24

AVRO-3683: Use Vec instead of slice reference for schemata

8f331c3

It is much easier to deal with. Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3683: Fix the resolving of the writer schema when reading an Avr…

e0f0516

…o file Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

AVRO-3683: Cleaup

de6655d

Signed-off-by: Martin Tzvetanov Grigorov <mgrigorov@apache.org>

martin-g marked this pull request as ready for review January 9, 2023 21:44

Merge branch 'master' into avro-3683-multiple-schemas

001c7f4

woile reviewed Mar 10, 2023

View reviewed changes

lang/rust/avro/src/reader.rs Show resolved Hide resolved

martin-g merged commit b8b83b7 into master Mar 13, 2023

martin-g deleted the avro-3683-multiple-schemas branch March 13, 2023 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-3683: [Rust] Read/Write with multiple schemas #2014

AVRO-3683: [Rust] Read/Write with multiple schemas #2014

martin-g commented Dec 15, 2022 •

edited

Loading

martin-g commented Dec 15, 2022

markfarnan commented Dec 15, 2022

martin-g commented Dec 16, 2022

markfarnan commented Dec 16, 2022

martin-g commented Dec 16, 2022

markfarnan commented Dec 16, 2022

markfarnan commented Dec 16, 2022 •

edited

Loading

markfarnan commented Dec 16, 2022

untereiner commented Dec 18, 2022 •

edited

Loading

martin-g commented Dec 18, 2022

martin-g commented Dec 18, 2022

markfarnan commented Dec 18, 2022

untereiner commented Dec 19, 2022

martin-g commented Dec 19, 2022

untereiner commented Dec 20, 2022

martin-g commented Jan 9, 2023

martin-g commented Jan 20, 2023

martin-g commented Jan 30, 2023

markfarnan commented Feb 1, 2023

chupaty commented Feb 14, 2023

martin-g commented Feb 15, 2023

chupaty commented Feb 17, 2023 •

edited

Loading

martin-g commented Mar 10, 2023 •

edited

Loading

woile commented Mar 10, 2023 •

edited

Loading

markfarnan commented Mar 10, 2023

markfarnan commented Mar 11, 2023 •

edited

Loading

martin-g commented Mar 13, 2023

AVRO-3683: [Rust] Read/Write with multiple schemas #2014

AVRO-3683: [Rust] Read/Write with multiple schemas #2014

Conversation

martin-g commented Dec 15, 2022 • edited Loading

What is the purpose of the change

Verifying this change

Documentation

martin-g commented Dec 15, 2022

markfarnan commented Dec 15, 2022

martin-g commented Dec 16, 2022

markfarnan commented Dec 16, 2022

martin-g commented Dec 16, 2022

markfarnan commented Dec 16, 2022

markfarnan commented Dec 16, 2022 • edited Loading

markfarnan commented Dec 16, 2022

untereiner commented Dec 18, 2022 • edited Loading

martin-g commented Dec 18, 2022

martin-g commented Dec 18, 2022

markfarnan commented Dec 18, 2022

untereiner commented Dec 19, 2022

martin-g commented Dec 19, 2022

untereiner commented Dec 20, 2022

martin-g commented Jan 9, 2023

martin-g commented Jan 20, 2023

martin-g commented Jan 30, 2023

markfarnan commented Feb 1, 2023

chupaty commented Feb 14, 2023

martin-g commented Feb 15, 2023

chupaty commented Feb 17, 2023 • edited Loading

martin-g commented Mar 10, 2023 • edited Loading

woile commented Mar 10, 2023 • edited Loading

markfarnan commented Mar 10, 2023

markfarnan commented Mar 11, 2023 • edited Loading

martin-g commented Mar 13, 2023

martin-g commented Dec 15, 2022 •

edited

Loading

markfarnan commented Dec 16, 2022 •

edited

Loading

untereiner commented Dec 18, 2022 •

edited

Loading

chupaty commented Feb 17, 2023 •

edited

Loading

martin-g commented Mar 10, 2023 •

edited

Loading

woile commented Mar 10, 2023 •

edited

Loading

markfarnan commented Mar 11, 2023 •

edited

Loading