AVRO-3240: fix deserializer schema backward compatibility #1379

ultrabug · 2021-10-21T20:37:34Z

When providing your own schema to from_avro_datum,
the deserialization is not backward compatible with messages
containing a previous schema even if the schemas are created
to be backward compatible.

This is due to the decode_variable function in utils returning
Error::ReadVariableIntegerBytes when the reader object is
smaller than expected and thus cannot fill the read_exact buffer.

Reading a message generated with an older schema version than the
one we are statically using, the payload buffer is by essence
smaller than the expected payload from a message serialized with
a newer schema (backward compatible schemas are basically append
only).

The proposed fix takes that into account and just returns empty
Ok() which will be interpreted as null. This is in line with the
fact that a backward compatible schema has to allow for null
values to the newly added fields.

Make sure you have checked all steps below.

Jira

AVRO-3240

Tests

I have implemented a test to demonstrate the problem

martin-g · 2021-10-22T08:48:53Z

* I'd like comments before working on tests

I cannot comment without seeing what exactly does not work :-/

ultrabug · 2021-10-23T06:33:23Z

I cannot comment without seeing what exactly does not work :-/

Ok I'm sorry if my description is not clear enough. Let's see if my Rust is good enough to write a great test now 👍

ultrabug · 2021-10-25T10:30:56Z

Ok @martin-g in my quest to write a test for this (I fail so far) I found something that I clearly do not understand but which triggers the same Error than my initial problem here.

Disclaimer : I'm not sure if they are related, but since I don't understand how to fix this failing test below, I dunno...

Basically, you can't call from_avro_datum multiple times on the same reader :

#[test]
    fn test_from_avro_datum_multiple() {
        let schema = Schema::parse_str(SCHEMA).unwrap();
        let mut encoded: &'static [u8] = &[54, 6, 102, 111, 111];

        let mut record = Record::new(&schema).unwrap();
        record.put("a", 27i64);
        record.put("b", "foo");
        let expected = record.into();

        from_avro_datum(&schema, &mut encoded, None).unwrap();

        assert_eq!(
            from_avro_datum(&schema, &mut encoded, None).unwrap(),
            expected
        );

Gives

thread 'reader::tests::test_from_avro_datum_multiple' panicked at 'called `Result::unwrap()` on an `Err` value: ReadVariableIntegerBytes(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })', src/reader.rs:408:54

ultrabug · 2021-10-25T14:40:00Z

@martin-g I finally tracked down the problem to the decode function and found the way to fix my problem properly (that is, without breaking other tests) and I am now providing an example test to demonstrate the problem.

The root cause is that decode is not handling its reader's OEF properly for all Schema variants.

I fixed the boolean and union variants to showcase the fix and discuss it.

FYI @gklijs @PauloAOliveira

martin-g · 2021-10-25T18:40:26Z

thread 'reader::tests::test_from_avro_datum_multiple' panicked at 'called Result::unwrap() on an Err value: ReadVariableIntegerBytes(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })', src/reader.rs:408:54

This is expected. You try to unwrap a None the second time.

martin-g · 2021-10-25T18:40:51Z

I finally tracked down the problem to the decode function and found the way to fix my problem properly (that is, without breaking other tests) and I am now providing an example test to demonstrate the problem.

Cool! I will take a look soon!

ultrabug · 2021-10-25T20:23:17Z

This is expected. You try to unwrap a None the second time.

I'm sorry but I'm not sure I'm following you here. Calling the same function twice with the same parameters is an expected failure?

My understanding is that Rust's Array &[u8] does implement the Read trait but there's no mean to seek its position back to 0, which means that the second time we use it, we are indeed at the end position of the array reader.

For instance, using a Cursor + forcing its position with set_position(0) between the two from_avro_datum makes the test above work just fine.

martin-g · 2021-10-26T05:04:03Z

You understood it correctly! With Read you consumed the bytes on the first read. On the second attempt it returns EOF and thus the None.

https://doc.rust-lang.org/std/io/trait.Read.html#impl-Read-2 says

Read is implemented for &[u8] by copying from the slice.

Note that reading updates the slice to point to the yet unread part. The slice will be empty when EOF is reached.

If you want to read the same data twice you need to reset the source. This is what the Seek trait gives you.

lang/rust/src/decode.rs

lang/rust/src/reader.rs

ultrabug · 2021-11-09T13:28:50Z

Sorry for the late reply @martin-g I took some time off, now back to business!

I pushed the corrections you asked for.

martin-g · 2022-01-07T13:30:04Z

I've checked out your PR locally but I am not able to push to your branch.
I am not sure whether it is just because I don't do something right or it is ASF policy...

Anyway, I've fixed the formatting and now I will have to squash all commits for this PR into one and then push it to master.
The problem is that this way you won't appear in the commit log. All your work will be credited to me... :-/
Any concerns ?

martin-g · 2022-01-07T13:30:48Z

Actually, I could merge the PR and then fix the formatting in a follow up commit!

martin-g · 2022-01-07T13:33:38Z

Done!
Thank you for the contribution, @ultrabug !

ultrabug · 2022-01-07T13:44:13Z

Thank you @martin-g :) with #1368 after this one I'll be close to being able to switch to mainstream Rust avro 👍

martin-g · 2022-01-07T14:12:21Z

Done!

* AVRO-3240: fix decoding schema when reaching the end of the reader * AVRO-3240: add from_avro_datum test for reader EOF * AVRO-3240: add reader eof handling to Schema::String (cherry picked from commit 7584320)

Ten0 · 2023-02-01T19:29:36Z

I don't understand why this is required: according to the spec, the exact writer schema should be made available at deserialization, while this feature seems to be about deserializing while specifying a different "writer schema" than the data was actually written with.

martin-g · 2023-02-07T07:32:39Z

@Ten0 different writer schema is actually called a reader schema. This is used for schema evolution.

Ten0 · 2023-02-07T08:33:40Z

Yes and when using that we normally still need to specify the actual writer schema, and reader schema is provided as a separate argument.
But this feature seems to be about deserializing using a different schema than the actual writer schema as writer_schema, cf test:

avro/lang/rust/avro/src/reader.rs

Lines 573 to 588 in e22f029

    
           let mut encoded: &'static [u8] = &[54, 6, 102, 111, 111]; 
        
           let expected_record: TestRecord3240 = TestRecord3240 { 
        
               a: 27i64, 
        
               b: String::from("foo"), 
        
               a_nullable_array: None, 
        
               a_nullable_string: None, 
        
           }; 
        
           let avro_datum = from_avro_datum(&schema, &mut encoded, None).unwrap(); 
        
           let parsed_record: TestRecord3240 = match &avro_datum { 
        
               Value::Record(_) => from_value::<TestRecord3240>(&avro_datum).unwrap(), 
        
               unexpected => { 
        
                   panic!("could not map avro data to struct, found unexpected: {unexpected:?}") 
        
               } 
        
           };

I can't find mention of schema evolution that is supposed to work this way in the spec.

There is something similar called "schema compatibility", but it essentially says "if your data happens to be wire compatible that's fine":

it is possible, though not advisable, to read Avro data with a schema that does not have the same Parsing Canonical Form as the schema with which the data was written. In order for this to work, the serialized primitive values must be compatible, in order value by value, with the items in the deserialization schema.

and that's not the case here because it also says null fields should still normally always be serialized as the union discriminant, even at the end of the record:

default: [...] only used [...] for schema evolution purposes. The presence of a default value does not make the field optional at encoding time

where schema evolution refers, as you mentioned, to the reader schema process.

github-actions bot added the Rust label Oct 21, 2021

gklijs mentioned this pull request Oct 25, 2021

Doubt regarding deserialization gklijs/schema_registry_converter#69

Closed

ultrabug added 2 commits October 25, 2021 16:30

AVRO-3240: fix decoding schema when reaching the end of the reader

2a4113f

AVRO-3240: add from_avro_datum test for reader EOF

cf520b4

ultrabug force-pushed the fix/schema_backward_compat branch from df892f4 to cf520b4 Compare October 25, 2021 14:31

martin-g requested changes Oct 27, 2021

View reviewed changes

lang/rust/src/decode.rs Outdated Show resolved Hide resolved

lang/rust/src/reader.rs Outdated Show resolved Hide resolved

lang/rust/src/reader.rs Show resolved Hide resolved

lang/rust/src/reader.rs Outdated Show resolved Hide resolved

lang/rust/src/reader.rs Outdated Show resolved Hide resolved

AVRO-3240: add reader eof handling to Schema::String

ffe34c7

ultrabug force-pushed the fix/schema_backward_compat branch from 10e7162 to ffe34c7 Compare November 9, 2021 13:28

martin-g self-assigned this Jan 7, 2022

martin-g merged commit 7584320 into apache:master Jan 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-3240: fix deserializer schema backward compatibility #1379

AVRO-3240: fix deserializer schema backward compatibility #1379

ultrabug commented Oct 21, 2021 •

edited

Loading

martin-g commented Oct 22, 2021

ultrabug commented Oct 23, 2021

ultrabug commented Oct 25, 2021 •

edited

Loading

ultrabug commented Oct 25, 2021

martin-g commented Oct 25, 2021

martin-g commented Oct 25, 2021

ultrabug commented Oct 25, 2021

martin-g commented Oct 26, 2021

ultrabug commented Nov 9, 2021

martin-g commented Jan 7, 2022 •

edited

Loading

martin-g commented Jan 7, 2022

martin-g commented Jan 7, 2022

ultrabug commented Jan 7, 2022

martin-g commented Jan 7, 2022

Ten0 commented Feb 1, 2023

martin-g commented Feb 7, 2023

Ten0 commented Feb 7, 2023 •

edited

Loading

AVRO-3240: fix deserializer schema backward compatibility #1379

AVRO-3240: fix deserializer schema backward compatibility #1379

Conversation

ultrabug commented Oct 21, 2021 • edited Loading

Jira

Tests

martin-g commented Oct 22, 2021

ultrabug commented Oct 23, 2021

ultrabug commented Oct 25, 2021 • edited Loading

ultrabug commented Oct 25, 2021

martin-g commented Oct 25, 2021

martin-g commented Oct 25, 2021

ultrabug commented Oct 25, 2021

martin-g commented Oct 26, 2021

ultrabug commented Nov 9, 2021

martin-g commented Jan 7, 2022 • edited Loading

martin-g commented Jan 7, 2022

martin-g commented Jan 7, 2022

ultrabug commented Jan 7, 2022

martin-g commented Jan 7, 2022

Ten0 commented Feb 1, 2023

martin-g commented Feb 7, 2023

Ten0 commented Feb 7, 2023 • edited Loading

ultrabug commented Oct 21, 2021 •

edited

Loading

ultrabug commented Oct 25, 2021 •

edited

Loading

martin-g commented Jan 7, 2022 •

edited

Loading

Ten0 commented Feb 7, 2023 •

edited

Loading