Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3240: fix deserializer schema backward compatibility #1379

Merged
merged 3 commits into from
Jan 7, 2022

Conversation

ultrabug
Copy link
Contributor

@ultrabug ultrabug commented Oct 21, 2021

When providing your own schema to from_avro_datum,
the deserialization is not backward compatible with messages
containing a previous schema even if the schemas are created
to be backward compatible.

This is due to the decode_variable function in utils returning
Error::ReadVariableIntegerBytes when the reader object is
smaller than expected and thus cannot fill the read_exact buffer.

Reading a message generated with an older schema version than the
one we are statically using, the payload buffer is by essence
smaller than the expected payload from a message serialized with
a newer schema (backward compatible schemas are basically append
only).

The proposed fix takes that into account and just returns empty
Ok() which will be interpreted as null. This is in line with the
fact that a backward compatible schema has to allow for null
values to the newly added fields.

Make sure you have checked all steps below.

Jira

Tests

  • I have implemented a test to demonstrate the problem

@github-actions github-actions bot added the Rust label Oct 21, 2021
@martin-g
Copy link
Member

* I'd like comments before working on tests

I cannot comment without seeing what exactly does not work :-/

@ultrabug
Copy link
Contributor Author

I cannot comment without seeing what exactly does not work :-/

Ok I'm sorry if my description is not clear enough. Let's see if my Rust is good enough to write a great test now 👍

@ultrabug
Copy link
Contributor Author

ultrabug commented Oct 25, 2021

Ok @martin-g in my quest to write a test for this (I fail so far) I found something that I clearly do not understand but which triggers the same Error than my initial problem here.

Disclaimer : I'm not sure if they are related, but since I don't understand how to fix this failing test below, I dunno...

Basically, you can't call from_avro_datum multiple times on the same reader :

#[test]
    fn test_from_avro_datum_multiple() {
        let schema = Schema::parse_str(SCHEMA).unwrap();
        let mut encoded: &'static [u8] = &[54, 6, 102, 111, 111];

        let mut record = Record::new(&schema).unwrap();
        record.put("a", 27i64);
        record.put("b", "foo");
        let expected = record.into();

        from_avro_datum(&schema, &mut encoded, None).unwrap();

        assert_eq!(
            from_avro_datum(&schema, &mut encoded, None).unwrap(),
            expected
        );

Gives

thread 'reader::tests::test_from_avro_datum_multiple' panicked at 'called `Result::unwrap()` on an `Err` value: ReadVariableIntegerBytes(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })', src/reader.rs:408:54

@ultrabug
Copy link
Contributor Author

@martin-g I finally tracked down the problem to the decode function and found the way to fix my problem properly (that is, without breaking other tests) and I am now providing an example test to demonstrate the problem.

The root cause is that decode is not handling its reader's OEF properly for all Schema variants.

I fixed the boolean and union variants to showcase the fix and discuss it.

FYI @gklijs @PauloAOliveira

@martin-g
Copy link
Member

thread 'reader::tests::test_from_avro_datum_multiple' panicked at 'called Result::unwrap() on an Err value: ReadVariableIntegerBytes(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })', src/reader.rs:408:54

This is expected. You try to unwrap a None the second time.

@martin-g
Copy link
Member

I finally tracked down the problem to the decode function and found the way to fix my problem properly (that is, without breaking other tests) and I am now providing an example test to demonstrate the problem.

Cool! I will take a look soon!

@ultrabug
Copy link
Contributor Author

This is expected. You try to unwrap a None the second time.

I'm sorry but I'm not sure I'm following you here. Calling the same function twice with the same parameters is an expected failure?

My understanding is that Rust's Array &[u8] does implement the Read trait but there's no mean to seek its position back to 0, which means that the second time we use it, we are indeed at the end position of the array reader.

For instance, using a Cursor + forcing its position with set_position(0) between the two from_avro_datum makes the test above work just fine.

@martin-g
Copy link
Member

You understood it correctly! With Read you consumed the bytes on the first read. On the second attempt it returns EOF and thus the None.

https://doc.rust-lang.org/std/io/trait.Read.html#impl-Read-2 says

Read is implemented for &[u8] by copying from the slice.

Note that reading updates the slice to point to the yet unread part. The slice will be empty when EOF is reached.

If you want to read the same data twice you need to reset the source. This is what the Seek trait gives you.

lang/rust/src/decode.rs Outdated Show resolved Hide resolved
lang/rust/src/reader.rs Outdated Show resolved Hide resolved
lang/rust/src/reader.rs Show resolved Hide resolved
lang/rust/src/reader.rs Outdated Show resolved Hide resolved
lang/rust/src/reader.rs Outdated Show resolved Hide resolved
@ultrabug
Copy link
Contributor Author

ultrabug commented Nov 9, 2021

Sorry for the late reply @martin-g I took some time off, now back to business!

I pushed the corrections you asked for.

@martin-g martin-g self-assigned this Jan 7, 2022
@martin-g
Copy link
Member

martin-g commented Jan 7, 2022

I've checked out your PR locally but I am not able to push to your branch.
I am not sure whether it is just because I don't do something right or it is ASF policy...

Anyway, I've fixed the formatting and now I will have to squash all commits for this PR into one and then push it to master.
The problem is that this way you won't appear in the commit log. All your work will be credited to me... :-/
Any concerns ?

@martin-g
Copy link
Member

martin-g commented Jan 7, 2022

Actually, I could merge the PR and then fix the formatting in a follow up commit!

@martin-g martin-g merged commit 7584320 into apache:master Jan 7, 2022
@martin-g
Copy link
Member

martin-g commented Jan 7, 2022

Done!
Thank you for the contribution, @ultrabug !

@ultrabug
Copy link
Contributor Author

ultrabug commented Jan 7, 2022

Thank you @martin-g :) with #1368 after this one I'll be close to being able to switch to mainstream Rust avro 👍

@martin-g
Copy link
Member

martin-g commented Jan 7, 2022

Done!

martin-g pushed a commit that referenced this pull request Feb 1, 2022
* AVRO-3240: fix decoding schema when reaching the end of the reader

* AVRO-3240: add from_avro_datum test for reader EOF

* AVRO-3240: add reader eof handling to Schema::String

(cherry picked from commit 7584320)
@Ten0
Copy link
Contributor

Ten0 commented Feb 1, 2023

I don't understand why this is required: according to the spec, the exact writer schema should be made available at deserialization, while this feature seems to be about deserializing while specifying a different "writer schema" than the data was actually written with.

@martin-g
Copy link
Member

martin-g commented Feb 7, 2023

@Ten0 different writer schema is actually called a reader schema. This is used for schema evolution.

@Ten0
Copy link
Contributor

Ten0 commented Feb 7, 2023

Yes and when using that we normally still need to specify the actual writer schema, and reader schema is provided as a separate argument.
But this feature seems to be about deserializing using a different schema than the actual writer schema as writer_schema, cf test:

let mut encoded: &'static [u8] = &[54, 6, 102, 111, 111];
let expected_record: TestRecord3240 = TestRecord3240 {
a: 27i64,
b: String::from("foo"),
a_nullable_array: None,
a_nullable_string: None,
};
let avro_datum = from_avro_datum(&schema, &mut encoded, None).unwrap();
let parsed_record: TestRecord3240 = match &avro_datum {
Value::Record(_) => from_value::<TestRecord3240>(&avro_datum).unwrap(),
unexpected => {
panic!("could not map avro data to struct, found unexpected: {unexpected:?}")
}
};

I can't find mention of schema evolution that is supposed to work this way in the spec.

There is something similar called "schema compatibility", but it essentially says "if your data happens to be wire compatible that's fine":

it is possible, though not advisable, to read Avro data with a schema that does not have the same Parsing Canonical Form as the schema with which the data was written. In order for this to work, the serialized primitive values must be compatible, in order value by value, with the items in the deserialization schema.

and that's not the case here because it also says null fields should still normally always be serialized as the union discriminant, even at the end of the record:

default: [...] only used [...] for schema evolution purposes. The presence of a default value does not make the field optional at encoding time

where schema evolution refers, as you mentioned, to the reader schema process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants