Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New AvroDeserializer class requires a reader schema #834

Closed
1 task done
mparry opened this issue Apr 15, 2020 · 13 comments
Closed
1 task done

New AvroDeserializer class requires a reader schema #834

mparry opened this issue Apr 15, 2020 · 13 comments
Labels
SERDE_API:Enhancement Track demand for changes/enhancements to SERDE experimental API.

Comments

@mparry
Copy link

mparry commented Apr 15, 2020

Description

The new AvroDeserializer class added in 1.4.0 requires a reader schema but this seems unnecessary and means that it's not possible to use these new classes in the same way as the old AvroConsumer - i.e. in AvroConsumer the default values for reader_key_schema and reader_value_schema are None.

Specifically, we have use cases where we are just interested in consuming (and, say, displaying) whatever messages are forthcoming on a topic and it's sufficient - and indeed preferable - to just use the corresponding writer schema from the registry.

I would be happy to provide a PR addressing this, unless the reader schema requirement is deliberate for some reason.

(A couple of other minor issues I noticed while looking at schema_registry/avro.py: i) the error messages about to_dict not being callable have the arguments the wrong way around; ii) the attempt to raise a ValueError about unrecognized properties will fail; the second call to format() should be join() instead.)

Checklist

Please provide the following information:

  • confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()): 1.4.0
@rnpridgeon
Copy link
Contributor

Thanks for the feedback @mparry.

Reader Schema requirement:

This was actually an item of debate during the design process. The question being should the Deserliazer be polymorphic(since we can access the writer schema from the Schema Registry). Ultimately the decision was made that the deserializer should only handle a single type, that which is defined in the reader schema.

That said we marked these features as experimental for exactly this reason. To see how our users intend to use the API. I'll link any subseuqent issues related to the same thing here to document interest in handling "Generic" Avro records.

Added a fix for the error message/exception raising to the v1.4.1rc branch. Thanks for catching that!

@rnpridgeon rnpridgeon added the SERDE_API:Enhancement Track demand for changes/enhancements to SERDE experimental API. label Apr 17, 2020
@mparry
Copy link
Author

mparry commented Apr 18, 2020

The benefit that I'm interested in isn't so much polymorphism (though sure, I can see that results too) but rather that the Python client needs no advance knowledge of the schema. In a dynamic language, this is quite a nice feature!

Clearly if you want to use the data in depth then you probably need to know something about the schema but I think there are various use cases where this isn't true. To expand on my original example, we have tools that simply monitor and display the data being published on a topic, and it's quite undesirable to have to manually propagate the schema downstream to there, when the registry serves this purpose rather well.

I guess my other argument would be that this seems like something of a regression from the existing functionality - i.e. as it stands, you can't necessarily simply replace AvroConsumer with DeserializingConsumer and AvroDeserializer in existing code. I think that the fact that various people have reported and/or fixed the current bug where fastavro isn't used if reader_key_schema isn't supplied (#601, #694, #773, #832, maybe more!) suggests that we're not the only ones interested in this approach :)

@sshcherbakov
Copy link

+1
Hard to believe that the reader schema is mandatory in the AvroDeserializer

@mpyatishev
Copy link

In my case I would like to retrieve schemas from Schema-Registry and use writer_schema instead of reader_schema. So my suggestion is make schema_str optional in AvroDeserializer.

@mpyatishev
Copy link

I could prepare patch and PR if you wish.

@n8sty
Copy link

n8sty commented Oct 12, 2020

I'm experiencing similar difficulty when trying to replace the legacy AvroConsumer with a DeserialzingConsumer and would be willing to take a crack at preparing a fix. I see two ways of doing this. Either a new class, eg: SchemaRegistryDeserializer that never takes a schema string as input and always uses the writer schema (definitely has some drawbacks), or adding some brief additional logic inside AvroDeserializer that makes schema_str optional.

@alphapseudo
Copy link

I definitely agree with this issue — looking at the example for the AvroConsumer

avro_deserializer = AvroDeserializer(schema_str,
schema_registry_client,
dict_to_user)
string_deserializer = StringDeserializer('utf_8')
consumer_conf = {'bootstrap.servers': args.bootstrap_servers,
'key.deserializer': string_deserializer,
'value.deserializer': avro_deserializer,
'group.id': args.group,
'auto.offset.reset': "earliest"}
consumer = DeserializingConsumer(consumer_conf)

It's necessary to provide a schema to deserialize messages from a topic in the current state.

You can definitely get the writer schema from the registry — so that piece is ok.

However, how do we know which schema version/id to retrieve if it's encoded inside the message itself?

# Write the magic byte and schema ID in network byte order (big endian)
fo.write(pack('>bI', _MAGIC_BYTE, self._schema_id))

With these changes, it makes encoding the Schema ID obsolete since the cart is before the horse

@97nitt
Copy link
Contributor

97nitt commented Nov 28, 2020

+1. One of the most important features of Avro (imo) is the ability to read data without advance knowledge of the schema. The schema (or a reference to it) is included with the data. Now, I also like the ability to create projections of Avro-encoded data by supplying a compatible reader schema when I want to (primarily to ignore fields I don't care about). Both use cases are supported by Avro, and I think both should be supported by the deserializer.

I took a look at the code, and I believe this would mean a breaking change to the AvroDeserializer constructor. By making the reader schema optional, I think the order of the arguments would need to change.

@mhowlett
Copy link
Contributor

mhowlett commented Mar 9, 2021

the root cause of this decision, i think, is that it mimics the specific api in strongly typed languages where the schema is always present in the generated class.

since the the new serde interface is still marked as experimental, I think we can err on the side of making the breaking parameter order change to make the API a bit nicer.

thanks for reporting @mparry and all the endorsement everyone else ..

@jayknair
Copy link

I just wanted to cast my vote for this change as well

@markns
Copy link

markns commented Oct 29, 2021

Is there any workaround for this issue? I'd like to be able to consume a topic without specifying the reader schema in advance.

@OneCricketeer
Copy link

This should be closed, as of #1000

@pranavrth
Copy link
Member

Closing this issue as the required PR is already merged and released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SERDE_API:Enhancement Track demand for changes/enhancements to SERDE experimental API.
Projects
None yet
Development

No branches or pull requests