New AvroDeserializer class requires a reader schema #834

mparry · 2020-04-15T16:22:45Z

Description

The new AvroDeserializer class added in 1.4.0 requires a reader schema but this seems unnecessary and means that it's not possible to use these new classes in the same way as the old AvroConsumer - i.e. in AvroConsumer the default values for reader_key_schema and reader_value_schema are None.

Specifically, we have use cases where we are just interested in consuming (and, say, displaying) whatever messages are forthcoming on a topic and it's sufficient - and indeed preferable - to just use the corresponding writer schema from the registry.

I would be happy to provide a PR addressing this, unless the reader schema requirement is deliberate for some reason.

(A couple of other minor issues I noticed while looking at schema_registry/avro.py: i) the error messages about to_dict not being callable have the arguments the wrong way around; ii) the attempt to raise a ValueError about unrecognized properties will fail; the second call to format() should be join() instead.)

Checklist

Please provide the following information:

confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()): 1.4.0

The text was updated successfully, but these errors were encountered:

rnpridgeon · 2020-04-17T15:46:04Z

Thanks for the feedback @mparry.

Reader Schema requirement:

This was actually an item of debate during the design process. The question being should the Deserliazer be polymorphic(since we can access the writer schema from the Schema Registry). Ultimately the decision was made that the deserializer should only handle a single type, that which is defined in the reader schema.

That said we marked these features as experimental for exactly this reason. To see how our users intend to use the API. I'll link any subseuqent issues related to the same thing here to document interest in handling "Generic" Avro records.

Added a fix for the error message/exception raising to the v1.4.1rc branch. Thanks for catching that!

mparry · 2020-04-18T15:28:56Z

The benefit that I'm interested in isn't so much polymorphism (though sure, I can see that results too) but rather that the Python client needs no advance knowledge of the schema. In a dynamic language, this is quite a nice feature!

Clearly if you want to use the data in depth then you probably need to know something about the schema but I think there are various use cases where this isn't true. To expand on my original example, we have tools that simply monitor and display the data being published on a topic, and it's quite undesirable to have to manually propagate the schema downstream to there, when the registry serves this purpose rather well.

I guess my other argument would be that this seems like something of a regression from the existing functionality - i.e. as it stands, you can't necessarily simply replace AvroConsumer with DeserializingConsumer and AvroDeserializer in existing code. I think that the fact that various people have reported and/or fixed the current bug where fastavro isn't used if reader_key_schema isn't supplied (#601, #694, #773, #832, maybe more!) suggests that we're not the only ones interested in this approach :)

sshcherbakov · 2020-06-15T12:25:54Z

+1
Hard to believe that the reader schema is mandatory in the AvroDeserializer

mpyatishev · 2020-08-31T12:07:01Z

In my case I would like to retrieve schemas from Schema-Registry and use writer_schema instead of reader_schema. So my suggestion is make schema_str optional in AvroDeserializer.

mpyatishev · 2020-08-31T12:12:27Z

I could prepare patch and PR if you wish.

n8sty · 2020-10-12T19:49:48Z

I'm experiencing similar difficulty when trying to replace the legacy AvroConsumer with a DeserialzingConsumer and would be willing to take a crack at preparing a fix. I see two ways of doing this. Either a new class, eg: SchemaRegistryDeserializer that never takes a schema string as input and always uses the writer schema (definitely has some drawbacks), or adding some brief additional logic inside AvroDeserializer that makes schema_str optional.

alphapseudo · 2020-11-13T20:13:56Z

I definitely agree with this issue — looking at the example for the AvroConsumer

confluent-kafka-python/examples/avro_consumer.py

Lines 86 to 97 in c33a0cc

    
           avro_deserializer = AvroDeserializer(schema_str, 
        
                                                schema_registry_client, 
        
                                                dict_to_user) 
        
           string_deserializer = StringDeserializer('utf_8') 
        
           consumer_conf = {'bootstrap.servers': args.bootstrap_servers, 
        
                            'key.deserializer': string_deserializer, 
        
                            'value.deserializer': avro_deserializer, 
        
                            'group.id': args.group, 
        
                            'auto.offset.reset': "earliest"} 
        
           consumer = DeserializingConsumer(consumer_conf)

It's necessary to provide a schema to deserialize messages from a topic in the current state.

You can definitely get the writer schema from the registry — so that piece is ok.

However, how do we know which schema version/id to retrieve if it's encoded inside the message itself?

confluent-kafka-python/src/confluent_kafka/schema_registry/avro.py

Lines 231 to 232 in c33a0cc

    
           # Write the magic byte and schema ID in network byte order (big endian) 
        
           fo.write(pack('>bI', _MAGIC_BYTE, self._schema_id))

With these changes, it makes encoding the Schema ID obsolete since the cart is before the horse

97nitt · 2020-11-28T15:05:37Z

+1. One of the most important features of Avro (imo) is the ability to read data without advance knowledge of the schema. The schema (or a reference to it) is included with the data. Now, I also like the ability to create projections of Avro-encoded data by supplying a compatible reader schema when I want to (primarily to ignore fields I don't care about). Both use cases are supported by Avro, and I think both should be supported by the deserializer.

I took a look at the code, and I believe this would mean a breaking change to the AvroDeserializer constructor. By making the reader schema optional, I think the order of the arguments would need to change.

mhowlett · 2021-03-09T14:13:03Z

the root cause of this decision, i think, is that it mimics the specific api in strongly typed languages where the schema is always present in the generated class.

since the the new serde interface is still marked as experimental, I think we can err on the side of making the breaking parameter order change to make the API a bit nicer.

thanks for reporting @mparry and all the endorsement everyone else ..

jayknair · 2021-03-20T13:03:18Z

I just wanted to cast my vote for this change as well

markns · 2021-10-29T16:08:34Z

Is there any workaround for this issue? I'd like to be able to consume a topic without specifying the reader schema in advance.

OneCricketeer · 2022-11-10T23:09:47Z

This should be closed, as of #1000

pranavrth · 2024-03-12T17:59:50Z

Closing this issue as the required PR is already merged and released.

rnpridgeon added the SERDE_API:Enhancement Track demand for changes/enhancements to SERDE experimental API. label Apr 17, 2020

shavera mentioned this issue Apr 24, 2020

Multiple topics with different schemata for DeserializingConsumer #838

Closed

7 tasks

97nitt mentioned this issue Nov 29, 2020

Make reader schema optional in AvroDeserializer #1000

Merged

This was referenced Dec 29, 2020

Avro Deserializer should not require a reader schema #1008

Closed

Examples for reading from multiple topics with the same DeserializingConsumer #1009

Closed

slominskir mentioned this issue Apr 19, 2021

Implement support of Schema Registry's schema references #974

Closed

7 tasks

pranavrth closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New AvroDeserializer class requires a reader schema #834

New AvroDeserializer class requires a reader schema #834

mparry commented Apr 15, 2020 •

edited

rnpridgeon commented Apr 17, 2020

mparry commented Apr 18, 2020

sshcherbakov commented Jun 15, 2020

mpyatishev commented Aug 31, 2020

mpyatishev commented Aug 31, 2020

n8sty commented Oct 12, 2020

alphapseudo commented Nov 13, 2020

97nitt commented Nov 28, 2020

mhowlett commented Mar 9, 2021

jayknair commented Mar 20, 2021

markns commented Oct 29, 2021

OneCricketeer commented Nov 10, 2022

pranavrth commented Mar 12, 2024

New AvroDeserializer class requires a reader schema #834

New AvroDeserializer class requires a reader schema #834

Comments

mparry commented Apr 15, 2020 • edited

Description

Checklist

rnpridgeon commented Apr 17, 2020

mparry commented Apr 18, 2020

sshcherbakov commented Jun 15, 2020

mpyatishev commented Aug 31, 2020

mpyatishev commented Aug 31, 2020

n8sty commented Oct 12, 2020

alphapseudo commented Nov 13, 2020

97nitt commented Nov 28, 2020

mhowlett commented Mar 9, 2021

jayknair commented Mar 20, 2021

markns commented Oct 29, 2021

OneCricketeer commented Nov 10, 2022

pranavrth commented Mar 12, 2024

mparry commented Apr 15, 2020 •

edited