Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient data encoding when using avro schema #945

Open
hyperevo opened this issue Jan 17, 2023 · 1 comment
Open

Inefficient data encoding when using avro schema #945

hyperevo opened this issue Jan 17, 2023 · 1 comment

Comments

@hyperevo
Copy link

Expected behavior

Situation:

  1. Using an avro schema definition where at least one field is of type byte.
  2. Providing data, in the form of a native golang struct with a corresponding property of type []byte

It is expected that the []byte property should be encoded as-is into the binary payload using the avro codec.

Actual behavior

In this situation, when encoding the native golang struct into a pulsar payload using the function here:

func (as *AvroSchema) Encode(data interface{}) ([]byte, error) {
, it first encodes the []byte type into a base64 string during the json.Marshal. Then after that it is converted into binary using the avro codec. This final binary data is significantly larger in size than it would have been if it was transmitted directly as the raw bytes.

Here is an example:
Original golang byte array: [172 12 53 97 9 70 89 247 94 3 56 242 127 146 9 209]
Base64 encoded text from byte array: rAw1YQlGWfdeAzjyf5IJ0Q==
The byte array of the final encoded binary payload (which is just the byte array representation of the base64 encoded string): [114 65 119 49 89 81 108 71 87 102 100 101 65 122 106 121 102 53 73 74 48 81 61 61]

In this example the encoded payload that the pulsar-client-go transmits to the pulsar queue is 50% larger in terms of bytes. This can lead to a dramatic loss to performance when throughput is the bottleneck.

Steps to reproduce

Encode a []byte object using the avro schema encode function here

func (as *AvroSchema) Encode(data interface{}) ([]byte, error) {
and look at the output.

System configuration

commit 504e589

@hyperevo hyperevo changed the title Ineffecient byte encoding when using avro schema Inefficient byte encoding when using avro schema Jan 17, 2023
@hyperevo hyperevo changed the title Inefficient byte encoding when using avro schema Inefficient data encoding when using avro schema Jan 17, 2023
@hyperevo
Copy link
Author

Note, this base64 encoding behavior is also inconsistent with the python client when using the same avro schema. The python consumer expects the byte field of the avro schema to be raw bytes of the data. But it is actually the bytes of a base64 encoded string. So data is decoded incorrectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant