Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use schemaless reader to handle complex schema #251

Merged
merged 2 commits into from
Dec 18, 2017

Conversation

fpietka
Copy link
Contributor

@fpietka fpietka commented Sep 25, 2017

I bumped into an issue lately trying to decode payloads coming from Debezium (MySQL binlogs). Fastavro wasn't used because of a KeyError exception in the library.

I discussed the issue here and found out read_data() wasn't able to decode avro messages with named types in their schema.

Here I replaced it with schemaless_reader() which acquire the schema tp be able to decode it:
https://github.com/tebeka/fastavro/blob/master/fastavro/reader.py#L596-L607

@ghost
Copy link

ghost commented Sep 25, 2017

It looks like @fpietka hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement here.

Once you've signed reply with [clabot:check] to prove it.

Appreciation of efforts,

clabot

@fpietka
Copy link
Contributor Author

fpietka commented Sep 25, 2017

[clabot:check]

@ghost
Copy link

ghost commented Sep 25, 2017

@confluentinc It looks like @fpietka just signed our Contributor License Agreement. 👍

Always at your service,

clabot

Copy link
Contributor

@ewencp ewencp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fpietka Can you explain why this would be correct? schemaless_reader doesn't look like it is intended to be public API (in fact, it looks like we should be using from fastavro import load). And read_data invokes schemaless_reader, so why would omitting the other stuff read_data does be correct?

@fpietka
Copy link
Contributor Author

fpietka commented Sep 25, 2017

I'll try to explain as I understood how it works.

Avro messages in Kafka have special headers to keep schema ID (https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html#wire-format) Which means read_data() has been called to read the message starting at the 5th byte (as we can see here)

I also think that neither read_data() nor schemaless_reader() was intended to be called directly. Calling fastavro.reader() as in the README would call iter_avro() which will read header then call acquaint_schema().

In my case, I was getting data from Debezium. Binlogs have a before and an after key which are identical. In the schema, the after key was a named reference to the part defined in the before key. When we call directly read_data(), we only get types defined in a READERS constant, and not the named one, until we call acquaint_schema() with the schema.

Since here we bypass the regular header, calling schemaless_reader() will call acquaint_schema() before read_data().

I hope it clarifies my intention in this PR.
Tell me if you think there is a better approach.

@scottbelden
Copy link

I thought I might try to provide a little context as I added the schemaless_reader and schemaless_writer capabilities to fastavro. I did intend for them to be part of the public API as they can be useful, for example, if you wanted to send data over the network and both sides know the schema (so you don't have to include it and can reduce the message size).

As for acquaint_schema, it is also in the __init__.py which is kind of the pseudo notion of the public API, but personally I feel like it probably shouldn't be there. It feels more to me like an implementation detail and it probably isn't clear to most users what it does and why it is necessary.

@fpietka
Copy link
Contributor Author

fpietka commented Sep 28, 2017

I changed the import because it wasn't working with Cython. Now this works for both.

@ewencp
Copy link
Contributor

ewencp commented Dec 18, 2017

Ok, finally had a chance to look at this again. I think I don't totally understand fastavro's APIs, but I think I understand what the schemaless part of schemaless_reader is now. LGTM, thanks for digging in and finding the solution to this!

@ewencp ewencp merged commit e226538 into confluentinc:master Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants