-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add avro stream input format #11040
add avro stream input format #11040
Conversation
@clintropolis Hi, I implement AvroStreamInputFormat as I mentioned last weekend. Can you review it and help me refine it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code LGTM 👍
This PR needs docs added to https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md I think before it is ready to go.
I think it should also be relatively easy to add an integration test for this since we already have an integration test for the Parser
implementation of Avro + Schema Registry. All that needs done is a new input_format
directory be created in this location https://github.com/apache/druid/tree/master/integration-tests/src/test/resources/stream/data/avro_schema_registry with a new input_format.json
template (using the InputFormat
instead of the Parser
). See JSON for example: https://github.com/apache/druid/tree/master/integration-tests/src/test/resources/stream/data/json. If this template is added, then i think it should be automatically picked up and run as part of the kafka data format integration tests.
final AvroStreamInputFormat that = (AvroStreamInputFormat) o; | ||
return Objects.equals(getFlattenSpec(), that.getFlattenSpec()) && | ||
Objects.equals(avroBytesDecoder, that.avroBytesDecoder); | ||
} | ||
|
||
@Override | ||
public int hashCode() | ||
{ | ||
return Objects.hash(getFlattenSpec(), avroBytesDecoder); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equality/hashcode should probably consider binaryAsString
for their computations
return CloseableIterators.withEmptyBaggage( | ||
Iterators.singletonIterator(avroBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open()) | ||
)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: strange formatting (occasionally style bot doesn't pick stuff up)
return CloseableIterators.withEmptyBaggage( | |
Iterators.singletonIterator(avroBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open()) | |
)))); | |
return CloseableIterators.withEmptyBaggage( | |
Iterators.singletonIterator(avroBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open())))) | |
); |
Sorry, I don't understand what you said. Do you mean an integration test is developing for Avro stream and when it finished, I can add a new json about this test? Or I need create this integration test by myself? |
@clintropolis Document done. Do I need to create integration test? Is there some examples of it? I'm interesting about it. |
Ah sorry, let me try to explain a bit more. So in the case of avro inline schema and avro schema registry, you should be able to just add the JSON files with the
|
docs/ingestion/data-formats.md
Outdated
"type" : "schema_repo", | ||
"subjectAndIdConverter" : { | ||
"type" : "avro_1124", | ||
"topic" : "${YOUR_TOPIC}" | ||
}, | ||
"schemaRepository" : { | ||
"type" : "avro_1124_rest_client", | ||
"url" : "${YOUR_SCHEMA_REPO_END_POINT}", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we should switch to using 'inline' or 'schema-registry' as the example instead of 'schema_repo', which isn't used as frequently in practice as far as I know.
|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) | | ||
|`avroBytesDecoder`| JSON Object |Specifies how to decode bytes to Avro record. | yes | | ||
| binaryAsString | Boolean | Specifies if the bytes Avro column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md#avro-bytes-decoder (which currently lives with the 'parsers' documentation) up to this 'input formats' section, and have the parsers
section link to the bytes decoder docs here.
I haven't quite determined what is going on yet, but it seems like there is some sort of serialization error that is causing the newly added schema-registry input format integration test to fail:
https://travis-ci.com/github/apache/druid/jobs/496046398#L9197 The inline schema test is passing 👍 |
It seems like https://github.com/apache/druid/blob/master/extensions-core/avro-extensions/src/main/java/org/apache/druid/data/input/avro/SchemaRegistryBasedAvroBytesDecoder.java#L50 is missing getter methods annotated with Could you add serialization round trip tests for more |
I test it last week and I know something wrong with schema registry decoder. I will fix it this weekend, thanks for your advice and I will change code for this exception. |
@clintropolis Hi, I fix this bug and pass integration-tests. Then I add one unit test for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for fixing this up 👍
public AvroStreamInputFormat( | ||
@JsonProperty("flattenSpec") @Nullable JSONPathSpec flattenSpec, | ||
@JsonProperty("avroBytesDecoder") AvroBytesDecoder avroBytesDecoder, | ||
@JsonProperty("binaryAsString") @Nullable Boolean binaryAsString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing a binaryAsString
getter annotated with @JsonProperty
i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks @bananaaggle 👍
Because of deprecated of parseSpec, I develop AvroStreamInputFormat for new interface, which supports stream ingestion for data encoded by Avro.
This PR has: