-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[schema] expose the native record for struct schema #9614
Conversation
@sijie @shiv4289 @aahmed-se please take a look |
I will spend time in adding tests only when we reach consensus on the best API to provide to our users |
c341a28
to
12369f5
Compare
@sijie I have added tests, please take a look, the patch is ready |
/pulsarbot run-failure-checks |
@sijie the patch passed CI, please take a look. @codelipenghui @rdhabalia PTAL as well |
@congbobo184 Could you please help review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the consumer uses AUTO_CONSUME schema, how to distinguish the unwrap type? Seems the implementation is assuming the users already know the schema is Avro or protobuf. Users can use the ALWAYS_COMPATIBLE policy, the schemas of the topic might have avro and protobuf schema. So shall we need to get the Avro type from the GenericRecord first?
And, for the GenericAvroRecord, GenericProtobufNativeRecord, etc. They have exposed the native schema, I think it's easy to get the field type from the native schema since the field exposed the index and the name.
I also have the same confusion as @codelipenghui said. |
When you are inside a Sink you have this situation and you can access all of the schema info from the
Unfortunately GenericRecord is used both to receive and to send values from/to Pulsar. We cannot easily add Do you think we should add something like
I will also like it, as it will help users that are not inside a Sink and they are receiving messages with the In that case I would make |
@eolivelli Currently, the PulsarClient maintains a schema cache for all topics, you can see |
@codelipenghui I cannot get your point. |
@eolivelli I shared the same question with @codelipenghui. Why can't you use a code snippet like the below?
|
@sijie @codelipenghui I believe that we need a strong and feature complete public API, that we can maintain in the future and can be used by users. |
@sijie @codelipenghui ping |
@eolivelli @sijie Is it better to add method |
I think what @codelipenghui suggested can be a good solution. |
Good idea. I will update the patch soon |
/pulsarbot run-failure-tests |
pulsar-client-api/src/main/java/org/apache/pulsar/client/api/schema/GenericRecord.java
Outdated
Show resolved
Hide resolved
* @return the internal representation of the record, or null if the requested information is not available. | ||
*/ | ||
default <T> T getNativeRecord(Class<T> clazz) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused why do you throw UnsupportedOperationException for getSchemaType
and return null
for this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we do not have a good "default value" for getSchemaType
but for this method the expected behaviour is that you get a null
value if you are asking for an unsupported interface, so returning null
here is the expected behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not throw UnsupportedOperationException for getNativeRecord
? It is a much clear behavior than returning null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that UnsupportedOperationException
does not work well,
because code that calls that method will have to catch it, and catching UnsupportedOperationException
smells.
I thinking about generic code that handles multiple different types of records.
* | ||
* @return the internal representation of the record, or null if the requested information is not available. | ||
*/ | ||
default <T> T getNativeRecord(Class<T> clazz) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default <T> T getNativeRecord(Class<T> clazz) { | |
default <T> T getNativeRecord() { |
Why do we need Class<T>
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of reasons:
- I am adding this specification because this way the user won't have to deal with ClassCastExceptions or to use 'instanceof'.
- Also the implementation of GenericRecord will be allowed to implement some "compatibility" feature.
For instance if in the future we will move away from JsonNode but the client code still expects to receive a JsonNode, we will be able to return a properly crafted JsonNode instance and to not break clients.
I am not saying we will do this soon, but if we do not add such support we won't be able to add it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about that. Having Class<T> clazz
means that the users need to know what exactly class that each schema type will use. This will be exactly the same as what I comment on #9614 (comment). There is no difference between these two approaches.
Because people anyway need to know what exactly class to use, I will instead suggest keeping the interface as simple as Object getNativeRecord()
. Returning Object
is very commonly seen in a lot of data processing engines that are dealing with different object types. I don't see it is a big problem to do the same thing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
* @return the internal representation of the record, or null if the requested information is not available. | ||
*/ | ||
default <T> T getNativeRecord(Class<T> clazz) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not throw UnsupportedOperationException for getNativeRecord
? It is a much clear behavior than returning null
.
* | ||
* @return the internal representation of the record, or null if the requested information is not available. | ||
*/ | ||
default <T> T getNativeRecord(Class<T> clazz) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about that. Having Class<T> clazz
means that the users need to know what exactly class that each schema type will use. This will be exactly the same as what I comment on #9614 (comment). There is no difference between these two approaches.
Because people anyway need to know what exactly class to use, I will instead suggest keeping the interface as simple as Object getNativeRecord()
. Returning Object
is very commonly seen in a lot of data processing engines that are dealing with different object types. I don't see it is a big problem to do the same thing here.
@sijie I have updated getNativeRecord to returning Object with a "throw new UnsupportedOperationException" as default implementation. PTAL @codelipenghui @dlg99 you already Approved this patch in the old form, please take a look again. |
/pulsarbot rerun-failure-checks |
1 similar comment
/pulsarbot rerun-failure-checks |
Thank you @sijie @codelipenghui and @dlg99 for your reviews and suggestions |
Motivation
Allow GenericRecord consumers to access the underlying implementation, such as Avro GenericRecord, Protobug DynamicMessage or JSON JSONNode.
Modifications
This is patch introduces support for retrieving such information, with two methods:
Verifying this change
New tests cases are added
Does this pull request potentially affect one of the following parts:
This change introduces a new API
Documentation