New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7210] Fix non-deterministic row access #8474
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just wondering whether we should throw an error in case of non-deterministic schema extraction?
@@ -388,7 +388,7 @@ public void testInferredSchemaPipeline() { | |||
new DoFn<Inferred, String>() { | |||
@ProcessElement | |||
public void process(@Element Row row, OutputReceiver<String> r) { | |||
r.output(row.getString(0) + ":" + row.getInt32(1)); | |||
r.output(row.getString("stringField") + ":" + row.getInt32("integerField")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically it is not safe to use AutoValue with Schemas. Should we convert the example to use a Pojo instead (like it previously did)? Should we throw an error that schemas do not work deterministically in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is perfectly safe. What's not safe is assuming that the order of the schema fields will match the order of the methods in your class. In fact it's technically not safe for POJOs either, as AFAICT Java makes no guarantees there either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wondered whether we should throw an error when users try to access fields via a non-deterministic position? It seems like this could cause subtle bugs when two fields have the same type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @reuvenlax! I still wonder whether we should prevent non-deterministic positional row access but this can be handled independently of this PR.
So this is deterministic as long as you resolve the ids beforehand. E.g.
the following is deterministic:
@DefaultFieldSchema(AutoValueSchema.class)
@autovalue
class Foo {...}
PCollection<Foo> foos = readFoos();
int userIdField = foos.getSchema().indexOf("userId");
foos.apply(ParDo.of(....
void process(@element Row row) {
row.get(userIdField);
}
In fact all of the schema transforms (Group, Join, etc.) do just this for
efficiency. They resolve the field names into field positions at expansion
time, and then always access by field id inside the actual transform.
Reuven
…On Mon, May 6, 2019 at 2:01 AM Maximilian Michels ***@***.***> wrote:
***@***.**** approved this pull request.
Thanks for the fix @reuvenlax <https://github.com/reuvenlax>! I still
wonder whether we should prevent non-deterministic positional row access
but this can be handled independently of this PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8474 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFAYJVM6OH5AXZMVPCHWFQLPT7XW7ANCNFSM4HKCZFNQ>
.
|
I see. So on Rows users need to ensure that they either know the structure of the Row or go through the Schema which returns the correct index for a field name. Thanks! |
Yes. I think most users will access the Row by name though, not by index.
If users have an object (POJO, Avro, AutoValue, etc.) with a matching
schema, they can instead just access via that object and not bother with
the Row object at all.
…On Wed, May 8, 2019 at 3:37 AM Maximilian Michels ***@***.***> wrote:
I see. So on Rows users need to ensure that they either know the structure
of the Row or go through the Schema which returns the correct index for a
field name. Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8474 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFAYJVKQ6HT7ZS4ZRDIJY2DPUKUM7ANCNFSM4HKCZFNQ>
.
|
No description provided.