Skip to content

[BEAM-9041, BEAM-9042] SchemaCoder equals should not rely on from/toRowFunction equality#10492

Merged
iemejia merged 4 commits intoapache:masterfrom
iemejia:BEAM-9042-schemacoder-not-serializable
Jan 8, 2020
Merged

[BEAM-9041, BEAM-9042] SchemaCoder equals should not rely on from/toRowFunction equality#10492
iemejia merged 4 commits intoapache:masterfrom
iemejia:BEAM-9042-schemacoder-not-serializable

Conversation

@iemejia
Copy link
Member

@iemejia iemejia commented Jan 3, 2020

This PR fixes both issues because (1) one fix depends on the other, and (2) to make it easier to validate/cherry pick into 2.18.0's branch.

BEAM-9041: Don't rely on equality for the from/to functions in SchemaCoder because nobody implements equals on SerializableFunction. I tried to do this with byte equality, maybe there is a better way to do it but I could not think of another.
BEAM-9042: Since Avro's Schema class is not Serializable I made it transient. Another approach could have been to transform the Schema into a String (not sure if this is needed but I can change it if you think it is worth).

R: @TheNeuralBit

@iemejia iemejia requested a review from TheNeuralBit January 3, 2020 00:11
Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made sure we had a good equals for most of the toRow/fromRow functions we generate when I changed SchemaCoder#equals in #9493. It looks like I missed some since these lambdas are still here, but I would prefer adding the equals rather than just reverting to byte equality if possible.

You have a good point that new implementations could easily neglect to implement equals though. Maybe we could mitigate that by defining our own interface that requires equals to be implemented?

cc: @kennknowles

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you just implement equals here instead of changing to comparing byte equality of the serialized function in SchemaCoder?

Copy link
Member Author

@iemejia iemejia Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a second different issue about capture of Avro schema on serialization (the key change is the transient part) so not really related to equals. As explained above I put both together because I use equality to validate the roundtrip of serialization/deserialization.

@iemejia
Copy link
Member Author

iemejia commented Jan 6, 2020

This is not a revert. Previous version did not compare from/toRow functions for equality. Do you have any suggestion on how to compare both functions? It is not really clear to me how to do so in particular for functions with no state.

@TheNeuralBit
Copy link
Member

TheNeuralBit commented Jan 7, 2020 via email

@kennknowles
Copy link
Member

Hmm. I am torn on this. I agree with @TheNeuralBit that mostly these conversions should just own their own equals. But I see how SerializableFunction, especially lambdas, will usually not implement this.

In the case of the SchemaCoder it is often automated, isn't it? And when it isn't, it is fair that it is a one-time up front effort to write a slightly larger class that has a good equals. We could provide a convenient base class, I guess, or even now add a default implementation in the functional interface maybe?

@iemejia
Copy link
Member Author

iemejia commented Jan 7, 2020

I see the things more clearly now, thanks for the explanation @TheNeuralBit. I was not aware of the goals of the other PR. I have one question related to that PR. Why did you use the strict conversion (no widening/narrowing) from/to row only on the schemaCoder that generates GenericRecords AvroUtils#schemaCoder(java.lang.Class<T>) and not in the others. Or maybe the real question is why we need to do this strictly?

@iemejia
Copy link
Member Author

iemejia commented Jan 7, 2020

After thinking more on the issue probably we don't have a nice solution for it. We cannot define equals (and hashCode) on SerializableFunction or its parent ProcessFunction because Java does not allow to override a member of java.lang.Object.

So the alternative would be to change the signature of SchemaCoder to use a new class that implements SerialiableFunction and equals, but this will require many changes for few returns, maybe the easiest thing to do here is the right one, just to document that the functions to convert from/to rows in SchemaCoder must implement equals/hashCode.

WDYT? I will update the PR with the fix for the Avro case in the meantime.

@iemejia iemejia force-pushed the BEAM-9042-schemacoder-not-serializable branch from 5d5b015 to dfa7d69 Compare January 7, 2020 17:05
@iemejia
Copy link
Member Author

iemejia commented Jan 7, 2020

PR updated I used your equals suggestion to tackle the first part instead of the byte equality change and improved the serialization issue of the internal Avro schema. Now we should be good to go. PTAL @TheNeuralBit

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM except that it seems to have broken some integration tests? I'm not sure what went wrong.

Why did you use the strict conversion (no widening/narrowing) from/to row only on the schemaCoder that generates GenericRecords AvroUtils#schemaCoder(java.lang.Class) and not in the others. Or maybe the real question is why we need to do this strictly?

I think this is a question for @reuvenlax - in my PR I tried to just preserve the behavior that he had implemented previously. (Primarily in #7290)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure but it looks like the failures are occurring when avroSchema is null. Either way I think you need to check if avroSchemaAsString is null here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right I forgot to check the nullability of the string before the parse, I will fix that and add a method for this. Hopefully everything will be green at that moment. Thanks for the hint.

@iemejia iemejia force-pushed the BEAM-9042-schemacoder-not-serializable branch from cf76be7 to fe17671 Compare January 7, 2020 21:55
@iemejia iemejia merged commit 8aba30f into apache:master Jan 8, 2020
@iemejia
Copy link
Member Author

iemejia commented Jan 8, 2020

Merging now that the tests are green.

@iemejia
Copy link
Member Author

iemejia commented Jan 8, 2020

Sorry if I rushed a bit the merge @TheNeuralBit I just wanted to cherry pick it to unblock 2.18.0 release.

@TheNeuralBit
Copy link
Member

Not at all, sorry I didn't approve sooner.

@iemejia iemejia deleted the BEAM-9042-schemacoder-not-serializable branch January 8, 2020 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants