-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-7116] Remove use of KV in Schema transforms #10151
[BEAM-7116] Remove use of KV in Schema transforms #10151
Conversation
run sql postcommit |
friendly ping |
Run Java PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. LGTM, just some minor comments.
} | ||
} | ||
|
||
public static class RowIterableFieldMatcher extends BaseMatcher<Row> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this dead? I don't see it used anywhere.
case ITERABLE: | ||
Row[] expectedIterable = Iterables.toArray((Iterable<Row>) expected, Row.class); | ||
List<Row> actualIterable = Lists.newArrayList(row.getIterable(fieldIndex)); | ||
return containsInAnyOrder(expectedIterable).matches(actualIterable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth documenting that RowFieldMatcher
doesn't care about order for array and iterable types. Maybe that should even be indicated in the name somehow? It seems like that's the primary difference between this and just using equalTo(expectedRow)
.
815529d
to
e16606e
Compare
@reuvenlax @TheNeuralBit https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/6002/
|
Yeah this PR must be the cause. It's not clear to me why this only seems to be a problem with the Spark runner.. I'll file a jira and exclude the test from Spark for now. |
@TheNeuralBit Thank you for confirmation. |
#10358 should fix Spark ValidatesRunner |
The short answer is that the failure is caused by a limitation of the Spark
runner - the Beam model allows the Iterable passed into a GroupByKey to
reiterated, but Spark is unable to support that.
The longer answer is that this is exposing a performance bug in the Group
transform. We were accidentally walking over the iterator, which we should
not be doing here. I'll send a PR to fix it.
…On Wed, Dec 11, 2019 at 3:43 PM Brian Hulette ***@***.***> wrote:
#10358 <#10358> should fix Spark
ValidatesRunner
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10151?email_source=notifications&email_token=AFAYJVMNEC7QSLVPSGN6B3TQYF3ITA5CNFSM4JO5WLU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGU6Q5Y#issuecomment-564783223>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAYJVNBKIBVTN6JRD7YJZLQYF3ITANCNFSM4JO5WLUQ>
.
|
Beam's KV type has no schema and due to special casing of KvCoder in Beam it is difficult to give it one. Here we modify the Beam schema transforms that return PCollection to instead return PCollection where the Row contains key and value fields. This is possible now that we support large iterables in schemas.