[BEAM-7116] Remove use of KV in Schema transforms #10151

reuvenlax · 2019-11-19T04:32:38Z

Beam's KV type has no schema and due to special casing of KvCoder in Beam it is difficult to give it one. Here we modify the Beam schema transforms that return PCollection to instead return PCollection where the Row contains key and value fields. This is possible now that we support large iterables in schemas.

reuvenlax · 2019-11-19T04:35:01Z

R: @TheNeuralBit

reuvenlax · 2019-11-19T06:03:07Z

run sql postcommit

reuvenlax · 2019-12-04T16:37:41Z

friendly ping

TheNeuralBit · 2019-12-05T21:13:45Z

Run Java PreCommit

TheNeuralBit

Sorry for the delay. LGTM, just some minor comments.

TheNeuralBit · 2019-11-21T22:53:29Z

sdks/java/core/src/test/java/org/apache/beam/sdk/schemas/utils/SchemaTestUtils.java

+    }
+  }
+
+  public static class RowIterableFieldMatcher extends BaseMatcher<Row> {


Is this dead? I don't see it used anywhere.

TheNeuralBit · 2019-11-21T22:56:10Z

sdks/java/core/src/test/java/org/apache/beam/sdk/schemas/utils/SchemaTestUtils.java

+        case ITERABLE:
+          Row[] expectedIterable = Iterables.toArray((Iterable<Row>) expected, Row.class);
+          List<Row> actualIterable = Lists.newArrayList(row.getIterable(fieldIndex));
+          return containsInAnyOrder(expectedIterable).matches(actualIterable);


It's probably worth documenting that RowFieldMatcher doesn't care about order for array and iterable types. Maybe that should even be indicated in the name somehow? It seems like that's the primary difference between this and just using equalTo(expectedRow).

suztomo · 2019-12-11T23:05:10Z

@reuvenlax @TheNeuralBit AvroSchemaTest.testAvroPipelineGroupBy started failing after merging this commit. Do you think this PR is related?

https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/6002/

org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalStateException: ValueIterator can't be iterated more than once,otherwise there could be data lost

TheNeuralBit · 2019-12-11T23:24:00Z

Yeah this PR must be the cause. It's not clear to me why this only seems to be a problem with the Spark runner.. I'll file a jira and exclude the test from Spark for now.

suztomo · 2019-12-11T23:26:01Z

@TheNeuralBit Thank you for confirmation.

TheNeuralBit · 2019-12-11T23:43:04Z

#10358 should fix Spark ValidatesRunner

reuvenlax · 2019-12-12T01:57:38Z

The short answer is that the failure is caused by a limitation of the Spark runner - the Beam model allows the Iterable passed into a GroupByKey to reiterated, but Spark is unable to support that. The longer answer is that this is exposing a performance bug in the Group transform. We were accidentally walking over the iterator, which we should not be doing here. I'll send a PR to fix it.

…

On Wed, Dec 11, 2019 at 3:43 PM Brian Hulette ***@***.***> wrote: #10358 <#10358> should fix Spark ValidatesRunner — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10151?email_source=notifications&email_token=AFAYJVMNEC7QSLVPSGN6B3TQYF3ITA5CNFSM4JO5WLU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGU6Q5Y#issuecomment-564783223>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFAYJVNBKIBVTN6JRD7YJZLQYF3ITANCNFSM4JO5WLUQ> .

…ma transforms

reuvenlax requested a review from TheNeuralBit November 19, 2019 04:34

TheNeuralBit approved these changes Dec 5, 2019

View reviewed changes

Remove KV from the schema API.

e16606e

reuvenlax force-pushed the update_group_and_join_iterable branch from 815529d to e16606e Compare December 7, 2019 08:40

reuvenlax merged commit 9501152 into apache:master Dec 9, 2019

suztomo mentioned this pull request Dec 11, 2019

[BEAM-8917] jsr305 dependency declaration for Nullable class #10324

Merged

3 tasks

dpcollins-google pushed a commit to dpcollins-google/beam that referenced this pull request Dec 20, 2019

Merge pull request apache#10151: [BEAM-7116] Remove use of KV in Sche…

cf2e617

…ma transforms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-7116] Remove use of KV in Schema transforms #10151

[BEAM-7116] Remove use of KV in Schema transforms #10151

reuvenlax commented Nov 19, 2019

reuvenlax commented Nov 19, 2019

reuvenlax commented Nov 19, 2019

reuvenlax commented Dec 4, 2019

TheNeuralBit commented Dec 5, 2019

TheNeuralBit left a comment

TheNeuralBit Nov 21, 2019

TheNeuralBit Nov 21, 2019

suztomo commented Dec 11, 2019

TheNeuralBit commented Dec 11, 2019

suztomo commented Dec 11, 2019

TheNeuralBit commented Dec 11, 2019

reuvenlax commented Dec 12, 2019 via email

[BEAM-7116] Remove use of KV in Schema transforms #10151

[BEAM-7116] Remove use of KV in Schema transforms #10151

Conversation

reuvenlax commented Nov 19, 2019

reuvenlax commented Nov 19, 2019

reuvenlax commented Nov 19, 2019

reuvenlax commented Dec 4, 2019

TheNeuralBit commented Dec 5, 2019

TheNeuralBit left a comment

Choose a reason for hiding this comment

TheNeuralBit Nov 21, 2019

Choose a reason for hiding this comment

TheNeuralBit Nov 21, 2019

Choose a reason for hiding this comment

suztomo commented Dec 11, 2019

TheNeuralBit commented Dec 11, 2019

suztomo commented Dec 11, 2019

TheNeuralBit commented Dec 11, 2019

reuvenlax commented Dec 12, 2019 via email