[BEAM-3419] Flesh out iterable side inputs and key enumeration for multimaps in shared libraries #10147

lukecwik · 2019-11-18T22:26:56Z

This now removed the byte[] that was used as the key and exposed the SDKs coder specifically using the structural value for comparison.

Update portable Python to use the iterable state key. Note that this doesn't effect Dataflow since dataflow_runner.py converts all iterable side inputs into multimap right now and no SDK performs key enumeration yet.

Update both Flink and Spark to support iterable API and also key enumeration for multimaps. To minimize the extent of this change, I did the minimal modification for Dataflow. A follow-up PR will do the same for Dataflow and then enable multimap side input key enumeration and iterable lookup within various SDKs.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

…ltimaps in shared libraries This now removed the byte[] that was used as the key and exposed the SDKs coder specifically using the structural value for comparison. Update portable Python to use the iterable state key. Note that this doesn't effect Dataflow since dataflow_runner.py converts all iterable side inputs into multimap right now and no SDK performs key enumeration yet. Update both Flink and Spark to support iterable API and also key enumeration for multimaps. To minimize the extent of this change, I did the minimal modification for Dataflow. A follow-up PR will do the same for Dataflow and then enable multimap side input key enumeration and iterable lookup within various SDKs.

lukecwik · 2019-11-19T17:05:57Z

R: @tweise @mxm

lukecwik · 2019-11-19T19:12:32Z

Run Portable_Python PreCommit

lukecwik · 2019-11-19T19:13:04Z

Run Python PreCommit

lukecwik · 2019-11-19T19:27:32Z

Run Java PreCommit

lukecwik · 2019-11-19T19:44:34Z

CC: @robertwb @lostluck

mxm

Great to see the side input access patterns more accurately represented in the Proto, as well as in the shared libraries. I've made a pass. It looks good to me.

mxm · 2019-11-20T12:33:31Z

Run Flink ValidatesRunner

tvalentyn · 2019-11-20T22:41:11Z

I think this breaks :runners:spark:compileJava on master. @lukecwik can you please take a look?

tvalentyn · 2019-11-20T22:42:10Z

beam/runners/spark/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/functions/SparkSideInputReader.java:49: error: incompatible types: MultimapView is not a functional interface
      o -> Collections.EMPTY_LIST;
      ^

tvalentyn · 2019-11-20T22:45:12Z

@ibzib has a fix in flight for this.

ibzib · 2019-11-20T22:47:53Z

#10182

…ltimaps in shared libraries (apache#10147) This now removed the byte[] that was used as the key and exposed the SDKs coder specifically using the structural value for comparison. Update portable Python to use the iterable state key. Note that this doesn't effect Dataflow since dataflow_runner.py converts all iterable side inputs into multimap right now and no SDK performs key enumeration yet. Update both Flink and Spark to support iterable API and also key enumeration for multimaps. To minimize the extent of this change, I did the minimal modification for Dataflow. A follow-up PR will do the same for Dataflow and then enable multimap side input key enumeration and iterable lookup within various SDKs.

lukecwik force-pushed the side_input branch 2 times, most recently from d319089 to 38cc4bc Compare November 19, 2019 00:45

lukecwik force-pushed the side_input branch from 38cc4bc to b1bc3d2 Compare November 19, 2019 17:04

lukecwik changed the title ~~[WIP] Flesh out iterable side input and key access for multimaps~~ [BEAM-3419] Flesh out iterable side inputs and key enumeration for multimaps in shared libraries Nov 19, 2019

mxm reviewed Nov 20, 2019

View reviewed changes

lukecwik merged commit 9407578 into apache:master Nov 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-3419] Flesh out iterable side inputs and key enumeration for multimaps in shared libraries #10147

[BEAM-3419] Flesh out iterable side inputs and key enumeration for multimaps in shared libraries #10147

lukecwik commented Nov 18, 2019 •

edited

lukecwik commented Nov 19, 2019

lukecwik commented Nov 19, 2019

lukecwik commented Nov 19, 2019

lukecwik commented Nov 19, 2019

lukecwik commented Nov 19, 2019

mxm left a comment

mxm commented Nov 20, 2019

tvalentyn commented Nov 20, 2019

tvalentyn commented Nov 20, 2019

tvalentyn commented Nov 20, 2019

ibzib commented Nov 20, 2019

[BEAM-3419] Flesh out iterable side inputs and key enumeration for multimaps in shared libraries #10147

[BEAM-3419] Flesh out iterable side inputs and key enumeration for multimaps in shared libraries #10147

Conversation

lukecwik commented Nov 18, 2019 • edited

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

lukecwik commented Nov 19, 2019

lukecwik commented Nov 19, 2019

lukecwik commented Nov 19, 2019

lukecwik commented Nov 19, 2019

lukecwik commented Nov 19, 2019

mxm left a comment

Choose a reason for hiding this comment

mxm commented Nov 20, 2019

tvalentyn commented Nov 20, 2019

tvalentyn commented Nov 20, 2019

tvalentyn commented Nov 20, 2019

ibzib commented Nov 20, 2019

lukecwik commented Nov 18, 2019 •

edited