Skip to content

[BEAM-890] Update compatibility matrix for Spark.#65

Closed
amitsela wants to merge 2 commits intoapache:asf-sitefrom
amitsela:BEAM-890
Closed

[BEAM-890] Update compatibility matrix for Spark.#65
amitsela wants to merge 2 commits intoapache:asf-sitefrom
amitsela:BEAM-890

Conversation

@amitsela
Copy link
Member

@amitsela amitsela commented Nov 4, 2016

No description provided.

@amitsela
Copy link
Member Author

amitsela commented Nov 4, 2016

R: @jbonofre

@asfbot
Copy link

asfbot commented Nov 4, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Test/16/

@asfbot
Copy link

asfbot commented Nov 4, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Stage/62/

Jenkins built the site at commit id 97c147f with Jekyll and staged it here. Happy reviewing.

Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again.

@davorbonaci
Copy link
Member

R: @kennknowles

Copy link
Member

@kennknowles kennknowles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So happy to be changing cells to "yes" :-)

Some suggestions about how to focus the explanations.

l2: group by window in batch only
l3: "Uses Spark's groupByKey for grouping. Grouping by window is currently only supported in batch."
l2: support for grouping by panes (streaming) is a work in progress.
l3: Using groupByKey for grouping, but only if the pipeline explicitly calls for GroupByKey or the model forces it. For efficient group-compute see Combine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say something positive about batch first, like:

"Full support for in batch mode. GroupByKey with multiple trigger firings in streaming mode is a work in progress."

And then I don't think you actually need to teach users about how to program against it here. The statement is likely true for most runners. If you just say "Using Spark's groupByKey" that tells users what they need to know, if they are familiar with Spark. (do use code font, I'd say)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be more clear about this in general. I know that the fact that people saw GroupByKey associated with Spark in the same context caused a bit of "riot" in Twitter a while ago ;-)
Maybe clearing the optmizations via Combine in general is a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Amit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kennknowles you mean use the tt tag for groupByKey, combineByKey and such ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever Jekyll requires. I guess it is YAML and I was thinking Markdown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbonofre any input here ? I'm the worst web-developer in the project. For sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ' should do the trick.

l1: 'Yes'
l2: fully supported
l3: Supports GroupedValues, Globally and PerKey.
l3: Using combineByKey and aggregate functions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Using Spark's combineByKey and aggregate functions."

l3: "Side input is actually a broadcast variable in Spark so it can't be updated during the life of a job. Spark-runner implementation of side input is more of an immutable, static, side input."
l1: 'Yes'
l2: fully supported
l3: A side input is actually a broadcast variable in Spark. In streaming mode, a side input could be updated between micro-batches. The distribution of side inputs to workers is not partitioned, but only a worker assigned with a relevant task will get a copy of the side input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Using Spark's broadcast variables."

I actually think the rest of it is just a description of what a side input is. They are always global views of a PCollection, and are generally always only read by workers who are working on a task that needs them.

I'm not saying you can't have some explanation, but maybe more focused. Maybe the second sentence is good, like "In streaming mode, side input values only update between micro-batches."

Copy link
Member Author

@amitsela amitsela Nov 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comments for Flink and Dataflow - clearly side inputs have size restrictions by design.. why is streaming different then batch ? and why write it here ?

@amitsela
Copy link
Member Author

amitsela commented Nov 4, 2016

@kennknowles @jbonofre trying a 2nd iteration.

@asfbot
Copy link

asfbot commented Nov 4, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Stage/63/

Jenkins built the site at commit id 6fa3dda with Jekyll and staged it here. Happy reviewing.

Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again.

@asfbot
Copy link

asfbot commented Nov 4, 2016

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Test/17/

@davorbonaci
Copy link
Member

LGTM

@asfgit asfgit closed this in e96b07f Nov 4, 2016
@amitsela amitsela deleted the BEAM-890 branch November 5, 2016 08:36
@jbonofre
Copy link
Member

jbonofre commented Nov 5, 2016

LGTM

robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request Jun 5, 2018
melap pushed a commit to apache/beam that referenced this pull request Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants