[BEAM-890] Update compatibility matrix for Spark.#65
[BEAM-890] Update compatibility matrix for Spark.#65amitsela wants to merge 2 commits intoapache:asf-sitefrom
Conversation
|
R: @jbonofre |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 97c147f with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
R: @kennknowles |
kennknowles
left a comment
There was a problem hiding this comment.
So happy to be changing cells to "yes" :-)
Some suggestions about how to focus the explanations.
_data/capability-matrix.yml
Outdated
| l2: group by window in batch only | ||
| l3: "Uses Spark's groupByKey for grouping. Grouping by window is currently only supported in batch." | ||
| l2: support for grouping by panes (streaming) is a work in progress. | ||
| l3: Using groupByKey for grouping, but only if the pipeline explicitly calls for GroupByKey or the model forces it. For efficient group-compute see Combine. |
There was a problem hiding this comment.
Say something positive about batch first, like:
"Full support for in batch mode. GroupByKey with multiple trigger firings in streaming mode is a work in progress."
And then I don't think you actually need to teach users about how to program against it here. The statement is likely true for most runners. If you just say "Using Spark's groupByKey" that tells users what they need to know, if they are familiar with Spark. (do use code font, I'd say)
There was a problem hiding this comment.
Maybe we should be more clear about this in general. I know that the fact that people saw GroupByKey associated with Spark in the same context caused a bit of "riot" in Twitter a while ago ;-)
Maybe clearing the optmizations via Combine in general is a good idea.
There was a problem hiding this comment.
@kennknowles you mean use the tt tag for groupByKey, combineByKey and such ?
There was a problem hiding this comment.
Whatever Jekyll requires. I guess it is YAML and I was thinking Markdown.
There was a problem hiding this comment.
@jbonofre any input here ? I'm the worst web-developer in the project. For sure.
_data/capability-matrix.yml
Outdated
| l1: 'Yes' | ||
| l2: fully supported | ||
| l3: Supports GroupedValues, Globally and PerKey. | ||
| l3: Using combineByKey and aggregate functions. |
There was a problem hiding this comment.
"Using Spark's combineByKey and aggregate functions."
_data/capability-matrix.yml
Outdated
| l3: "Side input is actually a broadcast variable in Spark so it can't be updated during the life of a job. Spark-runner implementation of side input is more of an immutable, static, side input." | ||
| l1: 'Yes' | ||
| l2: fully supported | ||
| l3: A side input is actually a broadcast variable in Spark. In streaming mode, a side input could be updated between micro-batches. The distribution of side inputs to workers is not partitioned, but only a worker assigned with a relevant task will get a copy of the side input. |
There was a problem hiding this comment.
"Using Spark's broadcast variables."
I actually think the rest of it is just a description of what a side input is. They are always global views of a PCollection, and are generally always only read by workers who are working on a task that needs them.
I'm not saying you can't have some explanation, but maybe more focused. Maybe the second sentence is good, like "In streaming mode, side input values only update between micro-batches."
There was a problem hiding this comment.
I don't understand the comments for Flink and Dataflow - clearly side inputs have size restrictions by design.. why is streaming different then batch ? and why write it here ?
|
@kennknowles @jbonofre trying a 2nd iteration. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 6fa3dda with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
Refer to this link for build results (access rights to CI server needed): |
|
LGTM |
|
LGTM |
No description provided.