Skip to content

[BEAM-3648] Support Splittable DoFn in Flink Batch Runner#4640

Closed
aljoscha wants to merge 2 commits intoapache:masterfrom
aljoscha:flink-batch-sdf
Closed

[BEAM-3648] Support Splittable DoFn in Flink Batch Runner#4640
aljoscha wants to merge 2 commits intoapache:masterfrom
aljoscha:flink-batch-sdf

Conversation

@aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Feb 8, 2018

This is built on top of #4639 so only the last commit is relevant here.

@aljoscha aljoscha requested a review from jkff February 8, 2018 15:20
@aljoscha
Copy link
Contributor Author

Run Flink ValidatesRunner

@aljoscha
Copy link
Contributor Author

@jkff Next step, now that #4639 is merged. 😉

@aljoscha
Copy link
Contributor Author

Run Flink ValidatesRunner

Copy link
Contributor

@jkff jkff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks! Sorry for the delay - was preparing for the Strata talk...


}

private static class GBKIntoKeyedWorkItemsTranslator<K, InputT>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this duplicate code with the streaming codepath? If so, can it be deduplicated into a shared location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no code duplication since the streaming codepath uses the Flink DataStream API while the batch codepath uses the DataSet API.

Flink is interesting in that the underlying execution engine is a streaming execution engine but there are two different APIs on top of that. Seems somewhat the opposite of what Dataflow was/is like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay - then no objections :)

}
}

private static class SplittableParDoTranslatorBatch<
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

import org.joda.time.Instant;

/**
* A {@link RichGroupReduceFunction} for splittable {@link DoFn} in Flink Batch Runner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify how it differs from the streaming one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment.

ScheduledExecutorService executorService = Executors.newSingleThreadScheduledExecutor(
Executors.defaultThreadFactory());

((SplittableParDoViaKeyedWorkItems.ProcessFn) dofn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract into a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I realised we also don't have the generic type info. I'll try and fix that.

},
sideInputReader,
executorService,
10000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to limit the output and duration in batch? Maybe we can just run the whole SDF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason, if there's work left it will just call again when the processing-time timer fires. I'll set this to Integer.MAX_VALUE, Duration.millis(Long.MAX_VALUE), though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that should be better in batch because we're optimizing for throughput rather than latency so bigger bundles are better. Depending on how Flink streaming runner works, it may be reasonable to choose different parameter values for streaming runner too, but the default ones may also be good.

@jkff
Copy link
Contributor

jkff commented Mar 13, 2018

Please let me know when ready for another round.

@stale
Copy link

stale bot commented Jun 7, 2018

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 7, 2018
@stale
Copy link

stale bot commented Jun 14, 2018

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Jun 14, 2018
@kennknowles kennknowles added stale and removed wontfix labels Jun 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants