Add BoundedReader APIs for expressing remaining and consumed parallelism by dhalperi · Pull Request #353 · apache/beam

dhalperi · 2016-05-19T02:52:53Z

These are useful for dynamic work rebalancing and autoscaling.

And implement it for common sources

*) Make the start of a block match Avro's definition: the first byte after the previous sync marker. This enables detecting the last block in the file. *) This change enables us to unify currentOffset and currentBlockOffset, as all records are emitted at the start of the block that contains them. *) Simplify block header reading to have fewer object allocations and buffers using a direct reader and a (allocated once only) CountingInputStream to measure the size of that header. *) Add tests for consumed and remaining parallelism *) Let BlockBasedSource detect the end of the file in remaining parallelism. *) Verify in more places that the correct number of bytes is read from the input Avro file.

*) empty file *) non-empty compressed file *) non-empty not-compressed file

*) empty file *) non-empty file

This is not a very good offset because it is an upper bound, but it is likely better than not reporting any progress at all.

dhalperi · 2016-05-19T16:54:13Z

R: @bjchambers

bjchambers · 2016-05-19T17:28:02Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java

This is helpful, but some of the parantheticals seem confusing.

"current (previous)" I think should be "current (about to be previous)" or just say "current" since you've already stated that the current block in the precondition is about to be previous.

Similarly, in Postcondition: "current (formerly next)" could be clarified to the "new current (formerly next)"

thanks. fixed.

bjchambers · 2016-05-19T18:01:22Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java

If we know we're at a sync marker, it's unclear why we need advancePastNextSyncMarker -- it looks like that attempts to read up to the next sync marker. Further, that seems to be why we need push back. If we now know the next byte is the start of the sync marker, couldn't we just read it and continue? Wouldn't that also mean that we don't need the push back here?

bjchambers · 2016-05-19T18:21:31Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java

Alternatively, for methods that are "here is the splittable implementation and here is the non-splittable implementation" use an "if/else" block to make that clear.

bjchambers · 2016-05-19T18:33:33Z

Is there any way to update SourceTestUtils with some tests for this? It seems like correct behavior can be determined for any source by reading, remembering the split points, and asserting that the source is telling you information that is consistent with what you actually observed.

bjchambers · 2016-05-20T23:25:20Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java

@@ -46,8 +46,8 @@
 *     <ul>
 *       <li>Progress estimation ({@link BoundedReader#getFractionConsumed})
 *       <li>Tracking of parallelism, to determine with the current source can be split


to determine with the current source can be split reads oddly

bjchambers · 2016-05-20T23:26:32Z

LGTM

dhalperi added 7 commits May 18, 2016 19:51

BoundedReader: add getParallelism{Consumed,Remaining}

dfeecdb

And implement it for common sources

OffsetBasedReader: test limited parallelism signals

837a42d

CompressedSource: add tests of parallelism and progress

32894f8

*) empty file *) non-empty compressed file *) non-empty not-compressed file

TextIO: implement and test parallelism

ca29728

*) empty file *) non-empty file

CountingSource: test limited parallelism

b866541

CompressedSource: implement currentOffset based on bytes decompressed

4c775be

This is not a very good offset because it is an upper bound, but it is likely better than not reporting any progress at all.

bjchambers reviewed May 19, 2016
View reviewed changes

dhalperi added 2 commits May 19, 2016 10:54

fixup! AvroSource: rewrite to support remaining parallelism

a50a74c

fixup! AvroSource: rewrite to support remaining parallelism

332319a

bjchambers reviewed May 19, 2016
View reviewed changes

dhalperi added 6 commits May 19, 2016 11:42

fixup! AvroSource: rewrite to support remaining parallelism

3972a7c

fixup! BoundedReader: add getParallelism{Consumed,Remaining}

8d15c70

fixup! BoundedReader: add getParallelism{Consumed,Remaining}

bddd995

fixup! BoundedReader: add getParallelism{Consumed,Remaining}

6b8be43

fixup! BoundedReader: add getParallelism{Consumed,Remaining}

aa6d856

rename parallelism to split points

d7ccae2

bjchambers reviewed May 20, 2016
View reviewed changes

asfgit closed this in 4755c5a May 20, 2016

dhalperi deleted the limited-parallelism branch May 20, 2016 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BoundedReader APIs for expressing remaining and consumed parallelism#353

Add BoundedReader APIs for expressing remaining and consumed parallelism#353
dhalperi wants to merge 15 commits intoapache:masterfrom
dhalperi:limited-parallelism

dhalperi commented May 19, 2016

Uh oh!

dhalperi commented May 19, 2016

Uh oh!

bjchambers May 19, 2016

Uh oh!

dhalperi May 19, 2016

Uh oh!

bjchambers May 19, 2016

Uh oh!

bjchambers May 19, 2016

Uh oh!

bjchambers commented May 19, 2016

Uh oh!

bjchambers May 20, 2016

Uh oh!

bjchambers commented May 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

dhalperi commented May 19, 2016

Uh oh!

dhalperi commented May 19, 2016

Uh oh!

bjchambers May 19, 2016

Choose a reason for hiding this comment

Uh oh!

dhalperi May 19, 2016

Choose a reason for hiding this comment

Uh oh!

bjchambers May 19, 2016

Choose a reason for hiding this comment

Uh oh!

bjchambers May 19, 2016

Choose a reason for hiding this comment

Uh oh!

bjchambers commented May 19, 2016

Uh oh!

bjchambers May 20, 2016

Choose a reason for hiding this comment

Uh oh!

bjchambers commented May 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments