Skip to content

Add BoundedReader APIs for expressing remaining and consumed parallelism#353

Closed
dhalperi wants to merge 15 commits intoapache:masterfrom
dhalperi:limited-parallelism
Closed

Add BoundedReader APIs for expressing remaining and consumed parallelism#353
dhalperi wants to merge 15 commits intoapache:masterfrom
dhalperi:limited-parallelism

Conversation

@dhalperi
Copy link
Contributor

These are useful for dynamic work rebalancing and autoscaling.

dhalperi added 7 commits May 18, 2016 19:51
And implement it for common sources
*) Make the start of a block match Avro's definition: the first byte after the previous sync marker.
   This enables detecting the last block in the file.
*) This change enables us to unify currentOffset and currentBlockOffset, as all records are emitted
   at the start of the block that contains them.
*) Simplify block header reading to have fewer object allocations and buffers using a direct
   reader and a (allocated once only) CountingInputStream to measure the size of that header.
*) Add tests for consumed and remaining parallelism
*) Let BlockBasedSource detect the end of the file in remaining parallelism.
*) Verify in more places that the correct number of bytes is read from
   the input Avro file.
*) empty file
*) non-empty compressed file
*) non-empty not-compressed file
*) empty file
*) non-empty file
This is not a very good offset because it is an upper bound, but it is
likely better than not reporting any progress at all.
@dhalperi
Copy link
Contributor Author

R: @bjchambers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful, but some of the parantheticals seem confusing.

"current (previous)" I think should be "current (about to be previous)" or just say "current" since you've already stated that the current block in the precondition is about to be previous.

Similarly, in Postcondition: "current (formerly next)" could be clarified to the "new current (formerly next)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know we're at a sync marker, it's unclear why we need advancePastNextSyncMarker -- it looks like that attempts to read up to the next sync marker. Further, that seems to be why we need push back. If we now know the next byte is the start of the sync marker, couldn't we just read it and continue? Wouldn't that also mean that we don't need the push back here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, for methods that are "here is the splittable implementation and here is the non-splittable implementation" use an "if/else" block to make that clear.

@bjchambers
Copy link
Contributor

Is there any way to update SourceTestUtils with some tests for this? It seems like correct behavior can be determined for any source by reading, remembering the split points, and asserting that the source is telling you information that is consistent with what you actually observed.

@@ -46,8 +46,8 @@
* <ul>
* <li>Progress estimation ({@link BoundedReader#getFractionConsumed})
* <li>Tracking of parallelism, to determine with the current source can be split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to determine with the current source can be split reads oddly

@bjchambers
Copy link
Contributor

LGTM

@asfgit asfgit closed this in 4755c5a May 20, 2016
@dhalperi dhalperi deleted the limited-parallelism branch May 20, 2016 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments