Skip to content

[BEAM-5650]: Modify BoundedToUnboundedSourceAdapter to configure its reader to read more than 1 bounded source#6952

Closed
jhalaria wants to merge 1 commit intoapache:masterfrom
jhalaria:UnboundedReadFromBoundedSourceEnhanced
Closed

[BEAM-5650]: Modify BoundedToUnboundedSourceAdapter to configure its reader to read more than 1 bounded source#6952
jhalaria wants to merge 1 commit intoapache:masterfrom
jhalaria:UnboundedReadFromBoundedSourceEnhanced

Conversation

@jhalaria
Copy link

@jhalaria jhalaria commented Nov 5, 2018

@iemejia - Please review. Thank-you.

  • Noticed an issue while reading around 1000 files from S3.
  • Started getting connection timeouts as the max number of open connections is set to 50.
  • Inside https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L250 , we open all the readers simultaneously. Thats the intended behavior for an unbounded source. But when it comes to reading a bounded source (eg. Doing reads from S3), opening all connections at the same time leads to connections not available when the number of files we are trying to open is greater than the maxHttpConnections possible which by default is set to 50 (ClientConfiguration.DEFAULT_MAX_CONNECTIONS) and there isn't a way to override it [We should have the ability to override this anyways. Will create a separate PR for that.].
  • The Change essentially gives the BoundedToUnboundedSourceAdapter an ArrayDeque so that one reader can read more than 1 BoundedSource. Each Reader finishes reading from a BoundedSource and then goes to the next one.
  • In case of check-pointing, if a reader has elements that aren't read yet from a BoundedSource, we create a checkpoint with the remaining elements of the BoundedSource and the elements remaining in the ArrayDequeue.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
Build Status --- --- ---

private BoundedSource<T> boundedSource;
//TODO: There must be a better way to figure out the maxOpenConnections available. Use that number here
//Mention this as part of PR to get feedback.
private static final int READER_QUEUE_SIZE = 10;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iemejia - Whats a good way to set the value here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be part of the splitting behavior of BoundedSource?

@jhalaria jhalaria force-pushed the UnboundedReadFromBoundedSourceEnhanced branch 9 times, most recently from b38b23b to b0fd857 Compare November 6, 2018 22:59
@jhalaria jhalaria force-pushed the UnboundedReadFromBoundedSourceEnhanced branch from b0fd857 to 83e321a Compare November 6, 2018 23:29
@jhalaria
Copy link
Author

jhalaria commented Nov 8, 2018

@iemejia - Whenever you get a chance, please review. Thanks.

@iemejia
Copy link
Member

iemejia commented Nov 8, 2018

Ok some general comments:

  • I updated the JIRA ticket to make it more clear now that we know that the issue is related to the way Flink translation works (by using runners-core UnboundedReadFromBoundedSource).
  • For the comment on tuning ClientConfiguration on S3 you can already do that via S3Options.setClientConfiguration.

In general I think the solution makes sense, however I have a bit of doubt if the point where you are fixing it is the correct one, I would like to preserve the current 1-1 relation of unbounded-bounded if possible. (Note that if we can't we will need to update the current documentation because with these changes the semantics has clearly changed).

Thinking a bit this is similar to the Connection/Thread Pool classic problem and I don't know if we may create something similar where the limited 'tasks' are such readers, or if we can just delegate this into native Flink and it will take care. Since I am not a Flink expert I asked @mxm to help us take a look, maybe he knows a better way to proceed.

@iemejia iemejia requested a review from mxm November 8, 2018 14:55
Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this problem @jhalaria. We definitely need to address this issue. Some observations from reading the code:

  • After the BoundedSource is split, it will be processed in bundles of READER_QUEUE_SIZE. This breaks the current guarantee that all readers will be processed at the same time. Some applications might actually assume this behavior.
  • There is an unnecessary change with how the class is instantiated, i.e. with an ArrayDequeue
  • The constructor assumes an implicit size 1, otherwise the the splitting / size estimation won't work correctly.

Rather than encapsulating this throttling behavior in UnboundedReadFromBoundedSource, wouldn't it make sense to delegate this a BoundedSource? We could create a ThrottledBoundedSource wrapper which would handle the throttling you implemented. This would be completely transparent to UnboundedReadFromBoundedSource.

The question remains how we decide during translation when to implement the throttling. I'd say, this could be a PipelineOption which you can set, as we have with HadoopFileSystemOptions.

ArrayDeque<BoundedSource<T>> unboundedSources = new ArrayDeque<>();
unboundedSources.add(transform.getSource());
UnboundedReadFromBoundedSource.BoundedToUnboundedSourceAdapter unboundedSource =
new UnboundedReadFromBoundedSource.BoundedToUnboundedSourceAdapter(unboundedSources);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer if there was no change here. Seems arbitrary to add a one-item Dequeue here.

public PCollection<T> expand(PBegin input) {
return input.getPipeline().apply(Read.from(new BoundedToUnboundedSourceAdapter<>(source)));
final ArrayDeque<BoundedSource<T>> dequeue = new ArrayDeque<>(Arrays.asList(source));
return input.getPipeline().apply(Read.from(new BoundedToUnboundedSourceAdapter<>(dequeue)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert, wouldn't make any change here. The normal use case of wrapping a single source should still be supported.


public BoundedToUnboundedSourceAdapter(BoundedSource<T> boundedSource) {
this.boundedSource = boundedSource;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this constructor. See above.

private BoundedSource<T> boundedSource;
//TODO: There must be a better way to figure out the maxOpenConnections available. Use that number here
//Mention this as part of PR to get feedback.
private static final int READER_QUEUE_SIZE = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be part of the splitting behavior of BoundedSource?

List<? extends BoundedSource<T>> splits = boundedSource.split(desiredBundleSize, options);
return splits
final List<? extends List<? extends BoundedSource<T>>> partition =
Lists.partition(splits, READER_QUEUE_SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This re-partitioning seems arbitrary / optimized for your case only.

*/
final BoundedSource<T> boundedSource = boundedSources.peek();
long estimatedSize = boundedSource.getEstimatedSizeBytes(options);
long desiredBundleSize = estimatedSize / desiredNumSplits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only estimate the bundle size based on the first boundedsource. Doesn't seem right.

public BoundedToUnboundedSourceAdapter(ArrayDeque<BoundedSource<T>> boundedSources) {
this.boundedSources = boundedSources;
final BoundedSource<T> source = boundedSources.peek();
this.checkpointCoder = new CheckpointCoder<>(source.getDefaultOutputCoder());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work if the coder are the same for all sources.

return helper(onlyFinishReadingCurrentSource);
}

private boolean helper(boolean onlyFinishReadingCurrentSource) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to a more descriptive name?

@jhalaria
Copy link
Author

jhalaria commented Nov 9, 2018

Thank-you @mxm and @iemejia for looking into this. I will incorporate your feedback.

@robertwb
Copy link
Contributor

Any updates on this PR?

@iemejia
Copy link
Member

iemejia commented Jan 8, 2019

Oups seems this needs a rebase too now @jhalaria

@jhalaria
Copy link
Author

@robertwb - I was out on vacation. Haven't had a chance to resume work on it yet. Will let you know once I have something.

@stale
Copy link

stale bot commented Mar 12, 2019

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Mar 12, 2019
@stale
Copy link

stale bot commented Mar 19, 2019

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Mar 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants