Skip to content

[BEAM-2052] Allow dynamic sharding in windowed file sinks#3023

Closed
jkff wants to merge 4 commits intoapache:masterfrom
jkff:finish-pr-2647-2
Closed

[BEAM-2052] Allow dynamic sharding in windowed file sinks#3023
jkff wants to merge 4 commits intoapache:masterfrom
jkff:finish-pr-2647-2

Conversation

@jkff
Copy link
Copy Markdown
Contributor

@jkff jkff commented May 9, 2017

This is a slightly modified and rearranged version of @reuvenlax 's #2647 .

My concerns about it are:

  1. In direct runner, the integration tests of dynamic sharding are vacuous, because direct runner replaces unspecified sharding with fixed sharding at https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/WriteWithShardingFactory.java (applied at https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/DirectRunner.java#L217). However, this is a testing-only concern: other runners don't have this override, so overall the testing is non-vacuous, this is just hard to test against direct runner and I suspect that we probably want these tests to be non-vacuous in direct runner too.

  2. When I removed that override for testing purposes, I noticed that there's a very large number of files being written - primarily, I guess, because the bundles are very small. So large a number of files that the test time for batch with dynamic sharding grows from 21 seconds to 5 minutes. In particular, we write many, many files for each window/pane - presumably because in streaming runners and in direct runner, there's at least 1 bundle per key, and we create at least 1 file per bundle in WriteFiles.Write(Windowed,Unwindowed)Bundles.

Reuven, can you please comment on whether this "at least 1 file per key" is expected behavior in a streaming runner? I suspect that it's not, but then I'm not sure how to fix the PR semantically.

CC: @reuvenlax @davorbonaci @dhalperi

FileBasedWriteOperation -> WriteOperation, FileBasedWriter -> Writer
@jkff
Copy link
Copy Markdown
Contributor Author

jkff commented May 10, 2017

retest this please


@FinishBundle
public void finishBundle(FinishBundleContext c) throws Exception {
FileResult result = writer.close();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (writer != null) {
}

(empty bundles are legal I believe)

private FileBasedWriter<T> writer = null;
private BoundedWindow window = null;
private class WriteWindowedBundles extends DoFn<T, FileResult> {
private Map<KV<BoundedWindow, PaneInfo>, Writer<T>> windowedWriters;
Copy link
Copy Markdown
Contributor

@reuvenlax reuvenlax May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect I think - BoundedWindow isn't guaranteed to implement a proper hashCode. That's why I used Coder.structuralValue in the original PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular while this will work for our built-in windows, users can write their own window functions, and we do not insist that they provide hashCode - all we insist is that they provide a Coder. Hence the reason I used Coder.structuredValue to encode the windows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BoundedWindow documents that it must implement equals and hashCode (https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/BoundedWindow.java#L30), and there's a lot of HashMaps keyed with BoundedWindow in the repo.

@reuvenlax
Copy link
Copy Markdown
Contributor

reuvenlax commented May 10, 2017 via email

testWindowedWordCountPipeline(options);
}

@Test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this test until we fix the streaming runner to not generate a file per bundle. Until then this is not a suggested use case for streaming, so I think it's ok to remove the test.

@jkff jkff force-pushed the finish-pr-2647-2 branch from b775df1 to b91e36b Compare May 10, 2017 17:48
@jkff
Copy link
Copy Markdown
Contributor Author

jkff commented May 10, 2017

Run Spark ValidatesRunner

@jkff
Copy link
Copy Markdown
Contributor Author

jkff commented May 10, 2017

Run Flink ValidatesRunner

@jkff
Copy link
Copy Markdown
Contributor Author

jkff commented May 10, 2017

Run Dataflow ValidatesRunner

@asfgit asfgit closed this in d0085e6 May 10, 2017
@jkff
Copy link
Copy Markdown
Contributor Author

jkff commented May 10, 2017

I merged this after manually running WindowedWordCountIT with Flink runner.

Looking at the dependency error - it looks pretty clearly unrelated to my PR though, the diff doesn't even mention "findbugs"...

@jkff
Copy link
Copy Markdown
Contributor Author

jkff commented May 10, 2017

Sorry, I'm an idiot, I actually introduced the error. Will send a fix PR right away.

@jkff jkff deleted the finish-pr-2647-2 branch May 10, 2017 19:27
@jkff
Copy link
Copy Markdown
Contributor Author

jkff commented May 10, 2017

DO NOT cherrypick without #3059

jkff added a commit to jkff/incubator-beam that referenced this pull request May 10, 2017
asfgit pushed a commit that referenced this pull request May 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants