[BEAM-6735] Add noSpilling option to WriteFiles. #7929

kyle-winkelman · 2019-02-22T20:04:40Z

Construct a simplified pipeline in the event that the number of writers isn't prohibitive. User must opt into this by using FileIO, TextIO, or WriteFiles withNoSpilling.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Choose reviewer(s) and mention them in a comment (R: @username).
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

kyle-winkelman · 2019-02-22T20:04:53Z

R: @lukecwik

kyle-winkelman · 2019-02-22T22:10:03Z

Build is failing due to FileIO.Write#defaultNaming not having a javadoc. I did not make a change to this method and I am unsure of what to write for its javadoc. Also there is another overloaded defaultNaming method and a relativeFileNaming method that may also require javadocs.

pabloem · 2019-03-11T18:36:47Z

Kyle could you rebase this? And would you mind adding the Javadoc? <3 Luke is out on leave, but he'll be back soon and he can review...

pabloem · 2019-03-11T18:37:05Z

Ismael may also be able to review if Luke can't

kyle-winkelman · 2019-03-11T20:09:13Z

Done!

lukecwik · 2019-03-12T20:01:18Z

For continuity it makes sense for @chamikaramj to take a look at this.

R: @chamikaramj

kyle-winkelman · 2019-03-13T23:27:52Z

I added a NeedsRunner test testWriteNoSpilling.
I also cleaned up testWriteSpilling which contrary to its naming didn't actually test spilling because spilling only occurs withRunnerDeterminedSharding.

kyle-winkelman · 2019-03-15T01:56:29Z

Sorry to be impatient but can I get a review @lukecwik or @chamikaramj?

pabloem · 2019-03-16T00:15:12Z

I took a look. I worked on similar code recently (for py BQ). I understand the change; and it seems to do what it intends. Would you consider passing an argument that explicitly tells WriteFiles to have unlimited files per bundle? e.g. if (maxFilesPerBundle < 0 || writers.size() <= getMaxNumWritersPerBundle()) , instead of making the spilled records tag null. That'd be clearer.

Just so I can understand where you're coming from to propose this change, what use case does this cover @kyle-winkelman ?

Thanks.

pabloem

Left comment : )

kyle-winkelman · 2019-03-16T01:00:01Z

I mainly use the spark-runner in batch mode. The pipeline, as it is now, has WriteUnshardedTempFilesFn be a ParDo.MultiOutput. The way ParDo.MultiOutputs are implemented in the spark-runner are to immediately cache the output and use it twice. Currently this is the only location in my pipeline with caching.

I use spark.dynamicAllocation so that I can release idle executors, but because of this caching those executors are not eligible to be released. This means at the end of my job I am holding onto hundreds of executors while only one of them does the copy from temp location to final location.

kyle-winkelman · 2019-03-16T01:01:14Z

I added a commit addressing your comment. I kept it separate so it was easy to review but let me know if I should squash the two commits into one before this gets merged.

pabloem · 2019-03-16T01:07:12Z

Thank you Kyle. That makes sense to me. Thanks.

pabloem · 2019-03-16T01:07:25Z

And yes, could you squash the changes?
Thanks

kyle-winkelman · 2019-03-16T03:24:50Z

Run Java PreCommit

kyle-winkelman · 2019-03-16T03:56:16Z

Run Java PreCommit

kyle-winkelman · 2019-03-16T04:25:52Z

Squashed. Thanks for the review @pabloem.

kyle-winkelman force-pushed the writefiles branch from 6cd50c1 to 954af30 Compare February 22, 2019 21:40

kyle-winkelman force-pushed the writefiles branch from 954af30 to d0f6db0 Compare February 23, 2019 14:58

pabloem requested a review from iemejia March 11, 2019 18:36

kyle-winkelman force-pushed the writefiles branch from d0f6db0 to e2b899a Compare March 11, 2019 19:32

kyle-winkelman mentioned this pull request Mar 11, 2019

[BEAM-6713] Add withMaxNumWritersPerBundle from WriteFiles to FileIO … #7893

Closed

3 tasks

kyle-winkelman force-pushed the writefiles branch from e2b899a to edb478d Compare March 12, 2019 12:23

pabloem requested a review from lgajowy March 12, 2019 16:34

kyle-winkelman force-pushed the writefiles branch 2 times, most recently from 7735c4d to 8c00020 Compare March 13, 2019 23:22

pabloem self-requested a review March 16, 2019 00:15

pabloem reviewed Mar 16, 2019

View reviewed changes

pabloem approved these changes Mar 16, 2019

View reviewed changes

[BEAM-6735] Add noSpilling option to WriteFiles.

41c9110

kyle-winkelman force-pushed the writefiles branch from ef2cc0f to 41c9110 Compare March 16, 2019 02:50

pabloem merged commit 66040ee into apache:master Mar 16, 2019

kyle-winkelman deleted the writefiles branch March 16, 2019 14:15

[BEAM-6735] Add noSpilling option to WriteFiles. #7929

[BEAM-6735] Add noSpilling option to WriteFiles. #7929

Uh oh!

Conversation

kyle-winkelman commented Feb 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-Commit Tests Status (on master branch)

Uh oh!

kyle-winkelman commented Feb 22, 2019

Uh oh!

kyle-winkelman commented Feb 22, 2019

Uh oh!

pabloem commented Mar 11, 2019

Uh oh!

pabloem commented Mar 11, 2019

Uh oh!

kyle-winkelman commented Mar 11, 2019

Uh oh!

lukecwik commented Mar 12, 2019

Uh oh!

kyle-winkelman commented Mar 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyle-winkelman commented Mar 15, 2019

Uh oh!

pabloem commented Mar 16, 2019

Uh oh!

pabloem left a comment

Choose a reason for hiding this comment

Uh oh!

kyle-winkelman commented Mar 16, 2019

Uh oh!

kyle-winkelman commented Mar 16, 2019

Uh oh!

pabloem commented Mar 16, 2019

Uh oh!

pabloem commented Mar 16, 2019

Uh oh!

kyle-winkelman commented Mar 16, 2019

Uh oh!

kyle-winkelman commented Mar 16, 2019

Uh oh!

kyle-winkelman commented Mar 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kyle-winkelman commented Feb 22, 2019 •

edited

Loading

kyle-winkelman commented Mar 13, 2019 •

edited

Loading