Skip to content

Conversation

@kyle-winkelman
Copy link
Contributor

@kyle-winkelman kyle-winkelman commented Feb 22, 2019

Construct a simplified pipeline in the event that the number of writers isn't prohibitive. User must opt into this by using FileIO, TextIO, or WriteFiles withNoSpilling.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Choose reviewer(s) and mention them in a comment (R: @username).
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@kyle-winkelman
Copy link
Contributor Author

R: @lukecwik

@kyle-winkelman
Copy link
Contributor Author

Build is failing due to FileIO.Write#defaultNaming not having a javadoc. I did not make a change to this method and I am unsure of what to write for its javadoc. Also there is another overloaded defaultNaming method and a relativeFileNaming method that may also require javadocs.

@pabloem
Copy link
Member

pabloem commented Mar 11, 2019

Kyle could you rebase this? And would you mind adding the Javadoc? <3 Luke is out on leave, but he'll be back soon and he can review...

@pabloem pabloem requested a review from iemejia March 11, 2019 18:36
@pabloem
Copy link
Member

pabloem commented Mar 11, 2019

Ismael may also be able to review if Luke can't

@kyle-winkelman
Copy link
Contributor Author

Done!

@lukecwik
Copy link
Member

For continuity it makes sense for @chamikaramj to take a look at this.

R: @chamikaramj

@kyle-winkelman kyle-winkelman force-pushed the writefiles branch 2 times, most recently from 7735c4d to 8c00020 Compare March 13, 2019 23:22
@kyle-winkelman
Copy link
Contributor Author

kyle-winkelman commented Mar 13, 2019

I added a NeedsRunner test testWriteNoSpilling.
I also cleaned up testWriteSpilling which contrary to its naming didn't actually test spilling because spilling only occurs withRunnerDeterminedSharding.

@kyle-winkelman
Copy link
Contributor Author

Sorry to be impatient but can I get a review @lukecwik or @chamikaramj?

@pabloem
Copy link
Member

pabloem commented Mar 16, 2019

I took a look. I worked on similar code recently (for py BQ). I understand the change; and it seems to do what it intends. Would you consider passing an argument that explicitly tells WriteFiles to have unlimited files per bundle? e.g. if (maxFilesPerBundle < 0 || writers.size() <= getMaxNumWritersPerBundle()) , instead of making the spilled records tag null. That'd be clearer.

Just so I can understand where you're coming from to propose this change, what use case does this cover @kyle-winkelman ?

Thanks.

@pabloem pabloem self-requested a review March 16, 2019 00:15
Copy link
Member

@pabloem pabloem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comment : )

@kyle-winkelman
Copy link
Contributor Author

I mainly use the spark-runner in batch mode. The pipeline, as it is now, has WriteUnshardedTempFilesFn be a ParDo.MultiOutput. The way ParDo.MultiOutputs are implemented in the spark-runner are to immediately cache the output and use it twice. Currently this is the only location in my pipeline with caching.

I use spark.dynamicAllocation so that I can release idle executors, but because of this caching those executors are not eligible to be released. This means at the end of my job I am holding onto hundreds of executors while only one of them does the copy from temp location to final location.

@kyle-winkelman
Copy link
Contributor Author

I added a commit addressing your comment. I kept it separate so it was easy to review but let me know if I should squash the two commits into one before this gets merged.

@pabloem
Copy link
Member

pabloem commented Mar 16, 2019

Thank you Kyle. That makes sense to me. Thanks.

@pabloem
Copy link
Member

pabloem commented Mar 16, 2019

And yes, could you squash the changes?
Thanks

@kyle-winkelman
Copy link
Contributor Author

Run Java PreCommit

1 similar comment
@kyle-winkelman
Copy link
Contributor Author

Run Java PreCommit

@kyle-winkelman
Copy link
Contributor Author

Squashed. Thanks for the review @pabloem.

@pabloem pabloem merged commit 66040ee into apache:master Mar 16, 2019
@kyle-winkelman kyle-winkelman deleted the writefiles branch March 16, 2019 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants