-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-9325] Override proper write method in UnownedOutputStream #11263
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious on how did you find that the reason for the performance issue you mention is the lack of this method, and if you have seen a performance? (the implementation does not check for boundares so that's an issue to fix).
sdks/java/core/src/test/java/org/apache/beam/sdk/util/UnownedOutputStreamTest.java
Outdated
Show resolved
Hide resolved
CC: @lukecwik you may be interested on taking a quick look since it seems you authored |
Thanks for the review! I was looking to improve my 10+TB GBK steps and happened to find this. I just decided to fix it as it should be an effortless fix. I'm not sure what you mean by In case you meant if I ran benchmark, I just ran a short benchmark using
The numbers are all over because it was on my laptop, but you can roughly see. |
What a terrible choice for the FilterOutputStream implementation. Reading the javadoc they clearly state that everyone who subclasses it needs to provide the optimal I was unaware of the FilterOutputStream problem when writing this. |
Please also fix JAXBCoder.java, as it too uses a FilteredOutputStream:
|
expected.write(data1, 0, data1.length); | ||
osActual.write(data1, 0, data1.length); | ||
|
||
assertArrayEquals(expected.toByteArray(), actual.toByteArray()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't actually test that the singular version of the method was called since if FilteredOutputStream wrote one byte at a time you would still get the expected result. You'll need to use a mock and validate that #write(byte[], int, int)
was called the correct number of times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added CallCountOutputStream
to test the proper number of call count.
sdks/java/core/src/main/java/org/apache/beam/sdk/util/UnownedOutputStream.java
Show resolved
Hide resolved
Yes this was a surprise to me too, kind of unexpected, but glad that @lukemin89 found the issue. Did you find it via some static analysis or just by performance 'luck' ? |
I wish I found it in a fancier way, but I just found it by luck. |
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks @lukemin89 I really like this type of PRs simple but that fix a hidden problem. Great find!
Mmm seems tests are not running on this one, weird. |
Now they are! Time to wait to and then merge. |
Curious this looks like something that can be matched 'easily' by static analyzers. |
org.apache.beam.sdk.util.UnownedOutputStream does not override the method
public void write(byte b[], int off, int len) throws IOException
resulting in extremely slow writing speed.
This is because
java.io.FilteredOutputStream
does not provide a proper method.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.