-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace String.format with faster Long.toHexString #37
Conversation
This looks good in principle, but I've learned that it's important to not rely too much on single benchmark runs, they can be affected by lots of different environmental factors. Also, it seems to me that this should really only affect the NUMERIC output from the benchmark. (Other people… is that correct?). Looks like the NUMERIC value went from 3242.6 to 3231.7, actually slower by a small amount? |
Let me try to run the benchmarks for this commit and share the results here. |
I've tried on multiple different hardware and found performance to be hit or miss. It's kinda surprising but this was the only run that showed better results for NUMERIC (but other matchers performed worse). Before
After the change
|
I don't think we have enough evidence to convince us that this particular change will meaningfully improve things. But @rudygt, you're looking in exactly the right area. The round-tripping between JSON numbers and |
I'd agree. This is worth digging more into. I ran a scrappy benchmark outside of the current test setup and found the hex string change proposed here to perform better. Might be helpful to see if Quamina does better or worse. It might show if there's an issue within ruler's benchmarks, or if there's an edge case that show up really well with citylots dataset. Test @Test
public void benchmarkStrFormatVsHex() {
Random rand = new Random();
final double[] seeds = rand
.doubles(10_000_000, -Constants.FIVE_BILLION, Constants.FIVE_BILLION)
.toArray();
// check that there's no bug when converting (also maybe triggers any memory optimizations)
for(Double seed : seeds) {
final String left = usesHexString(seed);
final String right = usesStrFormat(seed);
assertEquals(left, right);
}
// check for time in millis when converting all seed numbers to hex via proposed change
final Instant instant1 = Instant.now();
for(Double seed : seeds) {
usesHexString(seed);
}
final Instant instant2 = Instant.now();
System.out.println("usesHexString exec time (ms) " + Duration.between(instant1, instant2).toMillis());
// check for time in millis when converting all seed numbers based on String.format
final Instant instant3 = Instant.now();
for(Double seed : seeds) {
usesStrFormat(seed);
}
final Instant instant4 = Instant.now();
System.out.println("usesStrFormat exec time (ms) " + Duration.between(instant3, instant4).toMillis());
}
private static final double TEN_E_SIX = 1E6;
public static String usesStrFormat(double f) {
return String.format("%014x", (long) (TEN_E_SIX * (Constants.FIVE_BILLION + f))).toUpperCase(Locale.ROOT);
}
public static String usesHexString(double f) {
return leftPad(Long.toHexString((long) (TEN_E_SIX * (Constants.FIVE_BILLION + f))).toUpperCase(Locale.ROOT));
}
private static String leftPad(String input){
return input.length() < 14 ? "00000000000000".substring(input.length()) + input : input;
} Output
|
handy test you got there @baldawar I checked again today and found that the upperCase was also burning some cicles, and we know we are uppercasing hex digits so only "a-f", I found that if we go directly to hex with upper case chars we can get a clearer perf improvement. as we also know we want 14 chars we can just skip the first byte and get the fixed size string avoiding the left pad thing altogheter. I ran this test multiple times and the improvement factor range from 19-27 so at least it is consistently above pd: thanks to you both for taking the time to look into this Test
Output
|
df5d3a3
to
da0e92c
Compare
Um, are we sure that the lexical ordering of I don't think there's any reason we need the hex to be upper case, a-f is fine, no? |
@timbray Long.toHexString() is gone in the current version of the pull about the ordering as far as I remember java is big endian and this is platform independent so we should be good (also in the test we are bouncing around we are comparing values just to make sure the candidate is producing exactly the same ouput as the original version). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically LGTM. I'm assuming all the unit tests pass, which is kind of amazing after having switched from decimal to hex digits, especially considering things like the logic around https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/ByteMachine.java#L1021
I'm going to leave approval to Rishi or Shawn, because I don't have time to validate all the tests and so on right now.
Oh, and I didn't say, thanks so much @rudygt! |
I think this explains why there is uppercase in the method we are improving https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/Constants.java#L40 so looks like the comment dont reflect the fact that it works with hex digits now |
LGTM as well. @rudygt can you share the benchmark results (before and after) after the new change? Just want to make sure we there's nothing odd when running against the test dataset. |
int pos = (i - 1) * 2; | ||
chars[pos] = HEXES.charAt((b & 0xF0) >> 4); | ||
chars[pos + 1] = HEXES.charAt((b & 0x0F)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a blocker but would prefer to have documentation around these two lines. Not all engineers get what's happening.
return toHexStringSkippingFirstByte((long) (TEN_E_SIX * (Constants.FIVE_BILLION + f))); | ||
} | ||
|
||
private static final String HEXES = "0123456789ABCDEF"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal but would love to have this tied to Constants.HEX_DIGITS to avoid one going out to sync with the other (in case we do some refactoring in future)
really neat trick, btw! looking forward to merging it soon |
Thanks Rishi, I have incorporated your feedback to make it easier to read and tie the hex chars to constants. Before
After
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you making this better. ❤️
Description of changes:
I was running the benchmarks with a profiler attached and noticed that String.format was showing up, traced it to a simple decimal to hexadecimal conversion. I replaced that with the faster alternative Long.toHexString and a hand crafted left pad with fixed size as we know the desired lenght at 14.
Benchmark / Performance (for source code changes):
full output from benchmark test
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.