Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some additional performance improvements #39

Merged
merged 7 commits into from
Sep 13, 2022
Merged

some additional performance improvements #39

merged 7 commits into from
Sep 13, 2022

Conversation

rudygt
Copy link
Contributor

@rudygt rudygt commented Sep 10, 2022

Issue #25 :

Description of changes:

Following up on the clues from the numeric benchmark slow speed I tested Tim comment about the recursion on ACTask instead of the stepQueue approach (witch is a big issue when we have millions of steps, and the ArrayDeque needs to grow), and indeed we got a nice speed bump by getting rid of the queue and going down using recursion instead.

there are 3 individual items addresed here

  • remove stepQueue from ACTask and make it recursive instead
  • remove Stream.of().collect(Collectors.toSet()) to create a one element sets, that is very slow compared to Collections.singleton(). ( around 6 times better )
  • another small optimization related with getting slices of hex digits used on the Range class ( around 4 times better )

togheter these changes made the numeric benchmark performance go up by around 60%

Benchmark / Performance:

BEFORE:

Reading citylots2
Read 213068 events
EXACT events/sec: 256708.4
WILDCARD events/sec: 200063.8
PREFIX events/sec: 297581.0
SUFFIX events/sec: 296339.4
EQUALS_IGNORE_CASE events/sec: 256399.5
NUMERIC events/sec: 3476.8
ANYTHING-BUT events/sec: 160806.0
COMBO events/sec: 3466.1
Reading citylots2
Read 213068 events
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 10579
Events/sec: 20140.7
 Rules/sec: 140984.6
Before: 212.3
After: 3029.5
Per rule: -7042
Turning JSON into field-lists...
Finding Rules...
Lines: 213068, Msec: 2055
Events/sec: 103682.7
Before: 216.3
After: 1749.4
Per rule: -3832
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 1064
Events/sec: 200251.9
 Rules/sec: 729317345.9
Reading citylots2
Read 213068 events
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Matched: 52527
Lines: 213068, Msec: 14695
Events/sec: 14499.4
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 9711
Events/sec: 21940.9
 Rules/sec: 153586.2


AFTER:

Reading citylots2
Read 213068 events
EXACT events/sec: 273164.1
WILDCARD events/sec: 201007.5
PREFIX events/sec: 307901.7
SUFFIX events/sec: 307901.7
EQUALS_IGNORE_CASE events/sec: 271078.9
NUMERIC events/sec: 5424.3
ANYTHING-BUT events/sec: 172944.8
COMBO events/sec: 5280.0
Reading citylots2
Read 213068 events
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 9846
Events/sec: 21640.1
 Rules/sec: 151480.4
Before: 182.9
After: 3029.5
Per rule: -7116
Turning JSON into field-lists...
Finding Rules...
Lines: 213068, Msec: 2211
Events/sec: 96367.3
Before: 216.3
After: 1749.4
Per rule: -3832
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 1066
Events/sec: 199876.2
 Rules/sec: 727949020.6
Reading citylots2
Read 213068 events
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Matched: 52527
Lines: 213068, Msec: 10880
Events/sec: 19583.5
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 9087
Events/sec: 23447.6
 Rules/sec: 164132.9



By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@baldawar
Copy link
Collaborator

Will look at this on Monday Pacific time. Just dropped in to make some minor callouts.

  1. You don't need to use git force push unless you like using it. When merging a pull-request, we squash everything into a single commit.
  2. We should start updating the README for performance gains and do a minor version bump in pom.xml
  3. This made my weekend even better. Thank you. ❤️

@timbray
Copy link
Collaborator

timbray commented Sep 11, 2022

Before looking at the code:

  1. When I originally built the StepQueue implementation, my idea was that this was leaving the door open so that the step/field matching tasks could be dealt out to multiple threads. Which isn't going to be possible with the recursive approach. But then I decided that going concurrent probably wouldn't be a win because each field/state match is a tiny amount of work, the inevitable cost of synchronization would probably kill the benefit. But, I never tried it.
  2. Could you do a quickie measure of the recursion depth? The JVM stack size is typically about 1M/thread. There's a reason that recursive code is strongly disapproved of in kernel code; I'm wondering if something like CityLots, only more so, could blow up the recursive approach.

1 similar comment
@timbray
Copy link
Collaborator

timbray commented Sep 11, 2022

Before looking at the code:

  1. When I originally built the StepQueue implementation, my idea was that this was leaving the door open so that the step/field matching tasks could be dealt out to multiple threads. Which isn't going to be possible with the recursive approach. But then I decided that going concurrent probably wouldn't be a win because each field/state match is a tiny amount of work, the inevitable cost of synchronization would probably kill the benefit. But, I never tried it.
  2. Could you do a quickie measure of the recursion depth? The JVM stack size is typically about 1M/thread. There's a reason that recursive code is strongly disapproved of in kernel code; I'm wondering if something like CityLots, only more so, could blow up the recursive approach.

Copy link
Collaborator

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to check out your branch to really understand the new ACFinder implementation. In the back of my mind I'm wondering if, after removing the stepQueue, we still need the ACTask and ACStep classes. Will pull branch and have a look.

@@ -193,7 +191,9 @@ ByteTransition getTransitionForAllBytes() {
* @return All transitions contained in this map.
*/
Set<ByteTransition> getTransitions() {
return map.values().stream().filter(Objects::nonNull).collect(Collectors.toSet());
Set<ByteTransition> result = new HashSet<>(map.values());
result.remove(null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? Don't know this idiom, why remove(null)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically stream().filter().collect() is creating a set from the values removing all nulls in the process. but there is some overhead going that route (stream is slow in hot paths), so I basically tried to reproduce the same that why I remove null after creating the set, as far as i remember the set can hold one copy of null, I will double check to make sure this is needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you end up keeping this, would recommend a short comment or test on why. I can see someone missing this in a refactor fairly easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for having a consistent style on those two, I believe it is easy to understand now.

for (int i = 0; i < wanted.length; i++) {
assertEquals(wanted[i], (byte) l.get(i));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I work mostly in Go these days, I really like absolutely unified formatting. What tool are you using, and maybe we should put in a GitHub action or something to iron out the formatting. In the IntelliJ Go IDE it now runs gofmt on every save :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intellij idea with default format, but yep I have notice very small differences vs the format, some unintentional reformats still slipping tho jaja, agree that would be nice to have a format enforced by the project so we dont have to care about them

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an issue for this #43

@rudygt
Copy link
Contributor Author

rudygt commented Sep 11, 2022

Before looking at the code:

  1. When I originally built the StepQueue implementation, my idea was that this was leaving the door open so that the step/field matching tasks could be dealt out to multiple threads. Which isn't going to be possible with the recursive approach. But then I decided that going concurrent probably wouldn't be a win because each field/state match is a tiny amount of work, the inevitable cost of synchronization would probably kill the benefit. But, I never tried it.
  2. Could you do a quickie measure of the recursion depth? The JVM stack size is typically about 1M/thread. There's a reason that recursive code is strongly disapproved of in kernel code; I'm wondering if something like CityLots, only more so, could blow up the recursive approach.

I did check on the deep level and it stays at 4 on the benchmarks, 8 in ACMachineTest, and 9 across the whole test suite, We should probably add one specific test to stress deeper that that.

@baldawar
Copy link
Collaborator

Here's a dummy JSON I've used in the past deep_json.txt. Probably can building some programmatically and add a benchmark test around it.

Comment on lines +813 to +818
for (String value : pattern.getValues()) {
NameState matchPattern = findMatchPattern(getParser().parse(pattern.type(), value), pattern);
if (matchPattern != null) {
nextNameStates.add(matchPattern);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-duper nitpicking, but I think you can have a consistent style here and in ByteMap.getTransitions() if you for x.forEach( elem -> addWhenNotNull (...) ). I'm not sure on the performance implications yet but maybe slightly readable. Maybe...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did update this one to the new style that I believe is easier to understand.

Copy link
Collaborator

@baldawar baldawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle reminder to update the readme and pom.xml in the next iteration.

@@ -193,7 +191,9 @@ ByteTransition getTransitionForAllBytes() {
* @return All transitions contained in this map.
*/
Set<ByteTransition> getTransitions() {
return map.values().stream().filter(Objects::nonNull).collect(Collectors.toSet());
Set<ByteTransition> result = new HashSet<>(map.values());
result.remove(null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you end up keeping this, would recommend a short comment or test on why. I can see someone missing this in a refactor fairly easily.

for (int i = 0; i < wanted.length; i++) {
assertEquals(wanted[i], (byte) l.get(i));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an issue for this #43

@rudygt
Copy link
Contributor Author

rudygt commented Sep 12, 2022

Gentle reminder to update the readme and pom.xml in the next iteration.

I have increased the minor version on the pom.

not sure what to do about readme performance section, my machine is a desktop Ryzen9 5900x so my numbers look very different than the reference macbook, maybe you Rishi can take care of updating the numbers after your benchmark split lands on master ?

I also added a new test to make sure deep nested events and rules dont break anything

Thanks!.

Copy link
Collaborator

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from one fairly trivial complaint, I'm good on this PR.

Set<NameState> nextNameStates = new HashSet<>(pattern.getValues().size());
for (String value : pattern.getValues()) {
NameState matchPattern = findMatchPattern(getParser().parse(pattern.type(), value), pattern);
if (Objects.nonNull(matchPattern)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, the javadocs say that Objects.nonNull() is designed to be used while filtering. Obviously it does no harm, but when I saw this I had to go read a bunch of stuff to figure out whether this was better for some cosmic reason than just matchPattern != null and it's not. So maybe let's not make other people get the same puzzled look on their face and and up at StackOverflow…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jajaja, I got biased by the streams (very common in that context) , I did try to make it closer to the other side, but I think you are right, != null is crystal clear

@@ -193,7 +191,13 @@ ByteTransition getTransitionForAllBytes() {
* @return All transitions contained in this map.
*/
Set<ByteTransition> getTransitions() {
return map.values().stream().filter(Objects::nonNull).collect(Collectors.toSet());
Set<ByteTransition> result = new HashSet<>(map.values().size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that the Set may end up containing less elements than map.values().size(), because 1) nulls are removed, and 2) the values in the map are not unique. I suppose it's probably more efficient to oversize the Set initially though than to undersize it by going with the default constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good observation, I was aware of this, I just took the size of the values as the upper bound of the set size, but perf wise its cheaper to allocate above, than allowing the set to grow

return value - 48;
}
// ['A'-'F'] maps to [10-15] indexes
return (value - 65) + 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to just go with -55 here? If you're trying to lay out both steps here for understandability/readability, perhaps it would be better to go further and turn these numbers into private static final class variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly to try to keep it easy to understand, if we trust the compiler this is being optimized away.

I think the variable may be the better option then

Copy link
Contributor

@jonessha jonessha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the changes!

@@ -108,11 +109,11 @@ static byte[] digitSequence(byte first, byte last, boolean includeFirst, boolean

private static int getHexByteIndex(byte value) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonessha what do you think? I feel like using the chars helps with the clarity , and then only one obscure constant is needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Thank you!

@baldawar baldawar merged commit 84120a9 into aws:main Sep 13, 2022
@baldawar
Copy link
Collaborator

Thanks again @rudygt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants