some additional performance improvements #39

rudygt · 2022-09-10T16:07:52Z

Issue #25 :

Description of changes:

Following up on the clues from the numeric benchmark slow speed I tested Tim comment about the recursion on ACTask instead of the stepQueue approach (witch is a big issue when we have millions of steps, and the ArrayDeque needs to grow), and indeed we got a nice speed bump by getting rid of the queue and going down using recursion instead.

there are 3 individual items addresed here

remove stepQueue from ACTask and make it recursive instead
remove Stream.of().collect(Collectors.toSet()) to create a one element sets, that is very slow compared to Collections.singleton(). ( around 6 times better )
another small optimization related with getting slices of hex digits used on the Range class ( around 4 times better )

togheter these changes made the numeric benchmark performance go up by around 60%

Benchmark / Performance:

BEFORE:

Reading citylots2
Read 213068 events
EXACT events/sec: 256708.4
WILDCARD events/sec: 200063.8
PREFIX events/sec: 297581.0
SUFFIX events/sec: 296339.4
EQUALS_IGNORE_CASE events/sec: 256399.5
NUMERIC events/sec: 3476.8
ANYTHING-BUT events/sec: 160806.0
COMBO events/sec: 3466.1
Reading citylots2
Read 213068 events
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 10579
Events/sec: 20140.7
 Rules/sec: 140984.6
Before: 212.3
After: 3029.5
Per rule: -7042
Turning JSON into field-lists...
Finding Rules...
Lines: 213068, Msec: 2055
Events/sec: 103682.7
Before: 216.3
After: 1749.4
Per rule: -3832
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 1064
Events/sec: 200251.9
 Rules/sec: 729317345.9
Reading citylots2
Read 213068 events
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Matched: 52527
Lines: 213068, Msec: 14695
Events/sec: 14499.4
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 9711
Events/sec: 21940.9
 Rules/sec: 153586.2


AFTER:

Reading citylots2
Read 213068 events
EXACT events/sec: 273164.1
WILDCARD events/sec: 201007.5
PREFIX events/sec: 307901.7
SUFFIX events/sec: 307901.7
EQUALS_IGNORE_CASE events/sec: 271078.9
NUMERIC events/sec: 5424.3
ANYTHING-BUT events/sec: 172944.8
COMBO events/sec: 5280.0
Reading citylots2
Read 213068 events
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 9846
Events/sec: 21640.1
 Rules/sec: 151480.4
Before: 182.9
After: 3029.5
Per rule: -7116
Turning JSON into field-lists...
Finding Rules...
Lines: 213068, Msec: 2211
Events/sec: 96367.3
Before: 216.3
After: 1749.4
Per rule: -3832
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 1066
Events/sec: 199876.2
 Rules/sec: 727949020.6
Reading citylots2
Read 213068 events
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Matched: 52527
Lines: 213068, Msec: 10880
Events/sec: 19583.5
Reading lines...
Finding Rules...
Lots: 10000
Lots: 20000
Lots: 30000
Lots: 40000
Lots: 50000
Lots: 60000
Lots: 70000
Lots: 80000
Lots: 90000
Lots: 100000
Lots: 110000
Lots: 120000
Lots: 130000
Lots: 140000
Lots: 150000
Lots: 160000
Lots: 170000
Lots: 180000
Lots: 190000
Lots: 200000
Lots: 210000
Lines: 213068, Msec: 9087
Events/sec: 23447.6
 Rules/sec: 164132.9

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

baldawar · 2022-09-10T18:01:50Z

Will look at this on Monday Pacific time. Just dropped in to make some minor callouts.

You don't need to use git force push unless you like using it. When merging a pull-request, we squash everything into a single commit.
We should start updating the README for performance gains and do a minor version bump in pom.xml
This made my weekend even better. Thank you. ❤️

timbray · 2022-09-11T20:42:28Z

Before looking at the code:

When I originally built the StepQueue implementation, my idea was that this was leaving the door open so that the step/field matching tasks could be dealt out to multiple threads. Which isn't going to be possible with the recursive approach. But then I decided that going concurrent probably wouldn't be a win because each field/state match is a tiny amount of work, the inevitable cost of synchronization would probably kill the benefit. But, I never tried it.
Could you do a quickie measure of the recursion depth? The JVM stack size is typically about 1M/thread. There's a reason that recursive code is strongly disapproved of in kernel code; I'm wondering if something like CityLots, only more so, could blow up the recursive approach.

timbray · 2022-09-11T20:43:02Z

Before looking at the code:

When I originally built the StepQueue implementation, my idea was that this was leaving the door open so that the step/field matching tasks could be dealt out to multiple threads. Which isn't going to be possible with the recursive approach. But then I decided that going concurrent probably wouldn't be a win because each field/state match is a tiny amount of work, the inevitable cost of synchronization would probably kill the benefit. But, I never tried it.
Could you do a quickie measure of the recursion depth? The JVM stack size is typically about 1M/thread. There's a reason that recursive code is strongly disapproved of in kernel code; I'm wondering if something like CityLots, only more so, could blow up the recursive approach.

timbray

will need to check out your branch to really understand the new ACFinder implementation. In the back of my mind I'm wondering if, after removing the stepQueue, we still need the ACTask and ACStep classes. Will pull branch and have a look.

timbray · 2022-09-11T20:46:51Z

src/main/software/amazon/event/ruler/ByteMap.java

@@ -193,7 +191,9 @@ ByteTransition getTransitionForAllBytes() {
     * @return All transitions contained in this map.
     */
    Set<ByteTransition> getTransitions() {
-        return map.values().stream().filter(Objects::nonNull).collect(Collectors.toSet());
+        Set<ByteTransition> result = new HashSet<>(map.values());
+        result.remove(null);


Huh? Don't know this idiom, why remove(null)?

basically stream().filter().collect() is creating a set from the values removing all nulls in the process. but there is some overhead going that route (stream is slow in hot paths), so I basically tried to reproduce the same that why I remove null after creating the set, as far as i remember the set can hold one copy of null, I will double check to make sure this is needed.

If you end up keeping this, would recommend a short comment or test on why. I can see someone missing this in a refactor fairly easily.

I went for having a consistent style on those two, I believe it is easy to understand now.

timbray · 2022-09-11T20:51:23Z

src/test/software/amazon/event/ruler/CIDRTest.java

        for (int i = 0; i < wanted.length; i++) {
-            assertEquals(wanted[i], (byte) l.get(i));


Since I work mostly in Go these days, I really like absolutely unified formatting. What tool are you using, and maybe we should put in a GitHub action or something to iron out the formatting. In the IntelliJ Go IDE it now runs gofmt on every save :)

intellij idea with default format, but yep I have notice very small differences vs the format, some unintentional reformats still slipping tho jaja, agree that would be nice to have a format enforced by the project so we dont have to care about them

Added an issue for this #43

rudygt · 2022-09-11T23:02:27Z

Before looking at the code:

When I originally built the StepQueue implementation, my idea was that this was leaving the door open so that the step/field matching tasks could be dealt out to multiple threads. Which isn't going to be possible with the recursive approach. But then I decided that going concurrent probably wouldn't be a win because each field/state match is a tiny amount of work, the inevitable cost of synchronization would probably kill the benefit. But, I never tried it.

Could you do a quickie measure of the recursion depth? The JVM stack size is typically about 1M/thread. There's a reason that recursive code is strongly disapproved of in kernel code; I'm wondering if something like CityLots, only more so, could blow up the recursive approach.

I did check on the deep level and it stays at 4 on the benchmarks, 8 in ACMachineTest, and 9 across the whole test suite, We should probably add one specific test to stress deeper that that.

baldawar · 2022-09-12T17:10:11Z

Here's a dummy JSON I've used in the past deep_json.txt. Probably can building some programmatically and add a benchmark test around it.

baldawar · 2022-09-12T17:21:59Z

src/main/software/amazon/event/ruler/ByteMachine.java

+        for (String value : pattern.getValues()) {
+            NameState matchPattern = findMatchPattern(getParser().parse(pattern.type(), value), pattern);
+            if (matchPattern != null) {
+                nextNameStates.add(matchPattern);
+            }
+        }


super-duper nitpicking, but I think you can have a consistent style here and in ByteMap.getTransitions() if you for x.forEach( elem -> addWhenNotNull (...) ). I'm not sure on the performance implications yet but maybe slightly readable. Maybe...

I did update this one to the new style that I believe is easier to understand.

baldawar

Gentle reminder to update the readme and pom.xml in the next iteration.

baldawar · 2022-09-12T17:23:19Z

src/main/software/amazon/event/ruler/ByteMap.java

@@ -193,7 +191,9 @@ ByteTransition getTransitionForAllBytes() {
     * @return All transitions contained in this map.
     */
    Set<ByteTransition> getTransitions() {
-        return map.values().stream().filter(Objects::nonNull).collect(Collectors.toSet());
+        Set<ByteTransition> result = new HashSet<>(map.values());
+        result.remove(null);


If you end up keeping this, would recommend a short comment or test on why. I can see someone missing this in a refactor fairly easily.

baldawar · 2022-09-12T17:26:45Z

src/test/software/amazon/event/ruler/CIDRTest.java

        for (int i = 0; i < wanted.length; i++) {
-            assertEquals(wanted[i], (byte) l.get(i));


Added an issue for this #43

rudygt · 2022-09-12T19:51:28Z

Gentle reminder to update the readme and pom.xml in the next iteration.

I have increased the minor version on the pom.

not sure what to do about readme performance section, my machine is a desktop Ryzen9 5900x so my numbers look very different than the reference macbook, maybe you Rishi can take care of updating the numbers after your benchmark split lands on master ?

I also added a new test to make sure deep nested events and rules dont break anything

Thanks!.

timbray

Aside from one fairly trivial complaint, I'm good on this PR.

timbray · 2022-09-12T19:58:48Z

src/main/software/amazon/event/ruler/ByteMachine.java

+        Set<NameState> nextNameStates = new HashSet<>(pattern.getValues().size());
+        for (String value : pattern.getValues()) {
+            NameState matchPattern = findMatchPattern(getParser().parse(pattern.type(), value), pattern);
+            if (Objects.nonNull(matchPattern)) {


Um, the javadocs say that Objects.nonNull() is designed to be used while filtering. Obviously it does no harm, but when I saw this I had to go read a bunch of stuff to figure out whether this was better for some cosmic reason than just matchPattern != null and it's not. So maybe let's not make other people get the same puzzled look on their face and and up at StackOverflow…

jajaja, I got biased by the streams (very common in that context) , I did try to make it closer to the other side, but I think you are right, != null is crystal clear

jonessha · 2022-09-12T22:41:50Z

src/main/software/amazon/event/ruler/ByteMap.java

@@ -193,7 +191,13 @@ ByteTransition getTransitionForAllBytes() {
     * @return All transitions contained in this map.
     */
    Set<ByteTransition> getTransitions() {
-        return map.values().stream().filter(Objects::nonNull).collect(Collectors.toSet());
+        Set<ByteTransition> result = new HashSet<>(map.values().size());


Just a note that the Set may end up containing less elements than map.values().size(), because 1) nulls are removed, and 2) the values in the map are not unique. I suppose it's probably more efficient to oversize the Set initially though than to undersize it by going with the default constructor.

yes good observation, I was aware of this, I just took the size of the values as the upper bound of the set size, but perf wise its cheaper to allocate above, than allowing the set to grow

jonessha · 2022-09-12T22:49:48Z

src/main/software/amazon/event/ruler/Range.java

+            return value - 48;
+        }
+        // ['A'-'F'] maps to [10-15] indexes
+        return (value - 65) + 10;


Any reason not to just go with -55 here? If you're trying to lay out both steps here for understandability/readability, perhaps it would be better to go further and turn these numbers into private static final class variables?

mostly to try to keep it easy to understand, if we trust the compiler this is being optimized away.

I think the variable may be the better option then

jonessha

LGTM, thanks for the changes!

rudygt · 2022-09-12T23:17:15Z

src/main/software/amazon/event/ruler/Range.java

@@ -108,11 +109,11 @@ static byte[] digitSequence(byte first, byte last, boolean includeFirst, boolean

    private static int getHexByteIndex(byte value) {


@jonessha what do you think? I feel like using the chars helps with the clarity , and then only one obscure constant is needed

This looks good! Thank you!

baldawar · 2022-09-13T05:16:00Z

Thanks again @rudygt

rudygt added 3 commits September 10, 2022 17:09

speed up hex digitSequence creation

4b080f1

avoid Stream.of().collect(Collectors.toSet()) to create sets

35f3ef4

get away from stepQueue make ACFinder recursive instead

efff79a

rudygt force-pushed the main branch from 407e3ba to efff79a Compare September 10, 2022 16:11

timbray reviewed Sep 11, 2022

View reviewed changes

baldawar reviewed Sep 12, 2022

View reviewed changes

baldawar requested changes Sep 12, 2022

View reviewed changes

rudygt added 2 commits September 12, 2022 20:27

use a consistent style to create a set from a list

5c5a35b

add benchmark with deep nested events, bump version minor

73311cc

rudygt requested a review from baldawar September 12, 2022 19:56

timbray reviewed Sep 12, 2022

View reviewed changes

go back to != null instead of Objects.nonNull

9d34ea8

timbray approved these changes Sep 12, 2022

View reviewed changes

jonessha reviewed Sep 12, 2022

View reviewed changes

jonessha approved these changes Sep 12, 2022

View reviewed changes

reduce magic constant usage on hex digit sequence generation

ab3cb90

rudygt commented Sep 12, 2022

View reviewed changes

baldawar mentioned this pull request Sep 13, 2022

Make Benchmark Tests Run as part of CI for Pull-Requests #44

Open

baldawar approved these changes Sep 13, 2022

View reviewed changes

baldawar merged commit 84120a9 into aws:main Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some additional performance improvements #39

some additional performance improvements #39

rudygt commented Sep 10, 2022

baldawar commented Sep 10, 2022

timbray commented Sep 11, 2022

timbray commented Sep 11, 2022

timbray left a comment

timbray Sep 11, 2022

rudygt Sep 11, 2022

baldawar Sep 12, 2022

rudygt Sep 12, 2022

timbray Sep 11, 2022

rudygt Sep 11, 2022

baldawar Sep 12, 2022

rudygt commented Sep 11, 2022

baldawar commented Sep 12, 2022

baldawar Sep 12, 2022

rudygt Sep 12, 2022

baldawar left a comment

baldawar Sep 12, 2022

baldawar Sep 12, 2022

rudygt commented Sep 12, 2022

timbray left a comment

timbray Sep 12, 2022

rudygt Sep 12, 2022

jonessha Sep 12, 2022

rudygt Sep 12, 2022

jonessha Sep 12, 2022

rudygt Sep 12, 2022

jonessha left a comment

rudygt Sep 12, 2022

jonessha Sep 13, 2022

baldawar commented Sep 13, 2022

		for (int i = 0; i < wanted.length; i++) {
		assertEquals(wanted[i], (byte) l.get(i));

		@@ -108,11 +109,11 @@ static byte[] digitSequence(byte first, byte last, boolean includeFirst, boolean

		private static int getHexByteIndex(byte value) {

some additional performance improvements #39

some additional performance improvements #39

Conversation

rudygt commented Sep 10, 2022

Issue #25 :

Description of changes:

Benchmark / Performance:

baldawar commented Sep 10, 2022

timbray commented Sep 11, 2022

timbray commented Sep 11, 2022

timbray left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudygt commented Sep 11, 2022

baldawar commented Sep 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baldawar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudygt commented Sep 12, 2022

timbray left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonessha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baldawar commented Sep 13, 2022