[DefaultRouter] fix unnecessary system clock reads due to races accessing router state #694

dferstay · 2021-12-27T00:10:42Z

Motivation

Previously, we used atomic operations to read and update parts of the
default router state. Unfortunately, the reads and updates could race
under concurrent calls which leads to unnecessary clock reads and an
associated slowdown in performance.

Modifications

Now, we use atomic addition to increment the message count and batch size.
This removes the race condition by ensuring that each go-routine will have
a unique messageCount, and hence only one will perform the clock read.

Verifying this change

Make sure that the change passes the CI checks.

Run the default router unit tests to verify correctness.

Run the default router bench parallel bench tests (#693) and verify the performance speedup; results before and after below:

name                      old time/op  new time/op  delta
DefaultRouterParallel     14.7ns ± 1%  14.8ns ± 2%     ~     (p=0.459 n=9+8)
DefaultRouterParallel-2   55.0ns ±13%  41.9ns ± 0%  -23.86%  (p=0.000 n=10+7)
DefaultRouterParallel-4   53.5ns ± 9%  44.1ns ± 8%  -17.68%  (p=0.000 n=10+9)
DefaultRouterParallel-8   54.2ns ± 8%  53.2ns ± 3%     ~     (p=1.000 n=10+8)
DefaultRouterParallel-16  56.4ns ±21%  51.3ns ± 0%     ~     (p=0.165 n=10+8)

The large variance in the DefaultRouterParallel-2 and DefaultRouterParallel-16 old test results is due to the nature of the race described above; with some test runs reading the system clock more than others.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): No
The public API: No
The schema: No
The default values of configurations: No
The wire protocol: No

Documentation

Does this pull request introduce a new feature? No

wolfstudy · 2021-12-27T08:35:11Z

@dferstay #684 It looks like this pr is dealing with the same thing

dferstay · 2021-12-27T22:45:24Z

@wolfstudy ,

@dferstay #684 It looks like this pr is dealing with the same thing

Indeed, however #684 introduces a mutex on the publish path which has a negative impact on performance, especially under load from multiple concurrent go-routines. A parallel DefaultRouter bench test was added to quantify the performance impact in #693

For details, see my comment on #684 here: #684 (comment)

cckellogg · 2021-12-30T18:10:12Z

pulsar/default_router.go


-		messageCountReached := previousMessageCount >= uint32(maxBatchingMessages-1)
-		sizeReached := (size >= uint32(maxBatchingSize)-previousBatchingMaxSize)
+		messageCountReached := messageCount%uint32(maxBatchingMessages) == 0


Should this be messageCount >= uint32(maxBatchingMessages)?

@cckellogg ,
The initial approach taken in this commit was to only modify the messageCount in the router state by using an atomic increment in order to avoid races between when the counter is modified (on a new message) and when the counter is reset (when the batch is completed). By doing this the count is now a running total of all messages that have passed through the router, so we use modulo arithmetic to determine when the maxBatchingMessages has been reached.

Upon further reflection, I think the use of a running total may have introduced a regression whereby we switch partitions prematurely. The behaviour we want is documented in the comment on L74:

// ...sticking with a given partition for a certain amount of messages or volume buffered or the max delay to // batch is reached so that we ensure having a decent amount of batching of the messages.

If the partition switch is triggered by either size or delay, the messages routed to the previous partition will be included in the count of messages being routed to the new partition as the message count is not reset on switch.

I'm thinking that we can safely reset the message count on partition switch. The use of atomic increment on the message count will still effectively latch the if expression that protects the clock read on L88.

I'll make this change, gather bench results, and report back.

Updated parellel bench results before and after this commit:

name old time/op new time/op delta DefaultRouterParallel 14.7ns ± 1% 14.9ns ± 6% ~ (p=0.248 n=9+9) DefaultRouterParallel-2 55.0ns ±13% 33.7ns ± 8% -38.71% (p=0.000 n=10+9) DefaultRouterParallel-4 53.5ns ± 9% 28.8ns ± 4% -46.13% (p=0.000 n=10+9) DefaultRouterParallel-8 54.2ns ± 8% 36.3ns ± 1% -33.04% (p=0.000 n=10+8) DefaultRouterParallel-16 56.4ns ±21% 39.5ns ±21% -29.98% (p=0.000 n=10+10)

I've made a refinement that uses compare-and-swap to keep multiple go-routines racing to increment the partition cursor which was causing partitions to be skipped. This has the side-effect that only one of the racing go-routines will read the system clock to record the time the last batch was completed. Results before and after this commit are below:

name old time/op new time/op delta DefaultRouterParallel 14.7ns ± 1% 18.5ns ± 4% +25.88% (p=0.000 n=9+9) DefaultRouterParallel-2 55.0ns ±13% 43.5ns ± 2% -20.93% (p=0.000 n=10+8) DefaultRouterParallel-4 53.5ns ± 9% 40.0ns ±11% -25.29% (p=0.000 n=10+10) DefaultRouterParallel-8 54.2ns ± 8% 41.9ns ± 5% -22.68% (p=0.000 n=10+10) DefaultRouterParallel-16 56.4ns ±21% 41.7ns ±22% -25.98% (p=0.000 n=10+10)

…sing router state Previously, we used atomic operations to read and update parts of the default router state. Unfortunately, the reads and updates could race under concurrent calls leading to unnecessary clock reads and an associated slowdown in performance. Now, we use atomic addition to increment the message count and batch size. This removes the race condition by ensuring that each go-routine will have a unique messageCount, and hence only one will perform the clock read. Furthermore, we use atomic compare-and-swap to ensure that partitions are not skipped if multiple go-routines attempt to increment the partition cursor. Signed-off-by: Daniel Ferstay <dferstay@splunk.com>

wolfstudy

LGTM +1

dferstay mentioned this pull request Dec 27, 2021

Use critical section for batch counters in producer default_router #684

Open

dferstay force-pushed the default-router-performance-and-correctness-improvements branch from 4056ea7 to 4860648 Compare December 27, 2021 00:20

dferstay changed the title ~~Remove races reading and updating the default router state~~ [DefaultRouter] fix unnecessary system clock reads due to races accessing router state Dec 27, 2021

wolfstudy assigned dferstay Dec 28, 2021

wolfstudy added this to the v0.8.0 milestone Dec 28, 2021

wolfstudy requested review from wolfstudy, cckellogg, freeznet and zymap December 28, 2021 03:47

cckellogg reviewed Dec 30, 2021

View reviewed changes

dferstay requested a review from cckellogg December 30, 2021 19:54

dferstay force-pushed the default-router-performance-and-correctness-improvements branch from a659bfb to 1d8a0d3 Compare January 2, 2022 19:50

cckellogg approved these changes Jan 12, 2022

View reviewed changes

wolfstudy approved these changes Jan 17, 2022

View reviewed changes

wolfstudy merged commit a119bab into apache:master Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DefaultRouter] fix unnecessary system clock reads due to races accessing router state #694

[DefaultRouter] fix unnecessary system clock reads due to races accessing router state #694

dferstay commented Dec 27, 2021 •

edited

Loading

wolfstudy commented Dec 27, 2021

dferstay commented Dec 27, 2021

cckellogg Dec 30, 2021

dferstay Dec 30, 2021

dferstay Dec 30, 2021 •

edited

Loading

dferstay Dec 30, 2021

dferstay Jan 2, 2022

wolfstudy left a comment

[DefaultRouter] fix unnecessary system clock reads due to races accessing router state #694

[DefaultRouter] fix unnecessary system clock reads due to races accessing router state #694

Conversation

dferstay commented Dec 27, 2021 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

wolfstudy commented Dec 27, 2021

dferstay commented Dec 27, 2021

cckellogg Dec 30, 2021

Choose a reason for hiding this comment

dferstay Dec 30, 2021

Choose a reason for hiding this comment

dferstay Dec 30, 2021 • edited Loading

Choose a reason for hiding this comment

dferstay Dec 30, 2021

Choose a reason for hiding this comment

dferstay Jan 2, 2022

Choose a reason for hiding this comment

wolfstudy left a comment

Choose a reason for hiding this comment

dferstay commented Dec 27, 2021 •

edited

Loading

dferstay Dec 30, 2021 •

edited

Loading