Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARTEMIS-1999 Broker uses 100% core's CPU time if msg grouping is used #2203

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
3 participants
@franz1981
Copy link
Contributor

franz1981 commented Jul 31, 2018

The deliver loop won't give up trying to deliver messages when
back-pressure kicks in (credits and/or TCP) if msg grouping is used and
there are many consumers registered: this change will allow the loop
to exit by instructing the logic that the group consumer is the only
consumer to check.

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jul 31, 2018

@clebertsuconic Please take a look if the change seems to break any other logic and @michaelandrepearce if exclusive consumers should be affected by a similar issue (IMO it shouldn't be the case)

@clebertsuconic

This comment has been minimized.

Copy link
Contributor

clebertsuconic commented Jul 31, 2018

Would be possible a test? without a test I don't know how to validate the change TBH.

At least were you able to run the whole testsuite.

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jul 31, 2018

I have already run the entire test suite that is already filled of AMQP and CORE JMS message group tests AFAIK.
About the performance implications (not anymore 100% CPU) I don't know what could I write to validate it with a test...

@clebertsuconic

This comment has been minimized.

Copy link
Contributor

clebertsuconic commented Jul 31, 2018

@franz1981 make a test that will exercise the loop. Try to get a synchronize on Queue...

if the test hangs.. it's a bug... use a timeout tag on the test.

@michaelandrepearce

This comment has been minimized.

Copy link
Contributor

michaelandrepearce commented Jul 31, 2018

@franz1981 i exclusive yes i think would have similar issue, after all it followed the same logic of message groups in part. It would be easy to fix, as in the same place / if statement just check for the exclusive flag. (best way to confirm this would be to extend the test case @clebertsuconic is suggesting to do the same with exclusive)

Btw your fix assumes a message group has been assigned already, what if all consumers are busy and a message group isnt assigned? it would still spin i assume.

I would probably change the check you are doing to see if groupID is set or not.

@@ -2370,10 +2370,10 @@ private void deliver() {
}
}

if (pos == endPos) {
// Round robin'd all
if (pos == endPos || groupConsumer != null) {

This comment has been minimized.

Copy link
@michaelandrepearce

michaelandrepearce Jul 31, 2018

Contributor

groupConsumer is only set if a msg group is already assigned, or has succesfully handled. i would change this to check if groupID is not null.

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jul 31, 2018

exclusive yes i think would have similar issue, after all it followed the same logic of message groups in part. It would be easy to fix, as in the same place / if statement just check for the exclusive flag.

Nice, I suppose that would be better to raise a different issue/PR for that even if I'm tempted to do it fro this one: it is indeed fixing a similar but different issue.
I'm anyway opened to do it differently too 👍

what if all consumers are busy and a message group isnt assigned? it would still spin i assume.

if the message group isn't assigned it will round robin between the consumers until noDeliver == size: in that case it will stop spinning without burning any CPU, because deliverAsync won't be called anymore.
The reason why it won't stop spinning if groupConsumer is set is that the consumer round robin isn't happening in that case and it won't hit the logic that will stop the loop.

@michaelandrepearce

This comment has been minimized.

Copy link
Contributor

michaelandrepearce commented Jul 31, 2018

@franz1981 once you do fix for this with test and this is merged, ill fix exclusive quickly, as then i can just rip your hard work :P :P :P

btw how you get a clean PR build ? without the damn MultiThreadAsynchronousFileTest failing...like it has been for most PR's recently?

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jul 31, 2018

@michaelandrepearce ahah fine!

btw how you get a clean PR build ? without the damn MultiThreadAsynchronousFileTest failing...like it has been for most PR's recently?

That's a nice question indeed...it isn't an intermittent failure? Probably I've been lucky...or unlucky depends on the point of views :P

@clebertsuconic

This comment has been minimized.

Copy link
Contributor

clebertsuconic commented Jul 31, 2018

@michaelandrepearce would be too much of a hack to add a -Ptravis profile and add a property to ignore those tests?

Those tests run at least daily on my CI and they never fail.

It's probably a question of limited resources or the use of hypervisors on a public cloud. We don't actually need those to validate PR buidls.

@clebertsuconic

This comment has been minimized.

Copy link
Contributor

clebertsuconic commented Jul 31, 2018

@franz1981 so are you adding a test?

@michaelandrepearce

This comment has been minimized.

Copy link
Contributor

michaelandrepearce commented Jul 31, 2018

@clebertsuconic before i went on holiday it didnt seem to error as much as it is now, i wonder if some recent merge has destabilized the build and in particular this test case? I would worry about ignoring it, as its a concurrency test so it maybe actually highlighting an issue thats been introduced by some recent change / merge. Has there been any changes around the journals?

@clebertsuconic

This comment has been minimized.

Copy link
Contributor

clebertsuconic commented Jul 31, 2018

@michaelandrepearce there are no changes around the journal...

We already have a profile that will ignore a lot of tests.. and anyone running the full testsuite would still be able to capture regressions.

I will take a look on the failures, but if this is indeed environmental we should move them to the ignore-list on the -Pfast-tests

@clebertsuconic

This comment has been minimized.

Copy link
Contributor

clebertsuconic commented Aug 1, 2018

@franz1981 please watch the examples. On my private CI an example failed on your branch.

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Aug 1, 2018

@clebertsuconic I have tried both message-group and message-group2 examples and they are working: which examples are not working for you?
I have rebased and run all the examples and now it seems to pass: i suppose that the failures wasn't related to the change I did.
I have added a test (that's quite strict I would say) to test it too: let me know if it seems ok 👍

@franz1981 franz1981 force-pushed the franz1981:ARTEMIS-1999 branch 2 times, most recently from 149dd7f to 226d12f Aug 1, 2018

@clebertsuconic

This comment has been minimized.

Copy link
Contributor

clebertsuconic commented Aug 3, 2018

@franz1981 can you rebase here? what to do with this?

@franz1981 franz1981 force-pushed the franz1981:ARTEMIS-1999 branch from 226d12f to 3105b09 Aug 3, 2018

@clebertsuconic

This comment has been minimized.

Copy link
Contributor

clebertsuconic commented Aug 3, 2018

just one nit-pick.. this test should be on QueueImplTest, away from timing... (integration-tests)

can you move it?

ARTEMIS-1999 Broker uses 100% core's CPU time if msg grouping is used
The deliver loop won't give up trying to deliver messages when
back-pressure kicks in (credits and/or TCP) if msg grouping is used and
there are many consumers registered: this change will allow the loop
to exit by instructing the logic that the group consumer is the only
consumer to check.

@franz1981 franz1981 force-pushed the franz1981:ARTEMIS-1999 branch from 3105b09 to 755c6e2 Aug 4, 2018

@asfgit asfgit closed this in 3d2eadf Aug 8, 2018

@clebertsuconic

This comment has been minimized.

Copy link
Contributor

clebertsuconic commented Aug 8, 2018

@franz1981 I have moved this into 2.6.x but I had to manually make the changes.. can you check 2.6.x please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.