Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARTEMIS-2214 Cache durable&deliveryTime in PagedReference #2482

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
5 participants
@qihongxu
Copy link
Contributor

qihongxu commented Dec 26, 2018

We recently performed a test on artemis broker and found a severe performance issue.

When paged messages are being consumed, decrementMetrics in QueuePendingMessageMetrics will try to ‘getMessage’ to check whether they are durable or not. In this way queue will be locked for a long time because page may be GCed and need to be reload entirely. Other operations rely on queue will be blocked at this time, which cause a significant TPS drop. Detailed stacks are attached below.

This also happens when consumer is closed and messages are pushed back to the queue, artemis will check priority on return if these messages are paged.

To solve the issue, durable and priority need to be cached in PagedReference just like messageID, transactionID and so on. I have applied a patch to fix the issue. Any review is appreciated.

jira issue:https://issues.apache.org/jira/projects/ARTEMIS/issues/ARTEMIS-2214

@qihongxu

This comment has been minimized.

Copy link
Contributor Author

qihongxu commented Dec 26, 2018

@franz1981

This comment has been minimized.

Copy link
Contributor

franz1981 commented Dec 26, 2018

Nice catch!!!
If you care about checking perf of paging please try to profile the broker with https://github.com/jvm-profiling-tools/async-profiler using lock/cpu events too: it will allows to get profile data without affecting performance, helping to find issues like this one :)

@@ -74,6 +74,10 @@

private long messageSize = -1;

private byte priority;

This comment has been minimized.

Copy link
@michaelandrepearce

michaelandrepearce Dec 27, 2018

Contributor

default this to -1


@Override
public boolean isDurable() {
if (messageID < 0) {

This comment has been minimized.

Copy link
@michaelandrepearce

michaelandrepearce Dec 27, 2018

Contributor

should use durable field or additional flag to see if its been cached, messageID maybe cached already but not durable. possibly make the held field a byte with 0 = false, 1 = true and -1 not set.

@@ -74,6 +74,10 @@

private long messageSize = -1;

private byte priority;

private boolean durable;

This comment has been minimized.

Copy link
@michaelandrepearce

michaelandrepearce Dec 27, 2018

Contributor

need to default this as "not set" somehow, possible use a byte to represent the boolean with 0 = false, 1 = true, -1 not set

@qihongxu qihongxu force-pushed the qihongxu:modify_pagedReference branch from a49ad88 to 7327ac8 Dec 28, 2018

@qihongxu

This comment has been minimized.

Copy link
Contributor Author

qihongxu commented Dec 28, 2018

@michaelandrepearce Thanks for the review. Now I have changed the code at your suggestion.

@qihongxu

This comment has been minimized.

Copy link
Contributor Author

qihongxu commented Dec 29, 2018

Nice catch!!!
If you care about checking perf of paging please try to profile the broker with https://github.com/jvm-profiling-tools/async-profiler using lock/cpu events too: it will allows to get profile data without affecting performance, helping to find issues like this one :)

Thanks for your advice! We used to record performance from producer/consumer side. Later on we will try to dig into broker side by using tools as recommended :)

@michaelandrepearce

This comment has been minimized.

Copy link
Contributor

michaelandrepearce commented Dec 29, 2018

Just looking over this a bit more. As the page ref needs to be a small as possible.

The whole point of paging is to remove it from memory to be able to scale, there is a little we need to keep but we need to be super frugal. This is more important that we keep paging lean over perf, an ideal world paging would mean nothing was left in memory tbh, and thus mean broker could scale limitless to memory and then only disk. as performance wise already consumer are not keeping up and therefor e2e latencies are going to be massively out. If you care for that you shouldnt be paging in first place.

I get the durable point as metrics are done on every consumed message and would be on a frequent / hot path, But im less convinced in the priority one.

With regards to priority i dont see this being accessed or affecting a hot path its only on messages being added to message references. On intial adding during address to queue routing the message will be in memory still at this point. In regards to a consumer closing and having unhandled messages. This isnt on hot path imo, nor should be treated like a frequent event and shouldnt be tuning for that, eg you expect consumers to be long lived in a throughput or latency world and only go away rarely.

Based on this are you ok to remove the priority addition?

@michaelandrepearce

This comment has been minimized.

Copy link
Contributor

michaelandrepearce commented Jan 3, 2019

@qihongxu nudge

@@ -120,6 +128,8 @@ public PagedReferenceImpl(final PagePosition position,
this.largeMessage = message.getMessage().isLargeMessage() ? IS_LARGE_MESSAGE : IS_NOT_LARGE_MESSAGE;
this.transactionID = message.getTransactionID();
this.messageID = message.getMessage().getMessageID();
this.priority = message.getMessage().getPriority();

This comment has been minimized.

Copy link
@wy96f

wy96f Jan 4, 2019

Contributor

deliveryTime can be set in the constructor like transactionID , messageID , etc :)

@qihongxu

This comment has been minimized.

Copy link
Contributor Author

qihongxu commented Jan 4, 2019

@michaelandrepearce
There are some cases will perform lots of rollbacks in a short period of time. For example if we would like to upgrade our server while thousands of consumers are receiving message, the close of server causes massive rollbacks to original address. Thus queue might blocked on reading GC'ed pages and during this period all ActiveMQ-server threads are blocked too, no operations could be done. Under this circumstance it will take more than 5-10 minutes(e.g 2000 consumers) to completely remove all consumers, which is too long time for downstream systems.

deliveryTime can be set in the constructor like transactionID , messageID , etc :)

@wy96f
Yes we do find similar blocks in PagedReferenceImpl::getScheduledDeliveryTime() since if deliveryTime is not set it will call getMessage() during rollback.
getPriority.txt
getScheduledDeliveryTime.txt

To these two situations, detailed stacks are shown in attachment.

Considering priority only occupy one byte, it might be worthwhile to add it in PageRef to improve stability:) As for deliveryTime, since it is already in PageRef, we can simply add this.deliveryTime = message.getMessage().getScheduledDeliveryTime(); in constructor to avoid block on rollback.

@michaelandrepearce

This comment has been minimized.

Copy link
Contributor

michaelandrepearce commented Jan 4, 2019

Im very cautious of optimising for one case
Its very unusual for consumers to go away as it is good design in most mom a consumer is long lived. Like wise upgrade is a non normal event.

So without priority removed im -1 this. As i stated in an ideal world everything in paging would be off heap, we shouldnt just end up with every value in page ref, else we lose the reason for paging originals intent which is to remove the message from heap at penalty that it will be slower. If you want faster add more ram so more can be kept on heap.

@qihongxu

This comment has been minimized.

Copy link
Contributor Author

qihongxu commented Jan 4, 2019

Im very cautious of optimising for one case
Its very unusual for consumers to go away as it is good design in most mom a consumer is long lived. Like wise upgrade is a non normal event.

So without priority removed im -1 this. As i stated in an ideal world everything in paging would be off heap, we shouldnt just end up with every value in page ref, else we lose the reason for paging originals intent which is to remove the message from heap at penalty that it will be slower. If you want faster add more ram so more can be kept on heap.

Not adding priority is acceptable since it's rare. I will later remove the priority part.
And how about deliveryTime? It's already in PageRef and might need to be set in the constructor.

@michaelandrepearce

This comment has been minimized.

Copy link
Contributor

michaelandrepearce commented Jan 4, 2019

Yes you could add this to set in the section within the constructor where message != null, seems sensible.

@qihongxu qihongxu force-pushed the qihongxu:modify_pagedReference branch from 7327ac8 to 50fc6c3 Jan 4, 2019

@qihongxu qihongxu changed the title ARTEMIS-2214 Cache durable&priority in PagedReference ARTEMIS-2214 Cache durable&deliveryTime in PagedReference Jan 4, 2019

//pre-cache the message size so we don't have to reload the message later if it is GC'd
getPersistentSize();
} else {
this.largeMessage = UNDEFINED_IS_LARGE_MESSAGE;
this.transactionID = -2;
this.messageID = -1;
this.messageSize = -1;
this.durable = UNDEFINED_IS_DURABLE;

This comment has been minimized.

Copy link
@michaelandrepearce

michaelandrepearce Jan 4, 2019

Contributor

for completeness (its a nit) set deliveryTime to its undefined value here.

@qihongxu qihongxu force-pushed the qihongxu:modify_pagedReference branch from 50fc6c3 to 5b2e366 Jan 4, 2019

@michaelandrepearce

This comment has been minimized.

Copy link
Contributor

michaelandrepearce commented Jan 4, 2019

@qihongxu looks good to me now, thanks 👍

@michaelandrepearce

This comment has been minimized.

Copy link
Contributor

michaelandrepearce commented Jan 4, 2019

@franz1981 i was going to merge this, but im away in another country without my main computer with my apache git ssh cert key. Could you merge this for me.

@franz1981

This comment has been minimized.

Copy link
Contributor

franz1981 commented Jan 4, 2019

I will probably do it on the weekend or on Monday ;)

@franz1981

This comment has been minimized.

Copy link
Contributor

franz1981 commented Jan 10, 2019

Please due to a change can you rebase this one?

@qihongxu qihongxu force-pushed the qihongxu:modify_pagedReference branch from 5b2e366 to 310220a Jan 11, 2019

@qihongxu qihongxu force-pushed the qihongxu:modify_pagedReference branch from 310220a to 89e0084 Jan 11, 2019

@clebertsuconic

This comment has been minimized.

Copy link
Contributor

clebertsuconic commented Jan 16, 2019

...merging

@asfgit asfgit closed this in 9cbe451 Jan 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.