Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Service Bus QueueBinding & PubSub Couldn't renew active message lock - 1.10.0-rc.8 #2532

Closed
fabistb opened this issue Feb 14, 2023 · 5 comments
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@fabistb
Copy link
Contributor

fabistb commented Feb 14, 2023

Description

Azure Service Bus Queue Binding and Pub/Sub component throw unusual amount of Couldn't renew active message lock for message with dapr 1.10.0-rc.8

Expected Behavior

There are non or only a low amount of lock renewal exceptions in the logs.

Actual Behavior

While processing messages the azure service bus queue binding and pub/sub component throw an usual amount of Couldn't renew active message lock for message exceptions.

Queue Binding:

{"app_id":"<appId>","component":"<bindingName> (bindings.azure.servicebusqueues/v1)","instance":"<instance>","level":"warning","msg":"Error renewing message locks for queue <queue name>: couldn't renew active message lock for message 6ccf1ebf01a54d21a8d1c5a6ce53ce20, (locklost): rpc: failed, status code 410 and description: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue. For more information please see https://aka.ms/ServiceBusExceptions . Reference:c54bd3a8-baf5-4ba4-a054-4b7a98ad8dd9, TrackingId:2f4eaf6a-cb68-4b87-8dbc-63dff74946cf_B23, SystemTracker:<cluster>:Queue:<queue name>, Timestamp:2023-02-14T09:09:30","scope":"dapr.contrib","time":"2023-02-14T09:09:30.299800956Z","type":"log","ver":"1.10.0-rc.8"}

Pub/Sub:
{"app_id":"<appId>","component":"<bindingName> (pubsub.azure.servicebus/v1)","instance":"<instance>","level":"warning","msg":"Error renewing message locks for topic <topic name>: couldn't renew active message lock for message fdf963d1a3da497eae76618354288215, (locklost): rpc: failed, status code 410 and description: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue. For more information please see https://aka.ms/ServiceBusExceptions . Reference:82f7e64a-addb-4a76-95c3-cf6537b8d9df, TrackingId:820a1d25-6d66-404b-8b68-eed94b17e59c_B23, SystemTracker:<clusterName>:Topic:<topicName>, Timestamp:2023-02-14T13:44:13","scope":"dapr.contrib","time":"2023-02-14T13:44:13.657630114Z","type":"log","ver":"1.10.0-rc.8"}

Steps to Reproduce the Problem

No specific reproduction way discovered.
Run standard tests which produce a medium to high amount of messages on queues / topics.

@fabistb fabistb added the kind/bug Something isn't working label Feb 14, 2023
@ItalyPaleAle ItalyPaleAle self-assigned this Feb 14, 2023
@berndverst berndverst added this to the v1.10 milestone Feb 14, 2023
@ItalyPaleAle
Copy link
Contributor

Hi @fabistb, I have looked into this and it seems that the components behave correctly.

A few questions:

  • How many messages are "medium to high amounts"? (I have tested with 1,000 messages delivered in 2 seconds)
  • How long does processing each message take? (I am simulating this with 11s delay)
  • What's the value for lockRenewalInSec?
  • Do you have any maxActiveMessages or maxConcurrentHandlers configured?

As you know, these warnings could appear if a message lock's is being renewed after the message has already been ACK'd (completed). This is not an issue, as it just means that your app completed the work already so the lock renewal is not necessary anymore (you can see that in the error message: " Either the lock expired, or the message has already been removed from the queue").

This happens because when it's time to renew the locks, at the interval, we "snapshot" the active messages and then renew each one's lock in sequence. If you have a high number for maxActiveMessages, it takes time for the component to renew the locks for each one, and if the app has ACK'd the message in the meanwhile, then the lock renewal fails.

For example, this is from my tests with Debug-level logging (the logs are in sequence but I've omitted all irrelevant lines). You can see that Dapr reports that the message was completed, and then the error renewing the locks happened:

DEBU[0100] Completing message 22de340330524da39a28dd86bdf36fd1 on topic topic2  app_id=checkout component="order-pub-sub (pubsub.azure.servicebus/v1)" instance=AleTestVM scope=dapr.contrib type=log ver=dev
DEBU[0100] Removing message 22de340330524da39a28dd86bdf36fd1 with sequence number 531 from active messages on topic topic2  app_id=checkout component="order-pub-sub (pubsub.azure.servicebus/v1)" instance=AleTestVM scope=dapr.contrib type=log ver=dev
WARN[0100] Error renewing message locks for topic topic2: couldn't renew active message lock for message 22de340330524da39a28dd86bdf36fd1, (locklost): rpc: failed, status code 410 and description: The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue. For more information please see https://aka.ms/ServiceBusExceptions . Reference:556bb224-c652-4ac2-aba8-1d8b65f649e6, TrackingId:70cab73a-cadf-4d11-b93f-cd4d2c6d2829_B25, SystemTracker:aledevdapr:Topic:topic2|checkout, Timestamp:2023-02-14T17:31:10  app_id=checkout component="order-pub-sub (pubsub.azure.servicebus/v1)" instance=AleTestVM scope=dapr.contrib type=log ver=dev
DEBU[0105] Renewed message locks for topic topic2        app_id=checkout component="order-pub-sub (pubsub.azure.servicebus/v1)" instance=AleTestVM scope=dapr.contrib type=log ver=dev

This is fine, it just means that the "snapshot" happened before the message was ACK'd, but then it took Dapr longer to renew the locks before this one, so when it was time to renew the lock for this message, it was already completed.

I am fairly confident that if you have debug logs enabled and you looked in the lines before, you'd see a line saying "Completing message [ID]" too, for the messages whose locks could not be renewed.

That said…

Although this is "working fine", I agree that these logs are confusing. There are 3 things I'm going to do to improve this:

  1. Renewing a lot of locks can take a lot of time, so after we take the "snapshot" but before we renew each individual lock we should check again if the message exists, and if not don't attempt to renew the lock for those.
  2. If there are lots of messages whose locks we need to renew, we should parallelize the operation a bit so it can be faster (e.g. i think we can do at least 3 in parallel)
  3. Lastly, I'll improve the error message to say that in many cases this warning can be safely ignored.

@fabistb
Copy link
Contributor Author

fabistb commented Feb 14, 2023

Hi @ItalyPaleAle ,

first, I'm sorry if I created an issue if their isn't an issue.
I just checked the other systems and this error message appears to be new.

Queue:
Regrading the amount of messages, it is constantly picking messages from a backlog > 100k messages.
However by checking more in detail the mainly affected queue has a high dead-letter rate with MaxDeliveryCount = 10 if this has an impact on the lock renewal.

processingTime: ~2 ms
lockRenewalInSec : default value (20)
maxActiveMessages: default value (1)
maxConcurrentHandlers: default value(1)

Topic:
Way lower amount of messages, a couple of hundred in 10 minutes.
processingTime: ~ 1s
lockRenewalInSec : default value (20)
maxActiveMessages: default value (10000)
maxConcurrentHandlers: default value(10)

@ItalyPaleAle
Copy link
Contributor

@fabistb no need to apologize, there's still a lot of improvements I was able to make because at the very least we should reduce the amount of "false-positives" in the logs, and figured out some other areas of improvements in the process (see #2533). Some changes introduced in 1.10 could have caused this warning to show up more frequently anyways.

Is the fact that messages are sent to DLQ expected given the nature of your app, or a bug in Dapr?

@fabistb
Copy link
Contributor Author

fabistb commented Feb 14, 2023

@ItalyPaleAle , awesome thanks!

The DLQ are caused by the nature of the app.

yaron2 pushed a commit that referenced this issue Feb 14, 2023
…#2532) (#2533)

* Improved lock renewal

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Better memory management

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Complete and abandon messages in parallel too

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* 10->20

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

---------

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@yaron2
Copy link
Member

yaron2 commented Feb 14, 2023

Closed via #2532.

@fabistb we'll reopen if needed.

@yaron2 yaron2 closed this as completed Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants