Disable service bus subscription (and potentially other errors?) will stop message processing in Dapr sidecar and needs pod restart to resume #1612

IvanILanekassen · 2022-03-24T19:36:00Z

In what area(s)?

/area runtime

What version of Dapr?

1.6.0

Expected Behavior

Intermittent service bus errors should not stop Dapr message processing

Actual Behavior

We discovered this issue when disabling the service bus subscription to stop message processing. We immediately get errors in the dapr side car on the form:

time="2022-03-24T18:24:01.222738335Z" level=debug msg="Closing subscription to topic <topic>" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:01.223704688Z" level=error msg="azure service bus error: error receiving message on topic <topic>, link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G3:2832318:b7gVHOmwF2ISmLz21XWy-R0wQPd4AWj7JbcEyHzxEVVqJiw_nyg6HA' is force detached. Code: consumer(link2688). Details: InnerMessageReceiver was closed. TrackingId:c1d38a450005056a00000a80623c6ae1_G3_B2, SystemTracker:<service bus namespace>:Topic:<topic>|<subscription>, Timestamp:2022-03-24T18:24:01, Info: map[]}" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:01.223755891Z" level=debug msg="Lock renewal context for topic <topic> done" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:01.223667186Z" level=warning msg="azure service bus error: closing subscription entity for topic <topic>: link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G3:2832318:b7gVHOmwF2ISmLz21XWy-R0wQPd4AWj7JbcEyHzxEVVqJiw_nyg6HA' is force detached. Code: consumer(link2688). Details: InnerMessageReceiver was closed. TrackingId:c1d38a450005056a00000a80623c6ae1_G3_B2, SystemTracker:<service bus namespace>:Topic:<topic>|<subscription>, Timestamp:2022-03-24T18:24:01, Info: map[]}" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:01.22373459Z" level=warning msg="Subscription to topic <topic> lost connection, attempting to reconnect... [0/30]" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:03.224120298Z" level=debug msg="Subscription to topic <topic> is limited to 4 message handler(s)" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:03.224174001Z" level=debug msg="Waiting to receive message on topic <topic>" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:03.984272457Z" level=error msg="azure service bus error: error receiving message on topic <topic>, *Error{Condition: com.microsoft:entity-disabled, Description: Messaging entity '<service bus namespace>:Topic:<topic>|<subscription>' is currently disabled. For more information please see https://aka.ms/ServiceBusExceptions . TrackingId:f3befc0e-acbf-42f3-8e7d-73316da0f357_B2, SystemTracker:<service bus namespace>:Topic:<topic>|<subscription>, Timestamp:2022-03-24T18:24:03 TrackingId:9d8eea867acd4c45b77edf5128205c1e_G4, SystemTracker:gateway5, Timestamp:2022-03-24T18:24:03, Info: map[]}" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0

And after only 1 minute and 20 seconds we get this error:

time="2022-03-24T18:25:20.086919138Z" level=debug msg="Waiting to receive message on topic <topic>" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.086849334Z" level=debug msg="Subscription to topic <topic> is limited to 4 message handler(s)" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.676660156Z" level=debug msg="Closing subscription to topic <topic>" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.676746161Z" level=debug msg="Reconnect context for topic <topic> is done" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.676706759Z" level=error msg="azure service bus error: error receiving message on topic <topic>, *Error{Condition: com.microsoft:entity-disabled, Description: Messaging entity '<service bus namespace>:Topic:<topic>|<subscription>' is currently disabled. For more information please see https://aka.ms/ServiceBusExceptions . TrackingId:f3befc0e-acbf-42f3-8e7d-73316da0f357_B2, SystemTracker:<service bus namespace>:Topic:<topic>|<subscription>, Timestamp:2022-03-24T18:25:20 TrackingId:f597010729174588bb5d37083523fffa_G6, SystemTracker:gateway5, Timestamp:2022-03-24T18:25:20, Info: map[]}" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.676824965Z" level=debug msg="Lock renewal context for topic <topic> done" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.67672946Z" level=error msg="Subscription to topic <topic> lost connection, unable to recover after 30 attempts" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0

After this point, no more messages are processed by the pod, and we need to restart the pod to the the processing going again.

Steps to Reproduce the Problem

Create service that subscribes to service bus subscription
Deploy service in Kubernetes
In Azure portal, disable subscription
Observe similar logs in dapr sidecar.
It is unclear what other errors in service bus that will cause this. In any case, 30 retries in 1 minute and 20 seconds seems to be giving up too fast, without the possibility to "self heal".

Release Note

RELEASE NOTE: FIX Self healing from intermittent issues in Azure Service Bus

The text was updated successfully, but these errors were encountered:

yaron2 · 2022-03-24T19:51:22Z

Transferring to correct repository.

cc @berndverst @halspang.

IvanILanekassen · 2022-06-10T16:31:32Z

Hi again,

I have recently been running nightly-2022-05-24 version of Dapr in a test environment, and the behaviour is now different. I get some error messages about disabled subscription, but not the same amount.

However the net effect is the same, re-enabling the subscription does not make the sidecar start processing messages again, and I still need to restart the deployment to make the sidecar start processing the messages.

yaron2 · 2022-06-10T17:13:16Z

Hi again,

I have recently been running nightly-2022-05-24 version of Dapr in a test environment, and the behaviour is now different. I get some error messages about disabled subscription, but not the same amount.

However the net effect is the same, re-enabling the subscription does not make the sidecar start processing messages again, and I still need to restart the deployment to make the sidecar start processing the messages.

cc @ItalyPaleAle

ItalyPaleAle · 2022-06-10T17:18:53Z

As it is implemented today, this is the "correct" behavior and not a "bug" (as in, that's what the person who programmed this originally wanted to achieve - and I haven't modified that behavior). The reconnection that you see happening there is primarily meant to recover from things such as transient network failures.

However, you're right that if the sidecar can't reconnect after 30 attempts, it currently just "gives up". Perhaps we should remove the maximum limit and just make an exponential back-off that retries until forever? At that point, a retry could happen even after 10 minutes, but at least it won't leave daprd in a weird state where nothing is happening.

What do you think @yaron2 ?

yaron2 · 2022-06-10T17:38:51Z

As it is implemented today, this is the "correct" behavior and not a "bug" (as in, that's what the person who programmed this originally wanted to achieve - and I haven't modified that behavior). The reconnection that you see happening there is primarily meant to recover from things such as transient network failures.

However, you're right that if the sidecar can't reconnect after 30 attempts, it currently just "gives up". Perhaps we should remove the maximum limit and just make an exponential back-off that retries until forever? At that point, a retry could happen even after 10 minutes, but at least it won't leave daprd in a weird state where nothing is happening.

What do you think @yaron2 ?

I think that's a good idea. It'd create a connectivity reconciliation loop between desired (connected) and current state.

With exponential back-off configurable between min and max time. Fixes dapr#1612 Also includes fixes: - Binding: make sure it actually retries to connect forever - Binding: add delay (exponential backoff) before reconnecting - PubSub: better handling of failures such as topics disabled or other non-connection issues Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

ItalyPaleAle · 2022-06-10T20:04:35Z

I've opened #1783 which should fix this issue.

* Make Service Bus attempt to reconnect forever in case of issues With exponential back-off configurable between min and max time. Fixes #1612 Also includes fixes: - Binding: make sure it actually retries to connect forever - Binding: add delay (exponential backoff) before reconnecting - PubSub: better handling of failures such as topics disabled or other non-connection issues Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * 💄 Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Added warning for deprecated metadata options Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * 💄 Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

ItalyPaleAle · 2022-06-14T03:07:42Z

@IvanILanekassen the fix should now be available in the nightly builds, should you want to try them out! Docker images with tag nightly-2022-06-13 or newer should have this fix.

(We are also planning to have a release candidate hopefully early next week)

IvanILanekassen added the kind/bug Something isn't working label Mar 24, 2022

yaron2 transferred this issue from dapr/dapr Mar 24, 2022

artursouza added this to the v1.8 milestone Apr 12, 2022

artursouza added area/runtime/pubsub P1 triaged/resolved labels Apr 12, 2022

ItalyPaleAle mentioned this issue Jun 10, 2022

Make Service Bus attempt to reconnect forever in case of issues #1783

Merged

3 tasks

berndverst closed this as completed in #1783 Jun 10, 2022

berndverst mentioned this issue Aug 17, 2022

amqp:link:detach-forced error messages from pubsub.azure.eventhubs components #1969

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable service bus subscription (and potentially other errors?) will stop message processing in Dapr sidecar and needs pod restart to resume #1612

Disable service bus subscription (and potentially other errors?) will stop message processing in Dapr sidecar and needs pod restart to resume #1612

IvanILanekassen commented Mar 24, 2022 •

edited

Loading

yaron2 commented Mar 24, 2022

IvanILanekassen commented Jun 10, 2022

yaron2 commented Jun 10, 2022

ItalyPaleAle commented Jun 10, 2022

yaron2 commented Jun 10, 2022

ItalyPaleAle commented Jun 10, 2022

ItalyPaleAle commented Jun 14, 2022

Disable service bus subscription (and potentially other errors?) will stop message processing in Dapr sidecar and needs pod restart to resume #1612

Disable service bus subscription (and potentially other errors?) will stop message processing in Dapr sidecar and needs pod restart to resume #1612

Comments

IvanILanekassen commented Mar 24, 2022 • edited Loading

In what area(s)?

What version of Dapr?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Release Note

yaron2 commented Mar 24, 2022

IvanILanekassen commented Jun 10, 2022

yaron2 commented Jun 10, 2022

ItalyPaleAle commented Jun 10, 2022

yaron2 commented Jun 10, 2022

ItalyPaleAle commented Jun 10, 2022

ItalyPaleAle commented Jun 14, 2022

IvanILanekassen commented Mar 24, 2022 •

edited

Loading