Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable service bus subscription (and potentially other errors?) will stop message processing in Dapr sidecar and needs pod restart to resume #1612

Closed
IvanILanekassen opened this issue Mar 24, 2022 · 7 comments · Fixed by #1783

Comments

@IvanILanekassen
Copy link

IvanILanekassen commented Mar 24, 2022

In what area(s)?

/area runtime

What version of Dapr?

1.6.0

Expected Behavior

Intermittent service bus errors should not stop Dapr message processing

Actual Behavior

We discovered this issue when disabling the service bus subscription to stop message processing. We immediately get errors in the dapr side car on the form:

time="2022-03-24T18:24:01.222738335Z" level=debug msg="Closing subscription to topic <topic>" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:01.223704688Z" level=error msg="azure service bus error: error receiving message on topic <topic>, link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G3:2832318:b7gVHOmwF2ISmLz21XWy-R0wQPd4AWj7JbcEyHzxEVVqJiw_nyg6HA' is force detached. Code: consumer(link2688). Details: InnerMessageReceiver was closed. TrackingId:c1d38a450005056a00000a80623c6ae1_G3_B2, SystemTracker:<service bus namespace>:Topic:<topic>|<subscription>, Timestamp:2022-03-24T18:24:01, Info: map[]}" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:01.223755891Z" level=debug msg="Lock renewal context for topic <topic> done" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:01.223667186Z" level=warning msg="azure service bus error: closing subscription entity for topic <topic>: link detached, reason: *Error{Condition: amqp:link:detach-forced, Description: The link 'G3:2832318:b7gVHOmwF2ISmLz21XWy-R0wQPd4AWj7JbcEyHzxEVVqJiw_nyg6HA' is force detached. Code: consumer(link2688). Details: InnerMessageReceiver was closed. TrackingId:c1d38a450005056a00000a80623c6ae1_G3_B2, SystemTracker:<service bus namespace>:Topic:<topic>|<subscription>, Timestamp:2022-03-24T18:24:01, Info: map[]}" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:01.22373459Z" level=warning msg="Subscription to topic <topic> lost connection, attempting to reconnect... [0/30]" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:03.224120298Z" level=debug msg="Subscription to topic <topic> is limited to 4 message handler(s)" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:03.224174001Z" level=debug msg="Waiting to receive message on topic <topic>" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:24:03.984272457Z" level=error msg="azure service bus error: error receiving message on topic <topic>, *Error{Condition: com.microsoft:entity-disabled, Description: Messaging entity '<service bus namespace>:Topic:<topic>|<subscription>' is currently disabled. For more information please see https://aka.ms/ServiceBusExceptions . TrackingId:f3befc0e-acbf-42f3-8e7d-73316da0f357_B2, SystemTracker:<service bus namespace>:Topic:<topic>|<subscription>, Timestamp:2022-03-24T18:24:03 TrackingId:9d8eea867acd4c45b77edf5128205c1e_G4, SystemTracker:gateway5, Timestamp:2022-03-24T18:24:03, Info: map[]}" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0

And after only 1 minute and 20 seconds we get this error:

time="2022-03-24T18:25:20.086919138Z" level=debug msg="Waiting to receive message on topic <topic>" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.086849334Z" level=debug msg="Subscription to topic <topic> is limited to 4 message handler(s)" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.676660156Z" level=debug msg="Closing subscription to topic <topic>" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.676746161Z" level=debug msg="Reconnect context for topic <topic> is done" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.676706759Z" level=error msg="azure service bus error: error receiving message on topic <topic>, *Error{Condition: com.microsoft:entity-disabled, Description: Messaging entity '<service bus namespace>:Topic:<topic>|<subscription>' is currently disabled. For more information please see https://aka.ms/ServiceBusExceptions . TrackingId:f3befc0e-acbf-42f3-8e7d-73316da0f357_B2, SystemTracker:<service bus namespace>:Topic:<topic>|<subscription>, Timestamp:2022-03-24T18:25:20 TrackingId:f597010729174588bb5d37083523fffa_G6, SystemTracker:gateway5, Timestamp:2022-03-24T18:25:20, Info: map[]}" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.676824965Z" level=debug msg="Lock renewal context for topic <topic> done" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0
time="2022-03-24T18:25:20.67672946Z" level=error msg="Subscription to topic <topic> lost connection, unable to recover after 30 attempts" app_id=<deployment> instance=<pod name> scope=dapr.contrib type=log ver=1.6.0

After this point, no more messages are processed by the pod, and we need to restart the pod to the the processing going again.

Steps to Reproduce the Problem

  • Create service that subscribes to service bus subscription
  • Deploy service in Kubernetes
  • In Azure portal, disable subscription
  • Observe similar logs in dapr sidecar.
  • It is unclear what other errors in service bus that will cause this. In any case, 30 retries in 1 minute and 20 seconds seems to be giving up too fast, without the possibility to "self heal".

Release Note

RELEASE NOTE: FIX Self healing from intermittent issues in Azure Service Bus

@IvanILanekassen IvanILanekassen added the kind/bug Something isn't working label Mar 24, 2022
@yaron2 yaron2 transferred this issue from dapr/dapr Mar 24, 2022
@yaron2
Copy link
Member

yaron2 commented Mar 24, 2022

Transferring to correct repository.

cc @berndverst @halspang.

@IvanILanekassen
Copy link
Author

Hi again,

I have recently been running nightly-2022-05-24 version of Dapr in a test environment, and the behaviour is now different. I get some error messages about disabled subscription, but not the same amount.

However the net effect is the same, re-enabling the subscription does not make the sidecar start processing messages again, and I still need to restart the deployment to make the sidecar start processing the messages.

@yaron2
Copy link
Member

yaron2 commented Jun 10, 2022

Hi again,

I have recently been running nightly-2022-05-24 version of Dapr in a test environment, and the behaviour is now different. I get some error messages about disabled subscription, but not the same amount.

However the net effect is the same, re-enabling the subscription does not make the sidecar start processing messages again, and I still need to restart the deployment to make the sidecar start processing the messages.

cc @ItalyPaleAle

@ItalyPaleAle
Copy link
Contributor

As it is implemented today, this is the "correct" behavior and not a "bug" (as in, that's what the person who programmed this originally wanted to achieve - and I haven't modified that behavior). The reconnection that you see happening there is primarily meant to recover from things such as transient network failures.

However, you're right that if the sidecar can't reconnect after 30 attempts, it currently just "gives up". Perhaps we should remove the maximum limit and just make an exponential back-off that retries until forever? At that point, a retry could happen even after 10 minutes, but at least it won't leave daprd in a weird state where nothing is happening.

What do you think @yaron2 ?

@yaron2
Copy link
Member

yaron2 commented Jun 10, 2022

As it is implemented today, this is the "correct" behavior and not a "bug" (as in, that's what the person who programmed this originally wanted to achieve - and I haven't modified that behavior). The reconnection that you see happening there is primarily meant to recover from things such as transient network failures.

However, you're right that if the sidecar can't reconnect after 30 attempts, it currently just "gives up". Perhaps we should remove the maximum limit and just make an exponential back-off that retries until forever? At that point, a retry could happen even after 10 minutes, but at least it won't leave daprd in a weird state where nothing is happening.

What do you think @yaron2 ?

I think that's a good idea. It'd create a connectivity reconciliation loop between desired (connected) and current state.

ItalyPaleAle added a commit to ItalyPaleAle/dapr-components-contrib that referenced this issue Jun 10, 2022
With exponential back-off configurable between min and max time. Fixes dapr#1612

Also includes fixes:

- Binding: make sure it actually retries to connect forever
- Binding: add delay (exponential backoff) before reconnecting
- PubSub: better handling of failures such as topics disabled or other non-connection issues

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle
Copy link
Contributor

I've opened #1783 which should fix this issue.

berndverst pushed a commit that referenced this issue Jun 10, 2022
* Make Service Bus attempt to reconnect forever in case of issues

With exponential back-off configurable between min and max time. Fixes #1612

Also includes fixes:

- Binding: make sure it actually retries to connect forever
- Binding: add delay (exponential backoff) before reconnecting
- PubSub: better handling of failures such as topics disabled or other non-connection issues

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* 💄

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Added warning for deprecated metadata options

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* 💄

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle
Copy link
Contributor

@IvanILanekassen the fix should now be available in the nightly builds, should you want to try them out! Docker images with tag nightly-2022-06-13 or newer should have this fix.

(We are also planning to have a release candidate hopefully early next week)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants