Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry: Context lost on consumer retry and Baggage Propagation #1452

Closed
bschwehn opened this issue Dec 7, 2023 · 5 comments
Closed

Comments

@bschwehn
Copy link
Contributor

bschwehn commented Dec 7, 2023

Hi,

We had an issue that OpenTelemetry context propagation to the consumer is not working correctly on retry.

The issue seems to be, that the context is extracted in BeforeConsume:

var parentContext = Propagator.Extract(default, eventData.TransportMessage, (msg, key) =>

and then saved:

_contexts[eventData.TransportMessage.GetId() + eventData.TransportMessage.GetGroup()] =

Then the first BeforeSubscribeInvoke removes the saved context:

_contexts.TryRemove(eventData.Message.GetId() + eventData.Message.GetGroup(), out var context);

On retries, BeforeConsume is not called again and the saved context is no longer saved, thus causing the issue.

I have tried changing the code to not save the context and instead just extract it every time in BeforeSubscribeInvoke, and that solves our issue.
But I don't know why the context is saved in the first place -- is it just a performance improvement?

I can create a pull request if you think this way of fixing the issue is correct.

I have also changed the Propagator from TraceContextPropagator

private static readonly TextMapPropagator Propagator = new TraceContextPropagator();

to Propagators.DefaultTextMapPropagator and added a

Baggage.Current = propagatedContext.Baggage;

which then automatically propagates Baggage.Current, which may also be useful for other people?

Instead or Propagators.DefaultTextMapPropagator, it can explicitly be set to

private static readonly TextMapPropagator Propagator = new CompositeTextMapPropagator(
    new TextMapPropagator[]
    {
        new TraceContextPropagator(),
        new BaggagePropagator()
    });

for BaggagePropagation, but as I understand, Propagators.DefaultTextMapPropagator will allow users of CAP to enable or disable baggage propagation by calling Sdk.SetDefaultTextMapPropagator (https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Api/README.md?plain=1#L463)

@yang-xiaodong
Copy link
Member

Hi @bschwehn ,

We did not consider the retry use case in our implementation of OpenTelemetry. I am unsure if adding a retry use case would make the entire tracing timeline look reasonable, especially since retries have to account for multiple attempts or eventual failure.

The reason for using _contexts is that we chose not to inject the latest context into the Header in BeforeConsume, which is not only a performance improvement but also reduces complexity. On the other hand, we overlooked the contextual continuity between sending messages again in the consumer during our implementation. If you are interested, please refer to #1100.

We would be very happy to accept a PR, and it would be great if you could help resolve #1100.

Thanks very much.

@bschwehn
Copy link
Contributor Author

bschwehn commented Dec 8, 2023

Thanks for the quick reply.
I think #1100 is already fixed by #1407

I have tested the scenario:

  • aspnetcore API call publishes message
  • aspnetcore consumer listens to message and then publishes

The Baggage and traceid is forwarded for all the calls.
#1100 mentioned this only working when aspnetcore instrumentation is enabled (so not for non aspnetcore worker). While I have not tested this exact scenario, the check for aspnetcore was removed in #1407, so it should not make a difference.

But I will try to set up this scenario today to make sure.

I am unsure if adding a retry use case would make the entire tracing timeline look reasonable, especially since retries have to account for multiple attempts or eventual failure.

What I am trying to accomplish is this:

  • we have logs that include the TraceId
  • when messages are sent and received, we want the TraceId to remain constant across sender and consumer. It does not matter to us if a message was handled with or without a retry, it is still the same data flow. The fact that it was retried or not does not change that and when looking at the logs, people want to be able to just search for one traceid and see the logs related to the dataflow across services. Changing the context / TraceId on retry breaks this.
  • senders should be able to propagate Baggage values.

I'll try to check the behaviour with a non aspnet consumer and will create a PR then.

@bschwehn
Copy link
Contributor Author

bschwehn commented Dec 8, 2023

I believe we can then entirely remove the _context Dictionary. It is already not used during publish:

It is set here:

_contexts.TryAdd(eventData.Message.GetId(), parentContext = Activity.Current.Context);

And removed here:
_contexts.TryRemove(eventData.TransportMessage.GetId(), out var context);

but the value is never used anymore.
Instead the Context is propagated by Propagator.Inject in BeforePublishMessageStore and Propagator.Extract in BeforePublish.

I will put this change into the PR also and it will then be easier to talk about the change. But if you already see something wrong with this reasoning above, let me know.

@yang-xiaodong
Copy link
Member

Hi @bschwehn

I think you have a deeper understanding here than I do, just do it and look forward to your PR, thank you very much.

bschwehn pushed a commit to bschwehn/CAP that referenced this issue Dec 8, 2023
1. Fixes an issue that caused context propagation to fail on message
retry (dotnetcore#1452)
Before, BeforeSubscribeInvoke removed the context so it was no longer
available on retry.

It also removes the _contexts dictionary, as during Subscribe this
handling didn't work during retry and during Publish, the dictionary
value was no longer used.

I have tested this with both aspnetcore services and console subscriber
that publishes during the subscribe and context propagation was working
fine. Thus dotnetcore#1100 is also closed, probably already in dotnetcore#1407.
bschwehn pushed a commit to bschwehn/CAP that referenced this issue Dec 8, 2023
1. Fixes an issue that caused context propagation to fail on message
retry (dotnetcore#1452)
Before, BeforeSubscribeInvoke removed the context so it was no longer
available on retry.

It also removes the _contexts dictionary, as during Subscribe this
handling didn't work during retry and during Publish, the dictionary
value was no longer used.

I have tested this with both aspnetcore services and console subscriber
that publishes during the subscribe and context propagation was working
fine. Thus dotnetcore#1100 is also closed, probably already in dotnetcore#1407.
bschwehn pushed a commit to bschwehn/CAP that referenced this issue Dec 8, 2023
1. Fixes an issue that caused context propagation to fail on message
retry (dotnetcore#1452)
Before, BeforeSubscribeInvoke removed the context so it was no longer
available on retry.

It also removes the _contexts dictionary, as during Subscribe this
handling didn't work during retry and during Publish, the dictionary
value was no longer used.

I have tested this with both aspnetcore services and console subscriber
that publishes during the subscribe and context propagation was working
fine. Thus dotnetcore#1100 is also closed, probably already in dotnetcore#1407.
yang-xiaodong pushed a commit that referenced this issue Dec 9, 2023
1. Fixes an issue that caused context propagation to fail on message
retry (#1452)
Before, BeforeSubscribeInvoke removed the context so it was no longer
available on retry.

It also removes the _contexts dictionary, as during Subscribe this
handling didn't work during retry and during Publish, the dictionary
value was no longer used.

I have tested this with both aspnetcore services and console subscriber
that publishes during the subscribe and context propagation was working
fine. Thus #1100 is also closed, probably already in #1407.

Co-authored-by: Benjamin Schwehn <benjamin.schwehn.contractor@ert.com>
@bschwehn
Copy link
Contributor Author

Thanks! Feel free to @ me if this changes seems to cause issues down the road!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants