Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Client x-request-id Mangled by RequestIdExtension::setTraceStatus #11532

Closed
cbrisket opened this issue Jun 10, 2020 · 24 comments · Fixed by #15248
Closed

[bug] Client x-request-id Mangled by RequestIdExtension::setTraceStatus #11532

cbrisket opened this issue Jun 10, 2020 · 24 comments · Fixed by #15248
Assignees
Labels

Comments

@cbrisket
Copy link

Description

When providing an x-request-id header in the request (rather than have it injected by Envoy), a UUID has its 14th byte modified. This results in a different UUID than provided by the caller, breaking the x-request-id passed along a call chain.

At a cursory glance, I believe the code is here:
request_id_extension_uuid_impl.cc

void UUIDRequestIDExtension::setTraceStatus(RequestHeaderMap& request_headers, TraceStatus status) {
  if (request_headers.RequestId() == nullptr) {
    return;
  }
  absl::string_view uuid_view = request_headers.getRequestIdValue();
  if (uuid_view.length() != Runtime::RandomGeneratorImpl::UUID_LENGTH) {
    return;
  }
  std::string uuid(uuid_view);

  switch (status) {
  case TraceStatus::Forced:
    uuid[TRACE_BYTE_POSITION] = TRACE_FORCED;
    break;
  case TraceStatus::Client:
    uuid[TRACE_BYTE_POSITION] = TRACE_CLIENT;
    break;
  case TraceStatus::Sampled:
    uuid[TRACE_BYTE_POSITION] = TRACE_SAMPLED;
    break;
  case TraceStatus::NoTrace:
    uuid[TRACE_BYTE_POSITION] = NO_TRACE;
    break;
  }
  request_headers.setRequestId(uuid);
}

This code doesn't distinguish either:

  • UUID vs non-UUID string (of length 36)
  • UUID generated internally vs provided in original request

Illustrative flow:

  • Request created by client, with 'x-request-id' header containing value 'aaaaaaaa-bbbb-4ccc-dddd-eeeeeeeeeeee'
  • Request is received by Envoy, which sets Tracing Status on the 14th byte
  • Request now has 'x-request-id' header containing value 'aaaaaaaa-bbbb-9ccc-dddd-eeeeeeeeeeee' (or 'bccc', 'eccc')

Ask

Ideally, setTraceStatus() should only update trace status on valid UUIDs generated by Envoy itself, and should not update values provided/retained from caller.

Config

Seems to occur either naturally (when request is deemed to be x-request-internal) or when using the 'preserve_external_request_id' connection manager flag.

References

I believe it caused the behaviour seen ##6050

Examples/Reproduce

Used with lds.yaml:

          - name: envoy.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager
              server_name: 'example'
              server_header_transformation: PASS_THROUGH
              generate_request_id: true
              preserve_external_request_id: true

Example - Request populated 'x-request-id', len != 36

curl -v -H "x-request-id: chris-request-id" https://echotest
> GET / HTTP/1.1
> Host: echotest
> User-Agent: curl/7.64.1
> Accept: */*
> x-request-id: chris-request-id
> 
< HTTP/1.1 200 OK
< server: openresty/1.11.2.5
< date: Wed, 10 Jun 2020 00:19:18 GMT
< content-type: text/plain
< x-envoy-upstream-service-time: 0
< transfer-encoding: chunked
< 
GET / HTTP/1.1
host: echotest
user-agent: curl/7.64.1
accept: */*
x-request-id: chris-request-id
x-forwarded-proto: https
x-envoy-expected-rq-timeout-ms: 30000
x-b3-traceid: 2157a3a019156213
x-b3-spanid: 89a206d07d499cda
x-b3-parentspanid: 2157a3a019156213
x-b3-sampled: 0
content-length: 0

Example - Request populated 'x-request-id, UUID with 14th byte == 4 (no trace)

curl -v -H "x-request-id: b6aa0744-49e1-4f5d-b5a8-750aaaf6c205" https://echotest
> GET / HTTP/1.1
> Host: echotest
> User-Agent: curl/7.64.1
> Accept: */*
> x-request-id: b6aa0744-49e1-4f5d-b5a8-750aaaf6c205
> 
< HTTP/1.1 200 OK
< server: openresty/1.11.2.5
< date: Wed, 10 Jun 2020 00:12:49 GMT
< content-type: text/plain
< x-envoy-upstream-service-time: 0
< transfer-encoding: chunked
< 
GET / HTTP/1.1
host: echotest
user-agent: curl/7.64.1
accept: */*
x-request-id: b6aa0744-49e1-9f5d-b5a8-750aaaf6c205
x-forwarded-proto: https
x-envoy-expected-rq-timeout-ms: 30000
x-b3-traceid: 9a837832ae4fac75
x-b3-spanid: f4468cd7b2b7db97
x-b3-parentspanid: 9a837832ae4fac75
x-b3-sampled: 1
content-length: 0

(extraneous -v fields removed)

@mattklein123
Copy link
Member

Yes agreed this seems like a bug. cc @euroelessar who has worked on this code recently and has more context.

@euroelessar
Copy link
Contributor

What would be an expected behavior?

Envoy uses 14th byte to store a tracing decision, so alternatives are:

  1. Never modify request id of inbound requests with pre-existing request-id header (as result - it will be impossible to force tracing for requests with request-id header/change decision at some later layer)
  2. Never modify non-UUID request ids only (technically same drawbacks, but also less surprising behavior for request ids generated by other applications)
  3. Provide alternative request-id propagation mechanism, which keeps "tracing decision" state in a separate header. Maybe change the default at some future point

Thoughts?

@cbrisket
Copy link
Author

Thanks for looking at this so quickly!

If the goal of x-request-id is correlation across two or more calls (for either unified logging or tracing) then I would argue stability of that ID is paramount. To that end, 1) would be my preference.

As a compromise, you could allow users to opt into that with a config-flag:

  • "Never modify request id of inbound requests with pre-existing request-id header (as result - it will be impossible to force tracing for requests with request-id header/change decision at some later layer)" if pack_tracing_status = false

This would at least keep default consistent with current and require a user to make an explicit decision on 'id stability' vs 'tracing status propagation'

If the underlying issue is packing the trace-status in an unrelated field, then the cleanest/most consistent solution would be 3) to pack tracing decision in a dedicated x-header.

Something like: (one of)

  1. Always store it in x-tracing-decision
  2. Store it in x-tracing decision if x-request-id is not generated locally (ie: was provided in original request), use x-request-id packing otherwise
  3. Control behaviour using a connection_manager flag (ie: request_id_tracing_field = (blank) use as is, request_id_tracing_field = x-tracing-decision packs in separate field), which allows packing in a separate field

Despite it being cleaner, I don't love the idea of adding a new x-tracing-decision header on all requests just to satisfy the 0.01% case. I imagine the getTracingStatus() logic during would end up being pretty complicated too. I don't understand the implications of non/propagation of that status and would value your recommendation on that.

@mattklein123
Copy link
Member

+1 to allow the behavior to be configurable, -1 to a new header. Context propagation is a disaster and I would rather not make it more complicated for the common case as @cbrisket mentions.

@sriyer
Copy link

sriyer commented Nov 6, 2020

@mattklein123 @euroelessar +1 on the opt out config option.

Would it be possible to include this on the up coming release for envoy ? We are kind of in a situation where we are moving away from our existing edge router to Envoy and apparently there are business tools that are breaking due to the value being altered.

preserve_external_request_id even when set true does not seem to prevent the modification of the header value. The 14th byte on the UUID changes no mater what ..

@mattklein123
Copy link
Member

I would be very happy to have a different built-in implementation which does something more basic, since this comes up quite often, but someone will have to implement it. @euroelessar if your internal impl is this basic thing perhaps you can just upstream it?

@euroelessar
Copy link
Contributor

Sounds good, let me have a look at it

@Brunomachadob
Copy link

I have created this issue a while ago pointing the same problem: #13774
I guess I can close that one if you're moving forward with this one here now.
It will be much better for me (and probably more people) than creating a C++ extension for this.

@karanvasnani
Copy link

Hey @euroelessar, wanted to know if you have been able to make progress on a fix for this?

@Anonymous-Coward
Copy link

I would also argue for not modifying x-b3-traceid, when it is provided by clients. When traces are stored in the same place across multiple systems, this breaks trace analysis tools - they group spans by x-b3-traceid, and that header's value changes for requests which should belong to the same trace.

@sriyer
Copy link

sriyer commented Feb 3, 2021

Hey @euroelessar @mattklein123 is this something that we can expect in the upcoming release ? we are trying to plan our course of action, would appreciate any update.

@mattklein123 mattklein123 self-assigned this Feb 3, 2021
@mattklein123
Copy link
Member

I don't think anyone is working on this currently. I will try to get this done since it's a frequent request.

@ikonst
Copy link
Contributor

ikonst commented Feb 14, 2021

Just wanted to note that at least some tracing implementations do their own sampling decision propagation, e.g. in Lightstep.

@mattklein123
Copy link
Member

For everyone watching this issue I'm going to fix this, but I'm trying to sort through some details. First, the request ID header mangling is behind this guard AFAICT:

if (!config.tracingConfig()) {
return;
}

So AFAICT no mangling will occur unless the user configures the tracing field in the HTTP connection manager configuration.

So I assume everyone hitting this issue wants to use tracing orthogonally to request ID generation. Is that right? Or am I misunderstanding how the mangling is occurring?

Assuming the real issue is people want to use tracing separate from request ID generation my suggestion is to do the following:

  1. Have an out of box request ID extension that generates UUIDs but never stores/retrieves any trace status in them. So basically the same as UUIDRequestIDExtension but getTraceStatus() and setTraceStatus() are NOP.
  2. If tracing is configured, store the trace decision outside of the request ID header so it can be passed to the external tracing providers that support them. As @ikonst says some providers manage their own sampling once the initial span is created, so sampling would still work in those cases. For those currently relying on sampling decisions being propagated inside the request ID header this would no longer work and only unstable sampling would be possible.

Let me know what folks think about this as I want to make sure we are solving the right problem here.

@Brunomachadob
Copy link

Brunomachadob commented Feb 26, 2021

Hey @mattklein123, thanks for you help on this. Congrats on the baby btw, hope you're all healthy and safe.

Yes, we are using tracing, I have this old issue about the same problem, but I closed in favour of this one, you can check a simplified version of our use case there: #13774

What we were considering was to have something along the lines of your option 2, we were just initially thinking if this option shouldn't be provided by Envoy (maybe shipping this Extension embedded and let us choose to use it).
Or we could rework this UUIDRequestIDExtension to work with any configured header, by default it would be x-request-id and we could override it via config.

If you decide not to ship this within Envoy, we would actually start looking into the possibility of creating our own Extension with the same idea stated above.

In a few cases the x-request-id is provided externally and we must return it as it came, untouched.

Today what we do is to revert the byte back from 9 to 4 using a envoy_on_response Lua filter, this solves most of the problems, but if anyone calls with other version than 4 (not so common), we would return the byte version as 4 😅 .

I hope this provides some more clarity

@sfudeus
Copy link

sfudeus commented Feb 26, 2021

Hi @mattklein123, thanks for working on this one.

What you describe would solve our problem with the current things.
We are using x-request-id completely unrelated to tracing and we rely on it being unchanged (since we are relaying it back in the response chain and expect no change to happen).
Orthogonally, we'd like to use tracing, but using b3-headers for that.

@mattklein123
Copy link
Member

Or we could rework this UUIDRequestIDExtension to work with any configured header, by default it would be x-request-id and we could override it via config.

Yes, this is my plan. I'm going to allow configuration to be passed to the built in UUID request ID extension. Initially I will just add configuration to disable trace status packing, but eventually we could also allow the header name itself to be configured. I think this will satisfy all of the problems that are being reported here.

mattklein123 added a commit that referenced this issue Feb 27, 2021
1) Promote the default UUID request_id implementation to
   an actual extension that is required in the build and wire
   up all documentation.
2) Add a configuration option to the extension that allows trace
   reason packing to be disabled (the default continues to be for it
   to be enabled to match existing behavior).
3) Update all documentation for the new behavior.
4) Substantial cleanup of these code paths for clarity and robustness.

Fixes #11532

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this issue Feb 28, 2021
1) Promote the default UUID request_id implementation to
   an actual extension that is required in the build and wire
   up all documentation.
2) Add a configuration option to the extension that allows trace
   reason packing to be disabled (the default continues to be for it
   to be enabled to match existing behavior).
3) Update all documentation for the new behavior.
4) Substantial cleanup of these code paths for clarity and robustness.

Fixes #11532

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this issue Mar 1, 2021
1) Promote the default UUID request_id implementation to
   an actual extension that is required in the build and wire
   up all documentation.
2) Add a configuration option to the extension that allows trace
   reason packing to be disabled (the default continues to be for it
   to be enabled to match existing behavior).
3) Update all documentation for the new behavior.
4) Substantial cleanup of these code paths for clarity and robustness.

Fixes #11532

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Member

I posted #15248 which should fix this if anyone wants to take a look while it winds its way through code review.

mattklein123 added a commit that referenced this issue Mar 5, 2021
1) Promote the default UUID request_id implementation to
   an actual extension that is required in the build and wire
   up all documentation.
2) Add a configuration option to the extension that allows trace
   reason packing to be disabled (the default continues to be for it
   to be enabled to match existing behavior).
3) Update all documentation for the new behavior.
4) Substantial cleanup of these code paths for clarity and robustness.

Fixes #11532

Signed-off-by: Matt Klein <mklein@lyft.com>
lizan pushed a commit to envoyproxy/data-plane-api that referenced this issue Mar 5, 2021
1) Promote the default UUID request_id implementation to
   an actual extension that is required in the build and wire
   up all documentation.
2) Add a configuration option to the extension that allows trace
   reason packing to be disabled (the default continues to be for it
   to be enabled to match existing behavior).
3) Update all documentation for the new behavior.
4) Substantial cleanup of these code paths for clarity and robustness.

Fixes envoyproxy/envoy#11532

Signed-off-by: Matt Klein <mklein@lyft.com>

Mirrored from https://github.com/envoyproxy/envoy @ 07c4c17be61c77d87d2c108b0775f2e606a7ae12
@Brunomachadob
Copy link

Brunomachadob commented Mar 5, 2021

Hi @mattklein123, just to clarify something, based on the newly created doc: https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/request_id/uuid/v3/uuid.proto#envoy-v3-api-msg-extensions-request-id-uuid-v3-uuidrequestidconfig

So I would be able to just use:

typed_config:
  "@type": type.googleapis.com/envoy.extensions.request_id.uuid.v3.UuidRequestIdConfig
  pack_trace_reason: false

And I will still get 100% sampling (but without header mutation) because of the random_sampling property (which is 100% by default)?

@mattklein123
Copy link
Member

@Brunomachadob yes that's correct. Let me know if it doesn't work for you.

@Brunomachadob
Copy link

@Brunomachadob yes that's correct. Let me know if it doesn't work for you.

Yes, this is will work for me.
Thanks for the help!

@jhughesjha
Copy link

@mattklein123,

Do you have a full example EnvoyFilter that would set pack_trace_reason: false? We think this is set but still seeing the 14th nibble being changed.

Thanks!

rexengineering pushed a commit to rexengineering/istio-envoy that referenced this issue Oct 15, 2021
1) Promote the default UUID request_id implementation to
   an actual extension that is required in the build and wire
   up all documentation.
2) Add a configuration option to the extension that allows trace
   reason packing to be disabled (the default continues to be for it
   to be enabled to match existing behavior).
3) Update all documentation for the new behavior.
4) Substantial cleanup of these code paths for clarity and robustness.

Fixes envoyproxy/envoy#11532

Signed-off-by: Matt Klein <mklein@lyft.com>
@ssubramanian123
Copy link

ssubramanian123 commented Oct 26, 2021

@Brunomachadob @mattklein123 Do we have a sample EnvoyFilter? I tried with multiple options but not working.

  - applyTo: NETWORK_FILTER
    match:
      context: GATEWAY
      listener:
        filterChain:
          filter:
            name: envoy.extensions.request_id.uuid
    patch:
      operation: MERGE
      value:
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.request_id.uuid.v3.UuidRequestIdConfig
          pack_trace_reason: false
          

@cobb-tx
Copy link

cobb-tx commented Aug 24, 2022

@ssubramanian123

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: golab-pack-trace-reason
  namespace: istio-system
spec:
  configPatches:
    - applyTo: NETWORK_FILTER
      match:
        context: ANY
        listener:
          filterChain:
            filter:
              name: "envoy.filters.network.http_connection_manager"
      patch:
        operation: MERGE
        value:
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
            request_id_extension:
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.request_id.uuid.v3.UuidRequestIdConfig
                pack_trace_reason: false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.