Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter events if the trace is not sampled #2999

Merged
merged 25 commits into from
Sep 5, 2023
Merged

filter events if the trace is not sampled #2999

merged 25 commits into from
Sep 5, 2023

Conversation

Geal
Copy link
Contributor

@Geal Geal commented Apr 26, 2023

there's no need to record the events in opentelemetry if the trace will not be registered

Checklist

Complete the checklist (and note appropriate exceptions) before a final PR is raised.

  • Changes are compatible[^1]
  • Documentation[^2] completed
  • Performance impact assessed and acceptable
  • Tests added and passing[^3]
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

[^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or ask for it to be labeled) as manual test

there's no need to record the events in opentelemetry if the trace will
not be registered
@Geal Geal changed the base branch from staging-perf3 to dev May 15, 2023 12:01
@github-actions

This comment has been minimized.

@BrynCooke
Copy link
Contributor

Let's make sure we have tests which use a 0.5 sample rate to ensure that we never get half traces.

if we reach the point where we need to make a sampling decision on a
child span, that means the parent span was not enabled, so we should not
enabled that span either, otherwise we would get incomplete traces
the test needs to run multiple queries to make sure one will be sampled.
The limit is set to 100, in most tests a trace will be sampled in less
than 10 requests. If this test fails, either the feature is broken, or
we unleashed some statistical monster somewhere
the new() function is where the telemetry plugin is created, but the new
config should start to affect traffic once activate() is called
@Geal
Copy link
Contributor Author

Geal commented Sep 4, 2023

Let's make sure we have tests which use a 0.5 sample rate to ensure that we never get half traces.

84c6558

@Geal Geal enabled auto-merge (squash) September 5, 2023 12:14
@Geal Geal merged commit b58a67c into dev Sep 5, 2023
10 of 12 checks passed
@Geal Geal deleted the geal/filter-events-too branch September 5, 2023 12:30
garypen pushed a commit that referenced this pull request Sep 12, 2023
This introduces a `SamplingFilter` that wraps `OpenTelemetryLayer`. The layer has an overhead on every request, because it records data for each span, even if no exporters are set up. The filter handles sampling ahead of the layer, only sending a trace to the layer when it is actually needed, ie when it is sampled, and an exporter was configured.
This also reduces the overhead of sampling, by managing it outside of the OpenTelemetryLayer.

It is configured through a sampling ratio stored in an atomic u64, that is modifed when the telemetry configuration is activated
@abernix abernix mentioned this pull request Sep 14, 2023
@Geal Geal mentioned this pull request Jan 16, 2024
6 tasks
Geal added a commit that referenced this pull request Jan 18, 2024
Fix #4321 
Fix #3872 (the regression appears in 1.30, when #2999 was merged)

related: #2402

This fixes a regression introduced in #2999, where events were not sent
anymore with traces
This was referenced Feb 1, 2024
Geal added a commit that referenced this pull request Feb 5, 2024
## 🚀 Features

### Specify Trace ID Formatting ([PR
#4530](#4530))

This adds the ability to specify the format of the trace ID in the
response headers of the supergraph service.

An example configuration making use of this feature is shown below:
```yaml
telemetry:
  apollo:
    client_name_header: name_header
    client_version_header: version_header
  exporters:
    tracing:
      experimental_response_trace_id:
        enabled: true
        header_name: trace_id
        format: decimal # Optional, defaults to hexadecimal
```

If the format is not specified, then the trace ID will continue to be in
hexadecimal format.

By [@nicholascioli](https://github.com/nicholascioli) in
#4530

### Introduce support for progressive `@override` ([PR
#4521](#4521))

The change brings support for progressive `@override`, which allows
dynamically overriding root fields and entity fields in the schema. This
feature is enterprise only and requires a license key to be used.

A new `label` argument is added to the `@override` directive in order to
indicate the field is dynamically overridden. Labels can come in two
forms:
1) String matching the form `percent(x)`: The router resolves these
labels based on the `x` value. For example, `percent(50)` will route 50%
of requests to the overridden field and 50% of requests to the original
field.
2) Arbitrary string matching the regex `^[a-zA-Z][a-zA-Z0-9_-:./]*$`:
These labels are expected to be resolved externally via coprocessor. A
coprocessor a supergraph request hook can inspect and modify the context
of a request in order to inform the router which labels to use during
query planning.

Please consult the docs for more information on how to use this feature
and how to implement a coprocessor for label resolution.

By [@trevorscheer](https://github.com/TrevorScheer) in
#4521

### Add a new selector to get all baggage key values in span attributes
([Issue #4425](#4425))

If you have several baggage items and would like to add all of them
directly as span attributes, for example `baggage: my_item=test,
my_second_item=bar` will add 2 span attributes `my_item=test` and
`my_second_item=bar`.

Example of configuration:

```yaml
telemetry:
  instrumentation:
    spans:
      router:
        attributes:
          baggage: true
```

By [@bnjjj](https://github.com/bnjjj) in
#4537

### Create a trace during router creation and plugin initialization
([Issue #4472](#4472))

When the router starts or reload, it will now generate a trace with
spans for query planner creation, schema parsing, plugin initialization
and request pipeline creation. This will help in debugging any issue
during startup, especially in plugins creation

By [@Geal](https://github.com/Geal) in
#4480

### Add static attribute on specific span in telemetry settings ([Issue
#4561](#4561))

Example of configuration:

```yaml
telemetry:
  instrumentation:
    spans:
      router:
        attributes:
          "my_attribute": "constant_value"
      supergraph:
        attributes:
          "my_attribute": "constant_value"
      subgraph:
        attributes:
          "my_attribute": "constant_value"
```

By [@bnjjj](https://github.com/bnjjj) in
#4566

## 🐛 Fixes

### Order HPA target so that kubernetes does not rewrite ([Issue
#4435](#4435))

This update addresses an OutOfSync issue in ArgoCD applications when
Horizontal Pod Autoscaler (HPA) is configured with both memory and CPU
limits.
Previously, the live and desired manifests within Kubernetes were not
consistently sorted, leading to persistent OutOfSync states in ArgoCD.
This change implements a sorting mechanism for HPA targets within the
Helm chart, ensuring alignment with Kubernetes' expected order.
This fix proactively resolves the sync discrepancies while using HPA,
circumventing the need to wait for Kubernetes' issue resolution
(kubernetes/kubernetes#74099).

By [@cyberhck](https://github.com/cyberhck) in
#4436

### Reactivate log events in traces ([PR
#4486](#4486))

This fixes a regression introduced in #2999, where events were not sent
with traces anymore due to too aggressive sampling

By [@Geal](https://github.com/Geal) in
#4486

### Fix inconsistency in environment variable parsing for telemetry
([Issue
#3203](https://github.com/apollographql/router/issues/ISSUE_NUMBER))

Previously, the router would complain when using the rover
recommendation of `APOLLO_TELEMETRY_DISABLED=1` environment
variable. Now any non-falsey value can be used, such as 1, yes, on,
etc..

By [@nicholascioli](https://github.com/nicholascioli) in
#4549

### Store static pages in `Bytes` structure to avoid expensive
allocation per request ([PR
#4528](#4528))

The `CheckpointService` created by the `StaticPageLayer` caused a
non-insignificant amount of memory to be allocated on every request. The
service stack gets cloned on every request, and so does the rendered
template.

The template is now stored in a `Bytes` struct instead which is cheap to
clone.

By [@xuorig](https://github.com/xuorig) in
#4528

### Fix header propagation issues ([Issue
#4312](#4312)), ([Issue
#4398](#4398))

This fixes two header propagation issues:
* if a client request header has already been added to a subgraph
request due to another header propagation rule, then it is only added
once
* `Accept`, `Accept-Encoding` and `Content-Encoding` were not in the
list of reserved headers that cannot be propagated. They are now in that
list because those headers are set explicitely by the Router in its
subgraph requests

There is a potential regression: if a router deployment was accidentally
relying on header propagation to compress subgraph requests, then it
will not work anymore because `Content-Encoding` is not propagated
anymore. Instead it should be set up from the `traffic_shaping` section
of the Router configuration:

```yaml
traffic_shaping:
  all:
    compression: gzip
  subgraphs: # Rules applied to requests from the router to individual subgraphs
    products:
      compression: identity
```

By [@Geal](https://github.com/Geal) in
#4535

## 🧪 Experimental

### Move cacheability metrics to the entity cache plugin ([Issue
#4253](#4253))

The metric was generated in the telemetry plugin before, but it was not
very convenient to keep it there. This adds more configuration:
- enable or disable the metrics
- set the metrics storage TTL (default is 60s)
- the metric's typename attribute is disabled by default. Activating it
can greatly increase the cardinality

This also includes some cleanup and performance improvements

By [@Geal](https://github.com/Geal) in
#4469

---------

Co-authored-by: Edward Huang <edward.huang@apollographql.com>
Co-authored-by: Jeremy Lempereur <jeremy.lempereur@iomentum.com>
Co-authored-by: Jesse Rosenberger <git@jro.cc>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants