Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway OpenTelemetry improvements #2700

Merged

Conversation

unflxw
Copy link
Contributor

@unflxw unflxw commented Jul 28, 2023

This PR adds the GraphQL document as a span attribute on the gateway's request span, and makes it so that errors are recorded as exception events in the spans in which they occur.

Commits

Record errors as OTel exception events

When an error is received or thrown, in addition to setting the
span status to error, report the error as an OpenTelemetry
exception event.

If more than one error is received in a GraphQL error array, report
only the first one.

Report the GraphQL document as a span attribute

On the request OpenTelemetry span, add the GraphQL document as an
OpenTelemetry span attribute, using the name for this
property recommended by the OpenTelemetry Semantic Conventions.

Notes

  • I made it so that only the first error in a GraphQL error array is added, erring on the side of caution, as I'm unsure just how many errors could be reported. If you'd rather it reports all errors in the array, let me know.
  • The exception events are not tested through the snapshots, but separately, checking only for their presence in the desired spans. This is convenient, as the details of the error (backtrace, timing) are not interesting to check against.
  • I could not figure out how to add the sub-query that is sent on fetch operations as a span attribute on the fetch spans. I believe this would also be useful. I am quite unfamiliar with the codebase, any pointers would be appreciated.
  • I did not rename the existing operationName attribute to the recommended graphql.operation.name in the OpenTelemetry Semantic Conventions. Let me know if you want me to do that.

@unflxw unflxw requested a review from a team as a code owner July 28, 2023 17:41
@apollo-cla
Copy link

@unflxw: Thank you for submitting a pull request! Before we can merge it, you'll need to sign the Apollo Contributor License Agreement here: https://contribute.apollographql.com/

@netlify
Copy link

netlify bot commented Jul 28, 2023

👷 Deploy request for apollo-federation-docs pending review.

Visit the deploys page to approve it

Name Link
🔨 Latest commit f92f200ddd48ced4b6250ced0a8a8faa86d6c569

@changeset-bot
Copy link

changeset-bot bot commented Jul 28, 2023

🦋 Changeset detected

Latest commit: 263bb6d

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 7 packages
Name Type
@apollo/gateway Minor
@apollo/federation-internals Minor
@apollo/composition Minor
@apollo/query-planner Minor
@apollo/query-graphs Minor
@apollo/subgraph Minor
apollo-federation-integration-testsuite Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@codesandbox-ci
Copy link

codesandbox-ci bot commented Jul 28, 2023

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

Copy link
Member

@trevor-scheer trevor-scheer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a changeset by running npx changeset? I'd consider this a minor. Please be descriptive in the changeset you write, it becomes the changelog entry. It should tell users what change they'll observe, why it might be useful, etc.

Looks pretty good, thanks for the addition! Just a couple comments.

gateway-js/src/executeQueryPlan.ts Outdated Show resolved Hide resolved
gateway-js/src/utilities/opentelemetry.ts Outdated Show resolved Hide resolved
@trevor-scheer trevor-scheer changed the base branch from main to next July 28, 2023 20:54
@trevor-scheer trevor-scheer requested a review from a team as a code owner July 28, 2023 20:54
@trevor-scheer
Copy link
Member

@unflxw FYI I've changed the base branch to next in case you decide to rebase or anything locally.

@unflxw
Copy link
Contributor Author

unflxw commented Jul 29, 2023

Thank you for reviewing this @trevor-scheer 🙌 will address the feedback (and rebase on next) shortly.

I also tried adding the operation type (graphql.operation.type) but I could not get it to work. It seemed to me that requestContext.operation?.operation should be it, but when running the tests it came up empty. 🤷

Any thoughts on how it could be obtained? This would be nice to have, as the combined operation type and name make for a good human-readable identifier of the query, with low cardinality, for the purpose of grouping traces.

When an error is received or thrown, in addition to setting the
span status to error, report the error as an OpenTelemetry
exception event.

If more than one error is received in a GraphQL error array, report
only the first one.
On the request OpenTelemetry span, add the GraphQL document as an
OpenTelemetry span attribute, using the name for this
property recommended by the OpenTelemetry Semantic Conventions.
When an array of errors is returned and the current span's status
is set to error, record all errors in the array as exception events
in the span, not just the first one.
Implement an intermediate `SpanAttributes` type that documents the
specific attributes that are expected in an OpenTelemetry span.

Mark the existing `operationName` attribute as deprecated in favor
of `graphql.operation.name`, which follows OpenTelemetry's GraphQL
Semantic Conventions. Report the attribute with both names.
Obtain the GraphQL operation type from the request context and set
it as a span attribute.
@unflxw unflxw force-pushed the gateway-opentelemetry-improvements branch from f92f200 to 4d2a157 Compare July 31, 2023 11:19
@unflxw
Copy link
Contributor Author

unflxw commented Jul 31, 2023

Done as described @trevor-scheer, please re-review 👍 I also figured out how to add the graphql.operation.type attribute from a different source.

Copy link
Member

@trevor-scheer trevor-scheer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unflxw changes look good!

@dariuszkuc raised a concern to me that capturing the documents could really blow up the size of these spans for some. I think it's best if we make the new behavior configurable and opt-in. I figure we might as well do the same for the new exceptions capturing while we're in there.

Thoughts? Would you mind implementing some configurability into this? I'm open to the naming here, wouldn't mind @dariuszkuc's input as well.

new ApolloGateway({
  // ...
  telemetry: {
    includeDocument: true, // default false
    recordExceptions: true, // default false
  },
});

@unflxw
Copy link
Contributor Author

unflxw commented Jul 31, 2023

@trevor-scheer @dariuszkuc I'm not opposed to adding a configuration option for including the document, I will look into adding it. That said, I think both should default to true. Maybe recordExceptions can also be a number, representing the maximum number of exceptions to report in a span, in which case it can default to some reasonably low number? I'll think about it and reach out with a proposal.

That said, I think it's worth pointing out that the @opentelemetry/instrumentation-graphql package, which Apollo-not-Gateway is compatible with, does not have a configuration option for omitting the document. Presumably that means that this isn't a problem for those servers, which in many setups would be the ones implementing the sub-graphs that Gateway federates with.

@trevor-scheer
Copy link
Member

@unflxw In principle I don't disagree with what you're saying, and that might be the ideal future state of this configuration. My primary concern is making that change in a minor version. If this change in default behavior drove up my operating costs in a non-negligible way I would be pretty frustrated and expect that to be part of a major bump that's explicitly called out. I have no idea what to expect as far as span size and cost goes, so I'd rather make no assumptions there. With what I've proposed, it's visible in the changelog and a new feature that users can opt in to.

@unflxw
Copy link
Contributor Author

unflxw commented Jul 31, 2023

Yeah, that makes sense to me. I'll implement it in that way.

Add a `telemetry` section to the Gateway configuration that allows
the user to determine whether the GraphQL document should be sent
in the request span (`includeDocument`) and whether exception
events should be recorded (`recordException`, which also accepts a
maximum number of exceptions to record)
@unflxw
Copy link
Contributor Author

unflxw commented Aug 2, 2023

@trevor-scheer @dariuszkuc Take a look! I went with the naming you suggested, feel free to change it if you want.

I think that a default value of 1 for recordExceptions would be conservative enough and not increase costs substantially, while providing quite a bit of value (it would at least give some context as to what went wrong) -- but for the time being, I've made both options default to false.

Copy link
Member

@trevor-scheer trevor-scheer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for incorporating those changes @unflxw, this looks good to me. I'm looking for @dariuszkuc to take a final pass on this next week just to make sure this aligns with what he and I spoke about.

@unflxw
Copy link
Contributor Author

unflxw commented Aug 4, 2023

Sounds good! Ping me if there's anything I can help with.

Copy link
Member

@dariuszkuc dariuszkuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

While opt-in flag is currently the best we can do, ideally open telemetry spec should define a mechanism that would allow you to dynamically opt-in to this additional information on a per trace basis (similar to zipkin B3 debug flag). As Trevor mentioned, those documents could potentially be rather large (as in hundreds of KB big). Sadly TraceFlags currently only define single sampled flag so there is no built-in mechanism yet :(.

@trevor-scheer trevor-scheer merged commit 4db499b into apollographql:next Aug 18, 2023
@trevor-scheer
Copy link
Member

Thanks for the contribution @unflxw!

pull bot pushed a commit to NOUIY/federation that referenced this pull request Nov 28, 2023
This PR was opened by the [Changesets
release](https://github.com/changesets/action) GitHub action. When
you're ready to do a release, you can merge this and the packages will
be published to npm automatically. If you're not ready to do a release
yet, that's fine, whenever you add more changesets to next, this PR will
be updated.


# Releases
## @apollo/composition@2.6.0

### Minor Changes

- Update `license` field in `package.json` to use `Elastic-2.0` SPDX
identifier
([apollographql#2741](apollographql#2741))


- Introduce the new `@policy` scope for composition
([apollographql#2818](apollographql#2818))

> Note that this directive will only be _fully_ supported by the Apollo
Router as a GraphOS Enterprise feature at runtime. Also note that
_composition_ of valid `@policy` directive applications will succeed,
but the resulting supergraph will not be _executable_ by the Gateway or
an Apollo Router which doesn't have the GraphOS Enterprise entitlement.

Users may now compose `@policy` applications from their subgraphs into a
supergraph.

      The directive is defined as follows:

    ```graphql
    scalar federation__Policy

    directive @Policy(policies: [[federation__Policy!]!]!) on
      | FIELD_DEFINITION
      | OBJECT
      | INTERFACE
      | SCALAR
      | ENUM
    ```

The `Policy` scalar is effectively a `String`, similar to the `FieldSet`
type.

In order to compose your `@policy` usages, you must update your
subgraph's federation spec version to v2.6 and add the `@policy` import
to your existing imports like so:

    ```graphql
@link(url: "https://specs.apollo.dev/federation/v2.6", import: [...,
"@Policy"])
    ```

### Patch Changes

- Updated dependencies
\[[`b18841be`](apollographql@b18841b),
[`e325b499`](apollographql@e325b49)]:
    -   @apollo/query-graphs@2.6.0
    -   @apollo/federation-internals@2.6.0

## @apollo/gateway@2.6.0

### Minor Changes

- Add more information to OpenTelemetry spans.
([apollographql#2700](apollographql#2700))

    Rename `operationName` to `graphql.operation.name` and add a
`graphql.operation.type` attribute, in conformance with the
OpenTelemetry
Semantic Conventions for GraphQL. The `operationName` attribute is now
deprecated, but it is still emitted alongside `graphql.operation.name`.

Add a `graphql.document` span attribute to the `gateway.request` span,
containing the entire GraphQL source sent in the request. This feature
    is disable by default.

When one or more GraphQL or internal errors occur, report them in the
OpenTelemetry span in which they took place, as an exception event. This
    feature is disabled by default.

To enable the `graphql.document` span attribute and the exception event
reporting, add the following entries to your `ApolloGateway` instance
    configuration:

    ```ts
    const gateway = new ApolloGateway({
      // ...
      telemetry: {
        // Set to `true` to include the `graphql.document` attribute
        includeDocument: true,
// Set to `true` to report all exception events, or set to a number
        // to report at most that number of exception events per span
        reportExceptions: true,
        // or: reportExceptions: 1
      },
    });
    ```

- Update `license` field in `package.json` to use `Elastic-2.0` SPDX
identifier
([apollographql#2741](apollographql#2741))


- Introduce the new `@policy` scope for composition
([apollographql#2818](apollographql#2818))

> Note that this directive will only be _fully_ supported by the Apollo
Router as a GraphOS Enterprise feature at runtime. Also note that
_composition_ of valid `@policy` directive applications will succeed,
but the resulting supergraph will not be _executable_ by the Gateway or
an Apollo Router which doesn't have the GraphOS Enterprise entitlement.

Users may now compose `@policy` applications from their subgraphs into a
supergraph.

      The directive is defined as follows:

    ```graphql
    scalar federation__Policy

    directive @Policy(policies: [[federation__Policy!]!]!) on
      | FIELD_DEFINITION
      | OBJECT
      | INTERFACE
      | SCALAR
      | ENUM
    ```

The `Policy` scalar is effectively a `String`, similar to the `FieldSet`
type.

In order to compose your `@policy` usages, you must update your
subgraph's federation spec version to v2.6 and add the `@policy` import
to your existing imports like so:

    ```graphql
@link(url: "https://specs.apollo.dev/federation/v2.6", import: [...,
"@Policy"])
    ```

- Add graphql.operation.name attribute on gateway.plan span
([apollographql#2807](apollographql#2807))

### Patch Changes

- Updated dependencies
\[[`b18841be`](apollographql@b18841b),
[`e325b499`](apollographql@e325b49)]:
    -   @apollo/query-planner@2.6.0
    -   @apollo/composition@2.6.0
    -   @apollo/federation-internals@2.6.0

## @apollo/federation-internals@2.6.0

### Minor Changes

- Update `license` field in `package.json` to use `Elastic-2.0` SPDX
identifier
([apollographql#2741](apollographql#2741))


- Introduce the new `@policy` scope for composition
([apollographql#2818](apollographql#2818))

> Note that this directive will only be _fully_ supported by the Apollo
Router as a GraphOS Enterprise feature at runtime. Also note that
_composition_ of valid `@policy` directive applications will succeed,
but the resulting supergraph will not be _executable_ by the Gateway or
an Apollo Router which doesn't have the GraphOS Enterprise entitlement.

Users may now compose `@policy` applications from their subgraphs into a
supergraph.

      The directive is defined as follows:

    ```graphql
    scalar federation__Policy

    directive @Policy(policies: [[federation__Policy!]!]!) on
      | FIELD_DEFINITION
      | OBJECT
      | INTERFACE
      | SCALAR
      | ENUM
    ```

The `Policy` scalar is effectively a `String`, similar to the `FieldSet`
type.

In order to compose your `@policy` usages, you must update your
subgraph's federation spec version to v2.6 and add the `@policy` import
to your existing imports like so:

    ```graphql
@link(url: "https://specs.apollo.dev/federation/v2.6", import: [...,
"@Policy"])
    ```

## @apollo/query-graphs@2.6.0

### Minor Changes

- Update `license` field in `package.json` to use `Elastic-2.0` SPDX
identifier
([apollographql#2741](apollographql#2741))


- Introduce the new `@policy` scope for composition
([apollographql#2818](apollographql#2818))

> Note that this directive will only be _fully_ supported by the Apollo
Router as a GraphOS Enterprise feature at runtime. Also note that
_composition_ of valid `@policy` directive applications will succeed,
but the resulting supergraph will not be _executable_ by the Gateway or
an Apollo Router which doesn't have the GraphOS Enterprise entitlement.

Users may now compose `@policy` applications from their subgraphs into a
supergraph.

      The directive is defined as follows:

    ```graphql
    scalar federation__Policy

    directive @Policy(policies: [[federation__Policy!]!]!) on
      | FIELD_DEFINITION
      | OBJECT
      | INTERFACE
      | SCALAR
      | ENUM
    ```

The `Policy` scalar is effectively a `String`, similar to the `FieldSet`
type.

In order to compose your `@policy` usages, you must update your
subgraph's federation spec version to v2.6 and add the `@policy` import
to your existing imports like so:

    ```graphql
@link(url: "https://specs.apollo.dev/federation/v2.6", import: [...,
"@Policy"])
    ```

### Patch Changes

- Updated dependencies
\[[`b18841be`](apollographql@b18841b),
[`e325b499`](apollographql@e325b49)]:
    -   @apollo/federation-internals@2.6.0

## @apollo/query-planner@2.6.0

### Minor Changes

- Update `license` field in `package.json` to use `Elastic-2.0` SPDX
identifier
([apollographql#2741](apollographql#2741))


- Introduce the new `@policy` scope for composition
([apollographql#2818](apollographql#2818))

> Note that this directive will only be _fully_ supported by the Apollo
Router as a GraphOS Enterprise feature at runtime. Also note that
_composition_ of valid `@policy` directive applications will succeed,
but the resulting supergraph will not be _executable_ by the Gateway or
an Apollo Router which doesn't have the GraphOS Enterprise entitlement.

Users may now compose `@policy` applications from their subgraphs into a
supergraph.

      The directive is defined as follows:

    ```graphql
    scalar federation__Policy

    directive @Policy(policies: [[federation__Policy!]!]!) on
      | FIELD_DEFINITION
      | OBJECT
      | INTERFACE
      | SCALAR
      | ENUM
    ```

The `Policy` scalar is effectively a `String`, similar to the `FieldSet`
type.

In order to compose your `@policy` usages, you must update your
subgraph's federation spec version to v2.6 and add the `@policy` import
to your existing imports like so:

    ```graphql
@link(url: "https://specs.apollo.dev/federation/v2.6", import: [...,
"@Policy"])
    ```

### Patch Changes

- Updated dependencies
\[[`b18841be`](apollographql@b18841b),
[`e325b499`](apollographql@e325b49)]:
    -   @apollo/query-graphs@2.6.0
    -   @apollo/federation-internals@2.6.0

## @apollo/subgraph@2.6.0

### Patch Changes

- Updated dependencies
\[[`b18841be`](apollographql@b18841b),
[`e325b499`](apollographql@e325b49)]:
    -   @apollo/federation-internals@2.6.0

## apollo-federation-integration-testsuite@2.6.0

### Minor Changes

- Update `license` field in `package.json` to use `Elastic-2.0` SPDX
identifier
([apollographql#2741](apollographql#2741))


- Introduce the new `@policy` scope for composition
([apollographql#2818](apollographql#2818))

> Note that this directive will only be _fully_ supported by the Apollo
Router as a GraphOS Enterprise feature at runtime. Also note that
_composition_ of valid `@policy` directive applications will succeed,
but the resulting supergraph will not be _executable_ by the Gateway or
an Apollo Router which doesn't have the GraphOS Enterprise entitlement.

Users may now compose `@policy` applications from their subgraphs into a
supergraph.

      The directive is defined as follows:

    ```graphql
    scalar federation__Policy

    directive @Policy(policies: [[federation__Policy!]!]!) on
      | FIELD_DEFINITION
      | OBJECT
      | INTERFACE
      | SCALAR
      | ENUM
    ```

The `Policy` scalar is effectively a `String`, similar to the `FieldSet`
type.

In order to compose your `@policy` usages, you must update your
subgraph's federation spec version to v2.6 and add the `@policy` import
to your existing imports like so:

    ```graphql
@link(url: "https://specs.apollo.dev/federation/v2.6", import: [...,
"@Policy"])
    ```

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants