Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BE-308] Federated metrics support #2900

Merged
merged 19 commits into from
Jul 15, 2019
Merged

[BE-308] Federated metrics support #2900

merged 19 commits into from
Jul 15, 2019

Conversation

glasser
Copy link
Member

@glasser glasser commented Jun 21, 2019

Augments apollo-engine-reporting and @apollo/gateway to provide federated
metrics to Engine from the gateway.

Extend trace reports to contain a QueryPlan tree when traces are received from
a gateway, mimicking the query plan that the gateway makes internally.
FetchNodes on the tree correspond to GraphQL operations executed against
federated services; each FetchNode contains its own tree of Trace.Nodes.

  • apollo-engine-reporting

    • Factor out core tree building logic to EngineReportingTreeBuilder class
    • New EngineFederatedTracingExtension adds a base64-encoded protobuf of a
      Trace (with just the node tree and top level timestamps) to an extension
      on federated service responses (if requested to do so via an HTTP header
      from gateway).
    • Implement new originalFieldName trace field (similar goal to Send the original field name instead of its alias - BE-158 #2401, BE-158).
    • If @apollo/gateway built us a query plan trace, add it to the trace
  • @apollo/gateway

    • Builds up a protobuf representation of the query plan, including sub-traces
      from federated services on FetchNodes
  • apollo-engine-reporting-protobuf

    • Copy reports.proto from internal repo
    • Set editorconfig settings to match internal repo
    • Always use normal numbers, not long.js objects
  • apollo-server-core

    • Thread some data through between reporting and gateway modules.
  • apollo-server-integration-testsuite

    • Clean up some ts warnings

TODO record __resolveReference as multiple nodes?
TODO is it actually OK to include the operation text in fetch trace nodes?
TODO more tests, especially of error cases
TODO internal docs (eg FIXME comment in reports.proto)
TODO external docs

@glasser glasser requested a review from zionts June 21, 2019 23:37
@glasser glasser force-pushed the glasser/tree-builder branch 4 times, most recently from 13bcd5c to 6096b14 Compare June 27, 2019 00:50
Copy link
Member

@abernix abernix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a couple comments, the second commit looks good to me!


this.stopped = true;
return {
durationNs: durationHrTimeToNanos(process.hrtime(this.startHrTime)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've recently started taking great care to not use Node.js specific APIs, and process.hrtime certainly falls into that category, which (I suspect among other things, like its dependence on protobuf) makes apollo-engine-reporting one of those packages which has to be conditionally require'd depending on whether the environment supports it.

I don't think we should do anything right now, but just pointing out that it might be worth looking into using performance-now package which falls back (CoffeeScript!) to performance.now (browsers) and Date.now().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point and I'll add it as a comment. It's not a regression — we've always used process.hrtime for every JS implementation of timing, and I think it would be bad for people to use Date.now() for timing since it lacks appropriate precision (and clock skew control).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am underwhelmed by both performance.now() and performance-now for this purpose.

I think we definitely don't need to care about supporting engine reporting from browsers. But we may need to care about alternate non-Node server implementations. I can't recall anyone ever complaining that engine didn't work on their server due to lack of process.hrtime, though?

performance.now uses a double instead of a pair of integers and so it has less precision, and even the docs basically say to only expect microsecond precision. (And browser implementations explicitly fudge the numbers to work around Spectre issues.)

performance-now has https://github.com/myrne/performance-now/blob/614099d729aef5d7bdbb85c0d1060166d63d05c7/src/performance-now.coffee#L6-L8 which well — the entire point of process.hrtime returning a silly pair of integers instead of a single number is that you can't precisely represent nanoseconds in a JS integer! See my comment on durationHrTimeToNanos: after a process uptime of 104 days, this function will start losing precision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I'd be way more interested in a a function with the API of process.hrtime that tries to produce integer nanoseconds from performance.now, than the other way around.

// of GraphQLError are not enumerable and won't show up in the trace
// (even in the json field) anyway.)
// XXX I really don't understand why we don't allow rewriteError
// to rewrite extensions, which do show up in JSON.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extensions seems like an oversight!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #2932

packages/apollo-engine-reporting/src/treeBuilder.ts Outdated Show resolved Hide resolved
packages/apollo-engine-reporting/src/treeBuilder.ts Outdated Show resolved Hide resolved
@glasser glasser changed the base branch from master to release-2.7.0 June 27, 2019 18:55
@glasser glasser changed the title apollo-engine-reporting: Factor out node creation from extension [BE-308] Federated metrics support Jun 29, 2019
@glasser
Copy link
Member Author

glasser commented Jun 29, 2019

Surprise! This is now the federated metrics PR.

uint64 sent_time_offset = 4;

// Wallclock times measured in the gateway for when this operation was
// sent and recieved.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops typo

}
message FetchNode {
string serviceName = 1;
string operation = 2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of this until we build UI that needs it. In general we probably shouldn't collect any string data that we don't immediately present to users, to help them form a better model of what data we collect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment noting that we might want to augment it with more information about the operation at a later date when we build a feature.

// FIXME: do we need requires?

// This Trace only contains start_time, end_time, duration_ns, and root;
// all timings were calculated on the federated service.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be explicit somewhere that all fields on fetchnode are measured on gateway (as an overall statement).

@@ -51,11 +52,14 @@ export interface GraphQLResponse {
}

export interface GraphQLRequestMetrics {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use comments


public didEncounterErrors(errors: GraphQLError[]) {
errors.forEach(err => {
// This is an error from a federated service. We will already be reporting
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose if we supported nested gateway levels (which won't work now for reporting anyway) then maybe this is wrong. Maybe it should be moved to EngineReportingAgent instead of the tree builder, before the call to the tree builder.

@glasser glasser force-pushed the glasser/tree-builder branch 3 times, most recently from a9d4ece to 281ad6e Compare July 3, 2019 22:04
// in its response.
if (traceNode) {
http = {
headers: new Headers({ 'apollo-federation-include-trace': 'ftv1' }),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're only sending these to the underlying service, but I was pretty sure there was a way to hook into the request context from the gateway to persist incoming headers and pass them along 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abernix or @jbaxleyiii any idea if I'm missing something here? Maybe the path to inject headers into the service is happening elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the hooking in happens at the GrapQLTransport level, so we might have some more work to do here, but I think this means that if users do that, they'll lose metrics tracking rather than losing their transport changes, which we should definitely resolve, but it's terrifying.

Copy link
Contributor

@zionts zionts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

glasser and others added 2 commits July 14, 2019 23:57
Augments apollo-engine-reporting and @apollo/gateway to provide federated
metrics to Engine from the gateway.

Extend trace reports to contain a QueryPlan tree when traces are received from
a gateway, mimicking the query plan that the gateway makes internally.
FetchNodes on the tree correspond to GraphQL operations executed against
federated services; each FetchNode contains its own tree of Trace.Nodes.

- apollo-engine-reporting
  - Factor out core tree building logic to EngineReportingTreeBuilder class
  - New EngineFederatedTracingExtension adds a base64-encoded protobuf of a
    Trace (with just the node tree and top level timestamps) to an extension
    on federated service responses (if requested to do so via an HTTP header
    from gateway).
  - Implement new originalFieldName trace field (similar goal to #2401, BE-158).
  - If @apollo/gateway built us a query plan trace, add it to the trace

- @apollo/gateway
  - Builds up a protobuf representation of the query plan, including sub-traces
    from federated services on FetchNodes

- apollo-engine-reporting-protobuf
  - Copy reports.proto from internal repo
  - Set editorconfig settings to match internal repo
  - Always use normal numbers, not long.js objects

- apollo-server-core
  - Thread some data through between reporting and gateway modules.

- apollo-server-integration-testsuite
  - Clean up some ts warnings

TODO record __resolveReference as multiple nodes?
TODO is it actually OK to include the operation text in fetch trace nodes?
TODO more tests, especially of error cases
TODO internal docs (eg FIXME comment in reports.proto)
TODO external docs
Adam Zionts and others added 12 commits July 14, 2019 23:57
Previously, the logic within executing a query plan would throw an error
and fail the entire operation if it could not parse a trace. Instead, it
adds the information that it failed to parse the trace into the overall
fetch node itself. This commit also adds some comments.
Build still breaking :(
We used to rely on the error code matching "DOWNSTREAM_SERVICE_ERROR" in
order to determine that an error was from an underlying service rather
than from the gateway itself. In the past few weeks, error handling in
the gateway has been modified to allow for overriding this error code
with whatever code has come from the gateway, but we are guaranteed to
always have a serviceName if the error comes from a downstream service.

Note, however, that this means if users pass "serviceName" in an error
extension, it will NOT be reported for metrics, even if it occurred at
the Gateway level. We may want to find something less flimsy than this.
Adam Zionts added 5 commits July 14, 2019 23:58
 - apollo-cache-control@0.7.6-alpha.10
 - apollo-datasource-rest@0.5.2-alpha.1
 - apollo-engine-reporting-protobuf@0.3.2-alpha.0
 - apollo-engine-reporting@1.4.0-alpha.10
 - @apollo/federation@0.6.11-alpha.10
 - @apollo/gateway@0.7.0-alpha.10
 - apollo-server-azure-functions@2.7.0-alpha.10
 - apollo-server-cloud-functions@2.7.0-alpha.10
 - apollo-server-cloudflare@2.7.0-alpha.10
 - apollo-server-core@2.7.0-alpha.10
 - apollo-server-express@2.7.0-alpha.10
 - apollo-server-fastify@2.7.0-alpha.10
 - apollo-server-hapi@2.7.0-alpha.10
 - apollo-server-integration-testsuite@2.7.0-alpha.10
 - apollo-server-koa@2.7.0-alpha.10
 - apollo-server-lambda@2.7.0-alpha.10
 - apollo-server-micro@2.7.0-alpha.10
 - apollo-server-plugin-base@0.6.0-alpha.10
 - apollo-server-plugin-response-cache@0.2.7-alpha.10
 - apollo-server-testing@2.7.0-alpha.10
 - apollo-server-types@0.1.1-alpha.1
 - apollo-server@2.7.0-alpha.10
 - apollo-tracing@0.7.5-alpha.10
 - graphql-extensions@0.8.0-alpha.10
 - apollo-server-azure-functions@2.7.0-alpha.11
 - apollo-server-cloud-functions@2.7.0-alpha.11
 - apollo-server-cloudflare@2.7.0-alpha.11
 - apollo-server-express@2.7.0-alpha.11
 - apollo-server-fastify@2.7.0-alpha.11
 - apollo-server-hapi@2.7.0-alpha.11
 - apollo-server-integration-testsuite@2.7.0-alpha.11
 - apollo-server-koa@2.7.0-alpha.11
 - apollo-server-lambda@2.7.0-alpha.11
 - apollo-server-micro@2.7.0-alpha.11
 - apollo-server-testing@2.7.0-alpha.11
 - apollo-server@2.7.0-alpha.11
 - apollo-server-azure-functions@2.7.0-alpha.12
 - apollo-server-cloud-functions@2.7.0-alpha.12
 - apollo-server-cloudflare@2.7.0-alpha.12
 - apollo-server-express@2.7.0-alpha.12
 - apollo-server-fastify@2.7.0-alpha.12
 - apollo-server-hapi@2.7.0-alpha.12
 - apollo-server-integration-testsuite@2.7.0-alpha.12
 - apollo-server-koa@2.7.0-alpha.12
 - apollo-server-lambda@2.7.0-alpha.12
 - apollo-server-micro@2.7.0-alpha.12
 - apollo-server-testing@2.7.0-alpha.12
 - apollo-server@2.7.0-alpha.12
@zionts zionts merged commit d8ade4d into release-2.7.0 Jul 15, 2019
@zionts zionts deleted the glasser/tree-builder branch July 15, 2019 07:59
zionts pushed a commit that referenced this pull request Jul 18, 2019
A previous commit
#2900 migrated the
extension for reporting tracing information to engine into an
abstraction called the treeBuilder. The treeBuilder was then utilized by
both a federatedExtension and the normal extension, but the two
extensions have different end-timing semantics. In the normal extension,
the end-timing is captured at the end of the *request*, whereas in the
federated extension, the end-timing is captured at the end of the
*execution*. The reason for this is that the standard exception does not
actually append information into the extensions field, whereas the
federated extension does, so serialization and transmission of the
response cannot be included in the end of the extension.

Due to the limitations of the current requestPipeline, executionDidEnd
is what needs to be hooked into to halt overall timing, which is called
directly prior to didEncounterErrors. It is likely possible to refactor
the request pipeline to support capturing the timing data of serializing
errors into the proto state, but since it would be more boolean
control-flow to add into the overall request context, this commit simply
discards that timing information for reporting in lieu of throwing an
error.

A subsequent commit will follow up with replacing error throws with
logging statements.
zionts pushed a commit that referenced this pull request Jul 18, 2019
A previous commit
#2900 migrated the
extension for reporting tracing information to engine into an
abstraction called the treeBuilder. The treeBuilder was then utilized by
both a federatedExtension and the normal extension, but the two
extensions have different end-timing semantics. In the normal extension,
the end-timing is captured at the end of the *request*, whereas in the
federated extension, the end-timing is captured at the end of the
*execution*. The reason for this is that the standard exception does not
actually append information into the extensions field, whereas the
federated extension does, so serialization and transmission of the
response cannot be included in the end of the extension.

Due to the limitations of the current requestPipeline, executionDidEnd
is what needs to be hooked into to halt overall timing, which is called
directly prior to didEncounterErrors. It is likely possible to refactor
the request pipeline to support capturing the timing data of serializing
errors into the proto state, but since it would be more boolean
control-flow to add into the overall request context, this commit simply
discards that timing information for reporting in lieu of throwing an
error.

A subsequent commit will follow up with replacing error throws with
logging statements.
zionts pushed a commit that referenced this pull request Jul 18, 2019
A previous commit
#2900 migrated the
extension for reporting tracing information to engine into an
abstraction called the treeBuilder. The treeBuilder was then utilized by
both a federatedExtension and the normal extension, but the two
extensions have different end-timing semantics. In the normal extension,
the end-timing is captured at the end of the *request*, whereas in the
federated extension, the end-timing is captured at the end of the
*execution*. The reason for this is that the standard exception does not
actually append information into the extensions field, whereas the
federated extension does, so serialization and transmission of the
response cannot be included in the end of the extension.

Due to the limitations of the current requestPipeline, executionDidEnd
is what needs to be hooked into to halt overall timing, which is called
directly prior to didEncounterErrors. It is likely possible to refactor
the request pipeline to support capturing the timing data of serializing
errors into the proto state, but since it would be more boolean
control-flow to add into the overall request context, this commit simply
discards that timing information for reporting in lieu of throwing an
error.

A subsequent commit will follow up with replacing error throws with
logging statements.
@hotgazpacho hotgazpacho mentioned this pull request Aug 12, 2019
@glasser glasser restored the glasser/tree-builder branch March 13, 2020 21:27
abernix pushed a commit to apollographql/federation that referenced this pull request Sep 4, 2020
* [BE-308] Federated metrics support

Augments apollo-engine-reporting and @apollo/gateway to provide federated
metrics to Engine from the gateway.

Extend trace reports to contain a QueryPlan tree when traces are received from
a gateway, mimicking the query plan that the gateway makes internally.
FetchNodes on the tree correspond to GraphQL operations executed against
federated services; each FetchNode contains its own tree of Trace.Nodes.

- apollo-engine-reporting
  - Factor out core tree building logic to EngineReportingTreeBuilder class
  - New EngineFederatedTracingExtension adds a base64-encoded protobuf of a
    Trace (with just the node tree and top level timestamps) to an extension
    on federated service responses (if requested to do so via an HTTP header
    from gateway).
  - Implement new originalFieldName trace field (similar goal to apollographql/apollo-server#2401, BE-158).
  - If @apollo/gateway built us a query plan trace, add it to the trace

- @apollo/gateway
  - Builds up a protobuf representation of the query plan, including sub-traces
    from federated services on FetchNodes

- apollo-engine-reporting-protobuf
  - Copy reports.proto from internal repo
  - Set editorconfig settings to match internal repo
  - Always use normal numbers, not long.js objects

- apollo-server-core
  - Thread some data through between reporting and gateway modules.

- apollo-server-integration-testsuite
  - Clean up some ts warnings

TODO record __resolveReference as multiple nodes?
TODO is it actually OK to include the operation text in fetch trace nodes?
TODO more tests, especially of error cases
TODO internal docs (eg FIXME comment in reports.proto)
TODO external docs

* Update protobuf to match cloud repo

* No longer set nonexistent operation field

* Update proto to fix typo

* Convert trace failure into metric

Previously, the logic within executing a query plan would throw an error
and fail the entire operation if it could not parse a trace. Instead, it
adds the information that it failed to parse the trace into the overall
fetch node itself. This commit also adds some comments.

* add snapshot-serializer test

* Add to comment justifying console.warn

* update snapshot

* prettier

* Fixup some imports

Build still breaking :(

* Rely on serviceName instead of error code

We used to rely on the error code matching "DOWNSTREAM_SERVICE_ERROR" in
order to determine that an error was from an underlying service rather
than from the gateway itself. In the past few weeks, error handling in
the gateway has been modified to allow for overriding this error code
with whatever code has come from the gateway, but we are guaranteed to
always have a serviceName if the error comes from a downstream service.

Note, however, that this means if users pass "serviceName" in an error
extension, it will NOT be reported for metrics, even if it occurred at
the Gateway level. We may want to find something less flimsy than this.

* lint fix

* Remove invalid null arguments

* Take suggested log changes

* Publish

 - apollo-cache-control@0.7.6-alpha.10
 - apollo-datasource-rest@0.5.2-alpha.1
 - apollo-engine-reporting-protobuf@0.3.2-alpha.0
 - apollo-engine-reporting@1.4.0-alpha.10
 - @apollo/federation@0.6.11-alpha.10
 - @apollo/gateway@0.7.0-alpha.10
 - apollo-server-azure-functions@2.7.0-alpha.10
 - apollo-server-cloud-functions@2.7.0-alpha.10
 - apollo-server-cloudflare@2.7.0-alpha.10
 - apollo-server-core@2.7.0-alpha.10
 - apollo-server-express@2.7.0-alpha.10
 - apollo-server-fastify@2.7.0-alpha.10
 - apollo-server-hapi@2.7.0-alpha.10
 - apollo-server-integration-testsuite@2.7.0-alpha.10
 - apollo-server-koa@2.7.0-alpha.10
 - apollo-server-lambda@2.7.0-alpha.10
 - apollo-server-micro@2.7.0-alpha.10
 - apollo-server-plugin-base@0.6.0-alpha.10
 - apollo-server-plugin-response-cache@0.2.7-alpha.10
 - apollo-server-testing@2.7.0-alpha.10
 - apollo-server-types@0.1.1-alpha.1
 - apollo-server@2.7.0-alpha.10
 - apollo-tracing@0.7.5-alpha.10
 - graphql-extensions@0.8.0-alpha.10

* empty commit

* Publish

 - apollo-server-azure-functions@2.7.0-alpha.11
 - apollo-server-cloud-functions@2.7.0-alpha.11
 - apollo-server-cloudflare@2.7.0-alpha.11
 - apollo-server-express@2.7.0-alpha.11
 - apollo-server-fastify@2.7.0-alpha.11
 - apollo-server-hapi@2.7.0-alpha.11
 - apollo-server-integration-testsuite@2.7.0-alpha.11
 - apollo-server-koa@2.7.0-alpha.11
 - apollo-server-lambda@2.7.0-alpha.11
 - apollo-server-micro@2.7.0-alpha.11
 - apollo-server-testing@2.7.0-alpha.11
 - apollo-server@2.7.0-alpha.11

* anotha one

* Publish

 - apollo-server-azure-functions@2.7.0-alpha.12
 - apollo-server-cloud-functions@2.7.0-alpha.12
 - apollo-server-cloudflare@2.7.0-alpha.12
 - apollo-server-express@2.7.0-alpha.12
 - apollo-server-fastify@2.7.0-alpha.12
 - apollo-server-hapi@2.7.0-alpha.12
 - apollo-server-integration-testsuite@2.7.0-alpha.12
 - apollo-server-koa@2.7.0-alpha.12
 - apollo-server-lambda@2.7.0-alpha.12
 - apollo-server-micro@2.7.0-alpha.12
 - apollo-server-testing@2.7.0-alpha.12
 - apollo-server@2.7.0-alpha.12

Apollo-Orig-Commit-AS: apollographql/apollo-server@d8ade4d
@abernix abernix deleted the glasser/tree-builder branch February 5, 2021 07:05
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants