initial studio agent, client awareness and operation reporting #309

garypen · 2022-01-07T11:36:59Z

~~Ready for first set of reviews.~~

Two different implementations: - layer based - opentelemetry layer based Also: implement some of the changes for telemetry ADR.

Adapt to recent changes and fix custom implementation.

like it says

Only part done...

Automatically extract data from traces and submit to the Apollo collector. Lots of the data is bogus and missing and probably being collected from the wrong trace point, but at least it's real...

Clean up the OpenTelemetry implementation to be purely Reporter focussed and clean up some more of the ideas lingering in the code. Still a long way from "finished", but enough to at least discuss.

Set a "base layer" which is Apollo Agent tracing and integrate more completely into the existing OT configuration code.

I've created a usage_agent repo under garypen until I decide where this should live.

Point at the latest and greatest

Just so I stop getting complaints from CircleCI...

This means the router has both server and client embedded within it. This is convenient for testing and CircleCI.

sigh...

Discussed with Jesse and concluded that it as simpler to keep the agent in the single repo for now. We can always split it out later if it turns out to be the wrong decision.

And remember to remove my fake error from the windows uname code.

This is enough to get xtask all to pass. I'm not really sure why encoding_rs wants to be on 0.8.30 on my branch, but ehhh...

- Add gzip compression to submitted traces - Do some minimal investigation of normalization - Clean up the interfaces for submitting records - "Batch" submissions and send them every 5 seconds

- use take instead of replace - add the Zlib license - Just make tests pass

Probably still not the right place, but I'll figure that out before long. Clean up one of the expect errors.

I'm focussing more on the opentelemetry approach right now, so throw this away for now.

Not done yet, but getting closer.

- levels are better - also, divide busy_ns by 1_000_000 for lower statistics Still wrong statistics, but lower values

Because that is already used elsewhere in the router and it does seem to have a cleaner interface.

That could be the cause of the intermittent errors during circleCI testing on Linux and Windows...

CircleCI linux tests keep failing because this test doesn't complete... It's annoying, so maybe adding these sleeps will improve this situation.

Maybe a bug in insta or maybe bad interactions with other macros...

Add some text to the configuration docs about the new configuration options.

Improved README.md Add a function to wrap error reporting

I'm not convinced this is a good idea, but I've added a comment to the code and left it in there for now. Post-merge, if it stays, it needs to be more sophisticated than just a HashMap.

done

Address requirement to allow BSP environment variables to influence the export batch and queue sizes of apollo telemetry. The implementation enforces default lower bounds and warns users that settings aren't correct if problems are detected.

remove expensive debug tracing

to reflect the information we now know about the problem

garypen · 2022-02-08T14:04:53Z

closes: #413

glasser

This is all looking great! I'm glad this got merged.

Usage reporting is a big project so I hope you take most of what I've written below in the spirit of "this needs to be right eventually". Most of the immediate actions to take for these comments are just "make sure you're tracking the need to do X somewhere and/or add a comment about not doing X yet".

My biggest suggestion is that for the medium term (ie, until router has moved away from the graphql-js embedding), you solve some of the issues around generating the report key (and referenced field list) by extending the main router-bridge interface from "returns a query plan" to "returns all the strongly cacheable information that the router needs about the (operation, schema) tuple". Which includes the stats report key, referenced fields list, whether or not validation passed (though that might not be needed if stats report key calculation is in JS), etc. This will mean you can get these calculations consistent with Gateway immediately without having to wait on apollo-rs to give you more tools.

At the very least you'll need to get the "does the validation algorithm pass" boolean from JS because I don't think apollo-rs is anywhere near implementing the 28 validation algorithms defined in the GraphQL spec.

glasser · 2022-02-08T19:54:01Z

apollo-router/src/apollo_telemetry.rs

+                let dh = dh_map
+                    .entry((client_name, client_version, key.clone()))
+                    .or_insert_with(|| DurationHistogram::new(None));
+                dh.increment_duration(span.end_time.duration_since(span.start_time).unwrap(), 1);


I don't have much insight into how you generated this spans in the first place, so I'm curious — are the timestamps monotonic or walltime? ie are they affected by ntp tweaking the current time? in AS I think we do our best to use diffs of monotonic clocks whenever calculating durations (which means that we can end up with timestamps and durations that don't necessarily match up.)

The spans are generated "auto-magically" by the tracing and opentelemetry framework. They are using SystemTime to record these end and start times and I note that:
https://doc.rust-lang.org/nightly/std/time/struct.SystemTime.html
is clear about the potential for drift. I'll raise an issue to address handling problematic situations (end time < start time basically). I definitely shouldn't just unwrap() the value.

glasser · 2022-02-08T20:00:56Z

apollo-router/src/apollo_telemetry.rs

+static GRAPHQL_VALIDATION_FAILURE: &str = "## GraphQLValidationFailure\n";
+static GRAPHQL_UNKNOWN_OPERATION_NAME: &str = "## GraphQLUnknownOperationName\n";
+
+fn normalize(op: Option<&opentelemetry::Value>, query: &str) -> String {


I'd highly encourage you to name this function something like "statsReportKey" as that's the name we use for this in our protocol. "normalize" is a very ambiguous name as there are many different operation normalization algorithms in our platform (and basically every time we use it it includes stripping comments, so none of them would yield a stats report key with the comment-ish first line).

I'll rename it to "stats_report_key". Rust prefers snake case for function names:
https://doc.rust-lang.org/1.0.0/style/style/naming/README.html

glasser · 2022-02-08T20:01:48Z

apollo-router/src/apollo_telemetry.rs

+fn normalize(op: Option<&opentelemetry::Value>, query: &str) -> String {
+    // If we don't have an operation name, we can't do anything useful
+    // with this query. Just return the appropriate error.
+    let op_name: String = match op {


Can you help me understand where this operation name comes from? It looks like it comes from an otel attribute which I think comes from

router/apollo-router/src/warp_http_server_factory.rs

Line 328 in 10d8f74

operation_name = %request.operation_name.clone().unwrap_or_else(|| "-".to_string()),

My naive interpretation of that code is that this is specifically the out of band operationName explicitly specified in a GraphQL request, ie like {"query": "query Foo { __typename }", "operationName": "Foo"}.

But in the very common case where a named query is provided without an explicit out of band operationName, ie a request like {"query": "query Foo { __typename }"}, my guess is that you're going to hit the "unknown operation name" case here. That's not correct for our reporting protocol: this should be treated as an operation named Foo (ie, the first line of the key should be # Foo).

I think maybe this was something that you were waiting on more apollo-rs support for before you can implement it properly? In that case this should probably be clearly called out via a comment as not currently implementing the right logic.

(If my cursory read of the code is wrong and this will properly get Foo for {"query": "query Foo { __typename }"}, my apologies!)

You are correct. I'll raise an issue to add a comment for this. Is this a problem that could be solved by enhancing the router_bridge interface as your previously described? (I think not, but I'm asking anyway).

I think it could — IMO it doesn't really make sense to "resolve" the operation name for a request until you've validated the request (because then you can rely on uniqueness of operation names, etc).

glasser · 2022-02-08T20:26:24Z

apollo-router/src/apollo_telemetry.rs

+        tracing::warn!("Could not parse query: {}", query);
+        return GRAPHQL_PARSE_FAILURE.to_string();
+    }
+    let doc = ast.document();


Is there a step here that applies the standard GraphQL Validation algorithm? ie, the thing that checks that the query actually works with the graph, and that it has all sorts of necessary internal consistencies? Does this happen elsewhere in the router, or is it something that's waiting on apollo-rs support? ie, that means implementing all the logic in http://spec.graphql.org/draft/#sec-Validation

Anyway, this algorithm requires us to know whether the operation passes validation before checking for operation names. (Otherwise you can't assume things like "there's at most one operation for a given name" or "if there's an anonymous operation then it's the only one".) So that should go here. Though presumably this is less a matter of "we should run the validation operation in the middle of normalize and more "whatever code in the router parses and validates operations should put the status of those checks somewhere that the normalize code can read".

...

...

ok, I'm now realizing that the way that validation works in router today (confirmed with @abernix) is that we run the graphql validation in graphql-js over the bridge. (or actually... we might not actually run validation at all right now? but if so, that's a bug and the plan I believe is to do it in graphql-js for now. validate runs on version-0.x but not main.)

So I think what you need to do is extend the router bridge API so that the status of GraphQL validation comes back explicitly from the JS query planner, and that that boolean is made available to the telemetry code.

That said... if you're already making this normalize function dependent on something returned from JS... maybe the best short term plan is to just do the entire stats report key calculation in JS. I'd be happy to collaborate on making the stats report key code in apollo-server a bit more accessible to router-bridge code, or honestly copy-pasting it there would be fine too. This would mean you'd be able to have gateway-compatible stats report keys right now before apollo-rs gives you all the tools you need; implementing stats report key generation in Rust can be part of the broader project of porting JS stuff to Rust rather than a release blocker.

There is no logic here to validate the graphQL. I think your suggestion to extend the router bridge makes most sense for this body of problems.

apollographql/federation-rs#25

glasser · 2022-02-08T20:27:03Z

apollo-router/src/apollo_telemetry.rs

+    tracing::debug!("required definitions: {:?}", required_definitions);
+    if required_definitions.len() != 1 {
+        tracing::warn!("Could not find required single definition: {}", query);
+        return GRAPHQL_VALIDATION_FAILURE.to_string();


Whether this should be a validation failure or an unknown operation name failure depends on exactly what's going on here, though.

Right. If we extend the router-bridge as described I presume we'll have the information we need and won't need to do this checking.

apollo-spaceport/src/server.rs

glasser · 2022-02-08T21:39:51Z

apollo-spaceport/src/server.rs

+                }
+                Err(e) => {
+                    tracing::warn!("attempt: {}, could not transfer: {}", i + 1, e);
+                    if i == 4 {


probably share this with the loop var

glasser · 2022-02-08T21:41:16Z

apollo-spaceport/src/server.rs

+                }
+            }
+        }
+        // The compiler can't figure out the exit paths are covered,


maybe can fix this with making the loop a while true and incrementing the retry yourself? Though also maybe there's just a crate that does retry for you. https://docs.rs/retry/latest/retry/ Looks like it has its own exponential backoff plus jitter too.

glasser · 2022-02-08T21:44:06Z

apollo-spaceport/src/server.rs

+        // TRIGGER_BATCH_LIMIT is a fairly arbitrary number which indicates
+        // that we are adding a lot of data for transfer. It is derived
+        // empirically from load testing.
+        let total = self.total.fetch_add(1, Ordering::SeqCst);


I'm not sure I follow what this is counting. Overall operations (requests) processed? Given that for now the only thing you're tracking per operation is a duration histogram (no traces yet, right?), I'd suggest that this should be counting the cardinality of unique operations. Having 50 (or whatever) different operations could trigger the size limit but 50 requests on the same operation shouldn't take much more space than 1.

This number is trying to deal with the more complex scenario of multiple routers all sending data to a single, shared spaceport. It's certainly not an exact science right now (hence the comment about empirical derivation). I think this is something that we can improve over time and with experience from usage. I'll file an issue which tries to capture some of this.

sure, my point is that the concept of "send reports earlier than the regularly scheduled interval" should probably be based on the estimated size of the report, and "number of unique operations" is going to dominate a lot more than "number of operations", since we're aggregating stats!

glasser · 2022-02-08T21:45:32Z

apollo-router/src/apollo_telemetry.rs

+                    client_version,
+                }),
+                query_latency_stats: Some(QueryLatencyStats {
+                    latency_count: dh.buckets,


This doesn't use the run-length encoding that replaces bunches of 0s with negative numbers. That should at least be called out via comment or something.

I'll file an issue for considering future implementation. Spending too much time in the export() fn exerts back-pressure into the telemetry framework and leads to dropped data spans. Given that the data is ultimately gzipped for transfer, I wasn't entirely sure that this extra processing (in the hot path) was worth performing.

I'll raise an issue to capture this thinking in a comment.

That's an interesting point — I'm not sure we've done any real benchmarking of whether the run-length encoding adds much when using gzip. (Certainly more memory will be used temporarily in building the protobuf but maybe that doesn't matter.)

Follows up: #445 Follows up: #309 Relates to: #66 Relates to: #444

follow-up to: apollographql/federation-rs#276 allows matching on the `BuildErrorType` enum

fix: make parent field suffix less ambiguous

garypen added 24 commits December 14, 2021 13:29

Start exploring options around a studio agent

f4d71ae

Two different implementations: - layer based - opentelemetry layer based Also: implement some of the changes for telemetry ADR.

Merge branch 'main' into usage-agent

b03ead8

Experiment some more with various tracing options

a1b6439

Adapt to recent changes and fix custom implementation.

Merge branch 'main' into usage-agent

50a24d9

Some hackery to figure out what is happening

083ca02

like it says

Start integrating apollo telemetry client with router

4822655

Only part done...

Add enough functionality to automate sending some trace data

e5b8f2a

Automatically extract data from traces and submit to the Apollo collector. Lots of the data is bogus and missing and probably being collected from the wrong trace point, but at least it's real...

Focus efforts on the OpenTelemetry approach

6ff030d

Clean up the OpenTelemetry implementation to be purely Reporter focussed and clean up some more of the ideas lingering in the code. Still a long way from "finished", but enough to at least discuss.

Merge branch 'main' into usage-agent

faa3272

Clean up integration of Apollo OT tracing

a63748f

Set a "base layer" which is Apollo Agent tracing and integrate more completely into the existing OT configuration code.

Add usage_agent dependency to github repo

85da9eb

I've created a usage_agent repo under garypen until I decide where this should live.

Revise git dependency for usage_agent

ceff8dc

Point at the latest and greatest

Make cargo deny happy

5c03ffc

Just so I stop getting complaints from CircleCI...

Add a usage server to the router

06a6117

This means the router has both server and client embedded within it. This is convenient for testing and CircleCI.

Update versions to pickup usage-agent changes

f84cdb1

sigh...

Migrate usage-agent to router repo

c839ec9

Discussed with Jesse and concluded that it as simpler to keep the agent in the single repo for now. We can always split it out later if it turns out to be the wrong decision.

Try to fix licensing complaints

9280c35

And remember to remove my fake error from the windows uname code.

Merge branch 'main' into usage-agent

8def90a

Pin encoding_rs to 0.8.29 and add usage_agent to deny.toml

d60aaea

This is enough to get xtask all to pass. I'm not really sure why encoding_rs wants to be on 0.8.30 on my branch, but ehhh...

Further investigation of work for studio agent functionality

de28a3b

- Add gzip compression to submitted traces - Do some minimal investigation of normalization - Clean up the interfaces for submitting records - "Batch" submissions and send them every 5 seconds

Fix some complaints about licensing and clippy

8abaebc

- use take instead of replace - add the Zlib license - Just make tests pass

Merge branch 'main' into garypen/66-usage-agent

de78fa0

Move telemetry from prepare to execute

24836db

Probably still not the right place, but I'll figure that out before long. Clean up one of the expect errors.

Throw away the custom logging investigation

de66553

I'm focussing more on the opentelemetry approach right now, so throw this away for now.

garypen changed the title ~~Garypen/66 usage agent~~ garypen/66 usage agent Jan 7, 2022

garypen added 5 commits January 7, 2022 15:10

Start thinking about how to propagate client information

1e374b3

Not done yet, but getting closer.

Clean up some of the logging and fake statistics

9715096

- levels are better - also, divide busy_ns by 1_000_000 for lower statistics Still wrong statistics, but lower values

Replace use of libflate with flate2

d0b0d45

Because that is already used elsewhere in the router and it does seem to have a cleaner interface.

Make sure to seek before truncating

96cea73

That could be the cause of the intermittent errors during circleCI testing on Linux and Windows...

Add some sleeping to see if that helps the linux tests to pass

31e2520

CircleCI linux tests keep failing because this test doesn't complete... It's annoying, so maybe adding these sleeps will improve this situation.

garypen added 8 commits February 4, 2022 14:32

fix various strange errors relating to insta snapshots

af58cb9

Maybe a bug in insta or maybe bad interactions with other macros...

review comments

92b6df9

Add some text to the configuration docs about the new configuration options.

review comments

d6fb85e

Improved README.md Add a function to wrap error reporting

remove op_name from normalized query cache

2e3efba

I'm not convinced this is a good idea, but I've added a comment to the code and left it in there for now. Post-merge, if it stays, it needs to be more sophisticated than just a HashMap.

review comment to remove Ingress from a comment

adb2e7b

done

review comments

042f837

Address requirement to allow BSP environment variables to influence the export batch and queue sizes of apollo telemetry. The implementation enforces default lower bounds and warns users that settings aren't correct if problems are detected.

address review comments

b9ed1b8

remove expensive debug tracing

update comments around set global tracer

2c83393

to reflect the information we now know about the problem

bnjjj approved these changes Feb 8, 2022

View reviewed changes

garypen requested review from o0Ignition0o and glasser February 8, 2022 13:29

garypen merged commit 4f60872 into main Feb 8, 2022

garypen deleted the garypen/66-usage-agent branch February 8, 2022 13:31

garypen linked an issue Feb 8, 2022 that may be closed by this pull request

studio usage agent: phase 1 #413

Closed

garypen removed a link to an issue Feb 8, 2022

Studio Agent #66

Closed

5 tasks

garypen mentioned this pull request Feb 8, 2022

studio usage agent: phase 1 #413

Closed

glasser reviewed Feb 8, 2022

View reviewed changes

garypen mentioned this pull request Feb 9, 2022

small improvements to studio usage reporting #420

Closed

7 tasks

abernix mentioned this pull request Feb 9, 2022

router-bridge: Re-introduce validate on the main (v2) branch apollographql/federation-rs#25

Closed

abernix changed the title ~~garypen/66 usage agent~~ feature: initial studio agent, client awareness and operation reporting Feb 17, 2022

abernix changed the title ~~feature: initial studio agent, client awareness and operation reporting~~ initial studio agent, client awareness and operation reporting Feb 17, 2022

abernix mentioned this pull request Feb 17, 2022

add documentation for spaceport #444

Closed

abernix added a commit that referenced this pull request Feb 17, 2022

docs: Spaceport docs formatting and editorial

0fda28d

Follows up: #445 Follows up: #309 Relates to: #66 Relates to: #444

abernix mentioned this pull request Feb 17, 2022

docs: Spaceport docs formatting and editorial #495

Merged

abernix added a commit that referenced this pull request Feb 17, 2022

docs: Spaceport docs formatting and editorial (#495)

89a320e

Follows up: #445 Follows up: #309 Relates to: #66 Relates to: #444

abernix added this to the v0.1.0-alpha.5 milestone Feb 18, 2022

garypen mentioned this pull request Apr 1, 2022

studio: Add a UserAgent to the data submitted to the ingress. #733

Closed

tinnou pushed a commit to Netflix-Skunkworks/router that referenced this pull request Oct 16, 2023

publicize BuildError type and BuildErrorType enum (apollographql#309)

b6fbf73

follow-up to: apollographql/federation-rs#276 allows matching on the `BuildErrorType` enum

goto-bus-stop pushed a commit to goto-bus-stop/router that referenced this pull request Jul 10, 2024

Merge pull request apollographql#309 from apollographql/lb/connectors-81

9e82367

fix: make parent field suffix less ambiguous

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial studio agent, client awareness and operation reporting #309

initial studio agent, client awareness and operation reporting #309

garypen commented Jan 7, 2022 •

edited by abernix

Loading

garypen commented Feb 8, 2022

glasser left a comment

glasser Feb 8, 2022

garypen Feb 9, 2022

glasser Feb 8, 2022

garypen Feb 9, 2022

glasser Feb 8, 2022

garypen Feb 9, 2022

glasser Feb 9, 2022

glasser Feb 8, 2022

garypen Feb 9, 2022

abernix Feb 9, 2022

glasser Feb 8, 2022

garypen Feb 9, 2022

glasser Feb 8, 2022

glasser Feb 8, 2022

glasser Feb 8, 2022

garypen Feb 9, 2022

glasser Feb 9, 2022

glasser Feb 8, 2022

garypen Feb 9, 2022

glasser Feb 9, 2022

initial studio agent, client awareness and operation reporting #309

initial studio agent, client awareness and operation reporting #309

Conversation

garypen commented Jan 7, 2022 • edited by abernix Loading

garypen commented Feb 8, 2022

glasser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garypen commented Jan 7, 2022 •

edited by abernix

Loading