Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial studio agent, client awareness and operation reporting #309

Merged
merged 105 commits into from
Feb 8, 2022

Conversation

garypen
Copy link
Contributor

@garypen garypen commented Jan 7, 2022

Ready for first set of reviews.

Two different implementations:
 - layer based
 - opentelemetry layer based

Also: implement some of the changes for telemetry ADR.
Adapt to recent changes and fix custom implementation.
Automatically extract data from traces and submit to the Apollo
collector. Lots of the data is bogus and missing and probably being
collected from the wrong trace point, but at least it's real...
Clean up the OpenTelemetry implementation to be purely Reporter focussed
and clean up some more of the ideas lingering in the code.

Still a long way from "finished", but enough to at least discuss.
Set a "base layer" which is Apollo Agent tracing and integrate more
completely into the existing OT configuration code.
I've created a usage_agent repo under garypen until I decide where this
should live.
Point at the latest and greatest
Just so I stop getting complaints from CircleCI...
This means the router has both server and client embedded within it.
This is convenient for testing and CircleCI.
Discussed with Jesse and concluded that it as simpler to keep the agent
in the single repo for now. We can always split it out later if it turns
out to be the wrong decision.
And remember to remove my fake error from the windows uname code.
This is enough to get xtask all to pass. I'm not really sure why
encoding_rs wants to be on 0.8.30 on my branch, but ehhh...
- Add gzip compression to submitted traces
- Do some minimal investigation of normalization
- Clean up the interfaces for submitting records
- "Batch" submissions and send them every 5 seconds
- use take instead of replace
- add the Zlib license
- Just make tests pass
Probably still not the right place, but I'll figure that out before
long.

Clean up one of the expect errors.
I'm focussing more on the opentelemetry approach right now, so throw
this away for now.
@garypen garypen changed the title Garypen/66 usage agent garypen/66 usage agent Jan 7, 2022
- levels are better
- also, divide busy_ns by 1_000_000 for lower statistics

Still wrong statistics, but lower values
Because that is already used elsewhere in the router and it does seem to
have a cleaner interface.
That could be the cause of the intermittent errors during circleCI
testing on Linux and Windows...
CircleCI linux tests keep failing because this test doesn't complete...

It's annoying, so maybe adding these sleeps will improve this situation.
Maybe a bug in insta or maybe bad interactions with other macros...
Add some text to the configuration docs about the new configuration
options.
Improved README.md
Add a function to wrap error reporting
I'm not convinced this is a good idea, but I've added a comment to the
code and left it in there for now.

Post-merge, if it stays, it needs to be more sophisticated than just a
HashMap.
Address requirement to allow BSP environment variables to influence the
export batch and queue sizes of apollo telemetry. The implementation
enforces default lower bounds and warns users that settings aren't correct
if problems are detected.
remove expensive debug tracing
to reflect the information we now know about the problem
@garypen garypen merged commit 4f60872 into main Feb 8, 2022
@garypen garypen deleted the garypen/66-usage-agent branch February 8, 2022 13:31
@garypen garypen linked an issue Feb 8, 2022 that may be closed by this pull request
@garypen garypen removed a link to an issue Feb 8, 2022
5 tasks
@garypen
Copy link
Contributor Author

garypen commented Feb 8, 2022

closes: #413

Copy link
Member

@glasser glasser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all looking great! I'm glad this got merged.

Usage reporting is a big project so I hope you take most of what I've written below in the spirit of "this needs to be right eventually". Most of the immediate actions to take for these comments are just "make sure you're tracking the need to do X somewhere and/or add a comment about not doing X yet".

My biggest suggestion is that for the medium term (ie, until router has moved away from the graphql-js embedding), you solve some of the issues around generating the report key (and referenced field list) by extending the main router-bridge interface from "returns a query plan" to "returns all the strongly cacheable information that the router needs about the (operation, schema) tuple". Which includes the stats report key, referenced fields list, whether or not validation passed (though that might not be needed if stats report key calculation is in JS), etc. This will mean you can get these calculations consistent with Gateway immediately without having to wait on apollo-rs to give you more tools.

At the very least you'll need to get the "does the validation algorithm pass" boolean from JS because I don't think apollo-rs is anywhere near implementing the 28 validation algorithms defined in the GraphQL spec.

let dh = dh_map
.entry((client_name, client_version, key.clone()))
.or_insert_with(|| DurationHistogram::new(None));
dh.increment_duration(span.end_time.duration_since(span.start_time).unwrap(), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much insight into how you generated this spans in the first place, so I'm curious — are the timestamps monotonic or walltime? ie are they affected by ntp tweaking the current time? in AS I think we do our best to use diffs of monotonic clocks whenever calculating durations (which means that we can end up with timestamps and durations that don't necessarily match up.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spans are generated "auto-magically" by the tracing and opentelemetry framework. They are using SystemTime to record these end and start times and I note that:
https://doc.rust-lang.org/nightly/std/time/struct.SystemTime.html
is clear about the potential for drift. I'll raise an issue to address handling problematic situations (end time < start time basically). I definitely shouldn't just unwrap() the value.

static GRAPHQL_VALIDATION_FAILURE: &str = "## GraphQLValidationFailure\n";
static GRAPHQL_UNKNOWN_OPERATION_NAME: &str = "## GraphQLUnknownOperationName\n";

fn normalize(op: Option<&opentelemetry::Value>, query: &str) -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd highly encourage you to name this function something like "statsReportKey" as that's the name we use for this in our protocol. "normalize" is a very ambiguous name as there are many different operation normalization algorithms in our platform (and basically every time we use it it includes stripping comments, so none of them would yield a stats report key with the comment-ish first line).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename it to "stats_report_key". Rust prefers snake case for function names:
https://doc.rust-lang.org/1.0.0/style/style/naming/README.html

fn normalize(op: Option<&opentelemetry::Value>, query: &str) -> String {
// If we don't have an operation name, we can't do anything useful
// with this query. Just return the appropriate error.
let op_name: String = match op {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand where this operation name comes from? It looks like it comes from an otel attribute which I think comes from

operation_name = %request.operation_name.clone().unwrap_or_else(|| "-".to_string()),

My naive interpretation of that code is that this is specifically the out of band operationName explicitly specified in a GraphQL request, ie like {"query": "query Foo { __typename }", "operationName": "Foo"}.

But in the very common case where a named query is provided without an explicit out of band operationName, ie a request like {"query": "query Foo { __typename }"}, my guess is that you're going to hit the "unknown operation name" case here. That's not correct for our reporting protocol: this should be treated as an operation named Foo (ie, the first line of the key should be # Foo).

I think maybe this was something that you were waiting on more apollo-rs support for before you can implement it properly? In that case this should probably be clearly called out via a comment as not currently implementing the right logic.

(If my cursory read of the code is wrong and this will properly get Foo for {"query": "query Foo { __typename }"}, my apologies!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. I'll raise an issue to add a comment for this. Is this a problem that could be solved by enhancing the router_bridge interface as your previously described? (I think not, but I'm asking anyway).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could — IMO it doesn't really make sense to "resolve" the operation name for a request until you've validated the request (because then you can rely on uniqueness of operation names, etc).

tracing::warn!("Could not parse query: {}", query);
return GRAPHQL_PARSE_FAILURE.to_string();
}
let doc = ast.document();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a step here that applies the standard GraphQL Validation algorithm? ie, the thing that checks that the query actually works with the graph, and that it has all sorts of necessary internal consistencies? Does this happen elsewhere in the router, or is it something that's waiting on apollo-rs support? ie, that means implementing all the logic in http://spec.graphql.org/draft/#sec-Validation

Anyway, this algorithm requires us to know whether the operation passes validation before checking for operation names. (Otherwise you can't assume things like "there's at most one operation for a given name" or "if there's an anonymous operation then it's the only one".) So that should go here. Though presumably this is less a matter of "we should run the validation operation in the middle of normalize and more "whatever code in the router parses and validates operations should put the status of those checks somewhere that the normalize code can read".

...

...

ok, I'm now realizing that the way that validation works in router today (confirmed with @abernix) is that we run the graphql validation in graphql-js over the bridge. (or actually... we might not actually run validation at all right now? but if so, that's a bug and the plan I believe is to do it in graphql-js for now. validate runs on version-0.x but not main.)

So I think what you need to do is extend the router bridge API so that the status of GraphQL validation comes back explicitly from the JS query planner, and that that boolean is made available to the telemetry code.

That said... if you're already making this normalize function dependent on something returned from JS... maybe the best short term plan is to just do the entire stats report key calculation in JS. I'd be happy to collaborate on making the stats report key code in apollo-server a bit more accessible to router-bridge code, or honestly copy-pasting it there would be fine too. This would mean you'd be able to have gateway-compatible stats report keys right now before apollo-rs gives you all the tools you need; implementing stats report key generation in Rust can be part of the broader project of porting JS stuff to Rust rather than a release blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no logic here to validate the graphQL. I think your suggestion to extend the router bridge makes most sense for this body of problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing::debug!("required definitions: {:?}", required_definitions);
if required_definitions.len() != 1 {
tracing::warn!("Could not find required single definition: {}", query);
return GRAPHQL_VALIDATION_FAILURE.to_string();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether this should be a validation failure or an unknown operation name failure depends on exactly what's going on here, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. If we extend the router-bridge as described I presume we'll have the information we need and won't need to do this checking.

apollo-spaceport/src/server.rs Show resolved Hide resolved
}
Err(e) => {
tracing::warn!("attempt: {}, could not transfer: {}", i + 1, e);
if i == 4 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably share this with the loop var

}
}
}
// The compiler can't figure out the exit paths are covered,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can fix this with making the loop a while true and incrementing the retry yourself? Though also maybe there's just a crate that does retry for you. https://docs.rs/retry/latest/retry/ Looks like it has its own exponential backoff plus jitter too.

// TRIGGER_BATCH_LIMIT is a fairly arbitrary number which indicates
// that we are adding a lot of data for transfer. It is derived
// empirically from load testing.
let total = self.total.fetch_add(1, Ordering::SeqCst);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what this is counting. Overall operations (requests) processed? Given that for now the only thing you're tracking per operation is a duration histogram (no traces yet, right?), I'd suggest that this should be counting the cardinality of unique operations. Having 50 (or whatever) different operations could trigger the size limit but 50 requests on the same operation shouldn't take much more space than 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number is trying to deal with the more complex scenario of multiple routers all sending data to a single, shared spaceport. It's certainly not an exact science right now (hence the comment about empirical derivation). I think this is something that we can improve over time and with experience from usage. I'll file an issue which tries to capture some of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, my point is that the concept of "send reports earlier than the regularly scheduled interval" should probably be based on the estimated size of the report, and "number of unique operations" is going to dominate a lot more than "number of operations", since we're aggregating stats!

client_version,
}),
query_latency_stats: Some(QueryLatencyStats {
latency_count: dh.buckets,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't use the run-length encoding that replaces bunches of 0s with negative numbers. That should at least be called out via comment or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll file an issue for considering future implementation. Spending too much time in the export() fn exerts back-pressure into the telemetry framework and leads to dropped data spans. Given that the data is ultimately gzipped for transfer, I wasn't entirely sure that this extra processing (in the hot path) was worth performing.

I'll raise an issue to capture this thinking in a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting point — I'm not sure we've done any real benchmarking of whether the run-length encoding adds much when using gzip. (Certainly more memory will be used temporarily in building the protobuf but maybe that doesn't matter.)

@abernix abernix changed the title garypen/66 usage agent feature: initial studio agent, client awareness and operation reporting Feb 17, 2022
@abernix abernix changed the title feature: initial studio agent, client awareness and operation reporting initial studio agent, client awareness and operation reporting Feb 17, 2022
abernix added a commit that referenced this pull request Feb 17, 2022
Follows up: #445
Follows up: #309

Relates to: #66
Relates to: #444
abernix added a commit that referenced this pull request Feb 17, 2022
Follows up: #445
Follows up: #309

Relates to: #66
Relates to: #444
@abernix abernix added this to the v0.1.0-alpha.5 milestone Feb 18, 2022
tinnou pushed a commit to Netflix-Skunkworks/router that referenced this pull request Oct 16, 2023
goto-bus-stop pushed a commit to goto-bus-stop/router that referenced this pull request Jul 10, 2024
fix: make parent field suffix less ambiguous
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants