Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Usage Reporting. #898

Merged
merged 70 commits into from
May 9, 2022
Merged

feat: Usage Reporting. #898

merged 70 commits into from
May 9, 2022

Conversation

o0Ignition0o
Copy link
Contributor

@o0Ignition0o o0Ignition0o commented Apr 22, 2022

feat: Usage Reporting.

fixes #67 #795

is part of a fix for #373

This pull request adds advanced Usage Reporting capabilities, such as:

  • a more reliable statsReportKey generation (which is now the same as the Apollo Gateway)
  • an accurate (no sampling required) operation_count
  • First steps towards the ability to exclude operations from usage reporting (only operation_count will be sent to uplink)
  • Referenced fields that can be seen in the Apollo studio's Fields section:

image

@netlify
Copy link

netlify bot commented Apr 22, 2022

Deploy Preview for apollo-router-docs canceled.

Name Link
🔨 Latest commit c402095
🔍 Latest deploy log https://app.netlify.com/sites/apollo-router-docs/deploys/6279076c2a041c000960b4c0

@github-actions

This comment has been minimized.

apollo-router-core/src/json_ext.rs Outdated Show resolved Hide resolved
apollo-router/src/plugins/telemetry/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this looks great!

Some reservations about how dashmap will perform, but they should be easy to dispel with performance testing.

apollo-router/Cargo.toml Outdated Show resolved Hide resolved
apollo-router-core/src/query_planner/mod.rs Show resolved Hide resolved
apollo-router-core/src/query_planner/mod.rs Outdated Show resolved Hide resolved
apollo-router/src/plugins/telemetry/mod.rs Outdated Show resolved Hide resolved

// Planner errors return stats report key that start with `## `
// while successful planning stats report key start with `# `
fn operation_count(stats_report_key: &str) -> u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch with this function, which is definitely correct

@BrynCooke BrynCooke requested review from Geal and garypen May 9, 2022 12:02
Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT around docs. Feel free to ignore.

NEXT_CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: Gary Pennington <gary@apollographql.com>
Copy link
Member

@glasser glasser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really awesome work. It's so cool to see this coming together!

pub(crate) latency_count: Duration,
pub(crate) request_count: u64,
pub(crate) cache_hits: u64,
pub(crate) persisted_query_hits: u64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that Router does implement APQs, but it doesn't currently set this field or persisted_query_misses, right? Is that being tracked?

Copy link
Member

@abernix abernix May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now tracked in #1014.

pub(crate) struct QueryLatencyStats {
pub(crate) latency_count: Duration,
pub(crate) request_count: u64,
pub(crate) cache_hits: u64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth calling out (in comments somewhere?) that some (most) of these fields are always zero for now. Specifically I think these features aren't hooked up:

  • caching (cache_hits, cache_latency_count, public/private_cache_ttl_count): not (yet) a Router feature
  • op registry (registered_operation_count, forbidden_operation_count): not (yet) a Router feature
  • error stats (root_error_stats, requests_with_errors_count): Router probably does have the information to calculate this but isn't yet

Potentially they could just be left out of the code for now? protobuf messages don't need to contain all fields.)

Copy link
Contributor

@BrynCooke BrynCooke May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is long term that we add this stuff so let's create tickets for them. I would not want to delete the code for it to then have to be recreated when we get round to it. Added tickets to track the rest of the stats:

#1009 #1010 #1011

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure though just to be clear — the first two are not just "we need to add these stats" but "we would need to first implement the features that they track" (if it is decided that they are features worth implementing — eg, the current implementation of operation-registry-based safelisting is not particularly ideal.)

pub(crate) private_cache_ttl_count: Duration,
pub(crate) registered_operation_count: u64,
pub(crate) forbidden_operation_count: u64,
pub(crate) requests_without_field_instrumentation: u64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest that you set this equal to request_count for now! In a router-y context this counts requests where the router doesn't send apollo-federation-include-trace:ftv1 to its subgraphs... which for now is every request. (Now, I think this field isn't really used for anything real by Studio at the moment, but being accurate should be pretty easy.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

#[derive(Default, Debug, Serialize)]
pub(crate) struct QueryLatencyStats {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it hard to understand the semantics of this struct. The duration-y fields have only one Duration (not a histogram) so it must represent a single request. So it seems like all of the u64 fields should really just be bools? (And the name latency_count doesn't really make sense either.)

And naming-wise, it's a bit confusing that the struct AggregatedQueryLatencyStats corresponds to the proto message QueryLatencyStats. Would it make more sense to rename AggregatedQueryLatencyStats to QueryLatencyStats and this one to like SingleQueryLatencyStats or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latency_count mirrors the protobuf format, can you elaborate on why this should be converted to bools?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's confusing is that you've changed the type of latency_count from DurationHistogram (a thing that can represent multiple requests) to Duration (a thing that can only represent a single request) but you've left most of the other fields which count requests as numbers. I don't think there's any case you would ever set those numbers to something greater than 1 because any such case would also require you to have more than one "duration".

apollo-spaceport/src/server.rs Show resolved Hide resolved
}
}

// The TS implementation of DurationHistogram does Run Length Encoding (RLE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, funny story. I thought it would be nice to at least strip trailing zeroes even if we don't batch up leading and internal batches of 0s. I thought this might even be a nice opportunity for me to learn how to run Rust tests and write my own first tiny patch.

And actually it's very easy. The initialization of buckets below can just be

            buckets: Vec::with_capacity(init_size.unwrap_or(DurationHistogram::DEFAULT_SIZE)),

After all, JS doesn't differentiate between length and capacity so its "pre-allocation" needs to actually put 0s in the array, and Rust can do better.

But then I noticed the tests still passed and realized why. In practice, 74 zeroes of buckets gets you only up to the 1.15ms bucket. So my change doesn't really do anything unless you're making lots of DurationHistograms representing entirely sub-ms durations. Since you're currently only doing DHs for whole operations (not individual fields) this is unlikely to be common.

*self.per_type_stat.entry(k).or_default() += v;
}
for (k, v) in metrics.referenced_fields_by_type {
// Merging is not required because metrics are always groupd by schema and query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what's going on here. I understand that each AggregatedMetrics object goes with a specific operation, so it makes sense that you don't need to merge the referenced fields object. But it seems like you are doing some sort of merging anyway: you are iterating over the types and doing a merge at that level. So you might still end up with some sort of merged thing if you have two different calls that give you different sets of types. (k is the type name.) Also just doesn't seem like you should need to do O(n) work here — either you have referenced_fields_by_type already and you don't need to do anything, or you don't and you need to put it in.

Maybe either referenced_fields_by_type has to be some sort of optional object, or that instead of starting with an "empty" AggregatedMetrics and adding one Metrics at a time, you start with an AggregatedMetrics created from the first Metrics for a given operation, and then when you later add_assign to that you can skip over referenced_fields_by_type? (This suggestion does run into an issue related to how your STUDIO_EXCLUDE feature works but I have a separate comment about that.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.get::<_, UsageReporting>(USAGE_REPORTING)
.unwrap_or_default()
{
let exclude: Option<bool> = context.get(STUDIO_EXCLUDE).unwrap_or_default();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I have a concern about the way you've implemented the STUDIO_EXCLUDE feature, which I think might actually require a little bit of refactoring to deal with. That said, it looks to me like this feature is not fully finished (ie I don't see any code that sets this context variable, though maybe I'm confused about how it works). I do see that #373 is closed though (cc @o0Ignition0o @BrynCooke who talked there recently).

The initial goal of the equivalent feature in Apollo Server is primarily performance — removing the impact on the server of certain kinds of requests. For example, some users use this to just reduce the number of operations that the usage reporting mechanism has to pay attention to (which was far worse in a previous iteration of the usage reporting protocol that only sent us traces without any aggregated stats). Others use it to filter out bad/spammy operations that would lead to a huge explosion in cardinality.

So for that purpose, it's kinda important that if you decide a given operation shouldn't be reported, that you don't end up hanging on to its report key, using it as a key in internal hashes, etc. Unfortunately the below looks like you still do that. I'd encourage you to find a way that allows information to be sent between the various components of your system that is not entirely sorted by stats_report_key — specifically, a way to (just like in the top-level usage report message itself!) send just an operation count without associating it with particular data about the operation (especially the stats_report_key, which could be enormous).

So right now your structure is that metrics::apollo::Metrics always has stats_report_key, client_name, and client_version, and sometimes has referenced_fields_by_type and query_latency_stats. I would argue that the first three need to be pushed into the "sometimes" as well. I would suggest that aggregate aggregate operation_count as its own thing outside of the (per-operation etc) AggregatedMetrics structure. (On the bright side this resolves a concern I referenced in another comment about merging referenced_fields_by_type — if every time you have a stats_report_key you also have referenced_fields_by_type then there's less of a worry around "merging".) This "ability to send general information that isn't attached to a specific stats report key" probably has to go to other parts of the system, like your submit_stats/add_stats/etc functions. The whole point of the operation_count field is that it's something that we can easily track as a single number that isn't divided up by client info and specific operation; I think you should refactor until it's basically not found anywhere divided up, only at top levels.

Again this doesn't really matter until you make STUDIO_EXCLUDE into an accessible feature... but one of the main goals of that feature is not achieved if you let "large numbers of large and distinct operations" take up memory.

Actually, looking at the one test related to this shows my point — lemme comment over on the snapshot too.

"secs": 0,
"nanos": 0
},
"request_count": 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have a long comment about the STUDIO_EXCLUDE feature elsewhere but looking here, yeah, in addition to the overall concept of minimizing in-memory impact of excluded features, this snapshot shows that the feature isn't actually achieving the goal. The main point of STUDIO_EXCLUDE is not to decrease the numbers shown for a given operation, but to leave the operation out of reporting entirely. Maybe it's some weird spam thing, maybe it has secrets in the text somehow that aren't stripped, who knows. But a definite requirement for such a feature is that if all requests with a given stats report key are excluded, we shouldn't include anything about that stats_report_key in the report at all (other than the number of operations contributing to the single top-level operation_count).

I realize this snapshot is just of an intermediate data structure, not of the actual report sent to Studio. But my read of this code in apollo-spaceport/src/server.rs

        let entry = query_usage
            .traces_and_stats_for_key
            .entry(record.key())
            .or_insert_with(|| record.get_traces_and_stats());

suggests that we don't (when constructing the final report) filter out keys whose values have no requests. So this doesn't really implement exclusion, unless I'm missing something (I didn't actually try to run it with exclusion on).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing::debug!("containing: {:?}", graph_usage);
for (graph, query_usage) in drained {
tracing::debug!("submitting: {} operations", query_usage.operation_count);
tracing::debug!("containing: {:?}", query_usage);
match crate::Report::try_new(&graph.reference) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to create a Report whose header only has the graph_ref field set.

Notably, it doesn't set the executable_schema_id field. This field should be set to the hex sha256 of the text of the supergraph schema, I believe. I have to admit I'm not 100% sure what aspects of our system actually depend on this being sent but it sure seems like a good idea.

This raises the broader point — it doesn't look like in general you're tying "the current schema" to any of this reporting stuff. I know that you implement Uplink with polling and there's a reload_server thing that happens when the schema changes, but I'm not sure how much of the system is "reloaded" when that happens. Also my understanding is that the model is that the Spaceport thing can run sorta separately from the main server. So it seems like a Spaceport could potentially get different messages about operations run against different schema versions?

In addition to just making sure that header.executable_schema_id is set correctly, the referenced_fields_by_type data structure depends on the schema. I know I've said (and you've implemented) "this structure is super cacheable, it only depends on the operation and the schema" but ... it does depend on the schema :) It doesn't look like the messages in agents.proto talk about the schema version.

So you could have the schema type A { y: ID } type Query { x: A } and the operation {x{y}}. This operation has referenced fields Query.x and A.y. Now you update to the schema type B { y: ID } type Query {x: B}. The same operation now has referenced fields Query.x and B.y. So it's not kosher to combine these two into the same message, as the referenced_fields data will be wrong for one of them!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding a little context here. We need to add executable_schema_id, to our payloads but we don't need to worry about cross contamination on schema reload. The entire plugin pipeline is recreated when this happens thus we can guarantee that a particular instance will only ever be dealing with one schema.

This would not be the case in any sort of multi-tenancy scenario, but that is something to think about later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great. And that includes the "spaceport"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants