feat: Usage Reporting. #898

o0Ignition0o · 2022-04-22T11:22:57Z

feat: Usage Reporting.

fixes #67 #795

is part of a fix for #373

This pull request adds advanced Usage Reporting capabilities, such as:

a more reliable statsReportKey generation (which is now the same as the Apollo Gateway)
an accurate (no sampling required) operation_count
First steps towards the ability to exclude operations from usage reporting (only operation_count will be sent to uplink)
Referenced fields that can be seen in the Apollo studio's Fields section:

…an object that doesnt have the specified key + add elapsed to the telemetry context

netlify · 2022-04-22T11:23:03Z

✅ Deploy Preview for apollo-router-docs canceled.

Name	Link
🔨 Latest commit	`c402095`
🔍 Latest deploy log	https://app.netlify.com/sites/apollo-router-docs/deploys/6279076c2a041c000960b4c0

apollo-router-core/src/json_ext.rs

apollo-router/src/plugins/telemetry/mod.rs

…r into igni/usage_reporting

garypen

In general, this looks great!

Some reservations about how dashmap will perform, but they should be easy to dispel with performance testing.

apollo-router/Cargo.toml

apollo-router-core/src/plugin/utils/test/mock/canned.rs

apollo-router-core/src/query_planner/caching_query_planner.rs

apollo-router-core/src/query_planner/mod.rs

apollo-router-core/src/services/layers/forbid_http_get_mutations.rs

apollo-router/src/plugins/telemetry/mod.rs

apollo-router/src/plugins/telemetry/tracing/apollo_telemetry.rs

Co-authored-by: Gary Pennington <gary@apollographql.com>

…r into igni/usage_reporting

glasser · 2022-05-07T00:27:13Z

apollo-router/src/plugins/telemetry/mod.rs

+
+// Planner errors return stats report key that start with `## `
+// while successful planning stats report key start with `# `
+fn operation_count(stats_report_key: &str) -> u64 {


Good catch with this function, which is definitely correct

garypen

NIT around docs. Feel free to ignore.

NEXT_CHANGELOG.md

Co-authored-by: Gary Pennington <gary@apollographql.com>

glasser

This is really awesome work. It's so cool to see this coming together!

glasser · 2022-05-10T21:48:03Z

apollo-router/src/plugins/telemetry/metrics/apollo.rs

+    pub(crate) latency_count: Duration,
+    pub(crate) request_count: u64,
+    pub(crate) cache_hits: u64,
+    pub(crate) persisted_query_hits: u64,


I believe that Router does implement APQs, but it doesn't currently set this field or persisted_query_misses, right? Is that being tracked?

Now tracked in #1014.

glasser · 2022-05-10T21:49:49Z

apollo-router/src/plugins/telemetry/metrics/apollo.rs

+pub(crate) struct QueryLatencyStats {
+    pub(crate) latency_count: Duration,
+    pub(crate) request_count: u64,
+    pub(crate) cache_hits: u64,


Probably worth calling out (in comments somewhere?) that some (most) of these fields are always zero for now. Specifically I think these features aren't hooked up:

caching (cache_hits, cache_latency_count, public/private_cache_ttl_count): not (yet) a Router feature

op registry (registered_operation_count, forbidden_operation_count): not (yet) a Router feature

error stats (root_error_stats, requests_with_errors_count): Router probably does have the information to calculate this but isn't yet

Potentially they could just be left out of the code for now? protobuf messages don't need to contain all fields.)

The intention is long term that we add this stuff so let's create tickets for them. I would not want to delete the code for it to then have to be recreated when we get round to it. Added tickets to track the rest of the stats:

#1009 #1010 #1011

sure though just to be clear — the first two are not just "we need to add these stats" but "we would need to first implement the features that they track" (if it is decided that they are features worth implementing — eg, the current implementation of operation-registry-based safelisting is not particularly ideal.)

glasser · 2022-05-10T21:55:35Z

apollo-router/src/plugins/telemetry/metrics/apollo.rs

+    pub(crate) private_cache_ttl_count: Duration,
+    pub(crate) registered_operation_count: u64,
+    pub(crate) forbidden_operation_count: u64,
+    pub(crate) requests_without_field_instrumentation: u64,


I would suggest that you set this equal to request_count for now! In a router-y context this counts requests where the router doesn't send apollo-federation-include-trace:ftv1 to its subgraphs... which for now is every request. (Now, I think this field isn't really used for anything real by Studio at the moment, but being accurate should be pretty easy.)

glasser · 2022-05-10T21:59:02Z

apollo-router/src/plugins/telemetry/metrics/apollo.rs

+}
+
+#[derive(Default, Debug, Serialize)]
+pub(crate) struct QueryLatencyStats {


I find it hard to understand the semantics of this struct. The duration-y fields have only one Duration (not a histogram) so it must represent a single request. So it seems like all of the u64 fields should really just be bools? (And the name latency_count doesn't really make sense either.)

And naming-wise, it's a bit confusing that the struct AggregatedQueryLatencyStats corresponds to the proto message QueryLatencyStats. Would it make more sense to rename AggregatedQueryLatencyStats to QueryLatencyStats and this one to like SingleQueryLatencyStats or something?

latency_count mirrors the protobuf format, can you elaborate on why this should be converted to bools?

What's confusing is that you've changed the type of latency_count from DurationHistogram (a thing that can represent multiple requests) to Duration (a thing that can only represent a single request) but you've left most of the other fields which count requests as numbers. I don't think there's any case you would ever set those numbers to something greater than 1 because any such case would also require you to have more than one "duration".

apollo-spaceport/src/server.rs

glasser · 2022-05-10T23:04:09Z

apollo-router/src/plugins/telemetry/metrics/apollo.rs

+    }
+}
+
+// The TS implementation of DurationHistogram does Run Length Encoding (RLE)


So, funny story. I thought it would be nice to at least strip trailing zeroes even if we don't batch up leading and internal batches of 0s. I thought this might even be a nice opportunity for me to learn how to run Rust tests and write my own first tiny patch.

And actually it's very easy. The initialization of buckets below can just be

buckets: Vec::with_capacity(init_size.unwrap_or(DurationHistogram::DEFAULT_SIZE)),

After all, JS doesn't differentiate between length and capacity so its "pre-allocation" needs to actually put 0s in the array, and Rust can do better.

But then I noticed the tests still passed and realized why. In practice, 74 zeroes of buckets gets you only up to the 1.15ms bucket. So my change doesn't really do anything unless you're making lots of DurationHistograms representing entirely sub-ms durations. Since you're currently only doing DHs for whole operations (not individual fields) this is unlikely to be common.

glasser · 2022-05-10T23:27:14Z

apollo-router/src/plugins/telemetry/metrics/apollo.rs

+            *self.per_type_stat.entry(k).or_default() += v;
+        }
+        for (k, v) in metrics.referenced_fields_by_type {
+            // Merging is not required because metrics are always groupd by schema and query.


I'm not sure I understand what's going on here. I understand that each AggregatedMetrics object goes with a specific operation, so it makes sense that you don't need to merge the referenced fields object. But it seems like you are doing some sort of merging anyway: you are iterating over the types and doing a merge at that level. So you might still end up with some sort of merged thing if you have two different calls that give you different sets of types. (k is the type name.) Also just doesn't seem like you should need to do O(n) work here — either you have referenced_fields_by_type already and you don't need to do anything, or you don't and you need to put it in.

Maybe either referenced_fields_by_type has to be some sort of optional object, or that instead of starting with an "empty" AggregatedMetrics and adding one Metrics at a time, you start with an AggregatedMetrics created from the first Metrics for a given operation, and then when you later add_assign to that you can skip over referenced_fields_by_type? (This suggestion does run into an issue related to how your STUDIO_EXCLUDE feature works but I have a separate comment about that.)

glasser · 2022-05-10T23:41:47Z

apollo-router/src/plugins/telemetry/mod.rs

+            .get::<_, UsageReporting>(USAGE_REPORTING)
+            .unwrap_or_default()
+        {
+            let exclude: Option<bool> = context.get(STUDIO_EXCLUDE).unwrap_or_default();


So, I have a concern about the way you've implemented the STUDIO_EXCLUDE feature, which I think might actually require a little bit of refactoring to deal with. That said, it looks to me like this feature is not fully finished (ie I don't see any code that sets this context variable, though maybe I'm confused about how it works). I do see that #373 is closed though (cc @o0Ignition0o @BrynCooke who talked there recently).

The initial goal of the equivalent feature in Apollo Server is primarily performance — removing the impact on the server of certain kinds of requests. For example, some users use this to just reduce the number of operations that the usage reporting mechanism has to pay attention to (which was far worse in a previous iteration of the usage reporting protocol that only sent us traces without any aggregated stats). Others use it to filter out bad/spammy operations that would lead to a huge explosion in cardinality.

So for that purpose, it's kinda important that if you decide a given operation shouldn't be reported, that you don't end up hanging on to its report key, using it as a key in internal hashes, etc. Unfortunately the below looks like you still do that. I'd encourage you to find a way that allows information to be sent between the various components of your system that is not entirely sorted by stats_report_key — specifically, a way to (just like in the top-level usage report message itself!) send just an operation count without associating it with particular data about the operation (especially the stats_report_key, which could be enormous).

So right now your structure is that metrics::apollo::Metrics always has stats_report_key, client_name, and client_version, and sometimes has referenced_fields_by_type and query_latency_stats. I would argue that the first three need to be pushed into the "sometimes" as well. I would suggest that aggregate aggregate operation_count as its own thing outside of the (per-operation etc) AggregatedMetrics structure. (On the bright side this resolves a concern I referenced in another comment about merging referenced_fields_by_type — if every time you have a stats_report_key you also have referenced_fields_by_type then there's less of a worry around "merging".) This "ability to send general information that isn't attached to a specific stats report key" probably has to go to other parts of the system, like your submit_stats/add_stats/etc functions. The whole point of the operation_count field is that it's something that we can easily track as a single number that isn't divided up by client info and specific operation; I think you should refactor until it's basically not found anywhere divided up, only at top levels.

Again this doesn't really matter until you make STUDIO_EXCLUDE into an accessible feature... but one of the main goals of that feature is not achieved if you let "large numbers of large and distinct operations" take up memory.

Actually, looking at the one test related to this shows my point — lemme comment over on the snapshot too.

glasser · 2022-05-10T23:50:00Z

...pshots/apollo_router__plugins__telemetry__metrics__apollo__test__apollo_metrics_exclude.snap

+        "secs": 0,
+        "nanos": 0
+      },
+      "request_count": 0,


So I have a long comment about the STUDIO_EXCLUDE feature elsewhere but looking here, yeah, in addition to the overall concept of minimizing in-memory impact of excluded features, this snapshot shows that the feature isn't actually achieving the goal. The main point of STUDIO_EXCLUDE is not to decrease the numbers shown for a given operation, but to leave the operation out of reporting entirely. Maybe it's some weird spam thing, maybe it has secrets in the text somehow that aren't stripped, who knows. But a definite requirement for such a feature is that if all requests with a given stats report key are excluded, we shouldn't include anything about that stats_report_key in the report at all (other than the number of operations contributing to the single top-level operation_count).

I realize this snapshot is just of an intermediate data structure, not of the actual report sent to Studio. But my read of this code in apollo-spaceport/src/server.rs

let entry = query_usage .traces_and_stats_for_key .entry(record.key()) .or_insert_with(|| record.get_traces_and_stats());

suggests that we don't (when constructing the final report) filter out keys whose values have no requests. So this doesn't really implement exclusion, unless I'm missing something (I didn't actually try to run it with exclusion on).

glasser · 2022-05-11T00:10:44Z

apollo-spaceport/src/server.rs

-        tracing::debug!("containing: {:?}", graph_usage);
+    for (graph, query_usage) in drained {
+        tracing::debug!("submitting: {} operations", query_usage.operation_count);
+        tracing::debug!("containing: {:?}", query_usage);
        match crate::Report::try_new(&graph.reference) {


This appears to create a Report whose header only has the graph_ref field set.

Notably, it doesn't set the executable_schema_id field. This field should be set to the hex sha256 of the text of the supergraph schema, I believe. I have to admit I'm not 100% sure what aspects of our system actually depend on this being sent but it sure seems like a good idea.

This raises the broader point — it doesn't look like in general you're tying "the current schema" to any of this reporting stuff. I know that you implement Uplink with polling and there's a reload_server thing that happens when the schema changes, but I'm not sure how much of the system is "reloaded" when that happens. Also my understanding is that the model is that the Spaceport thing can run sorta separately from the main server. So it seems like a Spaceport could potentially get different messages about operations run against different schema versions?

In addition to just making sure that header.executable_schema_id is set correctly, the referenced_fields_by_type data structure depends on the schema. I know I've said (and you've implemented) "this structure is super cacheable, it only depends on the operation and the schema" but ... it does depend on the schema :) It doesn't look like the messages in agents.proto talk about the schema version.

So you could have the schema type A { y: ID } type Query { x: A } and the operation {x{y}}. This operation has referenced fields Query.x and A.y. Now you update to the schema type B { y: ID } type Query {x: B}. The same operation now has referenced fields Query.x and B.y. So it's not kosher to combine these two into the same message, as the referenced_fields data will be wrong for one of them!

Just adding a little context here. We need to add executable_schema_id, to our payloads but we don't need to worry about cross contamination on schema reload. The entire plugin pipeline is recreated when this happens thus we can guarantee that a particular instance will only ever be dealing with one schema.

This would not be the case in any sort of multi-tenancy scenario, but that is something to think about later.

That's great. And that includes the "spaceport"?

Re #898 (comment)

…the map. #898 (comment)

#898 (comment)

o0Ignition0o added 11 commits April 5, 2022 17:03

ok we have the usage_reporting_signature now

abc2917

extract usage_reporting_signature

86448c0

wip, let's put the usage_reporting in the right span and retrieve it now

98f4d12

add an error snapshot test with usage_reporting_signature

ff67553

Merge branch 'main' into igni/usage_reporting

de81dc8

Merge branch 'main' into igni/usage_reporting

7a12a8e

wip: add metrics to the telemetry plugin

d81363e

fixed a json_ext bug that would lead to panics when trying to update …

a431ac2

…an object that doesnt have the specified key + add elapsed to the telemetry context

wip

4edd359

Use latest router bridge + Merge branch 'main' into igni/usage_reporting

2173e17

make tests pass

50d6720

This comment has been minimized.

Sign in to view

o0Ignition0o and others added 4 commits April 22, 2022 15:49

compliance

237a3cc

Merge with main

ef99264

WIP, in theory everything in place before collecting metrics.

37c70f4

fix failing test

4b66938

BrynCooke assigned BrynCooke and o0Ignition0o Apr 28, 2022

Clippy licenses

915dc06

Geal reviewed Apr 29, 2022

View reviewed changes

apollo-router-core/src/json_ext.rs Outdated Show resolved Hide resolved

apollo-router/src/plugins/telemetry/mod.rs Outdated Show resolved Hide resolved

bryn and others added 9 commits April 29, 2022 15:23

Add broken map_future_with_context layer

f96ad40

bump router bridge + add a couple of comments

5338ce7

Progress

9cedf52

Merge branch 'igni/usage_reporting' of github.com:apollographql/route…

b7e508f

…r into igni/usage_reporting

make sure we clone the sender before each requests async move

f4f7658

Remove stats report key added by mistake.

2e260d1

merge main

73fed02

Add unit test for map future with context

bbf0e3b

Update doc for map_future_with_context

7a8c768

garypen requested changes May 6, 2022

View reviewed changes

BrynCooke and others added 4 commits May 6, 2022 17:42

Update apollo-router/src/plugins/telemetry/mod.rs

546b9ba

Co-authored-by: Gary Pennington <gary@apollographql.com>

remove dependency to apollo-rs in the router

accbbb3

Merge branch 'igni/usage_reporting' of github.com:apollographql/route…

2d9400e

…r into igni/usage_reporting

Replace QueryPlan::for_tests with QueryPlan::fake_new.

dfc69d3

BrynCooke force-pushed the igni/usage_reporting branch from 12ad6db to dfc69d3 Compare May 6, 2022 22:10

glasser reviewed May 7, 2022

View reviewed changes

bryn added 4 commits May 9, 2022 10:53

Make canned data private, add docs.

f7a3a85

Removed comment around checking the tick, it looks like it's working.

0def32e

Merge main

07dbd97

Update changelog

d201310

Geal mentioned this pull request May 9, 2022

bump query planner to get @inaccessible fixes #985

Merged

Merge branch 'main' into igni/usage_reporting

a9debc1

BrynCooke requested review from Geal and garypen May 9, 2022 12:02

garypen approved these changes May 9, 2022

View reviewed changes

NEXT_CHANGELOG.md Outdated Show resolved Hide resolved

Update NEXT_CHANGELOG.md

c402095

Co-authored-by: Gary Pennington <gary@apollographql.com>

BrynCooke merged commit 320fa52 into main May 9, 2022

BrynCooke deleted the igni/usage_reporting branch May 9, 2022 12:41

This was linked to issues May 9, 2022

make a separate tracing structure for Studio reporting #795

Closed

studio: extend existing router-bridge with more metadata #424

Closed

Studio Operation Usage Reporting: ability to exclude specific requests #373

Closed

Geal added this to the v0.9.0-rc.0 milestone May 10, 2022

Geal mentioned this pull request May 10, 2022

release: v0.9.0-rc.0 #1003

Merged

glasser reviewed May 11, 2022

View reviewed changes

BrynCooke mentioned this pull request May 11, 2022

Studio reporting followups #1012

Closed

abernix mentioned this pull request May 11, 2022

reporting: need to fill in persisted_query_misses #1014

Closed

BrynCooke pushed a commit that referenced this pull request May 11, 2022

Add panic if trace reporting is detected

be06557

Re #898 (comment)

BrynCooke pushed a commit that referenced this pull request May 11, 2022

Remove merge of referenced_fields_by_type, instead take ownership of …

f87f6d0

…the map. #898 (comment)

BrynCooke pushed a commit that referenced this pull request May 11, 2022

Set requests_without_field_instrumentation equal to request_count

e028faa

#898 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Usage Reporting. #898

feat: Usage Reporting. #898

o0Ignition0o commented Apr 22, 2022 •

edited

netlify bot commented Apr 22, 2022 •

edited

This comment has been minimized.

garypen left a comment

glasser May 7, 2022

garypen left a comment

glasser left a comment

glasser May 10, 2022

abernix May 11, 2022 •

edited

glasser May 10, 2022

BrynCooke May 11, 2022 •

edited

glasser May 11, 2022

glasser May 10, 2022

BrynCooke May 11, 2022

glasser May 10, 2022

BrynCooke May 11, 2022

glasser May 11, 2022

glasser May 10, 2022

glasser May 10, 2022

BrynCooke May 11, 2022

glasser May 10, 2022

glasser May 10, 2022

BrynCooke May 11, 2022

glasser May 11, 2022

BrynCooke May 11, 2022

BrynCooke May 11, 2022

glasser May 11, 2022

feat: Usage Reporting. #898

feat: Usage Reporting. #898

Conversation

o0Ignition0o commented Apr 22, 2022 • edited

netlify bot commented Apr 22, 2022 • edited

✅ Deploy Preview for apollo-router-docs canceled.

This comment has been minimized.

garypen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garypen left a comment

Choose a reason for hiding this comment

glasser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abernix May 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BrynCooke May 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

o0Ignition0o commented Apr 22, 2022 •

edited

netlify bot commented Apr 22, 2022 •

edited

abernix May 11, 2022 •

edited

BrynCooke May 11, 2022 •

edited