prep release: v1.13.0 #2841

abernix · 2023-03-23T13:18:46Z

Note

When approved, this PR will merge into the 1.13.0 branch which will — upon being approved itself — merge into main.

Things to review in this PR:

Changelog correctness (There is a preview below, but it is not necessarily the most up to date. See the Files Changed for the true reality.)

Version bumps

That it targets the right release branch (1.13.0 in this case!).

🚀 Features

Uplink metrics and improved logging (Issue #2769, Issue #2815, Issue #2816)

For monitoring, observability and debugging requirements around Uplink-related behaviors (those which occur as part of Managed Federation) the router now emits better log messages and emits new metrics around these facilities. The new metrics are:

apollo_router_uplink_duration_seconds_bucket: A histogram of durations with the following attributes:
- url: The URL that was polled
- query: SupergraphSdl or Entitlement
- type: new, unchanged, http_error, uplink_error, or ignored
- code: The error code, depending on type
- error: The error message
apollo_router_uplink_fetch_count_total: A gauge that counts the overall success (status="success") or failure (status="failure") counts occur when communicating to Uplink without taking into account fallback.

⚠️ The very first poll to Uplink is unable to capture metrics since its so early in the router's lifecycle that telemetry hasn't yet been setup. We consider this a suitable trade-off and don't want to allow perfect to be the enemy of good.

Here's an example of what these new metrics look like from the Prometheus scraping endpoint:

# HELP apollo_router_uplink_fetch_count_total apollo_router_uplink_fetch_count_total
# TYPE apollo_router_uplink_fetch_count_total gauge
apollo_router_uplink_fetch_count_total{query="SupergraphSdl",service_name="apollo-router",status="success"} 1
# HELP apollo_router_uplink_fetch_duration_seconds apollo_router_uplink_fetch_duration_seconds
# TYPE apollo_router_uplink_fetch_duration_seconds histogram
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.001"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.005"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.015"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.05"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.1"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.2"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.3"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.4"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.5"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="1"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="5"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="10"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="+Inf"} 1
apollo_router_uplink_fetch_duration_seconds_sum{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/"} 0.465257131
apollo_router_uplink_fetch_duration_seconds_count{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/"} 1

By @BrynCooke in #2779, #2817, #2819 #2826

🐛 Fixes

Only process Uplink messages that are deemed to be newer (Issue #2794)

Uplink is backed by multiple cloud providers to ensure high availability. However, this means that there will be periods of time where Uplink endpoints do not agree on what the latest data is. They are eventually consistent.

This has not been a problem for most users, as the default mode of operation for the router is to fallback to the secondary Uplink endpoint if the first fails.

The other mode of operation, is round-robin, which is triggered only when setting the APOLLO_UPLINK_ENDPOINTS environment variable. In this mode there is a much higher chance that the router will end up flapping due to disagreement between the Apollo Uplink servers or any user-provided proxies set into this variable.

This change introduces two fixes:

The Router checks uplink messages against the last known message to see if it is newer. Messages that are old are discarded.
The Router will only use fallback strategy. Uplink endpoints are only eventually consistent, and therefore it is better to always poll a primary source of information if available.

We will be improving the robustness of the solution over the next weeks, including via other fixes in this release, so this can be seen as an incremental improvement.

Note: We advise against using APOLLO_UPLINK_ENDPOINTS to try to cache uplink responses for HA purposes. Each request to Uplink currently sends state which limits the usefulness of such a cache.

By @BrynCooke in #2803 #2826

Distributed caching: Don't send Redis' `CLIENT SETNAME`

We won't send the CLIENT SETNAME command to connected Redis servers. This resolves an incompatibility with some Redis-compatible servers since not all "Redis-compatible" offerings (like Google Memorystore) actually support every Redis command. We weren't actually necessitating this feature, it was just a feature that could be enabled optionally on our Redis client. No Router functionality is impacted.

By @Geal in #2825

Support bare top-level `__typename` when aliased (Issue #2792)

PR #1762 implemented support for the query { __typename } but it did not work properly if the top-level standalone __typename field was aliased. This now works properly.

By @glasser in #2791

Maintain errors set on `_entities` (Issue #2731)

In their responses, some subgraph implementations do not return errors per entity but instead on the entire path. We now transmit those, irregardless.

By @Geal in #2756

📃 Configuration

Custom OpenTelemetry Datadog exporter mapping (Issue #2228)

This PR fixes the issue with the Datadog exporter not providing meaningful contextual data in the Datadog traces.
There is a known issue where OpenTelemetry is not fully compatible with Datadog.

To fix this, the opentelemetry-datadog crate added custom mapping functions.

Now, when enable_span_mapping is set to true, the Apollo Router will perform the following mapping:

Use the OpenTelemetry span name to set the Datadog span operation name.
Use the OpenTelemetry span attributes to set the Datadog span resource name.

For example:

Let's say we send a query MyQuery to the Apollo Router, then the Router using the operation's query plan will send a query to my-subgraph-name, producing the following trace:

    | apollo_router request                                                                 |
        | apollo_router router                                                              |
            | apollo_router supergraph                                                      |
            | apollo_router query_planning  | apollo_router execution                       |
                                                | apollo_router fetch                       |
                                                    | apollo_router subgraph                |
                                                        | apollo_router subgraph_request    |

As you can see, there is no clear information about the name of the query, the name of the subgraph, or the name of query sent to the subgraph.

Instead, with this new enable_span_mapping setting set to true, the following trace will be created:

    | request /graphql                                                                                   |
        | router                                                                                         |
            | supergraph MyQuery                                                                         |
                | query_planning MyQuery  | execution                                                    |
                                              | fetch fetch                                              |
                                                  | subgraph my-subgraph-name                            |
                                                      | subgraph_request MyQuery__my-subgraph-name__0    |

All this logic is gated behind the configuration enable_span_mapping which, if set to true, will take the values from the span attributes.

By @samuelAndalon in #2790

🛠 Maintenance

Migrate `xtask` CLI parsing from `StructOpt` to `Clap` (Issue #2807)

As an internal improvement to our tooling, we've migrated our xtask toolset from StructOpt to Clap, since StructOpt is in maintenance mode.

By @BrynCooke in #2808

Subgraph configuration override (Issue #2426)

We've introduced a new generic wrapper type for subgraph-level configuration, with the following behaviour:

If there's a config in all, it applies to all subgraphs. If it is not there, the default values apply
If there's a config in subgraphs for a specific named subgraph:
- the fields it does specify override the fields specified in all
- the fields it does not specify uses the values provided by all, or default values, if applicable

By @Geal in #2453

Add integration tests for Uplink URLs (Issue #2827)

We've added integration tests to ensure that all Uplink URLs can be contacted and data can be retrieved in an expected format.

We've also changed our URLs to align exactly with Gateway, to simplify our own documentation. Existing Router users do not need to take any action as we support both on our infrastructure.

By @BrynCooke in #2830, #2834

Improve integration test harness (Issue #2809)

Our internal integration test harness has been simplified.

By @BrynCooke in #2810

Use `kubeconform` to validate the Router's Helm manifest (Issue #1914)

We've had a couple cases where errors have been inadvertently introduced to our Helm charts. These have required fixes such as this fix. So far, we've been relying on manual testing and inspection, but we've reached the point where automation is desired. This change uses kubeconform to ensure that the YAML generated by our Helm manifest is indeed valid. Errors may still be possible, but this should at least prevent basic errors from occurring. This information will be surfaced in our CI checks.

By @garypen in #2835

📚 Documentation

Re-point links going via redirect to their true sources

Some of our documentation links were pointing to pages which have been renamed and received new page names during routine documentation updates. While the links were not broken (the former links redirected to the new URLs) we've updated them to avoid the extra hop

By @o0Ignition0o in #2780

Fix coprocessor docs about subgraph URI mutability

The subgraph uri is (and always has been) mutable when responding to the SubgraphRequest stage in a coprocessor.

By @lennyburdette in #2801

CHANGELOG.md

Co-authored-by: Chandrika Srinivasan <chandrikas@users.noreply.github.com> Co-authored-by: Geoffroy Couprie <apollo@geoffroycouprie.com>

The rename of a newly introduced metric in Apollo Router 1.13.0 was logged in the CHANGELOG using the _wrong_ metric name. The metric was renamed from `apollo_router_uplink_duration_seconds_bucket` to `apollo_router_uplink_fetch_duration_seconds_bucket` in #2826, but we failed to catch this discrepancy in the changelog for the [v1.13.0 release]. Ref: #2826 [v1.13.0 release]: #2841

The rename of a newly introduced metric in Apollo Router 1.13.0 was logged in the CHANGELOG using the _wrong_ metric name. The metric was renamed from `apollo_router_uplink_duration_seconds_bucket` to `apollo_router_uplink_fetch_duration_seconds_bucket` in #2826, but we failed to catch this discrepancy in the changelog for the v1.13.0 [release]. Ref: #2826 [release]: #2841

prep release: v1.13.0

ce85c90

apollo-bot2 assigned abernix Mar 23, 2023

abernix commented Mar 23, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Update CHANGELOG.md

dcda2f8

abernix requested review from garypen, SimonSapin, BrynCooke, chandrikas and Geal March 23, 2023 13:21

abernix marked this pull request as ready for review March 23, 2023 13:21

abernix requested a review from StephenBarlow as a code owner March 23, 2023 13:22

Geal reviewed Mar 23, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

garypen approved these changes Mar 23, 2023

View reviewed changes

BrynCooke approved these changes Mar 23, 2023

View reviewed changes

chandrikas reviewed Mar 23, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Apply suggestions from code review

048a28a

Co-authored-by: Chandrika Srinivasan <chandrikas@users.noreply.github.com> Co-authored-by: Geoffroy Couprie <apollo@geoffroycouprie.com>

abernix enabled auto-merge (squash) March 23, 2023 13:45

abernix disabled auto-merge March 23, 2023 13:45

chandrikas approved these changes Mar 23, 2023

View reviewed changes

abernix enabled auto-merge (squash) March 23, 2023 13:49

abernix disabled auto-merge March 23, 2023 17:47

abernix merged commit ef34c53 into 1.13.0 Mar 23, 2023
10 checks passed

abernix deleted the prep-1.13.0 branch March 23, 2023 17:47

abernix mentioned this pull request Mar 23, 2023

release: v1.13.0 #2849

Merged

abernix mentioned this pull request Apr 13, 2023

chore: Update CHANGELOG.md to reference correct metric name #2942

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prep release: v1.13.0 #2841

prep release: v1.13.0 #2841

abernix commented Mar 23, 2023 •

edited

prep release: v1.13.0 #2841

prep release: v1.13.0 #2841

Conversation

abernix commented Mar 23, 2023 • edited

🚀 Features

Uplink metrics and improved logging (Issue #2769, Issue #2815, Issue #2816)

🐛 Fixes

Only process Uplink messages that are deemed to be newer (Issue #2794)

Distributed caching: Don't send Redis' CLIENT SETNAME

Support bare top-level __typename when aliased (Issue #2792)

Maintain errors set on _entities (Issue #2731)

📃 Configuration

Custom OpenTelemetry Datadog exporter mapping (Issue #2228)

🛠 Maintenance

Migrate xtask CLI parsing from StructOpt to Clap (Issue #2807)

Subgraph configuration override (Issue #2426)

Add integration tests for Uplink URLs (Issue #2827)

Improve integration test harness (Issue #2809)

Use kubeconform to validate the Router's Helm manifest (Issue #1914)

📚 Documentation

Re-point links going via redirect to their true sources

Fix coprocessor docs about subgraph URI mutability

abernix commented Mar 23, 2023 •

edited

Distributed caching: Don't send Redis' `CLIENT SETNAME`

Support bare top-level `__typename` when aliased (Issue #2792)

Maintain errors set on `_entities` (Issue #2731)

Migrate `xtask` CLI parsing from `StructOpt` to `Clap` (Issue #2807)

Use `kubeconform` to validate the Router's Helm manifest (Issue #1914)