Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prep release: v1.13.0 #2841

Merged
merged 3 commits into from Mar 23, 2023
Merged

prep release: v1.13.0 #2841

merged 3 commits into from Mar 23, 2023

Conversation

abernix
Copy link
Member

@abernix abernix commented Mar 23, 2023

Note

When approved, this PR will merge into the 1.13.0 branch which will — upon being approved itself — merge into main.

Things to review in this PR:

  • Changelog correctness (There is a preview below, but it is not necessarily the most up to date. See the Files Changed for the true reality.)
  • Version bumps
  • That it targets the right release branch (1.13.0 in this case!).

🚀 Features

Uplink metrics and improved logging (Issue #2769, Issue #2815, Issue #2816)

For monitoring, observability and debugging requirements around Uplink-related behaviors (those which occur as part of Managed Federation) the router now emits better log messages and emits new metrics around these facilities. The new metrics are:

  • apollo_router_uplink_duration_seconds_bucket: A histogram of durations with the following attributes:

    • url: The URL that was polled
    • query: SupergraphSdl or Entitlement
    • type: new, unchanged, http_error, uplink_error, or ignored
    • code: The error code, depending on type
    • error: The error message
  • apollo_router_uplink_fetch_count_total: A gauge that counts the overall success (status="success") or failure (status="failure") counts occur when communicating to Uplink without taking into account fallback.

⚠️ The very first poll to Uplink is unable to capture metrics since its so early in the router's lifecycle that telemetry hasn't yet been setup. We consider this a suitable trade-off and don't want to allow perfect to be the enemy of good.

Here's an example of what these new metrics look like from the Prometheus scraping endpoint:

# HELP apollo_router_uplink_fetch_count_total apollo_router_uplink_fetch_count_total
# TYPE apollo_router_uplink_fetch_count_total gauge
apollo_router_uplink_fetch_count_total{query="SupergraphSdl",service_name="apollo-router",status="success"} 1
# HELP apollo_router_uplink_fetch_duration_seconds apollo_router_uplink_fetch_duration_seconds
# TYPE apollo_router_uplink_fetch_duration_seconds histogram
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.001"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.005"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.015"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.05"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.1"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.2"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.3"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.4"} 0
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="0.5"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="1"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="5"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="10"} 1
apollo_router_uplink_fetch_duration_seconds_bucket{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/",le="+Inf"} 1
apollo_router_uplink_fetch_duration_seconds_sum{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/"} 0.465257131
apollo_router_uplink_fetch_duration_seconds_count{kind="unchanged",query="SupergraphSdl",service_name="apollo-router",url="https://uplink.api.apollographql.com/"} 1

By @BrynCooke in #2779, #2817, #2819 #2826

🐛 Fixes

Only process Uplink messages that are deemed to be newer (Issue #2794)

Uplink is backed by multiple cloud providers to ensure high availability. However, this means that there will be periods of time where Uplink endpoints do not agree on what the latest data is. They are eventually consistent.

This has not been a problem for most users, as the default mode of operation for the router is to fallback to the secondary Uplink endpoint if the first fails.

The other mode of operation, is round-robin, which is triggered only when setting the APOLLO_UPLINK_ENDPOINTS environment variable. In this mode there is a much higher chance that the router will end up flapping due to disagreement between the Apollo Uplink servers or any user-provided proxies set into this variable.

This change introduces two fixes:

  1. The Router checks uplink messages against the last known message to see if it is newer. Messages that are old are discarded.
  2. The Router will only use fallback strategy. Uplink endpoints are only eventually consistent, and therefore it is better to always poll a primary source of information if available.

We will be improving the robustness of the solution over the next weeks, including via other fixes in this release, so this can be seen as an incremental improvement.

Note: We advise against using APOLLO_UPLINK_ENDPOINTS to try to cache uplink responses for HA purposes. Each request to Uplink currently sends state which limits the usefulness of such a cache.

By @BrynCooke in #2803 #2826

Distributed caching: Don't send Redis' CLIENT SETNAME

We won't send the CLIENT SETNAME command to connected Redis servers. This resolves an incompatibility with some Redis-compatible servers since not all "Redis-compatible" offerings (like Google Memorystore) actually support every Redis command. We weren't actually necessitating this feature, it was just a feature that could be enabled optionally on our Redis client. No Router functionality is impacted.

By @Geal in #2825

Support bare top-level __typename when aliased (Issue #2792)

PR #1762 implemented support for the query { __typename } but it did not work properly if the top-level standalone __typename field was aliased. This now works properly.

By @glasser in #2791

Maintain errors set on _entities (Issue #2731)

In their responses, some subgraph implementations do not return errors per entity but instead on the entire path. We now transmit those, irregardless.

By @Geal in #2756

📃 Configuration

Custom OpenTelemetry Datadog exporter mapping (Issue #2228)

This PR fixes the issue with the Datadog exporter not providing meaningful contextual data in the Datadog traces.
There is a known issue where OpenTelemetry is not fully compatible with Datadog.

To fix this, the opentelemetry-datadog crate added custom mapping functions.

Now, when enable_span_mapping is set to true, the Apollo Router will perform the following mapping:

  1. Use the OpenTelemetry span name to set the Datadog span operation name.
  2. Use the OpenTelemetry span attributes to set the Datadog span resource name.

For example:

Let's say we send a query MyQuery to the Apollo Router, then the Router using the operation's query plan will send a query to my-subgraph-name, producing the following trace:

    | apollo_router request                                                                 |
        | apollo_router router                                                              |
            | apollo_router supergraph                                                      |
            | apollo_router query_planning  | apollo_router execution                       |
                                                | apollo_router fetch                       |
                                                    | apollo_router subgraph                |
                                                        | apollo_router subgraph_request    |

As you can see, there is no clear information about the name of the query, the name of the subgraph, or the name of query sent to the subgraph.

Instead, with this new enable_span_mapping setting set to true, the following trace will be created:

    | request /graphql                                                                                   |
        | router                                                                                         |
            | supergraph MyQuery                                                                         |
                | query_planning MyQuery  | execution                                                    |
                                              | fetch fetch                                              |
                                                  | subgraph my-subgraph-name                            |
                                                      | subgraph_request MyQuery__my-subgraph-name__0    |

All this logic is gated behind the configuration enable_span_mapping which, if set to true, will take the values from the span attributes.

By @samuelAndalon in #2790

🛠 Maintenance

Migrate xtask CLI parsing from StructOpt to Clap (Issue #2807)

As an internal improvement to our tooling, we've migrated our xtask toolset from StructOpt to Clap, since StructOpt is in maintenance mode.

By @BrynCooke in #2808

Subgraph configuration override (Issue #2426)

We've introduced a new generic wrapper type for subgraph-level configuration, with the following behaviour:

  • If there's a config in all, it applies to all subgraphs. If it is not there, the default values apply
  • If there's a config in subgraphs for a specific named subgraph:
    • the fields it does specify override the fields specified in all
    • the fields it does not specify uses the values provided by all, or default values, if applicable

By @Geal in #2453

Add integration tests for Uplink URLs (Issue #2827)

We've added integration tests to ensure that all Uplink URLs can be contacted and data can be retrieved in an expected format.

We've also changed our URLs to align exactly with Gateway, to simplify our own documentation. Existing Router users do not need to take any action as we support both on our infrastructure.

By @BrynCooke in #2830, #2834

Improve integration test harness (Issue #2809)

Our internal integration test harness has been simplified.

By @BrynCooke in #2810

Use kubeconform to validate the Router's Helm manifest (Issue #1914)

We've had a couple cases where errors have been inadvertently introduced to our Helm charts. These have required fixes such as this fix. So far, we've been relying on manual testing and inspection, but we've reached the point where automation is desired. This change uses kubeconform to ensure that the YAML generated by our Helm manifest is indeed valid. Errors may still be possible, but this should at least prevent basic errors from occurring. This information will be surfaced in our CI checks.

By @garypen in #2835

📚 Documentation

Re-point links going via redirect to their true sources

Some of our documentation links were pointing to pages which have been renamed and received new page names during routine documentation updates. While the links were not broken (the former links redirected to the new URLs) we've updated them to avoid the extra hop

By @o0Ignition0o in #2780

Fix coprocessor docs about subgraph URI mutability

The subgraph uri is (and always has been) mutable when responding to the SubgraphRequest stage in a coprocessor.

By @lennyburdette in #2801

CHANGELOG.md Outdated Show resolved Hide resolved
@abernix abernix marked this pull request as ready for review March 23, 2023 13:21
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: Chandrika Srinivasan <chandrikas@users.noreply.github.com>
Co-authored-by: Geoffroy Couprie <apollo@geoffroycouprie.com>
@abernix abernix enabled auto-merge (squash) March 23, 2023 13:45
@abernix abernix disabled auto-merge March 23, 2023 13:45
@abernix abernix enabled auto-merge (squash) March 23, 2023 13:49
@abernix abernix disabled auto-merge March 23, 2023 17:47
@abernix abernix merged commit ef34c53 into 1.13.0 Mar 23, 2023
10 checks passed
@abernix abernix deleted the prep-1.13.0 branch March 23, 2023 17:47
@abernix abernix mentioned this pull request Mar 23, 2023
abernix added a commit that referenced this pull request Apr 13, 2023
The rename of a newly introduced metric in Apollo Router 1.13.0 was logged
in the CHANGELOG using the _wrong_ metric name.  The metric was renamed from
`apollo_router_uplink_duration_seconds_bucket` to
`apollo_router_uplink_fetch_duration_seconds_bucket` in
#2826, but we failed to catch
this discrepancy in the changelog for the [v1.13.0 release].

Ref: #2826
[v1.13.0 release]: #2841
abernix added a commit that referenced this pull request Apr 14, 2023
The rename of a newly introduced metric in Apollo Router 1.13.0 was
logged in the CHANGELOG using the _wrong_ metric name. The metric was
renamed from `apollo_router_uplink_duration_seconds_bucket` to
`apollo_router_uplink_fetch_duration_seconds_bucket` in
#2826, but we failed to
catch this discrepancy in the changelog for the v1.13.0 [release].

Ref: #2826
[release]: #2841
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants