persisted queries: improve Uplink failover and error handling #3863

glasser · 2023-09-20T22:19:21Z

If an Uplink request fails, the stream_from_uplink function automatically tries the next configured Uplink endpoint. However, in the case of the persisted queries feature, the Uplink response contains an URL located in the same cloud provider as the Uplink endpoint (which is used to fetch the full PQ list chunk). Before this PR, the PQ code fetched this second URL "outside" of stream_from_uplink, so an error downloading this file would not result in re-trying the Uplink request itself on the next endpoint.

This PR factors out the body of stream_from_uplink into stream_from_uplink_transforming_new_response, which takes an async function that is called on any UplinkResponse::New to transform the response. (stream_from_uplink passes the identity function here, so its API is not affected.) If this function returns Err, the uplink code moves on to the next endpoint URL, just like if the initial Uplink fetch had failed. The PQ layer now fetches the PQ manifest chunk bodies inside this async function instead of by mapping the uplink stream.

This means that if our GCP Uplink server is up but the GCS servers that serve manifests from GCP are down, we will fail over to the AWS Uplink server instead of just failing to fetch PQs.

Additionally, the response from Uplink can specify multiple valid URLs for each chunk. (Uplink does not currently do this, but the protocol allows for it.) Before this PR, Router would only look at the first URL listed; after this PR, it tries each URL in order until finding one that does not fail.

Additionally, before this PR, any errors fetching PQs from Uplink or GCS after a successful startup were ignored. (They would be sent on the ready_sender channel whose receiver is dropped, which could lead to a debug-level logged message about how the channel wasn't open, but they were not directly logged.) After this PR, we don't try to send these errors on the channel after the channel was used once; instead, we log them at the error level (similarly to how schema and license uplink errors work). Note that this only occurs if Router fails to fetch PQs from all configured Uplink endpoints; a failure to fetch from a single endpoint is still only logged at the debug level.

Co-authored-by: Jeremy Lempereur jeremy.lempereur@iomentum.com

router-perf · 2023-09-20T22:20:01Z

glasser · 2023-09-20T22:31:32Z

apollo-router/src/services/layers/persisted_queries/manifest_poller.rs

+                return Ok(());
+            }
+            Err(e) => {
+                if it.peek().is_some() {


This is a different pattern from how stream_from_uplink (or specifically, fetch) handles errors.

In stream_from_uplink, individual failures are logged to debug with a message specifying "Other endpoints will be tried" (even if it's the last one!), and if they all fail, a constant Err without the specific error message in it is returned. My understanding is that we don't want to spam stderr with single-endpoint errors, but this does mean that the eventual error doesn't have specific details on the problem that occurred (unless you go and reproduce with the log level set to debug).

Here, we do the debug-with-"Other endpoints will be tried" thing only when there actually are more URLs to try; errors from the last URL are returned directly.

I think this is better than the existing approach because it's nice to actually include information about some failure in the eventual error, but I'm happy to rewrite this to work more like fetch if you'd like.

o0Ignition0o

LGTM

o0Ignition0o · 2023-09-21T08:52:01Z

.circleci/config.yml

+            # As of 2023-09-20, no current kube versions are supported by kubeconform,
+            # due to https://github.com/yannh/kubeconform/issues/233. So we skip
+            # this check for now.
+            #
+            # # Install Helm
+            # curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
+
+            # # Install kubeconform
+            # KUBECONFORM_INSTALL=$(mktemp -d)
+            # curl -L https://github.com/yannh/kubeconform/releases/latest/download/kubeconform-linux-amd64.tar.gz | tar xz -C "${KUBECONFORM_INSTALL}"
+
+            # # Create list of kube versions
+            # CURRENT_KUBE_VERSIONS=$(curl -L https://raw.githubusercontent.com/kubernetes/website/main/data/releases/schedule.yaml \
+            #   | yq -o json '.' \
+            #   | jq --raw-output '.schedules[] | select((now | strftime("%Y-%m-%dT00:00:00Z")) as $date | .releaseDate < $date and .endOfLifeDate > $date) | .previousPatches[0].release')
+
+            # # Use helm to template our chart against all kube versions
+            # TEMPLATE_DIR=$(mktemp -d)
+            # for kube_version in ${CURRENT_KUBE_VERSIONS}; do
+            #   # Use helm to template our chart against kube_version
+            #   helm template --kube-version "${kube_version}" router helm/chart/router --set autoscaling.enabled=true > "${TEMPLATE_DIR}/router-${kube_version}.yaml"
+
+            #   # Execute kubeconform on our templated charts to ensure they are good
+            #   "${KUBECONFORM_INSTALL}/kubeconform" \
+            #     --kubernetes-version "${kube_version}" \
+            #     --strict \
+            #     --schema-location default \
+            #     --verbose \
+            #     "${TEMPLATE_DIR}/router-${kube_version}.yaml"
+            # done


#3867 (review) addresses this (workaround)

apollo-router/src/services/layers/persisted_queries/manifest_poller.rs

apollo-router/src/uplink/mod.rs

If an Uplink request fails, the `stream_from_uplink` function automatically tries the next configured Uplink endpoint. However, in the case of the persisted queries feature, the Uplink response contains an URL located in the same cloud provider as the Uplink endpoint (which is used to fetch the full PQ list chunk). Before this PR, the PQ code fetched this second URL "outside" of `stream_from_uplink`, so an error downloading this file would *not* result in re-trying the Uplink request itself on the next endpoint. This PR factors out the body of `stream_from_uplink` into `stream_from_uplink_transforming_new_response`, which takes an async function that is called on any `UplinkResponse::New` to transform the response. (`stream_from_link` passes the identity function here, so its API is not affected.) If this function returns `Err`, the uplink code moves on to the next endpoint URL, just like if the initial Uplink fetch had failed. The PQ layer now fetches the PQ manifest chunk bodies inside this async function instead of by mapping the uplink stream. This means that if our GCP Uplink server is up but the GCS servers that serve manifests from GCP are down, we will fail over to the AWS Uplink server instead of just failing to fetch PQs. Additionally, the response from Uplink can specify multiple valid URLs for each chunk. (Uplink does not currently do this.) Before this PR, Router would only look at the first URL listed; after this PR, it tries each URL in order until finding one that does not fail. Additionally, before this PR, any errors fetching PQs from Uplink or GCS after a successful startup were ignored. (They would be sent on the `ready_sender` channel whose receiver is dropped, which could lead to a `debug`-level logged message about how the channel wasn't open, but they were not *directly* logged.) After this PR, we don't try to send these errors on the channel after the channel was used once; instead, we log them at the `error` level (similarly to how schema and license uplink errors work). Note that this only occurs if Router fails to fetch PQs from all configured Uplink endpoints; a failure to fetch from a single endpoint is still only logged at the `debug` level. Co-authored-by: Jeremy Lempereur <jeremy.lempereur@iomentum.com>

glasser requested a review from o0Ignition0o September 20, 2023 22:19

This comment has been minimized.

Sign in to view

glasser force-pushed the glasser/fetch-with-callback branch from 1e17a4e to 9ae5277 Compare September 20, 2023 22:26

glasser commented Sep 20, 2023

View reviewed changes

glasser force-pushed the glasser/fetch-with-callback branch from 9ae5277 to bb57b65 Compare September 21, 2023 05:29

o0Ignition0o approved these changes Sep 21, 2023

View reviewed changes

o0Ignition0o requested review from Geal and BrynCooke September 21, 2023 07:46

o0Ignition0o reviewed Sep 21, 2023

View reviewed changes

bnjjj reviewed Sep 21, 2023

View reviewed changes

o0Ignition0o and others added 2 commits September 21, 2023 07:56

change generic name to TransformedResponse

3a9492d

glasser force-pushed the glasser/fetch-with-callback branch from fa13e6a to 3a9492d Compare September 21, 2023 15:05

bnjjj approved these changes Sep 21, 2023

View reviewed changes

EverlastingBugstopper approved these changes Sep 22, 2023

View reviewed changes

abernix merged commit 70cb943 into dev Sep 22, 2023
12 checks passed

abernix deleted the glasser/fetch-with-callback branch September 22, 2023 15:17

lrlna mentioned this pull request Sep 27, 2023

prep release: v1.31.0 #3920

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persisted queries: improve Uplink failover and error handling #3863

persisted queries: improve Uplink failover and error handling #3863

glasser commented Sep 20, 2023 •

edited

This comment has been minimized.

router-perf bot commented Sep 20, 2023

glasser Sep 20, 2023

o0Ignition0o left a comment

o0Ignition0o Sep 21, 2023

persisted queries: improve Uplink failover and error handling #3863

persisted queries: improve Uplink failover and error handling #3863

Conversation

glasser commented Sep 20, 2023 • edited

This comment has been minimized.

router-perf bot commented Sep 20, 2023

glasser Sep 20, 2023

Choose a reason for hiding this comment

o0Ignition0o left a comment

Choose a reason for hiding this comment

o0Ignition0o Sep 21, 2023

Choose a reason for hiding this comment

glasser commented Sep 20, 2023 •

edited