Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persisted queries: improve Uplink failover and error handling #3863

Merged
merged 2 commits into from Sep 22, 2023

Conversation

glasser
Copy link
Member

@glasser glasser commented Sep 20, 2023

If an Uplink request fails, the stream_from_uplink function automatically tries the next configured Uplink endpoint. However, in the case of the persisted queries feature, the Uplink response contains an URL located in the same cloud provider as the Uplink endpoint (which is used to fetch the full PQ list chunk). Before this PR, the PQ code fetched this second URL "outside" of stream_from_uplink, so an error downloading this file would not result in re-trying the Uplink request itself on the next endpoint.

This PR factors out the body of stream_from_uplink into stream_from_uplink_transforming_new_response, which takes an async function that is called on any UplinkResponse::New to transform the response. (stream_from_uplink passes the identity function here, so its API is not affected.) If this function returns Err, the uplink code moves on to the next endpoint URL, just like if the initial Uplink fetch had failed. The PQ layer now fetches the PQ manifest chunk bodies inside this async function instead of by mapping the uplink stream.

This means that if our GCP Uplink server is up but the GCS servers that serve manifests from GCP are down, we will fail over to the AWS Uplink server instead of just failing to fetch PQs.

Additionally, the response from Uplink can specify multiple valid URLs for each chunk. (Uplink does not currently do this, but the protocol allows for it.) Before this PR, Router would only look at the first URL listed; after this PR, it tries each URL in order until finding one that does not fail.

Additionally, before this PR, any errors fetching PQs from Uplink or GCS after a successful startup were ignored. (They would be sent on the ready_sender channel whose receiver is dropped, which could lead to a debug-level logged message about how the channel wasn't open, but they were not directly logged.) After this PR, we don't try to send these errors on the channel after the channel was used once; instead, we log them at the error level (similarly to how schema and license uplink errors work). Note that this only occurs if Router fails to fetch PQs from all configured Uplink endpoints; a failure to fetch from a single endpoint is still only logged at the debug level.

Co-authored-by: Jeremy Lempereur jeremy.lempereur@iomentum.com

@github-actions

This comment has been minimized.

@router-perf
Copy link

router-perf bot commented Sep 20, 2023

CI performance tests

  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • large-request - Stress test with a 1 MB request payload
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • reload - Reload test over a long period of time at a constant rate of users
  • no-graphos - Basic stress test, no GraphOS.
  • xxlarge-request - Stress test with 100 MB request payload
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • const - Basic stress test that runs with a constant number of users

return Ok(());
}
Err(e) => {
if it.peek().is_some() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a different pattern from how stream_from_uplink (or specifically, fetch) handles errors.

In stream_from_uplink, individual failures are logged to debug with a message specifying "Other endpoints will be tried" (even if it's the last one!), and if they all fail, a constant Err without the specific error message in it is returned. My understanding is that we don't want to spam stderr with single-endpoint errors, but this does mean that the eventual error doesn't have specific details on the problem that occurred (unless you go and reproduce with the log level set to debug).

Here, we do the debug-with-"Other endpoints will be tried" thing only when there actually are more URLs to try; errors from the last URL are returned directly.

I think this is better than the existing approach because it's nice to actually include information about some failure in the eventual error, but I'm happy to rewrite this to work more like fetch if you'd like.

Copy link
Contributor

@o0Ignition0o o0Ignition0o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 387 to 416
# As of 2023-09-20, no current kube versions are supported by kubeconform,
# due to https://github.com/yannh/kubeconform/issues/233. So we skip
# this check for now.
#
# # Install Helm
# curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# # Install kubeconform
# KUBECONFORM_INSTALL=$(mktemp -d)
# curl -L https://github.com/yannh/kubeconform/releases/latest/download/kubeconform-linux-amd64.tar.gz | tar xz -C "${KUBECONFORM_INSTALL}"

# # Create list of kube versions
# CURRENT_KUBE_VERSIONS=$(curl -L https://raw.githubusercontent.com/kubernetes/website/main/data/releases/schedule.yaml \
# | yq -o json '.' \
# | jq --raw-output '.schedules[] | select((now | strftime("%Y-%m-%dT00:00:00Z")) as $date | .releaseDate < $date and .endOfLifeDate > $date) | .previousPatches[0].release')

# # Use helm to template our chart against all kube versions
# TEMPLATE_DIR=$(mktemp -d)
# for kube_version in ${CURRENT_KUBE_VERSIONS}; do
# # Use helm to template our chart against kube_version
# helm template --kube-version "${kube_version}" router helm/chart/router --set autoscaling.enabled=true > "${TEMPLATE_DIR}/router-${kube_version}.yaml"

# # Execute kubeconform on our templated charts to ensure they are good
# "${KUBECONFORM_INSTALL}/kubeconform" \
# --kubernetes-version "${kube_version}" \
# --strict \
# --schema-location default \
# --verbose \
# "${TEMPLATE_DIR}/router-${kube_version}.yaml"
# done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3867 (review) addresses this (workaround)

apollo-router/src/uplink/mod.rs Show resolved Hide resolved
apollo-router/src/uplink/mod.rs Outdated Show resolved Hide resolved
apollo-router/src/uplink/mod.rs Show resolved Hide resolved
apollo-router/src/uplink/mod.rs Show resolved Hide resolved
o0Ignition0o and others added 2 commits September 21, 2023 07:56
If an Uplink request fails, the `stream_from_uplink` function
automatically tries the next configured Uplink endpoint. However, in the
case of the persisted queries feature, the Uplink response contains an
URL located in the same cloud provider as the Uplink endpoint (which is
used to fetch the full PQ list chunk). Before this PR, the PQ code
fetched this second URL "outside" of `stream_from_uplink`, so an error
downloading this file would *not* result in re-trying the Uplink request
itself on the next endpoint.

This PR factors out the body of `stream_from_uplink` into
`stream_from_uplink_transforming_new_response`, which takes an async
function that is called on any `UplinkResponse::New` to transform the
response. (`stream_from_link` passes the identity function here, so its
API is not affected.) If this function returns `Err`, the uplink code
moves on to the next endpoint URL, just like if the initial Uplink fetch
had failed. The PQ layer now fetches the PQ manifest chunk bodies inside
this async function instead of by mapping the uplink stream.

This means that if our GCP Uplink server is up but the GCS servers that
serve manifests from GCP are down, we will fail over to the AWS Uplink
server instead of just failing to fetch PQs.

Additionally, the response from Uplink can specify multiple valid URLs
for each chunk. (Uplink does not currently do this.) Before this PR,
Router would only look at the first URL listed; after this PR, it tries
each URL in order until finding one that does not fail.

Additionally, before this PR, any errors fetching PQs from Uplink or GCS
after a successful startup were ignored. (They would be sent on the
`ready_sender` channel whose receiver is dropped, which could lead to a
`debug`-level logged message about how the channel wasn't open, but they
were not *directly* logged.) After this PR, we don't try to send these
errors on the channel after the channel was used once; instead, we log
them at the `error` level (similarly to how schema and license uplink
errors work). Note that this only occurs if Router fails to fetch PQs
from all configured Uplink endpoints; a failure to fetch from a single
endpoint is still only logged at the `debug` level.

Co-authored-by: Jeremy Lempereur <jeremy.lempereur@iomentum.com>
@abernix abernix merged commit 70cb943 into dev Sep 22, 2023
12 checks passed
@abernix abernix deleted the glasser/fetch-with-callback branch September 22, 2023 15:17
@lrlna lrlna mentioned this pull request Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants