-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persisted queries: improve Uplink failover and error handling #3863
Conversation
This comment has been minimized.
This comment has been minimized.
CI performance tests
|
1e17a4e
to
9ae5277
Compare
return Ok(()); | ||
} | ||
Err(e) => { | ||
if it.peek().is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a different pattern from how stream_from_uplink
(or specifically, fetch
) handles errors.
In stream_from_uplink
, individual failures are logged to debug
with a message specifying "Other endpoints will be tried" (even if it's the last one!), and if they all fail, a constant Err
without the specific error message in it is returned. My understanding is that we don't want to spam stderr with single-endpoint errors, but this does mean that the eventual error doesn't have specific details on the problem that occurred (unless you go and reproduce with the log level set to debug).
Here, we do the debug
-with-"Other endpoints will be tried" thing only when there actually are more URLs to try; errors from the last URL are returned directly.
I think this is better than the existing approach because it's nice to actually include information about some failure in the eventual error, but I'm happy to rewrite this to work more like fetch
if you'd like.
9ae5277
to
bb57b65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
.circleci/config.yml
Outdated
# As of 2023-09-20, no current kube versions are supported by kubeconform, | ||
# due to https://github.com/yannh/kubeconform/issues/233. So we skip | ||
# this check for now. | ||
# | ||
# # Install Helm | ||
# curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash | ||
|
||
# # Install kubeconform | ||
# KUBECONFORM_INSTALL=$(mktemp -d) | ||
# curl -L https://github.com/yannh/kubeconform/releases/latest/download/kubeconform-linux-amd64.tar.gz | tar xz -C "${KUBECONFORM_INSTALL}" | ||
|
||
# # Create list of kube versions | ||
# CURRENT_KUBE_VERSIONS=$(curl -L https://raw.githubusercontent.com/kubernetes/website/main/data/releases/schedule.yaml \ | ||
# | yq -o json '.' \ | ||
# | jq --raw-output '.schedules[] | select((now | strftime("%Y-%m-%dT00:00:00Z")) as $date | .releaseDate < $date and .endOfLifeDate > $date) | .previousPatches[0].release') | ||
|
||
# # Use helm to template our chart against all kube versions | ||
# TEMPLATE_DIR=$(mktemp -d) | ||
# for kube_version in ${CURRENT_KUBE_VERSIONS}; do | ||
# # Use helm to template our chart against kube_version | ||
# helm template --kube-version "${kube_version}" router helm/chart/router --set autoscaling.enabled=true > "${TEMPLATE_DIR}/router-${kube_version}.yaml" | ||
|
||
# # Execute kubeconform on our templated charts to ensure they are good | ||
# "${KUBECONFORM_INSTALL}/kubeconform" \ | ||
# --kubernetes-version "${kube_version}" \ | ||
# --strict \ | ||
# --schema-location default \ | ||
# --verbose \ | ||
# "${TEMPLATE_DIR}/router-${kube_version}.yaml" | ||
# done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#3867 (review) addresses this (workaround)
If an Uplink request fails, the `stream_from_uplink` function automatically tries the next configured Uplink endpoint. However, in the case of the persisted queries feature, the Uplink response contains an URL located in the same cloud provider as the Uplink endpoint (which is used to fetch the full PQ list chunk). Before this PR, the PQ code fetched this second URL "outside" of `stream_from_uplink`, so an error downloading this file would *not* result in re-trying the Uplink request itself on the next endpoint. This PR factors out the body of `stream_from_uplink` into `stream_from_uplink_transforming_new_response`, which takes an async function that is called on any `UplinkResponse::New` to transform the response. (`stream_from_link` passes the identity function here, so its API is not affected.) If this function returns `Err`, the uplink code moves on to the next endpoint URL, just like if the initial Uplink fetch had failed. The PQ layer now fetches the PQ manifest chunk bodies inside this async function instead of by mapping the uplink stream. This means that if our GCP Uplink server is up but the GCS servers that serve manifests from GCP are down, we will fail over to the AWS Uplink server instead of just failing to fetch PQs. Additionally, the response from Uplink can specify multiple valid URLs for each chunk. (Uplink does not currently do this.) Before this PR, Router would only look at the first URL listed; after this PR, it tries each URL in order until finding one that does not fail. Additionally, before this PR, any errors fetching PQs from Uplink or GCS after a successful startup were ignored. (They would be sent on the `ready_sender` channel whose receiver is dropped, which could lead to a `debug`-level logged message about how the channel wasn't open, but they were not *directly* logged.) After this PR, we don't try to send these errors on the channel after the channel was used once; instead, we log them at the `error` level (similarly to how schema and license uplink errors work). Note that this only occurs if Router fails to fetch PQs from all configured Uplink endpoints; a failure to fetch from a single endpoint is still only logged at the `debug` level. Co-authored-by: Jeremy Lempereur <jeremy.lempereur@iomentum.com>
fa13e6a
to
3a9492d
Compare
If an Uplink request fails, the
stream_from_uplink
function automatically tries the next configured Uplink endpoint. However, in the case of the persisted queries feature, the Uplink response contains an URL located in the same cloud provider as the Uplink endpoint (which is used to fetch the full PQ list chunk). Before this PR, the PQ code fetched this second URL "outside" ofstream_from_uplink
, so an error downloading this file would not result in re-trying the Uplink request itself on the next endpoint.This PR factors out the body of
stream_from_uplink
intostream_from_uplink_transforming_new_response
, which takes an async function that is called on anyUplinkResponse::New
to transform the response. (stream_from_uplink
passes the identity function here, so its API is not affected.) If this function returnsErr
, the uplink code moves on to the next endpoint URL, just like if the initial Uplink fetch had failed. The PQ layer now fetches the PQ manifest chunk bodies inside this async function instead of by mapping the uplink stream.This means that if our GCP Uplink server is up but the GCS servers that serve manifests from GCP are down, we will fail over to the AWS Uplink server instead of just failing to fetch PQs.
Additionally, the response from Uplink can specify multiple valid URLs for each chunk. (Uplink does not currently do this, but the protocol allows for it.) Before this PR, Router would only look at the first URL listed; after this PR, it tries each URL in order until finding one that does not fail.
Additionally, before this PR, any errors fetching PQs from Uplink or GCS after a successful startup were ignored. (They would be sent on the
ready_sender
channel whose receiver is dropped, which could lead to adebug
-level logged message about how the channel wasn't open, but they were not directly logged.) After this PR, we don't try to send these errors on the channel after the channel was used once; instead, we log them at theerror
level (similarly to how schema and license uplink errors work). Note that this only occurs if Router fails to fetch PQs from all configured Uplink endpoints; a failure to fetch from a single endpoint is still only logged at thedebug
level.Co-authored-by: Jeremy Lempereur jeremy.lempereur@iomentum.com