Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image pull errors: failed to do request: Head "http://localhost:8578/v2/..." when streaming not required #739

Open
awalford16 opened this issue Mar 4, 2024 · 15 comments
Assignees
Labels

Comments

@awalford16
Copy link

awalford16 commented Mar 4, 2024

Describe the bug

We have been experimenting with ACR streaming feature and while doing this we set up a new ACR, copied some images across and enabled streaming on it.

We then enabled a nodepool to use acr streaming and since then we have seen some pod deployments fail with:

failed to do request: Head "[http://localhost:8578/v2/](http://localhost:8578/v2/IMAGE?ns=NON_STREAIMING_ACR)

The pod deployments that are failing are trying to pull from a different ACR (without streaming enabled) but they are using the same nodepool that we were using to test the streaming. I have tried scaling down the nodes and deploying streaming pods onto a different nodepool but we are still seeing this error with odd nodes that never had a streaming pod there.

When this happens the pod never successfully pulls, it continuously fails and we have to delete the node from the cluster

To Reproduce
Steps to reproduce the behavior:

  1. Set up one ACR where the image has artifact streaming enabled
  2. Set up an ACR with no artifact streaming enabled and push the same image there
  3. Enable streaming on one nodepool
  4. Deploy 2 deployments onto that nodepool (1 using streaming ACR and 1 not)
  5. Scale up and down onto new nodes a couple times as it is intermittent

Expected behavior
A clear and concise description of what you expected to happen.

All pod deployments, whether coming from Artifact Streaming or standard image pull should work as normal

Screenshots
If applicable, add screenshots to help explain your problem.

Any relevant environment information

  • OS: [e.g. Ubuntu, Windows] Ubuntu
  • Azure CLI/PowerShell/SDK version AKS Version 1.28.3
  • Datetime (UTC) when the issue occurred 4th March 2024 @ 10:00

Additional context
Add any other context about the problem here.

If any information is a concern to post here, you can create a support ticket or send an email to acrsup@microsoft.com.

@awalford16 awalford16 added the bug label Mar 4, 2024
@juliusl
Copy link
Member

juliusl commented Mar 4, 2024

Thanks for reporting this. Does the nodepool have access to both registries?

@awalford16
Copy link
Author

Hi, yes we have a service principal tied to the AKS cluster that has pull permissions for both ACRs

@juliusl
Copy link
Member

juliusl commented Mar 6, 2024

Got it, and can you confirm a couple of things to help narrow down the root cause.

  • If artifact streaming is disabled in the cluster there are no issues?

  • Are there any issues when only a single registry with streaming enabled is used?

Meanwhile, I will try to reproduce the issue on my end as well.

@awalford16
Copy link
Author

I can confirm there are no issues if artifact streaming is disabled in the cluster, we have only started seeing this since we enabled it and we only see it for the one nodepool that we enabled it on.

I was able to get the image to pull from the same ACR when it was not using streaming. However the issue appears to be temporamental and hard to reproduce as it affects random nodes (even though they have not interacted with our streaming-enabled ACR at any point)

Could you please share the command to disable artifact streaming on a nodepool, I could disable it on the pool we are seeing issues and validate that the issue goes away

@juliusl
Copy link
Member

juliusl commented Mar 18, 2024

@awalford16 I'm able to reproduce this on my side, so I'll be working on a fix. ETA, for a fix will be some time in the next month.

cc: @northtyphoon @ganeshkumarashok

@juliusl
Copy link
Member

juliusl commented Mar 19, 2024

@awalford16 so I did a bit more digging and I have a workaround you could try. It appears to happen when you have the same image-reference w/ different registries in the same pod.

For example,

Fails

apiVersion: v1
kind: Pod
metadata:
  name: &name mix-wordpress
spec:
  containers:
  - name: wordpress-streaming
    image: streaming.azurecr.io/wordpress:latest
  - name: wordpress-nonstreaming
    image: non-streaming.azurecr.io/wordpress:latest

Works

apiVersion: v1
kind: Pod
metadata:
  name: non-wordpress
spec:
  containers:
  - name: wordpress-nonstreaming
    image: non-streaming.azurecr.io/wordpress:latest

Works

apiVersion: v1
kind: Pod
metadata:
  name: wordpress
spec:
  containers:
  - name: wordpress-streaming
    image: streaming.azurecr.io/wordpress:latest

(I tested these all running on the same node pool w/ node-selectors)

I am working on figuring out the root cause and a fix, but just wanted to share a possible workaround you could try for your own evaluation.

@juliusl
Copy link
Member

juliusl commented Mar 20, 2024

So it looks like this can affect any pod spec that has multiple containers. It looks like there's been a regression in AKS/containerd but I'm still trying to narrow it down.

@juliusl
Copy link
Member

juliusl commented Mar 20, 2024

@awalford16 Could you provide the value of this label from your nodepool that has this issue,

`kubernetes.azure.com/node-image-version'

For example it should be some value that looks like this - AKSUbuntu-2204gen2containerd-202403.13.0

@awalford16
Copy link
Author

@juliusl thanks for looking into this. The label value is AKSUbuntu-2204gen2containerd-202401.17.1

@juliusl
Copy link
Member

juliusl commented Mar 23, 2024

@awalford16 so good news, I figured out the issue and I have a fix. I'm working on the release, so should be about a week or two for it to make it's way upstream.

@juliusl
Copy link
Member

juliusl commented May 13, 2024

@awalford16 Hey there just to close up the loop. The fix has been rolled out to all AKS regions for about a week or two, are you able to update your node images and give it a try?

@awalford16
Copy link
Author

Thanks @juliusl! Looks like it is working on my end now. For confirmation these are the versions on my nodes: 5.15.0-1061-azure and containerd://1.7.15-1

@tolga-hmcts
Copy link

Has the fix (#739) been distributed to UK South? And how can I rollback "az aks nodepool update --enable-artifact-streaming"?

@juliusl
Copy link
Member

juliusl commented Jul 19, 2024

@awalford16 thanks for following up!

@juliusl
Copy link
Member

juliusl commented Jul 19, 2024

And how can I rollback "az aks nodepool update --enable-artifact-streaming"?

@tolga-hmcts I'm getting someone from the AKS side to chime in on that

Has the fix (#739) been distributed to UK South?

Yes, it should have been rolled out at the time of your comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants