Image pull errors: failed to do request: Head "http://localhost:8578/v2/..." when streaming not required #739

awalford16 · 2024-03-04T13:46:23Z

Describe the bug

We have been experimenting with ACR streaming feature and while doing this we set up a new ACR, copied some images across and enabled streaming on it.

We then enabled a nodepool to use acr streaming and since then we have seen some pod deployments fail with:

failed to do request: Head "[http://localhost:8578/v2/](http://localhost:8578/v2/IMAGE?ns=NON_STREAIMING_ACR)

The pod deployments that are failing are trying to pull from a different ACR (without streaming enabled) but they are using the same nodepool that we were using to test the streaming. I have tried scaling down the nodes and deploying streaming pods onto a different nodepool but we are still seeing this error with odd nodes that never had a streaming pod there.

When this happens the pod never successfully pulls, it continuously fails and we have to delete the node from the cluster

To Reproduce
Steps to reproduce the behavior:

Set up one ACR where the image has artifact streaming enabled
Set up an ACR with no artifact streaming enabled and push the same image there
Enable streaming on one nodepool
Deploy 2 deployments onto that nodepool (1 using streaming ACR and 1 not)
Scale up and down onto new nodes a couple times as it is intermittent

Expected behavior
A clear and concise description of what you expected to happen.

All pod deployments, whether coming from Artifact Streaming or standard image pull should work as normal

Screenshots
If applicable, add screenshots to help explain your problem.

Any relevant environment information

OS: [e.g. Ubuntu, Windows] Ubuntu
Azure CLI/PowerShell/SDK version AKS Version 1.28.3
Datetime (UTC) when the issue occurred 4th March 2024 @ 10:00

Additional context
Add any other context about the problem here.

If any information is a concern to post here, you can create a support ticket or send an email to acrsup@microsoft.com.

The text was updated successfully, but these errors were encountered:

juliusl · 2024-03-04T20:04:28Z

Thanks for reporting this. Does the nodepool have access to both registries?

awalford16 · 2024-03-05T10:28:58Z

Hi, yes we have a service principal tied to the AKS cluster that has pull permissions for both ACRs

juliusl · 2024-03-06T07:24:42Z

Got it, and can you confirm a couple of things to help narrow down the root cause.

If artifact streaming is disabled in the cluster there are no issues?
Are there any issues when only a single registry with streaming enabled is used?

Meanwhile, I will try to reproduce the issue on my end as well.

awalford16 · 2024-03-12T17:48:58Z

I can confirm there are no issues if artifact streaming is disabled in the cluster, we have only started seeing this since we enabled it and we only see it for the one nodepool that we enabled it on.

I was able to get the image to pull from the same ACR when it was not using streaming. However the issue appears to be temporamental and hard to reproduce as it affects random nodes (even though they have not interacted with our streaming-enabled ACR at any point)

Could you please share the command to disable artifact streaming on a nodepool, I could disable it on the pool we are seeing issues and validate that the issue goes away

juliusl · 2024-03-18T23:38:07Z

@awalford16 I'm able to reproduce this on my side, so I'll be working on a fix. ETA, for a fix will be some time in the next month.

cc: @northtyphoon @ganeshkumarashok

juliusl · 2024-03-19T20:26:31Z

@awalford16 so I did a bit more digging and I have a workaround you could try. It appears to happen when you have the same image-reference w/ different registries in the same pod.

For example,

Fails

apiVersion: v1
kind: Pod
metadata:
  name: &name mix-wordpress
spec:
  containers:
  - name: wordpress-streaming
    image: streaming.azurecr.io/wordpress:latest
  - name: wordpress-nonstreaming
    image: non-streaming.azurecr.io/wordpress:latest

Works

apiVersion: v1
kind: Pod
metadata:
  name: non-wordpress
spec:
  containers:
  - name: wordpress-nonstreaming
    image: non-streaming.azurecr.io/wordpress:latest

Works

apiVersion: v1
kind: Pod
metadata:
  name: wordpress
spec:
  containers:
  - name: wordpress-streaming
    image: streaming.azurecr.io/wordpress:latest

(I tested these all running on the same node pool w/ node-selectors)

I am working on figuring out the root cause and a fix, but just wanted to share a possible workaround you could try for your own evaluation.

juliusl · 2024-03-20T02:59:43Z

So it looks like this can affect any pod spec that has multiple containers. It looks like there's been a regression in AKS/containerd but I'm still trying to narrow it down.

juliusl · 2024-03-20T03:03:30Z

@awalford16 Could you provide the value of this label from your nodepool that has this issue,

`kubernetes.azure.com/node-image-version'

For example it should be some value that looks like this - AKSUbuntu-2204gen2containerd-202403.13.0

awalford16 · 2024-03-20T07:36:03Z

@juliusl thanks for looking into this. The label value is AKSUbuntu-2204gen2containerd-202401.17.1

juliusl · 2024-03-23T23:10:17Z

@awalford16 so good news, I figured out the issue and I have a fix. I'm working on the release, so should be about a week or two for it to make it's way upstream.

juliusl · 2024-05-13T21:24:14Z

@awalford16 Hey there just to close up the loop. The fix has been rolled out to all AKS regions for about a week or two, are you able to update your node images and give it a try?

awalford16 · 2024-06-03T17:05:52Z

Thanks @juliusl! Looks like it is working on my end now. For confirmation these are the versions on my nodes: 5.15.0-1061-azure and containerd://1.7.15-1

tolga-hmcts · 2024-06-11T19:27:19Z

Has the fix (#739) been distributed to UK South? And how can I rollback "az aks nodepool update --enable-artifact-streaming"?

juliusl · 2024-07-19T00:12:50Z

@awalford16 thanks for following up!

juliusl · 2024-07-19T00:13:09Z

And how can I rollback "az aks nodepool update --enable-artifact-streaming"?

@tolga-hmcts I'm getting someone from the AKS side to chime in on that

Has the fix (#739) been distributed to UK South?

Yes, it should have been rolled out at the time of your comment

awalford16 added the bug label Mar 4, 2024

northtyphoon assigned juliusl Mar 19, 2024

juliusl mentioned this issue Mar 28, 2024

[Feature] Artifact Streaming Azure/AKS#3928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image pull errors: failed to do request: Head "http://localhost:8578/v2/..." when streaming not required #739

Image pull errors: failed to do request: Head "http://localhost:8578/v2/..." when streaming not required #739

awalford16 commented Mar 4, 2024 •

edited

Loading

juliusl commented Mar 4, 2024

awalford16 commented Mar 5, 2024

juliusl commented Mar 6, 2024

awalford16 commented Mar 12, 2024

juliusl commented Mar 18, 2024

juliusl commented Mar 19, 2024

juliusl commented Mar 20, 2024

juliusl commented Mar 20, 2024

awalford16 commented Mar 20, 2024

juliusl commented Mar 23, 2024

juliusl commented May 13, 2024

awalford16 commented Jun 3, 2024

tolga-hmcts commented Jun 11, 2024

juliusl commented Jul 19, 2024

juliusl commented Jul 19, 2024 •

edited

Loading

Image pull errors: failed to do request: Head "http://localhost:8578/v2/..." when streaming not required #739

Image pull errors: failed to do request: Head "http://localhost:8578/v2/..." when streaming not required #739

Comments

awalford16 commented Mar 4, 2024 • edited Loading

juliusl commented Mar 4, 2024

awalford16 commented Mar 5, 2024

juliusl commented Mar 6, 2024

awalford16 commented Mar 12, 2024

juliusl commented Mar 18, 2024

juliusl commented Mar 19, 2024

juliusl commented Mar 20, 2024

juliusl commented Mar 20, 2024

awalford16 commented Mar 20, 2024

juliusl commented Mar 23, 2024

juliusl commented May 13, 2024

awalford16 commented Jun 3, 2024

tolga-hmcts commented Jun 11, 2024

juliusl commented Jul 19, 2024

juliusl commented Jul 19, 2024 • edited Loading

awalford16 commented Mar 4, 2024 •

edited

Loading

juliusl commented Jul 19, 2024 •

edited

Loading