Skip to content
This repository has been archived by the owner on Oct 11, 2023. It is now read-only.

Synchronization timeouts (Dev Spaces) #407

Closed
Eneuman opened this issue Sep 24, 2020 · 149 comments
Closed

Synchronization timeouts (Dev Spaces) #407

Eneuman opened this issue Sep 24, 2020 · 149 comments

Comments

@Eneuman
Copy link

Eneuman commented Sep 24, 2020

About 4 hours ago we started receiving synchronization timeouts.
"Azds up" shows a timeout error after trying for 4 minutes.

Our k8 looks great and we have tried from two different networks with the same result.
Is there a problem with synchronization atm ?

@philon-msft
Copy link
Contributor

No, we don't know of any issues that would cause synchronization to fail for you. Have any changes in your project caused more or bigger files to be synced?

@Eneuman
Copy link
Author

Eneuman commented Sep 25, 2020

Hi and thanks for the quick reply

Nothing has changed in the project but he project is kinda large (around 20000 files).
I just tried again and now it synced fine and finished the build in just 1.4 minutes.

We keep seeing this performance degregation and it always occurs around 16:30 UTC.
Our large project stops synchronizing, and the smaller ones takes alot longer, but the next morning it work fine again.

We have tried 3 different networks from 3 different carriers with the same result.

We have run performance checks on our k8 but the node has plenty or resources left, and the application on the k8 runs smoth and responsive.

Does the synchronization happen just between our clients and our k8, or is there some other servers involved aswell?

@philon-msft
Copy link
Contributor

Hi @Eneuman , it sounds like there may be a pattern of high network traffic around 16:30 in nearby resources in the data center where your cluster and Controller are hosted. Are you in the West Europe region? There are some other Dev Spaces resources involved in the sync operation, but we don't have anything scheduled to run near at that time.

@Eneuman
Copy link
Author

Eneuman commented Oct 1, 2020

@philon-msft Yes, I am in the West Europe region.

And like clockwork, the synchronization just stoped working again :(
"Timed out after 42s trying to list azds pods"

Running "kubectl get pods" returns the pods in under a second.

At the moment, dev spaces is unusable for us after 16:30.

@Eneuman
Copy link
Author

Eneuman commented Oct 1, 2020

And now its back up and working without any changes on our side. This got to be a resource problem. Do you have any DevSpaces telemetry from the West Europe region you can take a look at? :)

@Eneuman
Copy link
Author

Eneuman commented Nov 7, 2020

Synchornization stoped working again 6/11 14:45 UTC
Today 7/11 11:00 UTC it is still not working. All our project (event the small ones) receive Synchronization Timout.
Cluster looks fine, no changes has been made.

This is realy frustrating... Please help us figure out what is going on..

@Eneuman
Copy link
Author

Eneuman commented Nov 8, 2020

Still not working :(

Using dev space dev with controller aks-we-eclinic-dev
Synchronizing files...4m 0s
Timed out waiting for file synchronization to complete

And the command azds down gives the following error:
Timed out after 1m trying to list services.

but azds list-up works fine.

@Eneuman
Copy link
Author

Eneuman commented Nov 9, 2020

Still not working :(

azds down
Timed out after 1m trying to list services.

@daniv-msft
Copy link
Collaborator

@Eneuman Sorry you're encountering this issue. :( We're having a look into this, and will keep you updated.

@daniv-msft
Copy link
Collaborator

Adding @shpathak-msft, who is looking into this.

@Eneuman
Copy link
Author

Eneuman commented Nov 9, 2020

Thank you!
Let me know of there is anything I can do to help!

Feel free to access our cluster if you need to troubleshoot.

//Per

@shpathak-msft
Copy link
Contributor

@Eneuman Could you please reach out to me at shpathak@microsoft.com? I have some logs from our backend service that I would like to share with you.

@Eneuman
Copy link
Author

Eneuman commented Nov 9, 2020 via email

@shpathak-msft
Copy link
Contributor

Thank you for reaching out and glad to know code synchronization is now working for you. I suppose you already know this, but I would suggest reaching out to AKS support if you continue to see the "could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1" error.

@Eneuman
Copy link
Author

Eneuman commented Nov 12, 2020

Thanks you for getting synchronization working again.
This has happend to use before and it's blocking us from working when it does. Are there any actions being taken to stop this issue from occuring again?

@daniv-msft
Copy link
Collaborator

@Eneuman Happy that we unblocked you! The issue you encountered is tricky to investigate because it is either transient, or it relies on us restarting the component on our side associated with your Azure Dev Spaces controller. This component didn't show specific errors itself that would help us finding the root cause of this. When this issue happened, this also didn't match to a deployment on our side or to code changes.
We know that Azure Dev Spaces relies on a complex codebase and can be unreliable, and this is part of the reason why we decided to move to another approach.

If I remember correctly, you investigated in the past transitioning to Bridge to Kubernetes but cannot do it presently because you rely on pod identity for your services. Is this correct? We're presently investigating adding support for pod identity and, if it's the only blocker for you, we would be happy to provide you with early bits as soon as we have something to share.

@Eneuman
Copy link
Author

Eneuman commented Nov 14, 2020

@daniv-msft Yes we are interested in transitioning to Bridge to Kubernetes but two features are missing.

Pod Identity (as you said) and the ability to stop all other instances of a particular pod when we start debugging it.

We rely heavily on azure service bus and pub/sub, so when debugging a service, we don't want any other running instances, of the pod we are debugging,to receive the events.

@daniv-msft
Copy link
Collaborator

@Eneuman Thanks for your quick reply.
All right, as of today we indeed expect the targeted service to be backed by only one pod, and fail if it's not the case. We should be able to automatically decrease the number of pods to 1 when debugging, and increase it back afterwards. I'm bumping the priority of this in our backlog so that we align it with our pod identity story.

Regarding pod identity, could you please confirm that your team members are using Windows machines to develop? From previous discussions, I believe that you're using Windows yourself.
In this case, the target OS is important as we might need for pod identity to have a specific implementation for each OS (because networking is done very differently in Windows compared to Linux, for example).

@Eneuman
Copy link
Author

Eneuman commented Nov 14, 2020

@daniv-msft Yes, we are only using Windows machines and Visual Studio 2019 Pro.

Really looking forward to trying the new bits out :)

@daniv-msft
Copy link
Collaborator

@Eneuman All right, thanks for confirming!
Sounds good, we'll keep in the loop as soon as we have a beta for pod identity.

@Eneuman
Copy link
Author

Eneuman commented Feb 15, 2021

@daniv-msft I think we just got this issue again :(
Can you please take a look and see if you can get it running again?

Enabling debugging...
Timed out after 1m trying to list azds pods.

Regards
Per

@amsoedal
Copy link
Collaborator

Hi @Eneuman, I've restarted some of the components on our side. Can you let me know if the issue keeps reproducing? Thanks!

@Eneuman
Copy link
Author

Eneuman commented Feb 15, 2021

Hi @amsoedal
Yes, now it running smooth again. Thanks you !

@Eneuman
Copy link
Author

Eneuman commented Feb 15, 2021

@amsoedal
Hi again, maby I was too fast there.
I still am having problems with atleast two of my pods.

One is giving me this error. Is this on my side ?

pods/klinikportalservice-757c5cc75b-cd6b7: BackOff: Back-off restarting failed container

@amsoedal
Copy link
Collaborator

Hi @Eneuman, sorry you're still having issues! Could you please run kubectl logs pods/klinikportalservice-757c5cc75b-cd6b7 -p (you might need to specify a container name with -c, but it should prompt you if necessary) to see what happened?

@Eneuman
Copy link
Author

Eneuman commented Feb 15, 2021

@amsoedal

Hmm, for some reason it seems like Azure ADO Cert is no longer valid. Maby a service issue?

kubectl logs -n dev pods/klinikportalservice-757c5cc75b-cd6b7 devspaces-build

2021-02-15T17:53:35.9003579Z | BuildService | TRACE | Starting build process.\nOperation context: <json>{"clientRequestId":"5a8260c7-6a35-4651-ba04-b960705bc226","correlationRequestId":"113b9760-35ec-4773-b952-a3733a6be493","requestId":null,"userSubscriptionId":null,"startTime":"2021-02-15T17:53:34.9226682Z","userAgent":"BuildSvc/1.0.20210206.1","requestHttpMethod":null,"requestUri":null,"apiVersion":null,"version":"1.0.20210206.1","requestHeaders":{},"loggingProperties":{"applicationName":"BuildService","deviceOperatingSystem":"Linux 5.4.0-1031-azure #32~18.04.1-Ubuntu SMP Tue Oct 6 10:03:22 UTC 2020","framework":".NET Core 4.6.28325.01"}}</json>
2021-02-15T17:53:35.9076895Z | BuildService | TRACE | Download workspace
2021-02-15T17:53:35.9646040Z | BuildService | TRACE | Log Handler started
2021-02-15T17:53:36.1176640Z | BuildService | WARNG | DownloadWorkspaceAsync failed with {"Message":"The SSL connection could not be established, see inner exception.","Data":{},"InnerException":{"ClassName":"System.Security.Authentication.AuthenticationException","Message":"The remote certificate is invalid according to the validation procedure.","Data":null,"InnerException":null,"HelpURL":null,"StackTraceString":"   at System.Net.Security.SslState.StartSendAuthResetSignal(ProtocolToken message, AsyncProtocolRequest asyncRequest, ExceptionDispatchInfo exception)\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequestasyncRequest)\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)\n--- End of stack trace from previous location where exception was thrown ---\n   at System.Net.Security.SslState.ThrowIfExceptional()\n   at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)\n   at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)\n   at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)\n   at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)\n   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)\n--- End of stack trace from previous location where exception was thrown ---\n   at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)","RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":null,"HResult":-2146233087,"Source":"System.Private.CoreLib","WatsonBuckets":null},"StackTrace":"   at Microsoft.Rest.RetryDelegatingHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)\n   at Microsoft.Azure.DevSpaces.Common.Auth.Handlers.AuthErrorHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in /app/common.auth/Handlers/AuthErrorHttpHandler.cs:line 47\n   at Microsoft.Azure.DevSpaces.Common.Auth.Handlers.ServiceClientCredentialsHttpHandler`1.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in /app/common.auth/Handlers/ServiceClientCredentialsHttpHandler.cs:line 34\nat System.Net.Http.HttpClient.FinishSendAsyncUnbuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)\n   at Microsoft.Azure.DevSpaces.ExecService.Client.WorkspaceOperations.DownloadSourceWithHttpMessagesAsync(String spaceName, String serviceName, String name, Dictionary`2 customHeaders, CancellationToken cancellationToken) in /app/execsvc.client.autogen/WorkspaceOperations.cs:line 161\n   at Microsoft.Azure.DevSpaces.ExecService.Client.WorkspaceOperationsExtensions.DownloadSourceAsync(IWorkspaceOperations operations, String spaceName, String serviceName, String name, CancellationToken cancellationToken) in /app/execsvc.client.autogen/WorkspaceOperationsExtensions.cs:line 49\n   at Microsoft.Azure.DevSpaces.Build.BuildClient.DownloadWorkspaceAsync() in /app/build/BuildClient.cs:line 154","HelpLink":null,"Source":"Microsoft.Rest.ClientRuntime","HResult":-2146233087}, retry.
2021-02-15T17:53:40.2424473Z | BuildService | ERROR | Logging handled exception: System.Net.Http.HttpRequestException: {"Message":"The SSL connection could not be established, see inner exception.","Data":{},"InnerException":{"ClassName":"System.Security.Authentication.AuthenticationException","Message":"The remote certificate is invalid according to the validation procedure.","Data":null,"InnerException":null,"HelpURL":null,"StackTraceString":"   at System.Net.Security.SslState.StartSendAuthResetSignal(ProtocolToken message, AsyncProtocolRequest asyncRequest, ExceptionDispatchInfo exception)\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming,Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\n   at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)\n--- End of stack trace from previous location where exception was thrown ---\n   at System.Net.Security.SslState.ThrowIfExceptional()\n   at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)\n   at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)\n   at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)\n   at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)\n   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)\n--- End of stack trace from previous location where exception was thrown ---\n   at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)","RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":null,"HResult":-2146233087,"Source":"System.Private.CoreLib","WatsonBuckets":null},"StackTrace":"   at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)\n   at System.Threading.Tasks.ValueTask`1.get_Result()\n   at System.Net.Http.HttpConnectionPool.CreateConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)\n   at System.Threading.Tasks.ValueTask`1.get_Result()\n   at System.Net.Http.HttpConnectionPool.WaitForCreatedConnectionAsync(ValueTask`1 creationTask)\n   at System.Threading.Tasks.ValueTask`1.get_Result()\n   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)\n   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)\n   at Microsoft.Azure.DevSpaces.Common.Auth.Handlers.ServiceClientCredentialsHttpHandler`1.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in /app/common.auth/Handlers/ServiceClientCredentialsHttpHandler.cs:line 34\n  at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)\n   at Microsoft.Azure.DevSpaces.Common.Logging.AzdsHttpTelemetryLogger.<>c__DisplayClass12_0.<<-ctor>b__0>d.MoveNext() in /app/common/Logging/Loggers/AzdsHttpTelemetryLogger.cs:line 44\n--- End of stack trace from previous location where exception was thrown ---\n   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)\n--- End of stack trace from previous location where exception was thrown ---\n   at System.Threading.ThreadPoolWorkQueue.Dispatch()","HelpLink":null,"Source":"System.Net.Http","HResult":-2146233087}
2021-02-15T17:53:40.2727271Z | BuildService | ERROR | Unhandled AppDomain exception! System.Net.Http.HttpRequestException: The SSL connection could not be established, seeinner exception.
2021-02-15T17:53:40.2731763Z | BuildService | ERROR | AppDomain base exception details: System.Security.Authentication.AuthenticationException: The remote certificate is invalid according to the validation procedure.
2021-02-15T17:53:40.2736504Z | BuildService | ERROR | Is AppDomain terminating? 'True'

Unhandled Exception: System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception. ---> System.Security.Authentication.AuthenticationException: The remote certificate is invalid according to the validation procedure.
   at System.Net.Security.SslState.StartSendAuthResetSignal(ProtocolToken message, AsyncProtocolRequest asyncRequest, ExceptionDispatchInfo exception)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)
--- End of stack trace from previous location where exception was thrown ---
   at System.Net.Security.SslState.ThrowIfExceptional()
   at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)
   at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)
   at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)
   at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.CreateConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.WaitForCreatedConnectionAsync(ValueTask`1 creationTask)
   at System.Threading.Tasks.ValueTask`1.get_Result()
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at Microsoft.Azure.DevSpaces.Common.Auth.Handlers.ServiceClientCredentialsHttpHandler`1.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in /app/common.auth/Handlers/ServiceClientCredentialsHttpHandler.cs:line 34
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Microsoft.Azure.DevSpaces.Common.Logging.AzdsHttpTelemetryLogger.<>c__DisplayClass12_0.<<-ctor>b__0>d.MoveNext() in /app/common/Logging/Loggers/AzdsHttpTelemetryLogger.cs:line 44
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ThreadPoolWorkQueue.Dispatch()

@amsoedal
Copy link
Collaborator

amsoedal commented Feb 15, 2021

@Eneuman Thanks for attaching this! Can you try again now?

@Eneuman
Copy link
Author

Eneuman commented Feb 15, 2021

@amsoedal Yes, now its working again :) Thanks!

@amsoedal
Copy link
Collaborator

Great! Happy to hear it :)

@Eneuman
Copy link
Author

Eneuman commented Feb 21, 2021

@amsoedal Hi again. It looks like this problem has started to occure again :(
The logs shows the same problem as above.

Can you please apply your fix again?

azds up also gives me this error:

Oops... An unexpected error has occurred. A report of the error will be sent to Microsoft.
For diagnostic information, see Azure Dev Spaces logs at 'C:\Users\PerBornsjö\AppData\Local\Temp\Azure Dev Spaces'.
Please include the following Client Request ID when contacting support: 01779312-4e11-4199-a2b5-b8ed5de999cb

And the log contains this, so it looks like that SSL error again.

2021-02-21T16:26:34.2473765Z | CLI | WARNG | Logging handled exception: Microsoft.Azure.DevSpaces.Client.Exceptions.OperationIdException: {"RequestId":null,"ClientRequestId":"01779312-4e11-4199-a2b5-b8ed5de999cb","CorrelationRequestId":null,"Request":null,"Response":null,"Format":"The SSL connection could not be established, see inner exception.","Args":[],"Message":"The SSL connection could not be established, see inner exception.","Data":{},"InnerException":{"Message":"The SSL connection could not be established, see inner exception.","Data":{},"InnerException":{"ClassName":"System.Security.Authentication.AuthenticationException","Message":"The remote certificate is invalid according to the validation procedure.","Data":null,"InnerException":null,"HelpURL":null,"StackTraceString":"   at System.Net.Security.SslState.StartSendAuthResetSignal(ProtocolToken message, AsyncProtocolRequest asyncRequest, ExceptionDispatchInfo exception)\r\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.StartReceiveBlob(Byte[] buffer, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.CheckCompletionBeforeNextReceive(ProtocolToken message, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.StartSendBlob(Byte[] incoming, Int32 count, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.ProcessReceivedBlob(Byte[] buffer, Int32 count, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.StartReadFrame(Byte[] buffer, Int32 readBytes, AsyncProtocolRequest asyncRequest)\r\n   at System.Net.Security.SslState.PartialFrameCallback(AsyncProtocolRequest asyncRequest)\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Net.Security.SslState.InternalEndProcessAuthentication(LazyAsyncResult lazyResult)\r\n   at System.Net.Security.SslState.EndProcessAuthentication(IAsyncResult result)\r\n   at System.Net.Security.SslStream.EndAuthenticateAsClient(IAsyncResult asyncResult)\r\n   at System.Net.Security.SslStream.<>c.<AuthenticateAsClientAsync>b__47_1(IAsyncResult iar)\r\n   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Net.Http.ConnectHelper.EstablishSslConnectionAsyncCore(Stream stream, SslClientAuthenticationOptions sslOptions, CancellationToken cancellationToken)","RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":null,"HResult":-2146233087,"Source":"System.Private.CoreLib","WatsonBuckets":null},"StackTrace":"   at Microsoft.Rest.RetryDelegatingHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.DevSpaces.Common.Auth.Handlers.AuthErrorHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in C:\\A\\1\\55\\s\\src\\common.auth\\Handlers\\AuthErrorHttpHandler.cs:line 47\r\n   at Microsoft.Azure.DevSpaces.Common.Logging.OperationContextRequestIdsHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in C:\\A\\1\\55\\s\\src\\common\\Logging\\OperationContextRequestIdsHttpHandler.cs:line 31\r\n   at Microsoft.Azure.DevSpaces.Client.ServiceClients.Handlers.ClientOperationContextRequestIdsHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in C:\\A\\1\\55\\s\\src\\client\\Client\\ServiceClients\\Handlers\\ClientOperationContextRequestIdsHttpHandler.cs:line 31\r\n   at Microsoft.Azure.DevSpaces.Client.ServiceClients.Handlers.UnsupportedRegionErrorHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in C:\\A\\1\\55\\s\\src\\client\\Client\\ServiceClients\\Handlers\\UnsupportedRegionErrorHttpHandler.cs:line 22\r\n   at Microsoft.Azure.DevSpaces.Client.ServiceClients.Handlers.ApiVersionCheckerHttpHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in C:\\A\\1\\55\\s\\src\\client\\Client\\ServiceClients\\Handlers\\ApiVersionCheckerHttpHandler.cs:line 26\r\n   at Microsoft.Azure.DevSpaces.Common.Auth.Handlers.ServiceClientCredentialsHttpHandler`1.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in C:\\A\\1\\55\\s\\src\\common.auth\\Handlers\\ServiceClientCredentialsHttpHandler.cs:line 34\r\n   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)\r\n   at Microsoft.Azure.DevSpaces.ExecService.Client.SpacesOperations.ListSpacesWithHttpMessagesAsync(Dictionary`2 customHeaders, CancellationToken cancellationToken) in C:\\A\\1\\55\\s\\src\\execsvc.client.autogen\\SpacesOperations.cs:line 590\r\n   at Microsoft.Azure.DevSpaces.Client.ServiceClients.ExecExceptionStrategy.RunWithHandlingAsync[T](Func`1 func, FailureConfig failureConfig) in C:\\A\\1\\55\\s\\src\\client\\Client\\ServiceClients\\ExceptionStrategies\\ExecExceptionStrategy.cs:line 59\r\n   at Microsoft.Azure.DevSpaces.Client.ManagementClients.SpaceManagementClientImplementation.ListSpacesAsync(CancellationToken cancellationToken) in C:\\A\\1\\55\\s\\src\\client\\Client\\ManagementClients\\SpaceManagementClientImplementation.cs:line 299\r\n   at Microsoft.Azure.DevSpaces.Client.Extensions.OwnedExtensions.<>c__DisplayClass2_0`2.<<TryRunOwnedOperationThenDisposeAsync>b__0>d.MoveNext() in C:\\A\\1\\55\\s\\src\\client\\Extensions\\OwnedExtensions.cs:line 62\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at Microsoft.Azure.DevSpaces.Client.Utilities.AutofacUtilities.TryRunWithErrorPropagationAsync[T](Func`1 func, ILog log, IOperationContext operationContext) in C:\\A\\1\\55\\s\\src\\client\\Utilities\\AutofacUtilities.cs:line 81","HelpLink":null,"Source":"Microsoft.Rest.ClientRuntime","HResult":-2146233087},"StackTrace":"   at Microsoft.Azure.DevSpaces.Client.Utilities.AutofacUtilities._Handle[T](Exception e, ILog log, IOperationContext operationContext, Boolean isRecurse) in C:\\A\\1\\55\\s\\src\\client\\Utilities\\AutofacUtilities.cs:line 163\r\n   at Microsoft.Azure.DevSpaces.Client.Utilities.AutofacUtilities.TryRunWithErrorPropagationAsync[T](Func`1 func, ILog log, IOperationContext operationContext) in C:\\A\\1\\55\\s\\src\\client\\Utilities\\AutofacUtilities.cs:line 87\r\n   at Microsoft.Azure.DevSpaces.Cli.Settings.DevSpacesSettingsManager.GetLocalSettingsAsync(CancellationToken cancellationToken) in C:\\A\\1\\55\\s\\src\\cli\\Settings\\DevSpacesSettingsManager.cs:line 124\r\n   at Microsoft.Azure.DevSpaces.Cli.AppContainerConfig.<>c.<BuildContainer>b__0_5(IComponentContext c) in C:\\A\\1\\55\\s\\src\\cli\\AppContainerConfig.cs:line 268\r\n   at Autofac.RegistrationExtensions.<>c__DisplayClass5_0`1.<Register>b__0(IComponentContext c, IEnumerable`1 p)\r\n   at Autofac.Builder.RegistrationBuilder.<>c__DisplayClass0_0`1.<ForDelegate>b__0(IComponentContext c, IEnumerable`1 p)\r\n   at Autofac.Core.Activators.Delegate.DelegateActivator.ActivateInstance(IComponentContext context, IEnumerable`1 parameters)\r\n   at Autofac.Core.Resolving.InstanceLookup.Activate(IEnumerable`1 parameters)","HelpLink":null,"Source":null,"HResult":-2146233088}
2021-02-21T16:26:34.2501562Z | CLI | TRACE | Event: Command.End <json>{"properties":{"arguments":"up","result":"Failed"},"metrics":{"duration":4066.0}}</json>

@Eneuman
Copy link
Author

Eneuman commented Apr 8, 2021

Did you just change something? It started working !
Or does something gets automaticly restarted at this time?

@daniv-msft
Copy link
Collaborator

@Eneuman No, we were still digging into the logs we have on our side to try to understand what was happening. We didn't restart anything manually. The only automatic restart script we have (restarting Traefik pods automatically every couple of hours) that we added a couple of weeks ago didn't run in the last minutes.

We confirm that we can now see on our side that the build did happen: Pushing build logs for space 'dev' and service 'testapplication' and requestId x0178b3bb0f35249d

@Eneuman
Copy link
Author

Eneuman commented Apr 8, 2021

Can it be some type of cache that got evicted?
I'm gonna recreate the cluster again just to see if Azure network plugin was a problem or not.

@Eneuman
Copy link
Author

Eneuman commented Apr 8, 2021

I'm gonna have to call it a night. Been at it for 20 hours straight and I can deploy services to my cluster now so I'm unblocked, even though we don't know why it started working I call it a win :)

@daniv-msft
Copy link
Collaborator

@Eneuman Makes sense, it's super late in Sweden right now! Yes, we share your frustration. :( Happy that you're unblocked at least!

@Eneuman
Copy link
Author

Eneuman commented Apr 9, 2021

@daniv-msft Hi
I have done alot of research today and this is what I have found so far.
My old cluster worked with Dev-Spaces and Pod-Identity, I never upgraded Pod-Identity so we don't know what version was actually running.

From what I can tell, todays version av Dev-Spaces and any existing versions of Pod-Identity is not compatible so It looks like I can't setup a new cluster and we are stuck.

My findings so far:
1) If I install only dev space controller, I can do azds up and it works
2) If I install dev space controller first then AAD-Pod Identity: When I do a AZDS then the build pod craches with the error below

An error has occurred. Please contact Azure Dev spaces team for troubleshooting.
2021-04-09T11:41:15.7706931Z | BuildService | TRACE | AssemblyLoadContext unloading\nOperation context: <json>{"clientRequestId":"2ccf90f3-f2e4-43a3-ae94-6a6d81615450","correlationRequestId":"1f4d7edc-4e9a-4cc0-9138-918ccd629f63","requestId":null,"userSubscriptionId":null,"startTime":"2021-04-09T11:39:53.2813578Z","userAgent":"BuildSvc/1.0.20210206.1","requestHttpMethod":null,"requestUri":null,"apiVersion":null,"version":"1.0.20210206.1","requestHeaders":{},"loggingProperties":{"applicationName":"BuildService","deviceOperatingSystem":"Linux 5.4.0-1043-azure #45~18.04.1-Ubuntu SMP Sat Mar 20 16:16:05 UTC 2021","framework":".NET Core 4.6.28325.01"}}</json>
2021-04-09T11:42:15.7698385Z | BuildService | TRACE | Process exiting...
2021-04-09T11:42:15.7704239Z | BuildService | TRACE | Process exited

3) If I install AAD-Pod identity first and then Dev Space controller: The init pods does not get initialized and the build process does not start.

I have tried all the existing AAD-Pod identity versions available with the same problem :(
We need AAD-Pod identity to work for our software to run.

You can reproduced this behavior by installing a AKS like this:
az aks create --subscription $SubscriptionName --resource-group $ResourceGroupName --location $DefaultLocation --max-pods 100 --name $AksClusterName --node-count 1 --node-vm-size $AksNodeSize --kubernetes-version $KubernetesVersion --output none

add a dev namespace kubectl create namespace dev

add Dev-Spaces az aks use-dev-spaces -n <name> -g <resource group> pick the dev namespaces as selected

and then add AAD-identity

kubectl apply -f https://raw.githubusercontent.com/Azure/aad-pod-identity/v1.6.3/deploy/infra/deployment-rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/Azure/aad-pod-identity/v1.6.3/deploy/infra/mic-exception.yaml

and then create a simple .net core service and do a azds up.

@amsoedal
Copy link
Collaborator

amsoedal commented Apr 9, 2021

@Eneuman thanks for these details. We're still working to get to the bottom of this, but it helps to know that pod identity is the determining difference

@Eneuman
Copy link
Author

Eneuman commented Apr 9, 2021

@amsoedal Sounds great :)

I think one of the problem is that Azure Pod Identity is intercepting the Manage Identity call that the init build container does and that breakes the build.

I noticed that if I add a AAD Exception like this:

apiVersion: "aadpodidentity.k8s.io/v1"
kind: AzurePodIdentityException
metadata:
  name: azure-dev-spaces-exception
  namespace: azds
spec:
  podLabels:
    app: azds-injector-webhook

then the webkhook containers starts to work.

From NMI logs:

I0409 13:10:26.424302       1 main.go:86] starting nmi process. Version: v1.7.5. Build date: 2021-04-02-21:14.
I0409 13:10:26.526716       1 crd.go:447] CRD lite informers started
I0409 13:10:26.526742       1 main.go:113] running NMI in namespaced mode: false
I0409 13:10:26.526751       1 nmi.go:53] initializing in standard mode
I0409 13:10:26.526756       1 probes.go:41] initialized health probe on port 8085
I0409 13:10:26.526760       1 probes.go:44] started health probe
I0409 13:10:26.526819       1 metrics.go:341] registered views for metric
I0409 13:10:26.526867       1 prometheus_exporter.go:21] starting Prometheus exporter
I0409 13:10:26.526879       1 metrics.go:347] registered and exported metrics on port 9090
I0409 13:10:26.526981       1 server.go:102] listening on port 2579
W0409 13:10:26.736008       1 iptables.go:123] flushing iptables to add aad-metadata custom chains
I0409 13:12:17.988699       1 server.go:362] exception pod azds/azds-webhook-deployment-78d598464d-xh749 token handling
I0409 13:12:17.988720       1 server.go:305] fetching token for user assigned MSI for resource: https://azds.io/
I0409 13:12:17.998412       1 server.go:196] status (200) took 9791388 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.63

The build does not work though

@amsoedal
Copy link
Collaborator

amsoedal commented Apr 9, 2021

@Eneuman are you able to debug your app after adding this exception?
Edit: I just saw you said the build still doesn't work. Can you share the exception from the build container?

@Eneuman
Copy link
Author

Eneuman commented Apr 9, 2021

No, but adding the exception makes the webhookspods work by the look of it. Building the images does not work. The build init container never finishes and the proxy init containers stays in Pending state.

@amsoedal
Copy link
Collaborator

amsoedal commented Apr 9, 2021

@Eneuman are you still using user assigned managed identity?

@Eneuman
Copy link
Author

Eneuman commented Apr 9, 2021

Yes

@Eneuman
Copy link
Author

Eneuman commented Apr 9, 2021

Have you been able to reproduce this behaviour in your cluster?

@daniv-msft
Copy link
Collaborator

@Eneuman We're trying this, and will let you know how it goes.

@daniv-msft
Copy link
Collaborator

@Eneuman I confirm we do reproduce the issue on multiple clusters. We're still looking into what goes wrong. This issue is challenging.

@daniv-msft
Copy link
Collaborator

@Eneuman Something strange for us is that we don't see the webhook issue you were mentioning above. On our side, the webhook seems to work fine (without adding AAD Exception), and we can even see the request to start building the image locally showing in their logs:
image

@daniv-msft
Copy link
Collaborator

Also, we can see the Build service exiting because of an error:

λ kubectl logs -n dev webfrontend-76d68d944c-frtpm -c devspaces-build
An error has occurred. Please contact Azure Dev spaces team for troubleshooting.
2021-04-09T15:42:14.7684021Z | BuildService | TRACE | AssemblyLoadContext unloading\nOperation context: <json>{"clientRequestId":"a3e1d09a-a1ef-446a-868b-3b769e2126e7","correlationRequestId":"9423774f-a445-463e-839f-a683ec44dc0e","requestId":null,"userSubscriptionId":null,"startTime":"2021-04-09T15:33:09.3853715Z","userAgent":"BuildSvc/1.0.20210206.1","requestHttpMethod":null,"requestUri":null,"apiVersion":null,"version":"1.0.20210206.1","requestHeaders":{},"loggingProperties":{"applicationName":"BuildService","deviceOperatingSystem":"Linux 5.4.0-1043-azure #45~18.04.1-Ubuntu SMP Sat Mar 20 16:16:05 UTC 2021","framework":".NET Core 4.6.28325.01"}}</json>
2021-04-09T15:43:14.7748874Z | BuildService | TRACE | Process exiting...
2021-04-09T15:43:14.7754218Z | BuildService | TRACE | Process exited

Unfortunately, the logs don't show much as of today, but we're working to rebuild an image of this with more logs and see the new info we can get. Do you see something similar on your side?

@Eneuman
Copy link
Author

Eneuman commented Apr 9, 2021 via email

@amsoedal
Copy link
Collaborator

amsoedal commented Apr 9, 2021

@Eneuman, Can you check the nmi and mic pod logs in the default namespace on your cluster? What do they say?

@amsoedal
Copy link
Collaborator

amsoedal commented Apr 9, 2021

Also, I opened this issue on the pod-identity page. Hopefully they'll be able to shed some light. Azure/aad-pod-identity#1039

@Eneuman
Copy link
Author

Eneuman commented Apr 9, 2021

Working on getting you the logs @amsoedal.
Just wanted to let you know that I have tried both using Azure CNI and Kubernet as network and both have the same errors.

@Eneuman
Copy link
Author

Eneuman commented Apr 9, 2021

@amsoedal & @daniv-msft

My setup:
New clean AKS using Kubenet as Network.

Installed AAD like this:
helm install aad-pod-identity aad-pod-identity/aad-pod-identity --set nmi.allowNetworkPluginKubenet=true
Installed Dev Spaces after AAD..
Tried to azds up a service.

C:\Users\PerBornsjö>kubectl logs aad-pod-identity-mic-5b5f866bfc-mpn6w -n default
I0409 19:07:04.272950       1 main.go:114] starting mic process. Version: v1.7.5. Build date: 2021-04-02-21:14
W0409 19:07:04.273008       1 main.go:119] --kubeconfig not passed will use InClusterConfig
I0409 19:07:04.273017       1 main.go:136] kubeconfig () cloudconfig (/etc/kubernetes/azure.json)
I0409 19:07:04.273186       1 main.go:144] running MIC in namespaced mode: false
I0409 19:07:04.273200       1 main.go:148] client QPS set to: 5. Burst to: 5
I0409 19:07:04.273219       1 mic.go:139] starting to create the pod identity client. Version: v1.7.5. Build date: 2021-04-02-21:14
I0409 19:07:04.297870       1 mic.go:145] Kubernetes server version: v1.18.14
I0409 19:07:04.298183       1 cloudprovider.go:122] MIC using user assigned identity: a8e2##### REDACTED #####b15f for authentication.
I0409 19:07:04.364540       1 probes.go:41] initialized health probe on port 8080
I0409 19:07:04.364563       1 probes.go:44] started health probe
I0409 19:07:04.364677       1 metrics.go:341] registered views for metric
I0409 19:07:04.364756       1 prometheus_exporter.go:21] starting Prometheus exporter
I0409 19:07:04.364764       1 metrics.go:347] registered and exported metrics on port 8888
I0409 19:07:04.364772       1 mic.go:244] initiating MIC Leader election
I0409 19:07:04.364788       1 leaderelection.go:243] attempting to acquire leader lease  default/aad-pod-identity-mic...
I0409 19:07:04.378490       1 leaderelection.go:253] successfully acquired lease default/aad-pod-identity-mic
I0409 19:07:04.580434       1 mic.go:317] upgrading the types to work with case sensitive go-client
I0409 19:07:04.785861       1 mic.go:321] type upgrade completed !!
I0409 19:07:05.271023       1 pod.go:72] pod cache synchronized. Took 300.611305ms
I0409 19:07:05.271054       1 pod.go:79] pod watcher started !!
I0409 19:07:05.471439       1 crd.go:456] CRD informers started
I0409 19:07:05.471482       1 mic.go:402] sync thread started.
C:\Users\PerBornsjö>kubectl logs aad-pod-identity-mic-5b5f866bfc-xrlwn -n default
I0409 19:07:04.365311       1 main.go:114] starting mic process. Version: v1.7.5. Build date: 2021-04-02-21:14
W0409 19:07:04.365358       1 main.go:119] --kubeconfig not passed will use InClusterConfig
I0409 19:07:04.365364       1 main.go:136] kubeconfig () cloudconfig (/etc/kubernetes/azure.json)
I0409 19:07:04.365543       1 main.go:144] running MIC in namespaced mode: false
I0409 19:07:04.365560       1 main.go:148] client QPS set to: 5. Burst to: 5
I0409 19:07:04.365580       1 mic.go:139] starting to create the pod identity client. Version: v1.7.5. Build date: 2021-04-02-21:14
I0409 19:07:04.390443       1 mic.go:145] Kubernetes server version: v1.18.14
I0409 19:07:04.390701       1 cloudprovider.go:122] MIC using user assigned identity: a8e2##### REDACTED #####b15f for authentication.
I0409 19:07:04.394529       1 probes.go:41] initialized health probe on port 8080
I0409 19:07:04.394549       1 probes.go:44] started health probe
I0409 19:07:04.394628       1 metrics.go:341] registered views for metric
I0409 19:07:04.394712       1 prometheus_exporter.go:21] starting Prometheus exporter
I0409 19:07:04.394722       1 metrics.go:347] registered and exported metrics on port 8888
I0409 19:07:04.394728       1 mic.go:244] initiating MIC Leader election
I0409 19:07:04.394744       1 leaderelection.go:243] attempting to acquire leader lease  default/aad-pod-identity-mic...
C:\Users\PerBornsjö>kubectl logs aad-pod-identity-nmi-brvk4 -n default
I0409 19:07:02.821400       1 main.go:86] starting nmi process. Version: v1.7.5. Build date: 2021-04-02-21:14.
I0409 19:07:03.023889       1 crd.go:447] CRD lite informers started
I0409 19:07:03.023917       1 main.go:113] running NMI in namespaced mode: false
I0409 19:07:03.023931       1 nmi.go:53] initializing in standard mode
I0409 19:07:03.023937       1 probes.go:41] initialized health probe on port 8085
I0409 19:07:03.023942       1 probes.go:44] started health probe
I0409 19:07:03.024023       1 metrics.go:341] registered views for metric
I0409 19:07:03.024092       1 prometheus_exporter.go:21] starting Prometheus exporter
I0409 19:07:03.024102       1 metrics.go:347] registered and exported metrics on port 9090
I0409 19:07:03.024195       1 server.go:102] listening on port 2579
W0409 19:07:04.268479       1 iptables.go:123] flushing iptables to add aad-metadata custom chains
W0409 19:08:26.542949       1 client.go:208] list pod error: pod list is empty. Retrying, attempt number: 1
W0409 19:08:26.595143       1 client.go:208] list pod error: pod list is empty. Retrying, attempt number: 1
W0409 19:08:26.843583       1 client.go:208] list pod error: pod list is empty. Retrying, attempt number: 2
W0409 19:08:26.895699       1 client.go:208] list pod error: pod list is empty. Retrying, attempt number: 2
E0409 19:09:47.159672       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-g6gx8, error: getting assigned identities for pod azds/azds-webhook-deployment-7567578c7f-g6gx8 in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
I0409 19:09:47.159736       1 server.go:196] status (404) took 80616850012 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.15
E0409 19:09:47.204851       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-tbbpm, error: getting assigned identities for pod azds/azds-webhook-deployment-7567578c7f-tbbpm in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
I0409 19:09:47.204897       1 server.go:196] status (404) took 80609814981 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.14
E0409 19:09:58.070757       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-tbbpm, error: context canceled
I0409 19:09:58.070797       1 server.go:196] status (404) took 8864937498 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.14
E0409 19:10:13.202240       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-g6gx8, error: context canceled
I0409 19:10:13.202283       1 server.go:196] status (404) took 24041424510 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.15
E0409 19:11:18.382253       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-tbbpm, error: getting assigned identities for pod azds/azds-webhook-deployment-7567578c7f-tbbpm in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
I0409 19:11:18.382288       1 server.go:196] status (404) took 80007974470 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.14
E0409 19:11:28.061983       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-tbbpm, error: context canceled
I0409 19:11:28.062023       1 server.go:196] status (404) took 7678830463 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.14
E0409 19:11:33.466788       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-g6gx8, error: getting assigned identities for pod azds/azds-webhook-deployment-7567578c7f-g6gx8 in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
I0409 19:11:33.466829       1 server.go:196] status (404) took 80006828758 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.15
E0409 19:11:43.244913       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-g6gx8, error: context canceled
I0409 19:11:43.244948       1 server.go:196] status (404) took 7777117147 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.15
E0409 19:12:48.337972       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-tbbpm, error: getting assigned identities for pod azds/azds-webhook-deployment-7567578c7f-tbbpm in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
I0409 19:12:48.338009       1 server.go:196] status (404) took 80008820452 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.14
E0409 19:12:58.074291       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-tbbpm, error: context canceled
I0409 19:12:58.074347       1 server.go:196] status (404) took 7735453855 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.14
E0409 19:13:03.524754       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-g6gx8, error: getting assigned identities for pod azds/azds-webhook-deployment-7567578c7f-g6gx8 in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
I0409 19:13:03.524794       1 server.go:196] status (404) took 80008753409 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.15
E0409 19:13:13.211651       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-g6gx8, error: context canceled
I0409 19:13:13.211694       1 server.go:196] status (404) took 7685847265 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.15
E0409 19:14:18.418035       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-tbbpm, error: getting assigned identities for pod azds/azds-webhook-deployment-7567578c7f-tbbpm in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
I0409 19:14:18.418073       1 server.go:196] status (404) took 80008906645 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.14
E0409 19:14:28.056407       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-tbbpm, error: context canceled
I0409 19:14:28.056446       1 server.go:196] status (404) took 7637079441 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.14
E0409 19:14:33.485128       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-g6gx8, error: getting assigned identities for pod azds/azds-webhook-deployment-7567578c7f-g6gx8 in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
I0409 19:14:33.485168       1 server.go:196] status (404) took 80009732397 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.15
E0409 19:14:43.236099       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-g6gx8, error: context canceled
I0409 19:14:43.236163       1 server.go:196] status (404) took 7748416317 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.15

As you can see the webhook gets intercepted by nim and failes. So no init pods gets created.
But if I add my exceptions from above I can get around this (after recreating the mic and nim pods).

Now the nim loggs look like this:

I0409 19:30:00.804890       1 server.go:362] exception pod azds/azds-webhook-deployment-7567578c7f-tbbpm token handling
I0409 19:30:00.804908       1 server.go:305] fetching token for user assigned MSI for resource: https://azds.io/
I0409 19:30:01.255704       1 server.go:196] status (200) took 450852054 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.14
E0409 19:30:14.749637       1 server.go:375] failed to get matching identities for pod: azds/azds-webhook-deployment-7567578c7f-g6gx8, error: getting assigned identities for pod azds/azds-webhook-deployment-7567578c7f-g6gx8 in CREATED state failed after 16 attempts, retry duration [5]s, error: <nil>. Check MIC pod logs for identity assignment errors
I0409 19:30:14.749703       1 server.go:196] status (404) took 80015113793 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.15
I0409 19:30:16.750779       1 server.go:362] exception pod azds/azds-webhook-deployment-7567578c7f-g6gx8 token handling
I0409 19:30:16.750799       1 server.go:305] fetching token for user assigned MSI for resource: https://azds.io/
I0409 19:30:16.761309       1 server.go:196] status (200) took 10586120 ns for req.method=GET reg.path=/metadata/identity/oauth2/token req.remote=10.244.0.15

And now the init pods starts, but the BUILD Init does not complete. The state is always Running

I think the problem is that the build pod gets intercepted because of this PR Azure/aad-pod-identity#191. It would be very intrested to see what wouold happen if it was possible to exclude init containers from AAD interception...

@amsoedal
Copy link
Collaborator

amsoedal commented Apr 9, 2021

Hey again,
@pragyamehta and I have been following up with the pod identity team. A couple things:

  • We followed this doc, and part of it is setting up an exception for azds infrastructure. So that might help.
  • The other thing is, after we followed this, when we ran kubectl describe AzureIdentityBinding my-cluster-mi-binding -n dev, we saw the following error:
Events:
  Type     Reason               Age                  From                   Message
  ----     ------               ----                 ----                   -------
  Warning  binding apply error  113s (x2 over 113s)  aadpodidentity.k8s.io  failed to apply binding dev/my-cluster-mi-binding node aks-nodepool1-87199437-vmss000000 for pod dev/bikes-58b465dd5d-ts6md, error: failed to update identities for aks-nodepool1-87199437-vmss in MC_redacted_westeurope, error: compute.VirtualMachineScaleSetsClient#Update: Failure sending request: StatusCode=403 -- Original Error: Code="AuthorizationFailed" Message="The client '<redacted>' with object id '<redacted>' does not have authorization to perform action 'Microsoft.Compute/virtualMachineScaleSets/write' over scope '/subscriptions/<redacted>/resourceGroups/MC_redacted_westeurope/providers/Microsoft.Compute/virtualMachineScaleSets/aks-nodepool1-87199437-vmss' or the scope is invalid. If access was recently granted, please refresh your credentials."

Could you please run this command on your own AzureIdentityBinding?
If you see this error, the Pod Identity team told us to follow https://azure.github.io/aad-pod-identity/docs/getting-started/role-assignment/.

@amsoedal
Copy link
Collaborator

amsoedal commented Apr 9, 2021

@Eneuman Our contact also told us to look out for https://azure.github.io/aad-pod-identity/docs/#v16x-breaking-change. If you follow the AZDS doc I linked above, you have to change the fields to camelcase.

@Eneuman
Copy link
Author

Eneuman commented Apr 9, 2021

@amsoedal Yes, the casing in my bindings should be correct, I have dbl checked them.
The documentation has it wrong and probobly should be updated, PodLabels should be podLabels now ;)

Running kubectl describe AzureIdentityBinding eclinic-pod-identity-binding -n dev says this:

Name:         eclinic-pod-identity-binding
Namespace:    dev
Labels:       <none>
Annotations:  <none>
API Version:  aadpodidentity.k8s.io/v1
Kind:         AzureIdentityBinding
Metadata:
  Creation Timestamp:  2021-04-09T20:19:38Z
  Generation:          1
  Managed Fields:
    API Version:  aadpodidentity.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:azureIdentity:
        f:selector:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2021-04-09T20:19:38Z
  Resource Version:  14237
  Self Link:         /apis/aadpodidentity.k8s.io/v1/namespaces/dev/azureidentitybindings/eclinic-pod-identity-binding
  UID:               4d4bfbe6-3b3b-4636-8882-XXXXXXXX
Spec:
  Azure Identity:  eclinic-pod-identity
  Selector:        eclinic_pod
Events:            <none>

but I think thats okey since my testpod does not use managed identities yet (I was going to add this after I get it running).

@daniv-msft
Copy link
Collaborator

Hi @Eneuman,
We're still investigating, and there is one thing that we would like to validate: could you please confirm that pod identity (independently of Azure Dev Spaces) works in a cluster as configured above? Would you have an easy way to test that?

@Eneuman
Copy link
Author

Eneuman commented Apr 12, 2021

@daniv-msft I have not been able to test this yet but Im working on it.

I did however found one problem on my side.
When I created the webhook exceptions for AAD, I followed this throubleshooting guide:
https://docs.microsoft.com/en-us/azure/dev-spaces/troubleshooting#error-no-azureassignedidentity-found-for-podazdsazds-webhook-deployment-id-in-assigned-state

However, PodLabels should be podLabels otherwise it won't work. It's probobly best to update that page to reflect the latest breaking changes in Pod Identity 1.7.5. After finding this, webhooks got excluded fine and the build process worked.

Now I'm only stuck with Pod Identity not working because of a Identity Not Found issue. Tracking it here Azure/aad-pod-identity#1040

I'm not sure if this is related to Dev-Spaces and Pod-Identity interceptions. So we are making progress :)

@amsoedal
Copy link
Collaborator

@Eneuman thanks for the update! Glad you're able to get past the build issue now at least!

@Eneuman
Copy link
Author

Eneuman commented May 24, 2021

We have migrated to Bridge For Kubernetes so we will close this issue.

@Eneuman Eneuman closed this as completed May 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants