-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflows with istio (or other service mesh) #1282
Comments
@dicarlo2 Were you ever able to figure out a way to get an istio-injected workflow working with artifact support? |
Maybe configuring istio to avoid intercepting s3 traffic with |
As you suggested, delaying killing sidecars in waitContainer() works. Also had to workaround istio/istio#9454 but didn't require changes in argo or istio code. Unsure if this kind of change is something argo maintainers want as a PR, or if there are downsides to doing this. func waitContainer() error {
wfExecutor := initExecutor()
defer wfExecutor.HandleError()
defer stats.LogStats()
stats.StartStatsTicker(5 * time.Minute)
// Wait for main container to complete
err := wfExecutor.Wait()
if err != nil {
wfExecutor.AddError(err)
// do not return here so we can still try to kill sidecars & save outputs
}
// Saving logs
logArt, err := wfExecutor.SaveLogs()
if err != nil {
wfExecutor.AddError(err)
// do not return here so we can still try to kill sidecars
}
// Saving output parameters
err = wfExecutor.SaveParameters()
if err != nil {
wfExecutor.AddError(err)
// do not return here so we can still try to kill sidecars
}
// Saving output artifacts
err = wfExecutor.SaveArtifacts()
if err != nil {
wfExecutor.AddError(err)
// do not return here so we can still try to kill sidecars
}
// Killing sidecar containers
err = wfExecutor.KillSidecars()
if err != nil {
wfExecutor.AddError(err)
return err
}
// Capture output script result
err = wfExecutor.CaptureScriptResult()
if err != nil {
wfExecutor.AddError(err)
return err
}
err = wfExecutor.AnnotateOutputs(logArt)
if err != nil {
wfExecutor.AddError(err)
return err
}
return nil
} |
Please go-ahead to create PR with changes. we will review and merge. Thanks for contributing |
@sarabala1979 Is there any way to get an Argo Workflow to recognize an automatically injected istio sidecar? |
@nfickas is there an issue you are running into with istio sidecars, or what do you mean by "recognize"? We have been using 2.4.2 with istio sidecars automatically injected without any issues. |
For us this issue was resolved here: #1645 |
@ddseapy our workflows start up with the automatic sidecars but then the sidecars are stuck in a running state after the workflow completes. |
I am seeing some odd behavior on the wait container, looks like it Errored out. |
That's the istio issue i linked above. The underlying issue is: istio/istio#11130 Really it's because k8s sidecar support PR (kubernetes/enhancements#753) hasn't gotten through, so there is no ordering which causes a race condition. You can exclude the k8s api server from the IP range that envoy intercepts, or do things like change the docker command. It could also be fixed in argo by retrying the request on failure with code changes in argo. |
I would rather not making any changes to argo code, out of the other two options which one would be easier and do you have any examples of either of those being done? Just curious did you run into this issue when getting argo working with injected sidecars? |
The easiest thing to do is to just annotate your Argo worfklows so that the auto-injector doesn't add a sidecar to them. Better support is coming in k8s (maybe in 1.8) but it's just not there yet. Jobs in k8s are the same way... there isn't really anything special about Argo there. Edit: This would be |
if you don't need sidecar, that is indeed easiest. We need to control traffic for our applications, so that is not an option for us. For the workflows we run it was sufficient to exclude the k8s master IP range via |
We need sidecars as well, I will try that out. Thanks! |
That's a good point, too. But just to be clear, I do believe that is going to make the pod stay up when the workflow is finished - correct? You may need a separate system to clean those up, I believe. I'm going off memory and could be mistaken. |
To state the obvious, requests to k8s api server will then bypass istio. This was not a security concern for our use-cases, but it could be for someone else. |
@jaygorrell that's why I put up the pr. we have been using 2.4.2 for quite a while now without issues, and the pod goes to completed state. |
For visibility - Prasanna Kumar Krishnamurthy in argo workflow slack suggested patching the pod spec (https://github.com/argoproj/argo/blob/master/examples/pod-spec-patch.yaml) with something like "containers":[{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 20; argoexec wait"]}] |
@ddseapy So the above suggestion will just make it clean up the istio-proxy container after 20 seconds. I think the real issue is that the wait and main containers are launching before the proxy starts up. |
My understanding is the suggestion should delay the wait container running by editing it's command. I would think that it would delay cleanup (as you said), but also prevent the startup race condition. If the main container has the same race condition (depends on your main container), you would need to wait there too. I haven't tried the approach myself yet though, so I may be misunderstanding things. |
hm yeah I am not seeing that happen, I am still seeing the main and wait containers startup before istio-proxy. For reference this is what I did: |
This comment was marked as resolved.
This comment was marked as resolved.
I'd still want this open. This is valuable for kubeflow pipelines multi tenant support, because api server needs to be protected by istio. Having istio sidecars in workflows allow in cluster api calls from workflows. |
The startup problem should be resolved in Istio 1.7.0 as a workaround: If you are using the spec:
components:
proxy:
holdApplicationUntilProxyStarts: true |
Any one found any work around for Linkerd2 ? |
Recently we ran into not 1, but 3 separate race conditions while working with argo in a test environment with istio enabled. I am posting this in hopes that it helps others, as certain aspects of this are covered in this issue but they are done through replies that can be hard to follow. In addition I believe case 3 may be a new one on this issue, and case 1 has a potential change to previous solutions. Relevant details of our istio environment: We defined a simple workflow that simply curled a host that had been whitelisted by istio. The workflow always contained this annotation: metadata:
annotations:
sidecar.istio.io/inject: "true" We were seeing race conditions in 3 areas
Workflow job:This one was the easiest. somesite.com had been whitelisted for egress and our initial worfklow had a command and args like this: command: [sh, -c]
args: ["date; curl -v somesite.com"] This resulted in an error like this in the main logs
Next I grabbed the proxy logs for the container kubectl logs <container> istio-proxy From those, I could see that the envoy proxy was not up at the timestamp of the curl call. A log entry is made that states "Envoy proxy is ready" once it is up, so you can just grep for that. Workflow job fixThe fix for the workflow job was to curl to check if the proxy was up, but we had a modification to what has previously been posted. Other posts curl localhost:15000 . What we found when doing that was that was the admin port, and a response from that did not mean the proxy was up. Instead we check the health status port: args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl https://somesite.com"]
We added an extra sleep just in case - more testing needed there. Argo wait containerThis is similar to what nfickas mentioned with ddseapy's solution . After we changed the workflow command we still faced issues. Specifically in the workflow ui we'd see a log about "Failed to establish pod watch: timed out waiting for the condition". The important log for this was the wait log kubectl logs <container> wait This showed a log entry like this
The issue in this was also a race condition. Because all of our traffic was going through istio, the wait container needed istio to be up. Again, an inspection of the istio log to compare timestamps was the right next step kubectl logs <container> istio-proxy | grep "Envoy proxy is ready" This grep for the ready state of the proxy showed a timestamp that was seconds after the argo wait container tried to make its call. Argo wait container fixThe fix for this was to exclude the k8 api ip range from being included in istio by adding to our annotations: metadata:
annotations:
sidecar.istio.io/inject: "true"
traffic.sidecar.istio.io/excludeOutboundIPRanges: k8apiip/32
This ensured that the communication did not need istio to be up yet. Minio issueThe last race condition we faced was on the write of logs to Minio (this is a test environment). This would occasionally fail and kubectl logs <container> wait showed 503 errors when trying to connect to minio. For this we needed to increase the logging level by modifying the command: args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl -s -XPOST http://localhost:15000/logging?level=debug; curl https://somesite.com"] A further inspection of the istio log kubectl logs <container> istio-proxy Showed the following
This issue is tied to istio not updating certificates. This appears to be an istio/envoy issue related to certificate handling. Some example issues are here and here. Minio fixA temporary fix for this is to disable mtls for minio with PeerAuthentication and DestinationRule per the istio docs Minio tls exclusion: apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: minio-mtls-disable
namespace: test
spec:
selector:
matchLabels:
app: minio
mtls:
mode: UNSET
portLevelMtls:
9000:
mode: DISABLE
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: minio
namespace: test
spec:
host: minio
trafficPolicy:
portLevelSettings:
- port:
number: 9000
tls:
mode: DISABLE Full example workflow: apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: race-workflow
spec:
entrypoint: test
templates:
- name: test
steps:
- - name: run-it
template: curler
- name: curler
activeDeadlineSeconds: 60
retryStrategy:
limit: 2
retryPolicy: "Always" #OnFailure or OnError also are options
script:
image: python3.8-slim:latest
command: [sh, -c]
args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl -s -XPOST http://localhost:15000/logging?level=debug; curl https://somesite.com"]
imagePullPolicy: IfNotPresent
metadata:
annotations:
sidecar.istio.io/inject: "true"
traffic.sidecar.istio.io/excludeOutboundIPRanges: k8apiip/32
While isito 1.8 does include the holdApplicationUntilStarts annotation, I do wonder if this would solve case 2 with the wait container or not. It does seem like something to force argo to wait until the proxy is up would be generally useful. Hope this helps someone. |
Workflows v3 should just work with ISTIO. Please let me know if that is not the case. |
May I confirm whether above statement is up to date or the documentation? |
We’ve been speaking to ISTIO, because they have similar issues with other tools. Previously, our recommendation was to disable ISTIO. We have since heard from users that the PNS executor works with ISTIO. |
We should fix #7057. |
Aside: it is very hard to cater from injected sidecars when container ordering is important to both you and the injected sidecar. I suspect that we'll always have problems with working with injected sidecars like ISTIO and Vault until we have a general way to kill them via API. #7057 is just a patch to the problem, and we may need more than one of these. |
I'll add that when my last company started using the emissary executor everything worked very well with istio. The only thing was having to find a way to kill the sidecars after the workflow ran but I think we solved that by reordering the pods startup somehow. |
@nfickas can you share more? Do you have details we could share with other users? |
I can't remember the entire implementation details and no longer have access to those repos. But the gist of it comes down to using istio's holdApplicationUntilProxyStarts configuration. This will cause the workflows execution to wait until the sidecars are able to serve traffic. If I'm remembering this correctly just adding that annotation magically fixed our issues. We had originally followed something similar to what is described in this article with limited success: https://medium.com/@marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74 |
Hello. First thanks for creating argo. I think its a great tool. I'm also struggling to use argo and istio. Using istio for me is also a company policy. IMO killing the sidecar is easy, but its almost impossible to get init containers working if they need network access. I would like to be able to use artifacts. My best thought with the current functionality is to have steps upload artifacts as normal, but when input artifacts are required bypass argo completely and get the files straight from S3 in the main container. Moderate PITA, but might work. Can you recommend anything better? Longer term I think it would be great if the artifacts could be downloaded by the wait container itself. It could expose some http server and reply 200 when that is complete. The main container could wait on that signal. Just a thought. |
I just wanted to say that the new Istio Ambient Mesh (GA in Istio 1.18) seems to solve the issues with Istio and Argo Workflows since there is no longer a sidecar. |
Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST
What happened: Argo attempted to store the docker logs to s3 after killing the istio sidecar. This means that it was impossible to connect to s3 since all traffic is iptable redirected to the (now killed) istio proxy.
What you expected to happen: The "wait" container leaves the sidecars (or perhaps designated ones) alive until artifact (logs, etc.) storage is complete.
How to reproduce it (as minimally and precisely as possible):
1/ Configure a cluster with istio
2/ Configure a workflow with a template that has an istio sidecar. We did this manually rather than injecting since argo doesn't recognize the injected container as a sidecar that needs killing.
3/ Specify s3 configuration and archiveLogs in the argo configuration
4/ Run the workflow
The text was updated successfully, but these errors were encountered: