Workflows with istio (or other service mesh) #1282

dicarlo2 · 2019-03-25T22:08:43Z

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

What happened: Argo attempted to store the docker logs to s3 after killing the istio sidecar. This means that it was impossible to connect to s3 since all traffic is iptable redirected to the (now killed) istio proxy.

What you expected to happen: The "wait" container leaves the sidecars (or perhaps designated ones) alive until artifact (logs, etc.) storage is complete.

How to reproduce it (as minimally and precisely as possible):

1/ Configure a cluster with istio
2/ Configure a workflow with a template that has an istio sidecar. We did this manually rather than injecting since argo doesn't recognize the injected container as a sidecar that needs killing.
3/ Specify s3 configuration and archiveLogs in the argo configuration
4/ Run the workflow

ddseapy · 2019-10-01T15:12:14Z

@dicarlo2 Were you ever able to figure out a way to get an istio-injected workflow working with artifact support?

ddseapy · 2019-10-01T15:29:01Z

Maybe configuring istio to avoid intercepting s3 traffic with traffic.sidecar.istio.io/excludeOutboundIPRanges?

ddseapy · 2019-10-03T17:54:30Z

As you suggested, delaying killing sidecars in waitContainer() works. Also had to workaround istio/istio#9454 but didn't require changes in argo or istio code.

Unsure if this kind of change is something argo maintainers want as a PR, or if there are downsides to doing this.

func waitContainer() error {
	wfExecutor := initExecutor()
	defer wfExecutor.HandleError()
	defer stats.LogStats()
	stats.StartStatsTicker(5 * time.Minute)

	// Wait for main container to complete
	err := wfExecutor.Wait()
	if err != nil {
		wfExecutor.AddError(err)
		// do not return here so we can still try to kill sidecars & save outputs
	}
	// Saving logs
	logArt, err := wfExecutor.SaveLogs()
	if err != nil {
		wfExecutor.AddError(err)
		// do not return here so we can still try to kill sidecars
	}
	// Saving output parameters
	err = wfExecutor.SaveParameters()
	if err != nil {
		wfExecutor.AddError(err)
		// do not return here so we can still try to kill sidecars
	}
	// Saving output artifacts
	err = wfExecutor.SaveArtifacts()
	if err != nil {
		wfExecutor.AddError(err)
		// do not return here so we can still try to kill sidecars
	}
	// Killing sidecar containers
	err = wfExecutor.KillSidecars()
	if err != nil {
		wfExecutor.AddError(err)
		return err
	}
	// Capture output script result
	err = wfExecutor.CaptureScriptResult()
	if err != nil {
		wfExecutor.AddError(err)
		return err
	}
	err = wfExecutor.AnnotateOutputs(logArt)
	if err != nil {
		wfExecutor.AddError(err)
		return err
	}
	return nil
}

sarabala1979 · 2019-10-03T18:31:33Z

Please go-ahead to create PR with changes. we will review and merge. Thanks for contributing

nfickas · 2020-01-08T21:50:38Z

@sarabala1979 Is there any way to get an Argo Workflow to recognize an automatically injected istio sidecar?

ddseapy · 2020-01-08T22:31:17Z

@nfickas is there an issue you are running into with istio sidecars, or what do you mean by "recognize"? We have been using 2.4.2 with istio sidecars automatically injected without any issues.

ddseapy · 2020-01-08T22:36:27Z

For us this issue was resolved here: #1645

nfickas · 2020-01-09T00:52:13Z

@ddseapy our workflows start up with the automatic sidecars but then the sidecars are stuck in a running state after the workflow completes.

nfickas · 2020-01-09T03:48:08Z

I am seeing some odd behavior on the wait container, looks like it Errored out. Failed to establish pod watch: Get https://10.100.200.1:443/api/v1/namespaces/default/pods?fieldSelector=metadata.name%3Dhello-world-82bm7&watch=true: dial tcp 10.100.200.1:443: connect: connection refused

ddseapy · 2020-01-09T13:33:21Z

That's the istio issue i linked above. The underlying issue is: istio/istio#11130

Really it's because k8s sidecar support PR (kubernetes/enhancements#753) hasn't gotten through, so there is no ordering which causes a race condition.

You can exclude the k8s api server from the IP range that envoy intercepts, or do things like change the docker command. It could also be fixed in argo by retrying the request on failure with code changes in argo.

nfickas · 2020-01-09T14:05:47Z

I would rather not making any changes to argo code, out of the other two options which one would be easier and do you have any examples of either of those being done?

Just curious did you run into this issue when getting argo working with injected sidecars?

jaygorrell · 2020-01-09T14:08:23Z

The easiest thing to do is to just annotate your Argo worfklows so that the auto-injector doesn't add a sidecar to them. Better support is coming in k8s (maybe in 1.8) but it's just not there yet.

Jobs in k8s are the same way... there isn't really anything special about Argo there.

Edit: This would be spec.templates[].metadata.annotations: sidecar.istio.io/inject: "false"

ddseapy · 2020-01-09T14:15:31Z

if you don't need sidecar, that is indeed easiest. We need to control traffic for our applications, so that is not an option for us. For the workflows we run it was sufficient to exclude the k8s master IP range via traffic.sidecar.istio.io/excludeOutboundIPRanges

nfickas · 2020-01-09T14:16:32Z

We need sidecars as well, I will try that out. Thanks!

jaygorrell · 2020-01-09T14:21:08Z

That's a good point, too. But just to be clear, I do believe that is going to make the pod stay up when the workflow is finished - correct? You may need a separate system to clean those up, I believe. I'm going off memory and could be mistaken.

ddseapy · 2020-01-09T14:21:10Z

To state the obvious, requests to k8s api server will then bypass istio. This was not a security concern for our use-cases, but it could be for someone else.

ddseapy · 2020-01-09T14:22:02Z

@jaygorrell that's why I put up the pr. we have been using 2.4.2 for quite a while now without issues, and the pod goes to completed state.

ddseapy · 2020-02-16T15:44:52Z

For visibility - Prasanna Kumar Krishnamurthy in argo workflow slack suggested patching the pod spec (https://github.com/argoproj/argo/blob/master/examples/pod-spec-patch.yaml) with something like

"containers":[{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 20; argoexec wait"]}]

nfickas · 2020-02-28T02:01:44Z

@ddseapy So the above suggestion will just make it clean up the istio-proxy container after 20 seconds. I think the real issue is that the wait and main containers are launching before the proxy starts up.

ddseapy · 2020-02-28T02:40:25Z

My understanding is the suggestion should delay the wait container running by editing it's command. I would think that it would delay cleanup (as you said), but also prevent the startup race condition.

If the main container has the same race condition (depends on your main container), you would need to wait there too.

I haven't tried the approach myself yet though, so I may be misunderstanding things.

nfickas · 2020-02-28T02:57:04Z

hm yeah I am not seeing that happen, I am still seeing the main and wait containers startup before istio-proxy. For reference this is what I did: podSpecPatch: '{"containers": [{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 10; argoexec wait"]}, {"name":"main", "command": ["/bin/bash", "-c"], "args": ["sleep 10; argoexec main"]}]}'

Bobgy · 2020-07-01T23:53:47Z

I'd still want this open. This is valuable for kubeflow pipelines multi tenant support, because api server needs to be protected by istio. Having istio sidecars in workflows allow in cluster api calls from workflows.

shadiramadan · 2020-08-25T04:12:33Z

The startup problem should be resolved in Istio 1.7.0 as a workaround:
https://istio.io/latest/news/releases/1.7.x/announcing-1.7/#production-operability-improvements
istio/istio#24737

If you are using the IstioOperator:

spec:
  components:
    proxy:
      holdApplicationUntilProxyStarts: true

BlackRider97 · 2020-10-28T09:16:02Z

Any one found any work around for Linkerd2 ?

arnoldrw · 2021-01-21T22:14:11Z

Recently we ran into not 1, but 3 separate race conditions while working with argo in a test environment with istio enabled.

I am posting this in hopes that it helps others, as certain aspects of this are covered in this issue but they are done through replies that can be hard to follow. In addition I believe case 3 may be a new one on this issue, and case 1 has a potential change to previous solutions.

Relevant details of our istio environment:
-we have all traffic in/out through istio to control what can be hit
-we have mtls enabled
-not on 1.8 yet so can't use per pod holdApplicationUntilProxyStarts istio annotation

We defined a simple workflow that simply curled a host that had been whitelisted by istio. The workflow always contained this annotation:

metadata:
  annotations:
    sidecar.istio.io/inject: "true"

We were seeing race conditions in 3 areas

The actual workflow job. If the istio proxy was not yet up, the curl would fail
The argo wait container. This caused a cascading failure
Writes to minio. These were seemingly randomly failing

Workflow job:

This one was the easiest. somesite.com had been whitelisted for egress and our initial worfklow had a command and args like this:

command: [sh, -c]
args: ["date; curl -v somesite.com"]

This resulted in an error like this in the main logs

<timestamp>
curl: (7) Failed to connect to somesite.com port 443: Connection refused

Next I grabbed the proxy logs for the container

kubectl logs <container> istio-proxy

From those, I could see that the envoy proxy was not up at the timestamp of the curl call. A log entry is made that states "Envoy proxy is ready" once it is up, so you can just grep for that.

Workflow job fix

The fix for the workflow job was to curl to check if the proxy was up, but we had a modification to what has previously been posted. Other posts curl localhost:15000 . What we found when doing that was that was the admin port, and a response from that did not mean the proxy was up. Instead we check the health status port:

args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl https://somesite.com"]

We added an extra sleep just in case - more testing needed there.

Argo wait container

This is similar to what nfickas mentioned with ddseapy's solution .

After we changed the workflow command we still faced issues. Specifically in the workflow ui we'd see a log about "Failed to establish pod watch: timed out waiting for the condition".

The important log for this was the wait log

kubectl logs <container> wait

This showed a log entry like this

time="2021-01-18T19:38:17.0651" level=error msg="executor error: Failed to establish pod watch: timed out waiting for the condition\ngithub.com/argoproj/argo/errors..........

The issue in this was also a race condition. Because all of our traffic was going through istio, the wait container needed istio to be up. Again, an inspection of the istio log to compare timestamps was the right next step

kubectl logs <container> istio-proxy | grep "Envoy proxy is ready"

This grep for the ready state of the proxy showed a timestamp that was seconds after the argo wait container tried to make its call.

Argo wait container fix

The fix for this was to exclude the k8 api ip range from being included in istio by adding to our annotations:

metadata:
  annotations:
    sidecar.istio.io/inject: "true"
    traffic.sidecar.istio.io/excludeOutboundIPRanges: k8apiip/32

This ensured that the communication did not need istio to be up yet.

Minio issue

The last race condition we faced was on the write of logs to Minio (this is a test environment). This would occasionally fail and

kubectl logs <container> wait

showed 503 errors when trying to connect to minio.

For this we needed to increase the logging level by modifying the command:

args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl -s -XPOST http://localhost:15000/logging?level=debug; curl https://somesite.com"]

A further inspection of the istio log

kubectl logs <container> istio-proxy

Showed the following

2021-01-18T19:51:50.461773Z	debug	envoy connection	[external/envoy/source/common/network/connection_impl.cc:727] [C17] connecting to <ip>:9000
2021-01-2-181T19:51:50.461858Z	debug	envoy connection	[external/envoy/source/common/network/connection_impl.cc:736] [C17] connection in progress
2021-01-18T19:51:50.462871Z	debug	envoy connection	[external/envoy/source/common/network/connection_impl.cc:592] [C17] connected
2021-01-18T19:51:50.462975Z	debug	envoy connection	[external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:191] [C17] handshake expecting read
2021-01-18T19:51:50.466813Z	debug	envoy connection	[external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:191] [C17] handshake expecting read
2021-01-18T19:51:50.466837Z	debug	envoy connection	[external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:191] [C17] handshake expecting read
2021-01-18T19:51:50.467951Z	debug	envoy connection	[external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:198] [C17] handshake error: 1
2021-01-18T19:51:50.467977Z	debug	envoy connection	[external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:226] [C17] TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED

This issue is tied to istio not updating certificates. This appears to be an istio/envoy issue related to certificate handling. Some example issues are here and here.

Minio fix

A temporary fix for this is to disable mtls for minio with PeerAuthentication and DestinationRule per the istio docs

Minio tls exclusion:

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: minio-mtls-disable
  namespace: test
spec:
  selector:
    matchLabels:
      app: minio
  mtls:
    mode: UNSET
  portLevelMtls:
    9000:
      mode: DISABLE
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: minio
  namespace: test
spec:
  host: minio
  trafficPolicy:
    portLevelSettings:
    - port:
        number: 9000
      tls:
        mode: DISABLE

Full example workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: race-workflow
spec:
  entrypoint: test
  templates:
  - name: test
    steps:
    - - name: run-it
        template: curler
  - name: curler
    activeDeadlineSeconds: 60
    retryStrategy: 
      limit: 2
      retryPolicy: "Always" #OnFailure or OnError also are options
    script:
      image: python3.8-slim:latest
      command: [sh, -c]
      args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl -s -XPOST http://localhost:15000/logging?level=debug; curl https://somesite.com"]
      imagePullPolicy: IfNotPresent
    metadata:
      annotations:
        sidecar.istio.io/inject: "true"
        traffic.sidecar.istio.io/excludeOutboundIPRanges: k8apiip/32

While isito 1.8 does include the holdApplicationUntilStarts annotation, I do wonder if this would solve case 2 with the wait container or not. It does seem like something to force argo to wait until the proxy is up would be generally useful.

Hope this helps someone.

alexec · 2021-04-16T16:21:31Z

Workflows v3 should just work with ISTIO. Please let me know if that is not the case.

Bobgy · 2021-07-23T03:11:27Z

May I confirm whether above statement is up to date or the documentation?
https://argoproj.github.io/argo-workflows/sidecar-injection/

alexec · 2021-11-07T16:10:19Z

It should just work, but it doesn’t.

We’ve been speaking to ISTIO, because they have similar issues with other tools. Previously, our recommendation was to disable ISTIO.

We have since heard from users that the PNS executor works with ISTIO.

alexec · 2021-11-07T16:10:54Z

We should fix #7057.

alexec · 2021-11-07T16:13:33Z

Aside: it is very hard to cater from injected sidecars when container ordering is important to both you and the injected sidecar. I suspect that we'll always have problems with working with injected sidecars like ISTIO and Vault until we have a general way to kill them via API. #7057 is just a patch to the problem, and we may need more than one of these.

nfickas · 2021-11-07T16:15:19Z

I'll add that when my last company started using the emissary executor everything worked very well with istio. The only thing was having to find a way to kill the sidecars after the workflow ran but I think we solved that by reordering the pods startup somehow.

alexec · 2021-11-07T16:18:31Z

@nfickas can you share more? Do you have details we could share with other users?

nfickas · 2021-11-07T16:38:22Z

I can't remember the entire implementation details and no longer have access to those repos. But the gist of it comes down to using istio's holdApplicationUntilProxyStarts configuration.

This will cause the workflows execution to wait until the sidecars are able to serve traffic.

If I'm remembering this correctly just adding that annotation magically fixed our issues.

We had originally followed something similar to what is described in this article with limited success: https://medium.com/@marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74

mattharrigan · 2022-02-28T12:59:24Z

Hello. First thanks for creating argo. I think its a great tool. I'm also struggling to use argo and istio. Using istio for me is also a company policy. IMO killing the sidecar is easy, but its almost impossible to get init containers working if they need network access. I would like to be able to use artifacts. My best thought with the current functionality is to have steps upload artifacts as normal, but when input artifacts are required bypass argo completely and get the files straight from S3 in the main container. Moderate PITA, but might work. Can you recommend anything better?

Longer term I think it would be great if the artifacts could be downloaded by the wait container itself. It could expose some http server and reply 200 when that is complete. The main container could wait on that signal. Just a thought.

RobCannon · 2023-06-10T02:59:34Z

I just wanted to say that the new Istio Ambient Mesh (GA in Istio 1.18) seems to solve the issues with Istio and Argo Workflows since there is no longer a sidecar.

ryandawsonuk mentioned this issue Oct 17, 2019

Disable istio-injection in namespace profile controller creates? kubeflow/kubeflow#3935

Closed

This comment was marked as resolved.

Sign in to view

stale bot added the wontfix label Jul 1, 2020

alexec added type/feature Feature request help wanted and removed wontfix labels Jul 1, 2020

connorlwilkes mentioned this issue Sep 14, 2020

Connect Timeout Issue with Istio proxy & Argo kubeflow/pipelines#4494

Closed

arnoldrw mentioned this issue Feb 11, 2021

token not valid for running mode after SSO authentication #4991

Closed

alexec closed this as completed Apr 16, 2021

karlschriek mentioned this issue Jul 23, 2021

Istio injection always disabled kubeflow/pipelines#6067

Closed

johnbuluba mentioned this issue Dec 6, 2021

Resource template task hangs forever with Istio sidecar #7273

Open

agilgur5 added the area/sidecar-injection Executor & Controller handling of injected sidecars like Istio label Sep 10, 2024

Workflows with istio (or other service mesh) #1282

Workflows with istio (or other service mesh) #1282

Comments

dicarlo2 commented Mar 25, 2019

ddseapy commented Oct 1, 2019

ddseapy commented Oct 1, 2019

ddseapy commented Oct 3, 2019

sarabala1979 commented Oct 3, 2019

nfickas commented Jan 8, 2020 • edited Loading

ddseapy commented Jan 8, 2020

ddseapy commented Jan 8, 2020

nfickas commented Jan 9, 2020

nfickas commented Jan 9, 2020

ddseapy commented Jan 9, 2020 • edited Loading

nfickas commented Jan 9, 2020

jaygorrell commented Jan 9, 2020 • edited Loading

ddseapy commented Jan 9, 2020

nfickas commented Jan 9, 2020

jaygorrell commented Jan 9, 2020

ddseapy commented Jan 9, 2020

ddseapy commented Jan 9, 2020

ddseapy commented Feb 16, 2020

nfickas commented Feb 28, 2020 • edited Loading

ddseapy commented Feb 28, 2020

nfickas commented Feb 28, 2020

This comment was marked as resolved.

Bobgy commented Jul 1, 2020

shadiramadan commented Aug 25, 2020 • edited by agilgur5 Loading

BlackRider97 commented Oct 28, 2020

arnoldrw commented Jan 21, 2021 • edited by agilgur5 Loading

Workflow job:

Workflow job fix

Argo wait container

Argo wait container fix

Minio issue

Minio fix

alexec commented Apr 16, 2021

Bobgy commented Jul 23, 2021

alexec commented Nov 7, 2021

alexec commented Nov 7, 2021

alexec commented Nov 7, 2021 • edited Loading

nfickas commented Nov 7, 2021

alexec commented Nov 7, 2021

nfickas commented Nov 7, 2021

mattharrigan commented Feb 28, 2022

RobCannon commented Jun 10, 2023

nfickas commented Jan 8, 2020 •

edited

Loading

ddseapy commented Jan 9, 2020 •

edited

Loading

jaygorrell commented Jan 9, 2020 •

edited

Loading

nfickas commented Feb 28, 2020 •

edited

Loading

shadiramadan commented Aug 25, 2020 •

edited by agilgur5

Loading

arnoldrw commented Jan 21, 2021 •

edited by agilgur5

Loading

alexec commented Nov 7, 2021 •

edited

Loading