Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows with istio (or other service mesh) #1282

Closed
dicarlo2 opened this issue Mar 25, 2019 · 36 comments
Closed

Workflows with istio (or other service mesh) #1282

dicarlo2 opened this issue Mar 25, 2019 · 36 comments
Labels
area/sidecar-injection Executor & Controller handling of injected sidecars like Istio type/feature Feature request

Comments

@dicarlo2
Copy link

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

What happened: Argo attempted to store the docker logs to s3 after killing the istio sidecar. This means that it was impossible to connect to s3 since all traffic is iptable redirected to the (now killed) istio proxy.

What you expected to happen: The "wait" container leaves the sidecars (or perhaps designated ones) alive until artifact (logs, etc.) storage is complete.

How to reproduce it (as minimally and precisely as possible):

1/ Configure a cluster with istio
2/ Configure a workflow with a template that has an istio sidecar. We did this manually rather than injecting since argo doesn't recognize the injected container as a sidecar that needs killing.
3/ Specify s3 configuration and archiveLogs in the argo configuration
4/ Run the workflow

@ddseapy
Copy link
Contributor

ddseapy commented Oct 1, 2019

@dicarlo2 Were you ever able to figure out a way to get an istio-injected workflow working with artifact support?

@ddseapy
Copy link
Contributor

ddseapy commented Oct 1, 2019

Maybe configuring istio to avoid intercepting s3 traffic with traffic.sidecar.istio.io/excludeOutboundIPRanges?

@ddseapy
Copy link
Contributor

ddseapy commented Oct 3, 2019

As you suggested, delaying killing sidecars in waitContainer() works. Also had to workaround istio/istio#9454 but didn't require changes in argo or istio code.

Unsure if this kind of change is something argo maintainers want as a PR, or if there are downsides to doing this.

func waitContainer() error {
	wfExecutor := initExecutor()
	defer wfExecutor.HandleError()
	defer stats.LogStats()
	stats.StartStatsTicker(5 * time.Minute)

	// Wait for main container to complete
	err := wfExecutor.Wait()
	if err != nil {
		wfExecutor.AddError(err)
		// do not return here so we can still try to kill sidecars & save outputs
	}
	// Saving logs
	logArt, err := wfExecutor.SaveLogs()
	if err != nil {
		wfExecutor.AddError(err)
		// do not return here so we can still try to kill sidecars
	}
	// Saving output parameters
	err = wfExecutor.SaveParameters()
	if err != nil {
		wfExecutor.AddError(err)
		// do not return here so we can still try to kill sidecars
	}
	// Saving output artifacts
	err = wfExecutor.SaveArtifacts()
	if err != nil {
		wfExecutor.AddError(err)
		// do not return here so we can still try to kill sidecars
	}
	// Killing sidecar containers
	err = wfExecutor.KillSidecars()
	if err != nil {
		wfExecutor.AddError(err)
		return err
	}
	// Capture output script result
	err = wfExecutor.CaptureScriptResult()
	if err != nil {
		wfExecutor.AddError(err)
		return err
	}
	err = wfExecutor.AnnotateOutputs(logArt)
	if err != nil {
		wfExecutor.AddError(err)
		return err
	}
	return nil
}

@sarabala1979
Copy link
Member

Please go-ahead to create PR with changes. we will review and merge. Thanks for contributing

@nfickas
Copy link

nfickas commented Jan 8, 2020

@sarabala1979 Is there any way to get an Argo Workflow to recognize an automatically injected istio sidecar?

@ddseapy
Copy link
Contributor

ddseapy commented Jan 8, 2020

@nfickas is there an issue you are running into with istio sidecars, or what do you mean by "recognize"? We have been using 2.4.2 with istio sidecars automatically injected without any issues.

@ddseapy
Copy link
Contributor

ddseapy commented Jan 8, 2020

For us this issue was resolved here: #1645

@nfickas
Copy link

nfickas commented Jan 9, 2020

@ddseapy our workflows start up with the automatic sidecars but then the sidecars are stuck in a running state after the workflow completes.

@nfickas
Copy link

nfickas commented Jan 9, 2020

I am seeing some odd behavior on the wait container, looks like it Errored out. Failed to establish pod watch: Get https://10.100.200.1:443/api/v1/namespaces/default/pods?fieldSelector=metadata.name%3Dhello-world-82bm7&watch=true: dial tcp 10.100.200.1:443: connect: connection refused

@ddseapy
Copy link
Contributor

ddseapy commented Jan 9, 2020

That's the istio issue i linked above. The underlying issue is: istio/istio#11130

Really it's because k8s sidecar support PR (kubernetes/enhancements#753) hasn't gotten through, so there is no ordering which causes a race condition.

You can exclude the k8s api server from the IP range that envoy intercepts, or do things like change the docker command. It could also be fixed in argo by retrying the request on failure with code changes in argo.

@nfickas
Copy link

nfickas commented Jan 9, 2020

I would rather not making any changes to argo code, out of the other two options which one would be easier and do you have any examples of either of those being done?

Just curious did you run into this issue when getting argo working with injected sidecars?

@jaygorrell
Copy link

jaygorrell commented Jan 9, 2020

The easiest thing to do is to just annotate your Argo worfklows so that the auto-injector doesn't add a sidecar to them. Better support is coming in k8s (maybe in 1.8) but it's just not there yet.

Jobs in k8s are the same way... there isn't really anything special about Argo there.

Edit: This would be spec.templates[].metadata.annotations: sidecar.istio.io/inject: "false"

@ddseapy
Copy link
Contributor

ddseapy commented Jan 9, 2020

if you don't need sidecar, that is indeed easiest. We need to control traffic for our applications, so that is not an option for us. For the workflows we run it was sufficient to exclude the k8s master IP range via traffic.sidecar.istio.io/excludeOutboundIPRanges

@nfickas
Copy link

nfickas commented Jan 9, 2020

We need sidecars as well, I will try that out. Thanks!

@jaygorrell
Copy link

That's a good point, too. But just to be clear, I do believe that is going to make the pod stay up when the workflow is finished - correct? You may need a separate system to clean those up, I believe. I'm going off memory and could be mistaken.

@ddseapy
Copy link
Contributor

ddseapy commented Jan 9, 2020

To state the obvious, requests to k8s api server will then bypass istio. This was not a security concern for our use-cases, but it could be for someone else.

@ddseapy
Copy link
Contributor

ddseapy commented Jan 9, 2020

@jaygorrell that's why I put up the pr. we have been using 2.4.2 for quite a while now without issues, and the pod goes to completed state.

@ddseapy
Copy link
Contributor

ddseapy commented Feb 16, 2020

For visibility - Prasanna Kumar Krishnamurthy in argo workflow slack suggested patching the pod spec (https://github.com/argoproj/argo/blob/master/examples/pod-spec-patch.yaml) with something like

"containers":[{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 20; argoexec wait"]}]

@nfickas
Copy link

nfickas commented Feb 28, 2020

@ddseapy So the above suggestion will just make it clean up the istio-proxy container after 20 seconds. I think the real issue is that the wait and main containers are launching before the proxy starts up.

@ddseapy
Copy link
Contributor

ddseapy commented Feb 28, 2020

My understanding is the suggestion should delay the wait container running by editing it's command. I would think that it would delay cleanup (as you said), but also prevent the startup race condition.

If the main container has the same race condition (depends on your main container), you would need to wait there too.

I haven't tried the approach myself yet though, so I may be misunderstanding things.

@nfickas
Copy link

nfickas commented Feb 28, 2020

hm yeah I am not seeing that happen, I am still seeing the main and wait containers startup before istio-proxy. For reference this is what I did: podSpecPatch: '{"containers": [{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 10; argoexec wait"]}, {"name":"main", "command": ["/bin/bash", "-c"], "args": ["sleep 10; argoexec main"]}]}'

@stale

This comment was marked as resolved.

@stale stale bot added the wontfix label Jul 1, 2020
@Bobgy
Copy link
Contributor

Bobgy commented Jul 1, 2020

I'd still want this open. This is valuable for kubeflow pipelines multi tenant support, because api server needs to be protected by istio. Having istio sidecars in workflows allow in cluster api calls from workflows.

@alexec alexec added type/feature Feature request help wanted and removed wontfix labels Jul 1, 2020
@shadiramadan
Copy link

shadiramadan commented Aug 25, 2020

The startup problem should be resolved in Istio 1.7.0 as a workaround:
https://istio.io/latest/news/releases/1.7.x/announcing-1.7/#production-operability-improvements
istio/istio#24737

If you are using the IstioOperator:

spec:
  components:
    proxy:
      holdApplicationUntilProxyStarts: true

@BlackRider97
Copy link

Any one found any work around for Linkerd2 ?

@arnoldrw
Copy link

arnoldrw commented Jan 21, 2021

Recently we ran into not 1, but 3 separate race conditions while working with argo in a test environment with istio enabled.

I am posting this in hopes that it helps others, as certain aspects of this are covered in this issue but they are done through replies that can be hard to follow. In addition I believe case 3 may be a new one on this issue, and case 1 has a potential change to previous solutions.

Relevant details of our istio environment:
-we have all traffic in/out through istio to control what can be hit
-we have mtls enabled
-not on 1.8 yet so can't use per pod holdApplicationUntilProxyStarts istio annotation

We defined a simple workflow that simply curled a host that had been whitelisted by istio. The workflow always contained this annotation:

metadata:
  annotations:
    sidecar.istio.io/inject: "true"

We were seeing race conditions in 3 areas

  1. The actual workflow job. If the istio proxy was not yet up, the curl would fail
  2. The argo wait container. This caused a cascading failure
  3. Writes to minio. These were seemingly randomly failing

Workflow job:

This one was the easiest. somesite.com had been whitelisted for egress and our initial worfklow had a command and args like this:

command: [sh, -c]
args: ["date; curl -v somesite.com"]

This resulted in an error like this in the main logs

<timestamp>
curl: (7) Failed to connect to somesite.com port 443: Connection refused

Next I grabbed the proxy logs for the container

kubectl logs <container> istio-proxy

From those, I could see that the envoy proxy was not up at the timestamp of the curl call. A log entry is made that states "Envoy proxy is ready" once it is up, so you can just grep for that.

Workflow job fix

The fix for the workflow job was to curl to check if the proxy was up, but we had a modification to what has previously been posted. Other posts curl localhost:15000 . What we found when doing that was that was the admin port, and a response from that did not mean the proxy was up. Instead we check the health status port:

args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl https://somesite.com"]

We added an extra sleep just in case - more testing needed there.

Argo wait container

This is similar to what nfickas mentioned with ddseapy's solution .

After we changed the workflow command we still faced issues. Specifically in the workflow ui we'd see a log about "Failed to establish pod watch: timed out waiting for the condition".

The important log for this was the wait log

kubectl logs <container> wait

This showed a log entry like this

time="2021-01-18T19:38:17.0651" level=error msg="executor error: Failed to establish pod watch: timed out waiting for the condition\ngithub.com/argoproj/argo/errors..........

The issue in this was also a race condition. Because all of our traffic was going through istio, the wait container needed istio to be up. Again, an inspection of the istio log to compare timestamps was the right next step

kubectl logs <container> istio-proxy | grep "Envoy proxy is ready"

This grep for the ready state of the proxy showed a timestamp that was seconds after the argo wait container tried to make its call.

Argo wait container fix

The fix for this was to exclude the k8 api ip range from being included in istio by adding to our annotations:

metadata:
  annotations:
    sidecar.istio.io/inject: "true"
    traffic.sidecar.istio.io/excludeOutboundIPRanges: k8apiip/32

This ensured that the communication did not need istio to be up yet.

Minio issue

The last race condition we faced was on the write of logs to Minio (this is a test environment). This would occasionally fail and

kubectl logs <container> wait 

showed 503 errors when trying to connect to minio.

For this we needed to increase the logging level by modifying the command:

args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl -s -XPOST http://localhost:15000/logging?level=debug; curl https://somesite.com"]

A further inspection of the istio log

kubectl logs <container> istio-proxy

Showed the following

2021-01-18T19:51:50.461773Z	debug	envoy connection	[external/envoy/source/common/network/connection_impl.cc:727] [C17] connecting to <ip>:9000
2021-01-2-181T19:51:50.461858Z	debug	envoy connection	[external/envoy/source/common/network/connection_impl.cc:736] [C17] connection in progress
2021-01-18T19:51:50.462871Z	debug	envoy connection	[external/envoy/source/common/network/connection_impl.cc:592] [C17] connected
2021-01-18T19:51:50.462975Z	debug	envoy connection	[external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:191] [C17] handshake expecting read
2021-01-18T19:51:50.466813Z	debug	envoy connection	[external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:191] [C17] handshake expecting read
2021-01-18T19:51:50.466837Z	debug	envoy connection	[external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:191] [C17] handshake expecting read
2021-01-18T19:51:50.467951Z	debug	envoy connection	[external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:198] [C17] handshake error: 1
2021-01-18T19:51:50.467977Z	debug	envoy connection	[external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:226] [C17] TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED

This issue is tied to istio not updating certificates. This appears to be an istio/envoy issue related to certificate handling. Some example issues are here and here.

Minio fix

A temporary fix for this is to disable mtls for minio with PeerAuthentication and DestinationRule per the istio docs

Minio tls exclusion:

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: minio-mtls-disable
  namespace: test
spec:
  selector:
    matchLabels:
      app: minio
  mtls:
    mode: UNSET
  portLevelMtls:
    9000:
      mode: DISABLE
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: minio
  namespace: test
spec:
  host: minio
  trafficPolicy:
    portLevelSettings:
    - port:
        number: 9000
      tls:
        mode: DISABLE

Full example workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: race-workflow
spec:
  entrypoint: test
  templates:
  - name: test
    steps:
    - - name: run-it
        template: curler
  - name: curler
    activeDeadlineSeconds: 60
    retryStrategy: 
      limit: 2
      retryPolicy: "Always" #OnFailure or OnError also are options
    script:
      image: python3.8-slim:latest
      command: [sh, -c]
      args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl -s -XPOST http://localhost:15000/logging?level=debug; curl https://somesite.com"]
      imagePullPolicy: IfNotPresent
    metadata:
      annotations:
        sidecar.istio.io/inject: "true"
        traffic.sidecar.istio.io/excludeOutboundIPRanges: k8apiip/32

While isito 1.8 does include the holdApplicationUntilStarts annotation, I do wonder if this would solve case 2 with the wait container or not. It does seem like something to force argo to wait until the proxy is up would be generally useful.

Hope this helps someone.

@alexec
Copy link
Contributor

alexec commented Apr 16, 2021

Workflows v3 should just work with ISTIO. Please let me know if that is not the case.

@alexec alexec closed this as completed Apr 16, 2021
@Bobgy
Copy link
Contributor

Bobgy commented Jul 23, 2021

May I confirm whether above statement is up to date or the documentation?
https://argoproj.github.io/argo-workflows/sidecar-injection/

@alexec
Copy link
Contributor

alexec commented Nov 7, 2021

It should just work, but it doesn’t.

We’ve been speaking to ISTIO, because they have similar issues with other tools. Previously, our recommendation was to disable ISTIO.

We have since heard from users that the PNS executor works with ISTIO.

@alexec
Copy link
Contributor

alexec commented Nov 7, 2021

We should fix #7057.

@alexec
Copy link
Contributor

alexec commented Nov 7, 2021

Aside: it is very hard to cater from injected sidecars when container ordering is important to both you and the injected sidecar. I suspect that we'll always have problems with working with injected sidecars like ISTIO and Vault until we have a general way to kill them via API. #7057 is just a patch to the problem, and we may need more than one of these.

@nfickas
Copy link

nfickas commented Nov 7, 2021

I'll add that when my last company started using the emissary executor everything worked very well with istio. The only thing was having to find a way to kill the sidecars after the workflow ran but I think we solved that by reordering the pods startup somehow.

@alexec
Copy link
Contributor

alexec commented Nov 7, 2021

@nfickas can you share more? Do you have details we could share with other users?

@nfickas
Copy link

nfickas commented Nov 7, 2021

I can't remember the entire implementation details and no longer have access to those repos. But the gist of it comes down to using istio's holdApplicationUntilProxyStarts configuration.

This will cause the workflows execution to wait until the sidecars are able to serve traffic.

If I'm remembering this correctly just adding that annotation magically fixed our issues.

We had originally followed something similar to what is described in this article with limited success: https://medium.com/@marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74

@mattharrigan
Copy link

Hello. First thanks for creating argo. I think its a great tool. I'm also struggling to use argo and istio. Using istio for me is also a company policy. IMO killing the sidecar is easy, but its almost impossible to get init containers working if they need network access. I would like to be able to use artifacts. My best thought with the current functionality is to have steps upload artifacts as normal, but when input artifacts are required bypass argo completely and get the files straight from S3 in the main container. Moderate PITA, but might work. Can you recommend anything better?

Longer term I think it would be great if the artifacts could be downloaded by the wait container itself. It could expose some http server and reply 200 when that is complete. The main container could wait on that signal. Just a thought.

@RobCannon
Copy link

I just wanted to say that the new Istio Ambient Mesh (GA in Istio 1.18) seems to solve the issues with Istio and Argo Workflows since there is no longer a sidecar.

@agilgur5 agilgur5 added the area/sidecar-injection Executor & Controller handling of injected sidecars like Istio label Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sidecar-injection Executor & Controller handling of injected sidecars like Istio type/feature Feature request
Projects
None yet
Development

No branches or pull requests