fix #4910 / #4923 addressing inconsistent behavior with exec streams #4959

shawkins · 2023-03-09T15:50:36Z

Description

Only 1 real issue was fully tracked down with #4910 / #4923 - vertx closeHandler is immediate. That needs changed to the endHandler, which is notified in order. Reviewing the vertx logic does also highlight that we'll have a problem if the server ever does send a ping - but that should not currently be the case. The only "ping" sent currently is the 0 byte message.

For the common failures some additional debugging has been added to the failure message to narrow in on where the problem lies.

I had thought there was a concurrency issue with PodOperationsImpl.readTo, but that shouldn't be the case - the stream was still being closed in the calling method's try.

Type of change

Bug fix (non-breaking change which fixes an issue)
Feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change
Chore (non-breaking change which doesn't affect codebase;
test, version modification, documentation, etc.)

Checklist

Code contributed by me aligns with current project license: Apache 2.0
I Added CHANGELOG entry regarding this change
I have implemented unit tests to cover my changes
I have added/updated the javadocs and other documentation accordingly
No new bugs, code smells, etc. in SonarCloud report
I tested my code in Kubernetes
I tested my code in OpenShift

shawkins · 2023-03-09T17:02:16Z

https://github.com/fabric8io/kubernetes-client/actions/runs/4376515845/jobs/7659072908#step:6:450 confirms a fresh instance where some of the stdIn upload is dropped. Options here include:

adding a time delay between when the last data is sent and close
attempt to check inline with the operation or afterwards. File upload and directory upload have divergent logic that makes this difficult. We may want to consolidate to just tar and a two step process - send the tar.gz, then in a separate exec check the length or checksum, then extract.

https://github.com/fabric8io/kubernetes-client/actions/runs/4268450566/jobs/7430955144#step:5:448 since there is no upload could be an issue with stdOut download - but I could never quite confirm that locally. Initially it seemed like an issue with the stream close, but that turned out to not be the case. I'd like to get another reproduction of that locally or via the github runs before drawing more conclusions about how to proceed.

shawkins · 2023-03-09T20:38:41Z

@manusa the first commit addresses the vertx close problem and some misc. cleanups. The second commit was speculative change - basically throwing a few more messages at the server and a little wait to help ensure that what was already sent is processed. This was done as a first attempt because I couldn't think of a good way an inline, or a post check that wouldn't require even more code. However it also does not appear to be resilient - at least for github execution - so it was removed.

manusa · 2023-03-10T06:19:21Z

So from the final code changes I see we're addressing 3 things here:

Vert.x ping-pong handling: request more frames in case a pong is received (added a handler instead of disabling client-sent pings) - the change / comment are just to being defensive, it does not appear that we'll currently send a client side ping, nor will the server send one.
Vert.x replacing closeHandler with endHandler which waits for the previous frames to be processed as opposed to closing the stream immediately
Pod (Download) Operations: moved the exitCode blocking retrieval to the try-with-resources block. OutputStream is clearly closed after the server process has completed. This is just a minor cleanup and should not address the problems seen with copyFile.

There still remains a few issues to solve:

Delaying the close operation server-side for tar extract operation (might complete before the buffers are flushed) -Upload-
Upload/Download checksums

I'll keep #4910 and #4923 open until we create follow-up issues.

please feel free to edit this comment to make it more accurate

manusa · 2023-03-10T06:20:50Z

I also think that we might want to add a little complexity to the upload operation (maybe also consolidate into a single implementation regardless of file or directory -like you proposed-). Perform 2 commands, one for the upload, one for the checksum. The upload command is only completed (local close processing is delayed) after the second command is successful. The checksum command is executed after the upload completes (websocket remains open) and is retried n times until success. I understand that this should cover the issue, but maybe I'm misinterpreting something.