Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with upload / failures with PodIT #4910

Closed
shawkins opened this issue Feb 23, 2023 · 3 comments · Fixed by #4968
Closed

Issues with upload / failures with PodIT #4910

shawkins opened this issue Feb 23, 2023 · 3 comments · Fixed by #4968
Milestone

Comments

@shawkins
Copy link
Contributor

With kubernetes 1.26 we are still seeing errors with pod upload:

Error:  Failures: 
Error:    PodIT.uploadBinaryFile:318 array lengths differ, expected: <16385> but was: <8192>
Error:    PodIT.uploadFile:298 expected: <I'm uploaded> but was: <>

https://github.com/fabric8io/kubernetes-client/actions/runs/4252695615/jobs/7396899761#step:6:579

With additional trace logging failed test runs showed the same data being sent to the server either way - as measured down at the level at which okhttp writes to the ssl socket.

There doesn't seem to be a good explanation for this other than we are more prone to seeing it on the containerd minikube runtime.

The issue does seem to match one reported upstream kubernetes/kubernetes#112834

We may want to add a formal warning about this behavior so that users are aware of potential data loss.

@andreaTP
Copy link
Member

andreaTP commented Mar 2, 2023

I keep hitting this issue on an unrelated PR:
https://github.com/fabric8io/kubernetes-client/actions/runs/4313568880/jobs/7526046821#step:6:500

Should we consider skipping those tests until we have a good solution?

@shawkins
Copy link
Contributor Author

shawkins commented Mar 2, 2023

The upload issue is prevalent against 1.26 / containerd - with all client types. It typically looks like an entire packet is being lost, but now here's one with just a single byte that's off https://github.com/fabric8io/kubernetes-client/actions/runs/4310764866/jobs/7519523723#step:5:1160

If we disable those tests we should definitely add a warning in the release notes that pod upload seems broken. The guess is that the api server is introducing message ordering issues each direction - sending or receiving - which would also cause the download issue.

The download issue seems to be much more rare - tried reproducing that more yesterday without success. I'd probably leave that test running for now.

@shawkins
Copy link
Contributor Author

shawkins commented Mar 6, 2023

The copyDir failures with the e2e tests and vertx are in the same vain as the download issue, but seems more reproducible.

I modified the code to fully read the input stream before passing it off to the tar utility to show both the length and an md5 sum of the contents. For each successful run that shows:
12800 Hwu1HVU3N2gLBk1b8cUA+g==

An unsuccessful run looks like:

[main] DEBUG io.fabric8.kubernetes.client.dsl.internal.core.v1.PodOperationsImpl - using first container busybox in pod pod-standard
[vert.x-eventloop-thread-1] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - exec message received 512 bytes on channel stdOut
[vert.x-eventloop-thread-1] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - exec message received on channel stdErr: tar: removing leading '/' from member names

[vert.x-eventloop-thread-1] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - exec message received 4096 bytes on channel stdOut
[vert.x-eventloop-thread-1] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - Exec Web Socket: On Close with code:[1000], due to: [null]
[-1206678562-pool-1-thread-2] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - Exec Web Socket: completed with a null exit code - no status was received prior to onClose
4608 U/9K/Wr7WsvMAIEcua8Hqw==
[vert.x-eventloop-thread-1] DEBUG io.fabric8.kubernetes.client.dsl.internal.ExecWebSocketListener - exec message received 8192 bytes on channel stdOut

Note the output of the length / checksum before the message that contains the remaining bytes - all of which occurs after receiving close from the server.

The workaround for the download handling is that we should be able to expect an errorChannel / exit code message. Rather than immediately terminate with onClose we can wait some amount of time for that to appear. Local testing seemed to confirm that this worked.

The upload is thornier. We really don't have any good way to know if the server has received our data. There is no expected exit code message, and even if it did come if the api server has misordered messages at best it would show an error if we used tar. The best I can come up with is that we could compute / request a checksum afterwords and clearly error and/or retry some number of times until there is a match.

Added a comment to the upstream issue and linking to #4923 for good measure.

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 9, 2023
manusa pushed a commit to shawkins/kubernetes-client that referenced this issue Mar 10, 2023
@manusa manusa closed this as completed in 11daf25 Mar 10, 2023
@manusa manusa reopened this Mar 10, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 13, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 22, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 23, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 23, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 23, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 24, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 24, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 24, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 24, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 24, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 28, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Mar 28, 2023
manusa pushed a commit to shawkins/kubernetes-client that referenced this issue Mar 31, 2023
manusa pushed a commit to shawkins/kubernetes-client that referenced this issue Apr 3, 2023
@manusa manusa added this to the 6.6.0 milestone Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants