Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable OpenShift builds after bumping Kubernetes Client to 6.4 #2024

Closed
2 tasks
manusa opened this issue Jan 26, 2023 · 12 comments · Fixed by fabric8io/kubernetes-client#4899
Closed
2 tasks
Assignees
Labels
bug Something isn't working

Comments

@manusa
Copy link
Member

manusa commented Jan 26, 2023

Describe the bug

After the recent upgrade to Kubernetes Client 6.4.0 (#1988) we're getting random errors when using the OpenShift Build goal/task:

Failed to execute the build: Unable to build the image using the OpenShift build service: Can't instantiate binary build, due to error reading/writing stream. Can be caused if the output stream was closed by the server.

Eclipse JKube version

1.11-SNAPSHOT

Component

OpenShift Maven Plugin

Steps to reproduce

Perform OpenShift Builds with SNAPSHOT version, especially for big images.

Expected behavior

Builds should complete successfully

Tasks

@manusa manusa added the bug Something isn't working label Jan 26, 2023
@manusa manusa self-assigned this Jan 26, 2023
@manusa
Copy link
Member Author

manusa commented Jan 26, 2023

/cc @shawkins for awareness

@shawkins
Copy link

Is this the debug issue?

@manusa
Copy link
Member Author

manusa commented Jan 26, 2023

I'm out of context now, can't recall what the debug issue is. If you mean the one about the logs, no it's not the same.

In this one we're getting some sort of timeouts when sending the binary file to OpenShift. I'm preparing a demo for tomorrow that requires other things from our SNAPSHOT and couldn't further investigate the issue 😓

I'll probably spend some time on this on Monday.

@manusa
Copy link
Member Author

manusa commented Feb 21, 2023

The TimeoutException stacktrace:

io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129)
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:542)
	at io.fabric8.openshift.client.dsl.internal.build.BuildConfigOperationsImpl.submitToApiServer(BuildConfigOperationsImpl.java:267)
	at io.fabric8.openshift.client.dsl.internal.build.BuildConfigOperationsImpl.fromFile(BuildConfigOperationsImpl.java:165)
	at io.fabric8.openshift.client.dsl.internal.build.BuildConfigOperationsImpl.fromFile(BuildConfigOperationsImpl.java:63)
	at io.fabric8.openshift.BuildIT.buildImage(BuildIT.java:118)
...
Caused by: java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:519)
	... 125 more

@manusa
Copy link
Member Author

manusa commented Feb 21, 2023

It seems that the OperationSupport#waitForResult call is using the global config instead of the one provided by the context at BuildConfigOperationsImpl#submitToApiServer.

Probably introduced by fabric8io/kubernetes-client#4678

@shawkins
Copy link

@manusa there are too many paths for timeout handling... there was a change to the OperationSupport to enforce the readTimeout client side as okhttp in particular was prone to enqueing a request and not enforcing the timeout while enqueued. Rather than updating the httpclient to expose it's config, given the other changes we've made, it would make more sense to move the enforcement of this out of OperationSupport and into the Standard httpclient logic handling of sendAsync.

@manusa
Copy link
Member Author

manusa commented Feb 21, 2023

I'm working on the fix, it really doesn't make sense that the submitToApiServer calls the waitForResult method since it should wait indefinitely.
I'll be sending a PR tomorrow morning (CET).

@shawkins
Copy link

shawkins commented Feb 21, 2023

I'm working on the fix, it really doesn't make sense that the submitToApiServer calls the waitForResult method since it should wait indefinitely.

It's no different than the other paths that call sendAsync - it's a blocking operation that is expected to complete within some timeout. The issue is that waitForResult is referencing the default timeout value - and that's a problem we could have with any of the other code paths (none of which are currently affected) if we don't move (or remove) the OperationSupport enforcement. You could make the case for removal if we believe that the changes to the okhttp threading defaults are sufficient for addressing this and that it's unlikely to be a problem with the other httpclients.

@manusa
Copy link
Member Author

manusa commented Feb 21, 2023

it's a blocking operation that is expected to complete within some timeout

I'm not sure there is a specific timeout value in this case. What should we default this value to? This is an OpenShift S2i build, I'm not aware of any restrictions in terms of timeout for these scenarios. And I would rather not put myself in the position of imposing one arbitrary value. IMO the timeout should happen server-side.

@shawkins
Copy link

I'm not sure there is a specific timeout value in this case.

It defaults to 0, but is settable by the user here https://github.com/fabric8io/kubernetes-client/blob/1d3e263cc6216f4a3adad6b677539186179f41c2/openshift-client/src/main/java/io/fabric8/openshift/client/dsl/internal/build/BuildConfigOperationsImpl.java#L244

It's then set on the httpclient https://github.com/fabric8io/kubernetes-client/blob/1d3e263cc6216f4a3adad6b677539186179f41c2/openshift-client/src/main/java/io/fabric8/openshift/client/dsl/internal/build/BuildConfigOperationsImpl.java#L260 and in the following line.

Setting the readTimeout overlaps with the behavior of RequestConfig.getRequestTimeout.

IMO the timeout should happen server-side.

Just to make sure we're on the same page, all the places that we're manipulating the httpclient reade/write timeouts the expectation it that it will be enforced by the httpclient. In particular the write timeout is not enforced by JDK, it is enforced as the sum of read and write timeouts by jetty (which may confuse the meaning of either being a 0/indefinite value), and as an idle write timeout on vertx (that seems incorrect at first glance). Given the state of things, I'd definitely be in favor of getting rid of this write timeout from the HttpClient. I'm not sure about how consistently the readtimeout is treated - we know it won't be enforced by okhttp while the request is enqueued.

@manusa
Copy link
Member Author

manusa commented Feb 22, 2023

It defaults to 0, but is settable by the user here https://github.com/fabric8io/kubernetes-client/blob/1d3e263cc6216f4a3adad6b677539186179f41c2/openshift-client/src/main/java/io/fabric8/openshift/client/dsl/internal/build/BuildConfigOperationsImpl.java#L244

OK, thought that these values were not applicable from this scenario. Then waitForResult needs some refactoring and probably be moved elsewhere. I'll update my changes.

IMO the timeout should happen server-side.

This statement refers exclusively to the S2I binary build scenario_, more specifically to the case where the user doesn't provide a timeout._ Meaning that our client-side read/write timeouts should default to 0 -run indefinitely-.

Setting the readTimeout overlaps with the behavior of RequestConfig.getRequestTimeout.

I'm not even sure what requestTimeout stands for. I understand connection timeout and read timeout, and I can try to understand write timeout. So possible assumptions are a) the sum of the values (IMO wrong), b) the minimum of the sum of connect+read and connect+write, c) something else.

I think that to make things consistent RequestConfig should admit customized values for connect and read (and maybe write if we keep this setting), instead of a totally new field.

Given the state of things, I'd definitely be in favor of getting rid of this write timeout from the HttpClient.

+1

@shawkins
Copy link

OK, thought that these values were not applicable from this scenario. Then waitForResult needs some refactoring and probably be moved elsewhere. I'll update my changes.

Prior to optional the enforcement of the timeout waitForResult was simply the central method to make calls blocking so that the same exception handling did not end up in every blocking location.

I'm not even sure what requestTimeout stands for.

Prior to adding the waitForResult check, which more closely matches the full request cycle, it was treated as the default read timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants