Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SharedInformer does not survive to an API server restart #2992

Closed
akram opened this issue Apr 14, 2021 · 4 comments · Fixed by #3018
Closed

SharedInformer does not survive to an API server restart #2992

akram opened this issue Apr 14, 2021 · 4 comments · Fixed by #3018
Assignees

Comments

@akram
Copy link
Contributor

akram commented Apr 14, 2021

SharedInformer created to watch ImageStreams or BuildConfig does not survive a k8s API server restart.

Please note, that this probably related to a bug in k8s api, as I was abled to reproduce the behaviour using the oc command.

As user using oc, if I restart the api server, while watching imagestreams, I got the following error:

STATUS                REASON            MESSAGE
Failure               InternalError     an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 3; INTERNAL_ERROR") has prevented the request from succeeding

The same operation using oc get secrets -w does not fail.

In the kubernetes-client, this materialized by an EOFException caught in the io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener which does not restart the WebSocket, but instead discards it from the manager.

2021-04-14 12:07:02 WARNING io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener onFailure Exec Failure
java.io.EOFException
	at okio.RealBufferedSource.require(RealBufferedSource.java:61)
	at okio.RealBufferedSource.readByte(RealBufferedSource.java:74)
	at okhttp3.internal.ws.WebSocketReader.readHeader(WebSocketReader.java:117)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:101)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

This is silent as a user point of view.

As a possible fix, we can considered having an else statement here: https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatcherWebSocketListener.java#L105

As, for an existing and started websocket, it is still possible to get a null response, which may have the signification that the websocket has been started previously, but became unaivalable.

edit:
Discussing with apiserver team, it seems that it is also impacting core objects, not only openshift specific. In my test, I was deleting the openshift apiserver part only. But, the same error is raised then in case of any other objects.

@akram akram changed the title SharedInformer on build.openshift.io or image.openshift.io apiGroups resources does not survive to an API server restart SharedInformer does not survive to an API server restart Apr 14, 2021
@akram akram closed this as completed Apr 15, 2021
@akram akram reopened this Apr 15, 2021
@shawkins
Copy link
Contributor

@akram you should see reconnects -

I think it defaults to unlimited reconnect attempts -

@shawkins
Copy link
Contributor

I believe we have hit a similar situation. If the relist operation fails due to the api server being unavailable it looks like no further reconnects will be attempted.

@shawkins
Copy link
Contributor

Specifically we see this after logs:

ERROR [io.fab.kub.cli.dsl.int.WatchConnectionManager] (OkHttp https://172.30.0.1/...) Unhandled exception encountered in watcher event handler: java.util.concurrent.RejectedExecutionException: Error while doing ReflectorRunnable list

where the root exception is a timeout.

@manusa
Copy link
Member

manusa commented May 5, 2021

Relates to: #2010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants