Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky integration tests with gRPC errors on close #228

Closed
pawelprazak opened this issue Oct 9, 2023 · 7 comments
Closed

Flaky integration tests with gRPC errors on close #228

pawelprazak opened this issue Oct 9, 2023 · 7 comments
Labels
area/core The SDK's core code impact/flaky-test A test that is unreliable
Milestone

Comments

@pawelprazak
Copy link
Collaborator

We are experiencing failing gRPC shutdown form our core SDK in integration tests, e.g.:

    Exception in thread "main" io.grpc.StatusRuntimeException: UNAVAILABLE: Channel shutdown invoked
    	at io.grpc.Status.asRuntimeException(Status.java:535)
    	at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533)
    	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
    	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
    	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
    	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
    	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    	at java.base/java.lang.Thread.run(Thread.java:829)

source: https://github.com/VirtusLab/besom/actions/runs/6432539960/job/17467662339?pr=227

Looks similar to this old issue: grpc/grpc-java#7752

In short, an exception is being thrown while closing gRPC internal executor when its status in "not OK" at the point in time when the code expects it to be "OK" - I don't have enough context to understand it fully at this point.

@pawelprazak pawelprazak added area/core The SDK's core code impact/flaky-test A test that is unreliable labels Oct 9, 2023
@lbialy
Copy link
Collaborator

lbialy commented Oct 9, 2023

Does the test in question hang?

@lbialy
Copy link
Collaborator

lbialy commented Oct 9, 2023

It shouldn't be possible by contract of our resource handling protocol that something would try to still use a grpc connection that is already closed from what I recall. If this is use-after-close then it's definitely a bug.

@pawelprazak
Copy link
Collaborator Author

pawelprazak commented Oct 9, 2023

It'd failed after 32s, so unlikely. I have a strong suspicion that there might be a problem where the main thread finishes before the gRPC thread due to SIGTERM or SIGKILL but I'm not sure what the GHA runner does actually.

In this test we are starting scala-cli that starts the test, that starts a subprocess that starts the gRPC thread, so there's a lot of places where thing could get done in the wrong order on shutdown. Unfortunately, I'm only guessing at this point.

From reading the code, it looks like the problem is in the closing logic itself, so not the typical use-after-close, but more like, use-during-close

We might want to just upgrade the gRPC to latest and see what happens.

@lbialy
Copy link
Collaborator

lbialy commented Oct 9, 2023

In case of main thread finishing due to sigterm/sigkill you wouldn't see that error because a) we don't handle signals to trigger resource closing logic, b) this exception is thrown on the main thread. It's actually quite possible that it is a bug in resource handling logic. Do we have any way to replicate this?

@pawelprazak
Copy link
Collaborator Author

pawelprazak commented Oct 9, 2023

We might be missing awaitTermination, I'll test that hypothesis in a PR.

I was unsuccessful in replicating it locally, or replicating reliably in CI.

pawelprazak added a commit that referenced this issue Oct 9, 2023
pawelprazak added a commit that referenced this issue Oct 9, 2023
pawelprazak added a commit that referenced this issue Oct 9, 2023
@lbialy
Copy link
Collaborator

lbialy commented Oct 9, 2023

Oh, yeah, that might do it.

pawelprazak added a commit that referenced this issue Oct 9, 2023
@pawelprazak
Copy link
Collaborator Author

This might help, esp. the SIGINT handling: #245

lbialy added a commit that referenced this issue Nov 14, 2023
…therefore breaking the ordering guarantees by Result.resource
@lbialy lbialy closed this as completed in 1644b27 Nov 14, 2023
@pawelprazak pawelprazak added this to the 0.2.0 milestone Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core The SDK's core code impact/flaky-test A test that is unreliable
Projects
None yet
Development

No branches or pull requests

2 participants