Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test: GrpcServerTest#testGrpcExecutorPool #388

Closed
2 of 3 tasks
zuston opened this issue Dec 6, 2022 · 9 comments · Fixed by #404
Closed
2 of 3 tasks

Flaky Test: GrpcServerTest#testGrpcExecutorPool #388

zuston opened this issue Dec 6, 2022 · 9 comments · Fixed by #404

Comments

@zuston
Copy link
Member

zuston commented Dec 6, 2022

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the flaky test

Error:  Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 4.137 s <<< FAILURE! - in org.apache.uniffle.common.GrpcServerTest
Error:  testGrpcExecutorPool  Time elapsed: 4.054 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <0.0> but was: <1.0>
	at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
	at org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:70)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:65)
	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:885)
	at org.apache.uniffle.common.GrpcServerTest.testGrpcExecutorPool(GrpcServerTest.java:83)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
	at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
	at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
	at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
	at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
	at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
	at org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
	at org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
	at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)

Actions URL

https://github.com/apache/incubator-uniffle/actions/runs/3625931180/jobs/6114463345

Parent issue

#1733

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@zuston zuston added the good first issue Good for newcomers label Dec 6, 2022
@jerqi jerqi removed the good first issue Good for newcomers label Dec 6, 2022
@jerqi
Copy link
Contributor

jerqi commented Dec 6, 2022

Is Flaky test a good first issue?

@zuston
Copy link
Member Author

zuston commented Dec 6, 2022

Emm... Good question.

@jerqi
Copy link
Contributor

jerqi commented Dec 6, 2022

Emm... Good question.

I feel it's difficult to fix flaky test. WDYT?

@jerqi
Copy link
Contributor

jerqi commented Dec 6, 2022

If it's easy to fix, we should fix the flaky test immediately.

@zuston
Copy link
Member Author

zuston commented Dec 6, 2022

Emm... Good question.

I feel it's difficult to fix flaky test. WDYT?

Yes. Maybe if we have the clear fix solution or small easy features, it could be mark it as good first issue.

jerqi pushed a commit that referenced this issue Dec 13, 2022
…uge` (#404)

### What changes were proposed in this pull request?
Fix incorrect usage of `GRPCMetrics#setGauge`

### Why are the changes needed?
It is a bug #244 #388 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No need
@zhengchenyu
Copy link
Collaborator

This problem still exists (https://github.com/apache/incubator-uniffle/actions/runs/5518411958/jobs/10062325314?pr=982).

The testGrpcExecutorPool has two state, then verify the metrics.
First state : assert the metrics when two thread has already run, but the third thread is not running.
Second state: assert the metrics when all thread is done.

I think the problem is that we can't make sure to enter first state. Because the three thread only sleep 200ms, in this patch sleep 120ms to enter the first state, the mistake is only 80ms, it is too short.
In my laptop, I change 120ms to 190ms, the probability of reproduce this problem will be increased.

So I think two way to solve this problem:
(a) increased thread sleep time from 200ms to 1000ms
(b) do not verify the first state.

@jerqi @zuston how about you?

@jerqi
Copy link
Contributor

jerqi commented Jul 11, 2023

This problem still exists (https://github.com/apache/incubator-uniffle/actions/runs/5518411958/jobs/10062325314?pr=982).

The testGrpcExecutorPool has two state, then verify the metrics. First state : assert the metrics when two thread has already run, but the third thread is not running. Second state: assert the metrics when all thread is done.

I think the problem is that we can't make sure to enter first state. Because the three thread only sleep 200ms, in this patch sleep 120ms to enter the first state, the mistake is only 80ms, it is too short. In my laptop, I change 120ms to 190ms, the probability of reproduce this problem will be increased.

So I think two way to solve this problem: (a) increased thread sleep time from 200ms to 1000ms (b) do not verify the first state.

@jerqi @zuston how about you?

I prefer solution a.

@jerqi
Copy link
Contributor

jerqi commented Jul 19, 2023

This problem still exists (https://github.com/apache/incubator-uniffle/actions/runs/5518411958/jobs/10062325314?pr=982).

The testGrpcExecutorPool has two state, then verify the metrics. First state : assert the metrics when two thread has already run, but the third thread is not running. Second state: assert the metrics when all thread is done.

I think the problem is that we can't make sure to enter first state. Because the three thread only sleep 200ms, in this patch sleep 120ms to enter the first state, the mistake is only 80ms, it is too short. In my laptop, I change 120ms to 190ms, the probability of reproduce this problem will be increased.

So I think two way to solve this problem: (a) increased thread sleep time from 200ms to 1000ms (b) do not verify the first state.

@jerqi @zuston how about you?

@zhengchenyu There is a fix. #1023

@zhengchenyu
Copy link
Collaborator

@jerqi OK, #1023 is a great work indeed. Wait the condition is more reliable than reset the sleep time.

jerqi pushed a commit that referenced this issue Jul 19, 2023
### What changes were proposed in this pull request?
I observed a previous fix injected some delay to fix the flaky test, but I believe that fix might be unstable in a CI environment or when run on different machines, given the dependency on some constant wait time. I suggest a new way to fix the test by adding some synchronization for the test execution only. I at first identify the source code location whose slow execution leads to the flaky test failure, where if common/src/main/java/org/apache/uniffle/common/rpc/GrpcServer.java#157 slows the grpcMetrics.incGauge() and grpcMetrics.setGauge(), then the test fails (https://github.com/apache/incubator-uniffle/blob/master/common/src/main/java/org/apache/uniffle/common/rpc/GrpcServer.java#L157 and https://github.com/apache/incubator-uniffle/blob/master/common/src/main/java/org/apache/uniffle/common/rpc/GrpcServer.java#L158). Hence, I introduce one variable in this class [GrpcServer.java](https://github.com/apache/incubator-uniffle/blob/master/common/src/main/java/org/apache/uniffle/common/rpc/GrpcServer.java#L158) that is only there to provide some synchronization. Basically, until these two statements are executed, I force the thread that the test runs on to wait before it accesses the value of grpcMetrics.getGaugeMap(). The waiting location is at https://github.com/apache/incubator-uniffle/blob/master/common/src/test/java/org/apache/uniffle/common/rpc/GrpcServerTest.java#L72

### Why are the changes needed?
This test is flakily fails. I run this test many times and it makes assertion fails. The failure message is as follows.
Failure:
[INFO] Running org.apache.uniffle.common.rpc.GrpcServerTest
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.185 s <<< FAILURE! - in org.apache.uniffle.common.rpc.GrpcServerTest
[ERROR] testGrpcExecutorPool Time elapsed: 0.142 s <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <2.0> but was: <0.0>
at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
at org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:70)
at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:65)
at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:885)
at org.apache.uniffle.common.rpc.GrpcServerTest.testGrpcExecutorPool(GrpcServerTest.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)

[ERROR] Failures:
[ERROR] GrpcServerTest.testGrpcExecutorPool:76 expected: <2.0> but was: <0.0>
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0

### How was this patch tested?
I run this test more than 1000 times and the test always passes.

Co-authored-by: other <other@ECE-A55006.austin.utexas.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants