Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE-507] Fix Flaky Test: ShuffleBufferManagerTest#cacheShuffleDataTest #511

Merged
merged 5 commits into from
Jan 29, 2023

Conversation

xianjingfeng
Copy link
Member

@xianjingfeng xianjingfeng commented Jan 28, 2023

What changes were proposed in this pull request?

Fix Flaky Test: ShuffleBufferManagerTest#cacheShuffleDataTest.

Why are the changes needed?

It is flaky Test. Fix #507

I found logs as follows in org.apache.uniffle.server.buffer.ShuffleBufferManagerTest-output.txt.

2023-01-28 09:24:40,791 INFO  [expiredAppCleaner] server.ShuffleTaskManager (ShuffleTaskManager.java:checkResourceStatus(511)) - Detect expired appId[removeShuffleDataTest1] according to rss.server.app.expired.withoutHeartbeat

So i think the reason is that some threads have not exit after running ShuffleTaskManagerTest#removeShuffleDataWithHdfsTest.

Does this PR introduce any user-facing change?

No

How was this patch tested?

No need

kaijchen
kaijchen previously approved these changes Jan 28, 2023
Copy link
Contributor

@kaijchen kaijchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @xianjingfeng for the fix.

Copy link
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov-commenter
Copy link

codecov-commenter commented Jan 28, 2023

Codecov Report

Merging #511 (51f248b) into master (0a6605b) will increase coverage by 0.01%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master     #511      +/-   ##
============================================
+ Coverage     59.73%   59.75%   +0.01%     
- Complexity     1764     1765       +1     
============================================
  Files           205      205              
  Lines         11527    11527              
  Branches       1033     1033              
============================================
+ Hits           6886     6888       +2     
+ Misses         4234     4233       -1     
+ Partials        407      406       -1     
Impacted Files Coverage Δ
...pache/hadoop/mapreduce/task/reduce/RssFetcher.java 92.30% <0.00%> (+1.53%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@xianjingfeng
Copy link
Member Author

It seems this bug is not fixed yet. https://github.com/apache/incubator-uniffle/actions/runs/4031027397/jobs/6930075442

Updated

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@advancedxy
Copy link
Contributor

So i think the reason is that some threads have not exit after running ShuffleTaskManagerTest#removeShuffleDataWithHdfsTest.

This pr should fix the flaky test. However, I think this statement indicates that there might be other resource leaking problem in the test code as many test create a shuffle server but never stop the server. We should create a new issue to track this problem: close/stop shuffle server after use in each test case. @xianjingfeng or @zuston(since you wrote most of the related test code) could you help track and resolve this issue?

@zuston
Copy link
Member

zuston commented Jan 29, 2023

So i think the reason is that some threads have not exit after running ShuffleTaskManagerTest#removeShuffleDataWithHdfsTest.

This pr should fix the flaky test. However, I think this statement indicates that there might be other resource leaking problem in the test code as many test create a shuffle server but never stop the server. We should create a new issue to track this problem: close/stop shuffle server after use in each test case. @xianjingfeng or @zuston(since you wrote most of the related test code) could you help track and resolve this issue?

Make sense. By the way, let's raise a test improvement issue? Except above problems, the speed of CI could be improved, including removing the unused mini HDFS cluster setup for some test basic class and sharing the mini HDFS for all class to avoid being created multiple times.

@advancedxy
Copy link
Contributor

So i think the reason is that some threads have not exit after running ShuffleTaskManagerTest#removeShuffleDataWithHdfsTest.

This pr should fix the flaky test. However, I think this statement indicates that there might be other resource leaking problem in the test code as many test create a shuffle server but never stop the server. We should create a new issue to track this problem: close/stop shuffle server after use in each test case. @xianjingfeng or @zuston(since you wrote most of the related test code) could you help track and resolve this issue?

Make sense. By the way, let's raise a test improvement issue? Except above problems, the speed of CI could be improved, including removing the unused mini HDFS cluster setup for some test basic class and sharing the mini HDFS for all class to avoid being created multiple times.

Sounds good to me. The average UT time is around 20mins, this is quite some time. Let's speed that up.

@advancedxy advancedxy merged commit cd2662f into apache:master Jan 29, 2023
@advancedxy
Copy link
Contributor

Merged this, thanks @xianjingfeng, @kaijchen and @zuston

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Test: ShuffleBufferManagerTest#cacheShuffleDataTest
5 participants