Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [tez] shuffle server may leak if not register remote storage. #1070

Closed
3 tasks done
zhengchenyu opened this issue Aug 2, 2023 · 0 comments · Fixed by #1076
Closed
3 tasks done

[Bug] [tez] shuffle server may leak if not register remote storage. #1070

zhengchenyu opened this issue Aug 2, 2023 · 0 comments · Fixed by #1076

Comments

@zhengchenyu
Copy link
Collaborator

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

In our cluster, we found require_buffer_failed increase rapidly. Then we found used_buffer_size keep very high which is greater than high watermark even though no application is running for long time.
Then we found error log as below:

[ERROR] 2023-08-02 12:09:00,329 ProcessEventThread DefaultFlushEventHandler processNextEvent - Exception happened when process event.
org.apache.uniffle.common.exception.RssException: Unexpected storage type!
	at org.apache.uniffle.server.DefaultFlushEventHandler.processNextEvent(DefaultFlushEventHandler.java:122)
	at org.apache.uniffle.server.DefaultFlushEventHandler.eventLoop(DefaultFlushEventHandler.java:109)
	at java.base/java.lang.Thread.run(Thread.java:829)

By some debug, we found when some big shuffle data occur, will fall back to hadoop storage, but here hadoop storage is null, so many flush event fail.
Why hadoop storage is null? because registerRemoteStorage is never called.
Why registerRemoteStorage is never called? because the conf in TezRemoteShuffleManager is not merge the 'rss.storage.type' from dynamic configuration. And I did not set 'rss.storage.type' in client side.

Affects Version(s)

master

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
zhengchenyu added a commit to zhengchenyu/incubator-uniffle that referenced this issue Aug 3, 2023
jerqi pushed a commit that referenced this issue Aug 4, 2023
…age. (#1076)

### What changes were proposed in this pull request?

* Merge dynamicClientConf for TezRemoteShuffleManager. Then solve the problem that can't not register remote storage when tez.rss.storage.type is not set in client side.
* doCleanup when selected storage is null. The solve the problem that usedMemory and inFlushSize are leaked.

### Why are the changes needed?

When this bug is trigger, even though no applications are running for long time, usedMemory and inFlushSize are keeping high.

Fix: #1070

### How was this patch tested?

unit test and test on cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant