[Bug] [tez] shuffle server may leak if not register remote storage. #1070

zhengchenyu · 2023-08-02T11:42:04Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

Describe the bug

In our cluster, we found require_buffer_failed increase rapidly. Then we found used_buffer_size keep very high which is greater than high watermark even though no application is running for long time.
Then we found error log as below:

[ERROR] 2023-08-02 12:09:00,329 ProcessEventThread DefaultFlushEventHandler processNextEvent - Exception happened when process event.
org.apache.uniffle.common.exception.RssException: Unexpected storage type!
	at org.apache.uniffle.server.DefaultFlushEventHandler.processNextEvent(DefaultFlushEventHandler.java:122)
	at org.apache.uniffle.server.DefaultFlushEventHandler.eventLoop(DefaultFlushEventHandler.java:109)
	at java.base/java.lang.Thread.run(Thread.java:829)

By some debug, we found when some big shuffle data occur, will fall back to hadoop storage, but here hadoop storage is null, so many flush event fail.
Why hadoop storage is null? because registerRemoteStorage is never called.
Why registerRemoteStorage is never called? because the conf in TezRemoteShuffleManager is not merge the 'rss.storage.type' from dynamic configuration. And I did not set 'rss.storage.type' in client side.

Affects Version(s)

master

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

…e storage.

…age. (#1076) ### What changes were proposed in this pull request? * Merge dynamicClientConf for TezRemoteShuffleManager. Then solve the problem that can't not register remote storage when tez.rss.storage.type is not set in client side. * doCleanup when selected storage is null. The solve the problem that usedMemory and inFlushSize are leaked. ### Why are the changes needed? When this bug is trigger, even though no applications are running for long time, usedMemory and inFlushSize are keeping high. Fix: #1070 ### How was this patch tested? unit test and test on cluster.

zhengchenyu added a commit to zhengchenyu/incubator-uniffle that referenced this issue Aug 3, 2023

[apache#1070] fix(tez): shuffle server may leak if not register remot…

887562b

…e storage.

zhengchenyu mentioned this issue Aug 3, 2023

[#1070] fix(tez): shuffle server may leak if not register remote stor… #1076

Merged

jerqi closed this as completed in #1076 Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [tez] shuffle server may leak if not register remote storage. #1070

[Bug] [tez] shuffle server may leak if not register remote storage. #1070

zhengchenyu commented Aug 2, 2023

[Bug] [tez] shuffle server may leak if not register remote storage. #1070

[Bug] [tez] shuffle server may leak if not register remote storage. #1070

Comments

zhengchenyu commented Aug 2, 2023

Code of Conduct

Search before asking

Describe the bug

Affects Version(s)

Uniffle Server Log Output

Uniffle Engine Log Output

Uniffle Server Configurations

Uniffle Engine Configurations

Additional context

Are you willing to submit PR?