You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched in the issues and found no similar issues.
Describe the bug
In our cluster, we found require_buffer_failed increase rapidly. Then we found used_buffer_size keep very high which is greater than high watermark even though no application is running for long time.
Then we found error log as below:
[ERROR] 2023-08-02 12:09:00,329 ProcessEventThread DefaultFlushEventHandler processNextEvent - Exception happened when process event.
org.apache.uniffle.common.exception.RssException: Unexpected storage type!
at org.apache.uniffle.server.DefaultFlushEventHandler.processNextEvent(DefaultFlushEventHandler.java:122)
at org.apache.uniffle.server.DefaultFlushEventHandler.eventLoop(DefaultFlushEventHandler.java:109)
at java.base/java.lang.Thread.run(Thread.java:829)
By some debug, we found when some big shuffle data occur, will fall back to hadoop storage, but here hadoop storage is null, so many flush event fail.
Why hadoop storage is null? because registerRemoteStorage is never called.
Why registerRemoteStorage is never called? because the conf in TezRemoteShuffleManager is not merge the 'rss.storage.type' from dynamic configuration. And I did not set 'rss.storage.type' in client side.
Affects Version(s)
master
Uniffle Server Log Output
No response
Uniffle Engine Log Output
No response
Uniffle Server Configurations
No response
Uniffle Engine Configurations
No response
Additional context
No response
Are you willing to submit PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
zhengchenyu
added a commit
to zhengchenyu/incubator-uniffle
that referenced
this issue
Aug 3, 2023
…age. (#1076)
### What changes were proposed in this pull request?
* Merge dynamicClientConf for TezRemoteShuffleManager. Then solve the problem that can't not register remote storage when tez.rss.storage.type is not set in client side.
* doCleanup when selected storage is null. The solve the problem that usedMemory and inFlushSize are leaked.
### Why are the changes needed?
When this bug is trigger, even though no applications are running for long time, usedMemory and inFlushSize are keeping high.
Fix: #1070
### How was this patch tested?
unit test and test on cluster.
Code of Conduct
Search before asking
Describe the bug
In our cluster, we found require_buffer_failed increase rapidly. Then we found used_buffer_size keep very high which is greater than high watermark even though no application is running for long time.
Then we found error log as below:
By some debug, we found when some big shuffle data occur, will fall back to hadoop storage, but here hadoop storage is null, so many flush event fail.
Why hadoop storage is null? because registerRemoteStorage is never called.
Why registerRemoteStorage is never called? because the conf in TezRemoteShuffleManager is not merge the 'rss.storage.type' from dynamic configuration. And I did not set 'rss.storage.type' in client side.
Affects Version(s)
master
Uniffle Server Log Output
No response
Uniffle Engine Log Output
No response
Uniffle Server Configurations
No response
Uniffle Engine Configurations
No response
Additional context
No response
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: