New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Fix skip() api maybe skip unexpected bytes which makes inconsistent data #40
Conversation
Could you add a ut for this case? |
Codecov Report
@@ Coverage Diff @@
## master #40 +/- ##
============================================
+ Coverage 56.81% 56.83% +0.01%
- Complexity 1203 1207 +4
============================================
Files 152 152
Lines 8401 8437 +36
Branches 813 819 +6
============================================
+ Hits 4773 4795 +22
- Misses 3369 3380 +11
- Partials 259 262 +3
Continue to review full report at Codecov.
|
The fix is target to potential problem with this api, and it can't be reproduced. |
Actually, I'm not quite sure if the skip() cause the problem, but with the fix, we can check it by the new log. |
Do the Hdfs |
@@ -41,7 +41,22 @@ public LocalFileReader(String path) throws Exception { | |||
|
|||
public byte[] read(long offset, int length) { | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at
-> targetSkip
or maxSkip
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Hdfs has its own api |
LGTM, I try to reproduce this problem, but I fail. |
@colinmjj Do we need to backport this patch to branch 0.5? |
sure, I'll create a new PR for the backport |
…tent data (apache#40) ### What changes were proposed in this pull request? Fix bug when call `inputstream.skip()` which may return unexpected result ### Why are the changes needed? Get exception messages as following, and it maybe caused by unexpected data from `Local` storage ``` com.tencent.rss.common.exception.RssException: Unexpected crc value for blockId[9992363390829154], expected:2562548848, actual:2244862586 at com.tencent.rss.client.impl.ShuffleReadClientImpl.readShuffleBlockData(ShuffleReadClientImpl.java:184) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:99) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With current UTs
…tent data (#40) (#52) ### What changes were proposed in this pull request? Fix bug when call `inputstream.skip()` which may return unexpected result ### Why are the changes needed? Get exception messages as following, and it maybe caused by unexpected data from `Local` storage ``` com.tencent.rss.common.exception.RssException: Unexpected crc value for blockId[9992363390829154], expected:2562548848, actual:2244862586 at com.tencent.rss.client.impl.ShuffleReadClientImpl.readShuffleBlockData(ShuffleReadClientImpl.java:184) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:99) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With current UTs
…1632) ### What changes were proposed in this pull request? In our cluster, delete pod is denied by web hook, even though all application is deleted for long time. When I curl http://host:ip/metrics/server, I found app_num_with_node is 1. The problem is some application is leaked. I see many duplicated logs `[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[appattempt_xxx_xx_xx] according to rss.server.app.expired.withoutHeartbeat`. When I jstack the server many times, clearResourceThread will be stuck forever, here is the call stack. ``` "clearResourceThread" #40 daemon prio=5 os_prio=0 cpu=3767.63ms elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition [0x00007f24f7b33000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.22/Native Method) - parking to wait for <0x00007f28d5e29f20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.22/LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.22/AbstractQueuedSynchronizer.java:885) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.22/AbstractQueuedSynchronizer.java:917) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.22/AbstractQueuedSynchronizer.java:1240) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.22/ReentrantReadWriteLock.java:959) at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756) at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183) at org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.22/Thread.java:829) ``` Apparently there's a lock that's not being released. Looking at the code, it's easy to see that the read lock in the flushBuffer is not released correctly. The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for app[appattempt_xxx] has already been removed, no need to flush the buffer` proved it. Fix: #1631 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? no test, obvious mistake
What changes were proposed in this pull request?
Fix bug when call
inputstream.skip()
which may return unexpected resultWhy are the changes needed?
Get exception messages as following, and it maybe caused by unexpected data from
Local
storageDoes this PR introduce any user-facing change?
No
How was this patch tested?
With current UTs