Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Fix skip() api maybe skip unexpected bytes which makes inconsistent data #40

Merged
merged 3 commits into from Jul 11, 2022

Conversation

colinmjj
Copy link

@colinmjj colinmjj commented Jul 8, 2022

What changes were proposed in this pull request?

Fix bug when call inputstream.skip() which may return unexpected result

Why are the changes needed?

Get exception messages as following, and it maybe caused by unexpected data from Local storage

com.tencent.rss.common.exception.RssException: Unexpected crc value for blockId[9992363390829154], expected:2562548848, actual:2244862586
        at com.tencent.rss.client.impl.ShuffleReadClientImpl.readShuffleBlockData(ShuffleReadClientImpl.java:184)
        at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:99)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)

Does this PR introduce any user-facing change?

No

How was this patch tested?

With current UTs

@jerqi
Copy link
Contributor

jerqi commented Jul 8, 2022

Could you add a ut for this case?

@codecov-commenter
Copy link

codecov-commenter commented Jul 8, 2022

Codecov Report

Merging #40 (dcf752e) into master (f7c65d4) will increase coverage by 0.01%.
The diff coverage is 55.55%.

@@             Coverage Diff              @@
##             master      #40      +/-   ##
============================================
+ Coverage     56.81%   56.83%   +0.01%     
- Complexity     1203     1207       +4     
============================================
  Files           152      152              
  Lines          8401     8437      +36     
  Branches        813      819       +6     
============================================
+ Hits           4773     4795      +22     
- Misses         3369     3380      +11     
- Partials        259      262       +3     
Impacted Files Coverage Δ
.../uniffle/storage/handler/impl/LocalFileReader.java 53.33% <55.55%> (-1.22%) ⬇️
...org/apache/uniffle/server/ShuffleFlushManager.java 76.70% <0.00%> (-1.71%) ⬇️
.../java/org/apache/uniffle/server/ShuffleServer.java 72.26% <0.00%> (-0.30%) ⬇️
...rg/apache/uniffle/client/util/RssClientConfig.java 0.00% <0.00%> (ø)
...apache/hadoop/mapreduce/v2/app/RssMRAppMaster.java 0.00% <0.00%> (ø)
...a/org/apache/uniffle/server/ShuffleServerConf.java 99.30% <0.00%> (+0.01%) ⬆️
.../java/org/apache/hadoop/mapreduce/RssMRConfig.java 91.66% <0.00%> (+0.75%) ⬆️
.../java/org/apache/spark/shuffle/RssSparkConfig.java 88.88% <0.00%> (+1.38%) ⬆️
...pache/hadoop/mapreduce/task/reduce/RssFetcher.java 90.62% <0.00%> (+1.56%) ⬆️
...org/apache/spark/shuffle/RssSparkShuffleUtils.java 49.20% <0.00%> (+6.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7c65d4...dcf752e. Read the comment docs.

@colinmjj colinmjj requested review from frankliee and jerqi July 8, 2022 09:07
@colinmjj
Copy link
Author

colinmjj commented Jul 8, 2022

Could you add a ut for this case?

The fix is target to potential problem with this api, and it can't be reproduced.

@colinmjj
Copy link
Author

colinmjj commented Jul 8, 2022

Actually, I'm not quite sure if the skip() cause the problem, but with the fix, we can check it by the new log.

@jerqi
Copy link
Contributor

jerqi commented Jul 8, 2022

Do the Hdfs skip api have similar problems?

@@ -41,7 +41,22 @@ public LocalFileReader(String path) throws Exception {

public byte[] read(long offset, int length) {
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at -> targetSkip or maxSkip ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@colinmjj
Copy link
Author

colinmjj commented Jul 8, 2022

Do the Hdfs skip() api have similar problems?

Hdfs has its own api seek() which wraps skip(), eg, DFSInputStream, and it also check the result from skip()

@jerqi
Copy link
Contributor

jerqi commented Jul 8, 2022

LGTM, I try to reproduce this problem, but I fail.

@frankliee frankliee merged commit cbe39c1 into apache:master Jul 11, 2022
@jerqi
Copy link
Contributor

jerqi commented Jul 12, 2022

@colinmjj Do we need to backport this patch to branch 0.5?

@colinmjj
Copy link
Author

@colinmjj Do we need to backport this patch to branch 0.5?

sure, I'll create a new PR for the backport

colinmjj pushed a commit to colinmjj/incubator-uniffle that referenced this pull request Jul 12, 2022
…tent data (apache#40)

### What changes were proposed in this pull request?
Fix bug when call `inputstream.skip()` which may return unexpected result


### Why are the changes needed?
Get exception messages as following, and it maybe caused by unexpected data from `Local` storage
```
com.tencent.rss.common.exception.RssException: Unexpected crc value for blockId[9992363390829154], expected:2562548848, actual:2244862586
        at com.tencent.rss.client.impl.ShuffleReadClientImpl.readShuffleBlockData(ShuffleReadClientImpl.java:184)
        at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:99)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
```


### Does this PR introduce _any_ user-facing change?
No


### How was this patch tested?
With current UTs
jerqi pushed a commit that referenced this pull request Jul 12, 2022
…tent data (#40) (#52)

### What changes were proposed in this pull request?
Fix bug when call `inputstream.skip()` which may return unexpected result


### Why are the changes needed?
Get exception messages as following, and it maybe caused by unexpected data from `Local` storage
```
com.tencent.rss.common.exception.RssException: Unexpected crc value for blockId[9992363390829154], expected:2562548848, actual:2244862586
        at com.tencent.rss.client.impl.ShuffleReadClientImpl.readShuffleBlockData(ShuffleReadClientImpl.java:184)
        at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:99)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
```


### Does this PR introduce _any_ user-facing change?
No


### How was this patch tested?
With current UTs
@zuston
Copy link
Member

zuston commented Aug 10, 2023

Yes, this problem still exist after this patch. And it has no exception log in the server side. @jerqi @colinmjj

zuston pushed a commit that referenced this pull request Apr 10, 2024
…1632)

### What changes were proposed in this pull request?

In our cluster, delete pod is denied by web hook, even though all application is deleted for long time.
When I curl http://host:ip/metrics/server, I found app_num_with_node is 1. 
The problem is some application is leaked. I see many duplicated logs `[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[appattempt_xxx_xx_xx] according to rss.server.app.expired.withoutHeartbeat`.
When I jstack the server many times, clearResourceThread will be stuck forever, here is the call stack.
```
"clearResourceThread" #40 daemon prio=5 os_prio=0 cpu=3767.63ms elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition  [0x00007f24f7b33000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.22/Native Method)
	- parking to wait for  <0x00007f28d5e29f20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.22/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.22/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.22/AbstractQueuedSynchronizer.java:917)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.22/AbstractQueuedSynchronizer.java:1240)
	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.22/ReentrantReadWriteLock.java:959)
	at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756)
	at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
	at org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown Source)
	at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)
```

Apparently there's a lock that's not being released. Looking at the code, it's easy to see that the read lock in the flushBuffer is not released correctly.  The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for app[appattempt_xxx] has already been removed, no need to flush the buffer` proved it.

Fix: #1631 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

no test, obvious mistake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants