Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] NullPointerException of WriterBuffer.getData due to race condition #808

Closed
3 tasks done
zuston opened this issue Apr 11, 2023 · 5 comments · Fixed by #848
Closed
3 tasks done

[Bug] NullPointerException of WriterBuffer.getData due to race condition #808

zuston opened this issue Apr 11, 2023 · 5 comments · Fixed by #848

Comments

@zuston
Copy link
Member

zuston commented Apr 11, 2023

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

This bug is caused by #706 . After this, the buffers in WriterBuffer will be visited by multi threads.

Stacktrace:

23/04/10 07:38:03 ERROR Executor: Exception in task 3025.0 in stage 0.0 (TID 676)
java.lang.NullPointerException
	at org.apache.spark.shuffle.writer.WriterBuffer.getData(WriterBuffer.java:77)
	at org.apache.spark.shuffle.writer.WriteBufferManager.createShuffleBlock(WriteBufferManager.java:224)
	at org.apache.spark.shuffle.writer.WriteBufferManager.clear(WriteBufferManager.java:213)
	at org.apache.spark.shuffle.writer.WriteBufferManager.addRecord(WriteBufferManager.java:198)
	at org.apache.spark.shuffle.writer.RssShuffleWriter.doWrite(RssShuffleWriter.java:213)
	at org.apache.spark.shuffle.writer.RssShuffleWriter.write(RssShuffleWriter.java:167)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Affects Version(s)

master

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@zuston
Copy link
Member Author

zuston commented Apr 11, 2023

I want to use the thread safe list to solve this problem. WDYT? @jerqi

@jerqi
Copy link
Contributor

jerqi commented Apr 11, 2023

I want to use the thread safe list to solve this problem. WDYT? @jerqi

Ok for me.

zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 11, 2023
@zuston
Copy link
Member Author

zuston commented Apr 11, 2023

Oh. This bug may cause data lost.

@jerqi
Copy link
Contributor

jerqi commented Apr 11, 2023

Oh. This bug may cause data lost.

I would like to revert #706 . I take a look at the pr that you fix. It seems that the mind is not clear for me.

@jerqi
Copy link
Contributor

jerqi commented Apr 11, 2023

Why do this issue cause data lost? We have checked the block infos in the reader.

zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 28, 2023
zuston added a commit to zuston/incubator-uniffle that referenced this issue Jul 20, 2023
zuston added a commit that referenced this issue Jul 22, 2023
#848)

### What changes were proposed in this pull request?

1. Guarantees thread safe by only allowing spills to be triggered by the current thread
2. Using  the same logic of processing blocks in the `RssShuffleWriter` and `WriteBufferManager` to ensure the data consistency

### Why are the changes needed?

Fix: #808 

In this PR, we use the two ways to solve the concurrent problem for `addRecord` and `spill` function
1. For the same thread, the spill will be invoked when adding  records and unsuffcient memory. This case could ensure 
thread safe. So it will do the spill sync.
2. When spill is invoked by other consumers, it will do nothing in this thread and just set a signal to let owner to release when adding record.

After this, we could avoid lock(may cause performance regression, like #811 did) to keep thread safe 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

1. UTs
rickyma added a commit to rickyma/incubator-uniffle that referenced this issue Mar 6, 2024
zuston pushed a commit that referenced this issue Mar 7, 2024
…sure data correctness (#1558)

### What changes were proposed in this pull request?

Verify the number of written records to enhance data accuracy.
Make sure all data records are sent by clients.
Make sure bugs like #714 will never be introduced into the code.

### Why are the changes needed?

A follow-up PR for #848.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UTs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants