[Bug] NullPointerException of WriterBuffer.getData due to race condition #808

zuston · 2023-04-11T02:42:15Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

Describe the bug

This bug is caused by #706 . After this, the buffers in WriterBuffer will be visited by multi threads.

Stacktrace:

23/04/10 07:38:03 ERROR Executor: Exception in task 3025.0 in stage 0.0 (TID 676)
java.lang.NullPointerException
	at org.apache.spark.shuffle.writer.WriterBuffer.getData(WriterBuffer.java:77)
	at org.apache.spark.shuffle.writer.WriteBufferManager.createShuffleBlock(WriteBufferManager.java:224)
	at org.apache.spark.shuffle.writer.WriteBufferManager.clear(WriteBufferManager.java:213)
	at org.apache.spark.shuffle.writer.WriteBufferManager.addRecord(WriteBufferManager.java:198)
	at org.apache.spark.shuffle.writer.RssShuffleWriter.doWrite(RssShuffleWriter.java:213)
	at org.apache.spark.shuffle.writer.RssShuffleWriter.write(RssShuffleWriter.java:167)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Affects Version(s)

master

Uniffle Server Log Output

No response

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

zuston · 2023-04-11T02:43:13Z

I want to use the thread safe list to solve this problem. WDYT? @jerqi

jerqi · 2023-04-11T02:45:40Z

I want to use the thread safe list to solve this problem. WDYT? @jerqi

Ok for me.

…getData due to race condition

zuston · 2023-04-11T09:30:34Z

Oh. This bug may cause data lost.

jerqi · 2023-04-11T09:41:17Z

Oh. This bug may cause data lost.

I would like to revert #706 . I take a look at the pr that you fix. It seems that the mind is not clear for me.

jerqi · 2023-04-11T10:47:46Z

Why do this issue cause data lost? We have checked the block infos in the reader.

… spilling

…n spill

#848) ### What changes were proposed in this pull request? 1. Guarantees thread safe by only allowing spills to be triggered by the current thread 2. Using the same logic of processing blocks in the `RssShuffleWriter` and `WriteBufferManager` to ensure the data consistency ### Why are the changes needed? Fix: #808 In this PR, we use the two ways to solve the concurrent problem for `addRecord` and `spill` function 1. For the same thread, the spill will be invoked when adding records and unsuffcient memory. This case could ensure thread safe. So it will do the spill sync. 2. When spill is invoked by other consumers, it will do nothing in this thread and just set a signal to let owner to release when adding record. After this, we could avoid lock(may cause performance regression, like #811 did) to keep thread safe ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1. UTs

…en records to enhance data accuracy

…sure data correctness (#1558) ### What changes were proposed in this pull request? Verify the number of written records to enhance data accuracy. Make sure all data records are sent by clients. Make sure bugs like #714 will never be introduced into the code. ### Why are the changes needed? A follow-up PR for #848. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs.

zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 11, 2023

[apache#808] fix(client): avoid nullPointerException of WriterBuffer.…

79df869

…getData due to race condition

zuston mentioned this issue Apr 11, 2023

[#808] fix(client): use synchronized to make spill and addRecord thread-safe #811

Closed

zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 28, 2023

[apache#808] fix(spark): ensure thread safe and data consistency when…

44315da

… spilling

zuston mentioned this issue Apr 28, 2023

[#808] feat(spark): ensure thread safe and data consistency when spilling #848

Merged

zuston added a commit to zuston/incubator-uniffle that referenced this issue Jul 20, 2023

[apache#808] feat(spark): ensure thread safe and data consistency whe…

dbedf57

…n spill

zuston closed this as completed in #848 Jul 22, 2023

rickyma added a commit to rickyma/incubator-uniffle that referenced this issue Mar 6, 2024

[apache#808][FOLLOWUP] improvement(spark): Verify the number of writt…

6de2bc9

…en records to enhance data accuracy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] NullPointerException of WriterBuffer.getData due to race condition #808

[Bug] NullPointerException of WriterBuffer.getData due to race condition #808

zuston commented Apr 11, 2023

zuston commented Apr 11, 2023

jerqi commented Apr 11, 2023

zuston commented Apr 11, 2023

jerqi commented Apr 11, 2023

jerqi commented Apr 11, 2023

[Bug] NullPointerException of WriterBuffer.getData due to race condition #808

[Bug] NullPointerException of WriterBuffer.getData due to race condition #808

Comments

zuston commented Apr 11, 2023

Code of Conduct

Search before asking

Describe the bug

Affects Version(s)

Uniffle Server Log Output

Uniffle Engine Log Output

Uniffle Server Configurations

Uniffle Engine Configurations

Additional context

Are you willing to submit PR?

zuston commented Apr 11, 2023

jerqi commented Apr 11, 2023

zuston commented Apr 11, 2023

jerqi commented Apr 11, 2023

jerqi commented Apr 11, 2023