New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI #6614
Conversation
@@ -139,6 +139,7 @@ private[streaming] class ReceiverSupervisorImpl( | |||
val blockId = blockIdOption.getOrElse(nextBlockId) | |||
val numRecords = receivedBlock match { | |||
case ArrayBufferBlock(arrayBuffer) => arrayBuffer.size | |||
case IteratorBlock(iterator) => iterator.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er, but this consumes the iterator. You can't do this right?
I am sorry. My bad . Do let me know if latest change looks fine |
val blockId = blockIdOption.getOrElse(nextBlockId) | ||
val numRecords = receivedBlock match { | ||
case ArrayBufferBlock(arrayBuffer) => arrayBuffer.size | ||
case IteratorBlock(iterator) => | ||
var arrayBuffer = ArrayBuffer(iterator.toArray : _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that it's safe to copy the iterator contents into memory here. I don't think it's worth it for the UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but Custom Receiver storing blocks using iterator can not utilize the UI stats . Not sure if there is any wide usage of Iterator , but in my Kafka Consumer (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) I earlier used Iterator but now changed it to ArrayBuffer. But people who use Iterator will have this issue..
… at Spark UI. Fixed some formatting issue
ok to test |
@dibbhatt could you add |
Test build #34131 has finished for PR 6614 at commit
|
added [STREAMING] to the title of the PR |
val blockId = blockIdOption.getOrElse(nextBlockId) | ||
val numRecords = receivedBlock match { | ||
case ArrayBufferBlock(arrayBuffer) => arrayBuffer.size | ||
case IteratorBlock(iterator) => | ||
var arrayBuffer = ArrayBuffer(iterator.toArray : _*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite inefficient, as the data will be first copied to the arraybuffer, and then BlockManager will again use the ArrayBuffer's iterator to store the data. This is precisely why this wasnt done till now. :(
I think we need a different design. There is a way to count the elements in the iterator without putting it into some intermediate buffer. Rather, the iterator is going to be consumed any way (assuming StorageLevel is serialized) by the block manager. The counting can be done while that is happening. To do this you have to construct a special CountingIterator that wraps the original iterator.
after iterating the original iterator via this counting iterator, you can get the count of the records it had iterated through. This number can then returned through BlockStoreResult object. How does that sound? |
It sounds good Tathagata. I will update the PR with the changes . |
…rrect count at Spark UI
Hi @tdas . Let me know how this looks... |
Test build #34192 has finished for PR 6614 at commit
|
…rrect count at Spark UI Fixed Scala Style check issue
Test build #34197 has finished for PR 6614 at commit
|
@@ -79,7 +87,10 @@ private[streaming] class BlockManagerBasedBlockHandler( | |||
throw new SparkException( | |||
s"Could not store $blockId to block manager with storage level $storageLevel") | |||
} | |||
BlockManagerBasedStoreResult(blockId) | |||
if(countIterator !=null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised the scala checks didn't catch this. @rxin know what's wrong?
This is getting a little bit complicated to think how this will work for all combinations of ReceivedBlock types, and StorageLevels and ReceivedBlockHandler types. This must be done with unit tests. Mind adding unit tests for this A x B x C combinations in the ReceivedBlockHandlerSuite? These tests will allow us to clearly verify for what combinations we can do this correctly and what combinations we cannot do at all? |
Sure Tathagata , will add a unit tests.. |
I am going to merge #6659 to prevent negative numbers in the UI. Could you please merge with the master after that PR is merged to verify that there are no negative numbers? Also, if possible, you unit test should verify this as well. |
Sure |
Just to understand when the message count may go wrong ..here is my understanding. Please let me know if I am wrong . Will this be a very common behavior ? In my environment for every storage level I get correct count of number of kafka messages I pumped. Probably I need to hit the memory limit of MemoryStore unrollSafely to run out of memory....which seems to be rare scenario . Let me know what you think. |
…rrect count at Spark UI Test Cases
Test build #34293 timed out for PR 6614 at commit |
Good point. I think the solution is to make the CountingIterator verify whether the end of the internal iterator was reached or not (that is hasNext had returned false or not). Accordingly we will know whether the count is correct or not. In cases where the count is in complete (e.g. MEMORY_ONLY, with not enough memory), the count can be ignored because in those cases the block was not successfully put into the memory. And so the count can be considered to be zero. In fact ideally we should be throwing exceptions saying that block could not be inserted in the block manager. Does that make sense? |
} | ||
|
||
test("BlockManagerBasedBlockHandler - count messages MEMORY_AND_DISK") { | ||
handler = makeBlockManagerBasedBlockHandler(blockManager, StorageLevel.MEMORY_AND_DISK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is repeated code! Cant these be made into a function which take storage level as a parameter? Each of these unit tests then just calls that function with appropriate parameters. In fact, the ReceivedBlockHandler type can also be a parameter, leading to further code reduction.
Conflicts: streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala streaming/src/test/scala/org/apache/spark/streaming/ReceivedBlockTrackerSuite.scala
…rrect count at Spark UI
What I observed is , when Block is not able to unrollSafely to memory if enough space is not there, BlockManager won't try to put the block and ReceivedBlockHandler will throw SparkException as it could not find the block id in PutResult. Thus block count won't go wrong if block is not able to unroll safely for MEMORY_ONLY settings. So I was wrong earlier ... For MEMORY_DISK settings , if BlockManager not able to unroll block to memory, block will still get deseralized to Disk. Same for WAL based store. So for those cases ( storage level = memory + disk ) also count will come fine if Block not able to unroll to memory. thus I added the isFullyConsumed in the CountingIterator but have not used it as such case will never happen that block not fully consumed and ReceivedBlockHandler still get the block ID. I have added few test cases to cover those block unrolling scenarios also. |
Test build #34391 timed out for PR 6614 at commit |
hi @tdas . I can see the build passed all test cases but it timed out. |
@dibbhatt could you resolve the conflicts again? Thank you. |
Hi @zsxwing , I resolved the conflicts . This PR modified the ReceivedBlockStoreResult which now takes the number Of Records, and that need to modify ReceiverSupervisorImpl and ReceivedBlockTrackerSuite from your PR to take the numRecords parameter... Is there any issue you see in merge ? Even though the build failed due to timeout issue, I can see all test cases are passed. Do let me know if I need to do anything. |
@dibbhatt did you modify any commit by accident, or is it a GitHub bug? There are 219 files changed and it still contains conflicts. |
Test build #34403 has finished for PR 6614 at commit
|
hi @zsxwing ...No , I have made only commit for my changes . Not sure why it says 219 files changed. If you see the commits details , you can see only files related to this PR has changed. Just now I modified comments on a file to trigger the build once again. This 219 files changed came after I merged my repo from upstream/master to take your PR changes. And I merged only your changed with mine and committed those . Below are the my changes since the merge...and I have not committed all these 219 files :( |
I've no idea why it's screwed up. Feel free to open a new one. |
Closing this PR. Will open a new one. 6614 got issue with merging from upstream/master and lot of unwanted files has come to this PR . Will open a new one |
In custom receiver if I call store with Iterator type (store(dataIterator: Iterator[T]): Unit ) , Spark UI does not show the correct count of records in block which leads to wrong value for Input Rate, Scheduling Delay and Input SIze. This pull request fix that issue.