-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-12374][SPARK-12150][SQL] Adding logical/physical operators for Range #10335
Conversation
} | ||
|
||
bufferHolder.reset() | ||
unsafeRow.pointTo(bufferHolder.buffer, 1, bufferHolder.totalSize()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why point to the same buffer after every iteration? We could do this during the construction of the iterator. BufferHolder
might be overkill here, pointing to an array of 16 bytes should also do the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell Thank you! You are right, let me move it to the construction of the Iterator.
When writing a prototype, I used a 16 bytes array. I was afraid the Spark community prefers to using the existing library functions here. Thus, I changed it to bufferHolder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with both, as long as we move out of the next
call.
Test build #47852 has finished for PR 10335 at commit
|
* | ||
* [[LeafNode]]s must override this. | ||
*/ | ||
val sizeInBytes = LongType.defaultSize * numElements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected
The high level structure of this look pretty good to me. Could you also post some numbers from a micro benchmark? It would be good to make sure we're actually speeding things up. |
Sure, will do It! Thank you for your guidance! |
Test build #47889 has finished for PR 10335 at commit
|
} | ||
val safePartitionStart = getSafeMargin(partitionStart) | ||
val safePartitionEnd = getSafeMargin(partitionEnd) | ||
val bufferHolder = new BufferHolder(LongType.defaultSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am not confident that this works. You are allocating 8-bytes by calling this constructor. Without a growing the buffer, you allocate a byte array of 8 bytes to the UnsafeRow
, which is happening here. You would need at least 16 bytes for UnsafeRow
to work (8 for the bitset and 8 for the long).
BufferHolder
is meant to be used with Unsafe*Writer
classes. I don't think it adds much value here. I think we should just use a 16 byte array instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh... True! Let me correct it. Thank you!
Test build #47972 has finished for PR 10335 at commit
|
RDD range API with
|
New range API using logical/physical operators with
|
Original range API with
|
@marmbrus Compared with the original range API (the elapsed time is around 30 seconds), the new version is around 33% faster. Of course, the RDD range API is still much faster (its elapsed time is 0.6 second). Do you think the performance improvement is good enough? Or any solution to further reduce the overhead? FYI: I just tried it in my local computer by using spark-shell: Thanks! |
Let me add the fix of #10337, try the function |
RDD range API with
|
New range API using logical/physical operators with
|
Original range API with
|
When the workload is small, both APIs are less than 1 second. Thus, I increase the workload by a factor of 100. Compared with the old Range API, the new version is 3 times faster than the old version. Sure, due to the overhead, RDD API is still much faster. |
@gatorsmile I have been playing arround with this for a bit. Overall I think we should do this. I do have two things for you to consider. The current approach follows the formal route. It implements a LogicalOperator/PhysicalOperator and changes the planner. It also - as you stated - reuses quite a bit of the code from
What do you think? My second point is about the benchmarking. I have been toying with this PR and the benchmarking code, and I am not sure that the current results are as revealing as they should be. I think the |
@gatorsmile small follow-up on the benchmarking. If I execute the following code (note the
I get to an average of 845 ms per run (versus 477 for |
Hi, @hvanhovell , Thank you for your comments! Regarding the benchmarking, I do not have a better way to measure them. So far, I just tried your suggested method. To compare results, I have to increase the workload scale to
|
Regarding benchmarking, we typically measure this kind of stuff using ForeachResults in spark-sql-perf (which measures just the time to pull the rows out of the iterator + conversion to the external format.) We should probably add an internal version as well that avoids the conversion cost. Regarding @hvanhovell simplified implementation, I thought about proposing it like this. The only question is if we will ever want to add optimizations on top of this (i.e. we could do a count(*) on this kind of plan really quickly). Since its already implemented I skew towards the more logically transparent implementation. However, it might be nice to reuse the code in RDD as he proposes in the physical operator. Super minor point: We should probably use UnsafeRow.createFromByteArray instead of |
BTW, the benchmarks look reasonable to me, I'm okay with merging this as soon as we are happy with the implementation. |
Thank you very much! @marmbrus I am trying to get an account. Hopefully, next time, I can directly use your performance benchmarking for other performance-related topics. Otherwise, I will try to mimic your benchmarking in my local laptop. : ) Also changed the code to use |
Test build #48032 has finished for PR 10335 at commit
|
Test build #48085 has finished for PR 10335 at commit
|
Thanks, merging to master. |
Based on the suggestions from @marmbrus , added logical/physical operators for Range for improving the performance.
Also added another API for resolving the JIRA Spark-12150.
Could you take a look at my implementation, @marmbrus ? If not good, I can rework it. : )
Thank you very much!