-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11243] [SQL] output UnsafeRow from columnar cache #9203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #44083 has finished for PR 9203 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed offline, maybe change this to assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do in separate PR
|
I'm really excited to see how much of a performance difference this makes when scanning string columns, since this could potentially provide a big perf. boost to the sc.textFile replacement that I showed the other day. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever start from the middle of an existing buffer? If not, can we move startingOffset into holder class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in UnsafeProjection, there could be multiple UnsafeRowWriter share a single BufferHolder.
|
LGTM otherwise. |
|
Test build #44089 has finished for PR 9203 at commit
|
|
@JoshRosen If we only scan the cached string and access it (without adding a ConvertToUnsafe), this could be a little bit slower, because we copy the bytes into UnsafeRow now. For the UTF8String object, we still will create it anyway, before this patch, we create it in ColumnAccessor, after this patch, we create in UnsafeRow.getUTF8String() (we could re-use that actually). |
|
Test build #44092 has finished for PR 9203 at commit
|
This reverts commit c76c759.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why implement these primitive write method in writer class not in codegen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those in UnsafeRowWriter are needed for ColumnAccessor. The purpose of these *Writer is to simplify generated code (which is already hard to understand), these method also go in this direction.
In terms of performance, I think there will not difference, they all could be JITed and inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is simpler to put them there rather than in generated code.
|
Test build #44098 has finished for PR 9203 at commit
|
|
Test build #44100 has finished for PR 9203 at commit
|
|
I'm going to merge this first. @davies please take a look at @cloud-fan 's comments, and address them in follow-up prs if necessary. |
This PR change InMemoryTableScan to output UnsafeRow, and optimize the unrolling and scanning by coping the bytes for var-length types between UnsafeRow and ByteBuffer directly without creating the wrapper objects. When scanning the decimals in TPC-DS store_sales table, it's 80% faster (copy it as long without create Decimal objects).