-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VL] Optimize ensureFlattened #4415
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format?
See also: |
auto startTime = std::chrono::steady_clock::now(); | ||
// Make sure to load lazy vector if not loaded already. | ||
|
||
ScopedTimer timer(&exportNanos_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious could we use velox's MicrosecondTimer
to avoid repeated implementation of same functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We add the Timer
and ScopedTimer
in gluten/cpp/core, not gluten/cpp/velox. So the classes are aimed at common usage. And the Timer
counts nanoseconds for accuracy, not microseconds.
/Benchmark Velox |
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
@@ -83,6 +77,7 @@ int64_t VeloxColumnarBatch::numBytes() { | |||
} | |||
|
|||
velox::RowVectorPtr VeloxColumnarBatch::getRowVector() const { | |||
VELOX_CHECK_NOT_NULL(rowVector_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_DCHECK_NOT_NULL
@@ -91,12 +91,24 @@ const int32_t* getFirstColumn(const facebook::velox::RowVector& rv) { | |||
VELOX_CHECK(rv.childrenSize() > 0, "RowVector missing partition id column."); | |||
|
|||
auto& firstChild = rv.childAt(0); | |||
VELOX_CHECK(firstChild->type()->isInteger(), "RecordBatch field 0 should be integer"); | |||
VELOX_CHECK(firstChild->isFlatEncoding(), "Partition id (field 0) is not flat encoding."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_DCHECK, this logic is guaranteed by code, so the DCHECK is enough
|
||
// first column is partition key hash value or pid | ||
return firstChild->asFlatVector<int32_t>()->rawValues(); | ||
} | ||
|
||
facebook::velox::VectorPtr flatChildAt(const facebook::velox::RowVector& rv, facebook::velox::column_index_t idx) { | ||
auto column = rv.childAt(idx); | ||
if (column->isLazy()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if column is complex datatype? Could this situation happens? The struct
column is not lazy and its child column is lazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complex types will be handled by PrestoSerializer
properly.
assert(stringColumn); | ||
RETURN_NOT_OK(splitBinaryType(binaryIdx, *stringColumn, dstAddrs)); | ||
auto column = flatChildAt(rv, colIdx)->asFlatVector<facebook::velox::StringView>(); | ||
VELOX_CHECK_NOT_NULL(column); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DCHECK
auto column = rv.childAt(colIdx); | ||
assert(column->isFlatEncoding()); | ||
|
||
auto column = flatChildAt(rv, colIdx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why flatChildAt
becomes required in the patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in facebook::velox::BaseVector::flattenVector
, it loads lazy vectors but preserves lazy encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then is it possible to handle the case inside VeloxColumnarBatch::ensureFlattened
?
0025c41
to
64b1e68
Compare
One Spark AQE UT failed, because w/o copy, |
64b1e68
to
52092a3
Compare
/Benchmark Velox |
One question here: origin We have observed some query shows the uncompressed size is much smaller than data size, such as 58G(uncompressed size) vs 3.1T(data size). |
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
@Yohahaha The "dataSize" metric in Vanilla Spark refers to the uncompressed size of the InternalRow. As it's used in the AQE "LocalShuffleReader" optimization, my understanding is that they consider this value as the raw uncompressed output data size. On the other hand, the size of the RowVector can be considered as the input size for the shuffle. The uncompressed Arrow buffers align more closely with the concept of the raw uncompressed size in this context.
This seems to exceed normal expectations. Could you provide some more detailed workload data? For example, data size, data schema, shuffle partitions, etc. |
Vanilla Spark update
So you think estimateFlatSize is not the real uncompressed size? From my understanding, shuffle input size is same as uncompressed size, Vanilla Spark use What I want confirmed is whether arrow buffers really reflect real uncompressed size? is any reuse logic for arrow buffers? |
Above result is from TPCDS 10T with 8000 partitions. any query before #4428 can reproduce, |
Yes. And it's true. This PR has one failed UT w/o #4428, that the RowVector length is 6, but the underlying buffer lengths are 4096.
The reuse logic has nothing to do with the buffer size computing. We accumulate the total raw arrow buffer size each time before the buffers is destroyed or reused.
SF10T or SF10 ? Looks like the shuffled data size is very small. I would think the case you provided is reasonable, because Velox buffer size can be larger than arrow buffers. One known case is that in Velox, any string <= 12 characters will use 12 bytes, while arrow stores real length. But 58G(uncompressed size) vs 3.1T(data size) is very abnormal and needs further investigation. Could you reproduce this case and compare the data size with Vanilla spark? What's the compressed data size (shuffle bytes written)? |
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
@Yohahaha Yes. The "dataSize" is the original "uncompressed size". If the data after compression is larger than it's uncompressed size, then we directly write the original uncompressed data, but with a few metadata added. Therefore the "shuffle bytes written" can be greater than the "dataSize" in some cases. |
got it, thank you! but still make no sense for me, will keep tracking this issue. |
Use upstream
facebook::velox::BaseVector::flattenVector
to avoid unnecessary copy.