New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17884: [C++] Add Intel®-IAA/QPL-based Parquet RLE Decode #14585
Conversation
Two high level concerns:
|
Thanks for your comments, it helps a lot. @emkornfield
|
@emkornfield hi, emkornfield. For the 2 questions you proposed last week, is the solution on comment acceptable? |
I'm not sure if we have the infrastructure to do this. I think for question #2, we should discuss on the mailing list whether we want to take on the dependency in the tool chain. And more details about possible CI options. |
@emkornfield I'll find a machine environment on Intel's lab that you can access as soon as possible and once the machine is ready I'll share the machine on email to you. |
@yaqi-zhao I'm sorry for the confusion it isn't about me having access to a test machine. The issue is being able to continuosly test this code with github actions to make sure there is no regressions. I'm not sure if github actions supports having custom machines integrated with it. |
"The new solution provides in general higher performance against current solution, and also consume less CPU." Could you share performance numbers? This description is very vague.. |
@maqister We run benchmark test on ReadRowGroups APIs, test results show that there are 1~2X perf improvements according to different rle-encoded bit width of parquet file. |
b894135
to
e257c67
Compare
@kou May I draw your attention to this PR. We use IAA to do the RLE Decode work and benchmark results on ReadRowGroups API show that there are up to 2X perf improvements Do you think this work valuable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't object this feature but I'm worried about how we can maintain this feature...
69f9e14
to
ed0916e
Compare
@kou Hi, kou.
|
ed0916e
to
f3daa6e
Compare
thanks a lot for sharing the results! |
104b00e
to
b11ed14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the maintenance of this feature, I add a runtime check the IAA device. If CPU processor of the machine support IAA, the continuous-integration will check the code. After the launching of the Intel CPU processor, the CI is able to check this feature.
Does this mean that we can build this code without the new Intel CPU?
Does QPL requires the new Intel CPU? Or we can use QPL with old Intel CPU?
BTW, how about fixing lint errors before we start careful review?
https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci
@kou Yes, QPL has no requirement for CPU, it can be built without the new Intel CPU. QPL can check if IAA is enabled on the CPU which the program is running on. |
Hi, @kou. I have fixed the lint error of this patch, could you please help review it again? |
It seems there are still CMake lint errors: https://github.com/apache/arrow/actions/runs/3591159963/jobs/6045602036#step:5:815
Could you confirm them? https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci
FYI: You can check lint result in your fork too: https://github.com/yaqi-zhao/arrow/actions/runs/3591159610/jobs/6045360960#step:5:4197 |
3a34e74
to
f684a63
Compare
a87eb4e
to
506b043
Compare
cpp/src/arrow/util/qpl_job_pool.h
Outdated
QplJobHWPool(); | ||
~QplJobHWPool(); | ||
|
||
static QplJobHWPool& instance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function names here look more like the Java style. I'd suggest to follow the convention like arrow/memory_pool.h
and add more comments about the public APIs. BTW, is it possible to refactor the static functions to be member functions? You may simply use a singleton to call member functions. cc @pitrou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac I have addressed your comment including function name styles, comments, and static functions. Can you please take a look again? Thanks.
7615f46
to
1bbefe1
Compare
uint32_t job_id = 0; | ||
qpl_job* job = ::arrow::util::internal::QplJobHWPool::GetInstance().AcquireJob(job_id); | ||
if (job == NULL) { | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fall back to GetBatchWithDict()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that if the QPL Job was not ready, it would not affect the RLE-Decoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I couldn't understand.
If we return -1
here, decoding Parquet data is succeeded? (I thought decoding Parquet data is failed.)
a9bc3bc
to
1dd6e51
Compare
1dd6e51
to
217b776
Compare
@kou Thanks for your comments and the code have been updated. Please take a look, thanks! |
set(QPL_PATCH_COMMAND) | ||
find_package(Patch) | ||
if(Patch_FOUND) | ||
# This patch is for Qpl <= v0.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you upstream this patch?
Please add pull request URL as comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was not unstreamed. Do we have to use patches that have already been merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a patch that is not yet merged but we don't want to maintain patches to reduce maintenance cost.
So we must upstream our patches.
set(QPL_LIBRARIES ${QPL_STATIC_LIB}) | ||
set(QPL_INCLUDE_DIRS "${QPL_PREFIX}/include") | ||
set_target_properties(Qpl::qpl | ||
PROPERTIES IMPORTED_LOCATION ${QPL_LIBRARIES} | ||
INTERFACE_INCLUDE_DIRECTORIES ${QPL_INCLUDE_DIRS}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove needless variables here:
set(QPL_LIBRARIES ${QPL_STATIC_LIB}) | |
set(QPL_INCLUDE_DIRS "${QPL_PREFIX}/include") | |
set_target_properties(Qpl::qpl | |
PROPERTIES IMPORTED_LOCATION ${QPL_LIBRARIES} | |
INTERFACE_INCLUDE_DIRECTORIES ${QPL_INCLUDE_DIRS}) | |
set_target_properties(Qpl::qpl | |
PROPERTIES IMPORTED_LOCATION ${QPL_STATIC_LIB} | |
INTERFACE_INCLUDE_DIRECTORIES ${QPL_PREFIX}/include) |
return nullptr; | ||
} | ||
uint32_t retry = 0; | ||
auto index = distribution(random_engine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use distribution()
to find a free job?
Is it efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our initial idea is to avoid finding available job from the first index and want to save time.
we choose Mersenne twister as random engine and std::uniform_int_distribution as distribution to guarantee the randomness.
I'm not sure if there is more efficient way, do you have some suggestions?
job->available_out = static_cast<uint32_t>(destination.size()); | ||
|
||
if (!bit_reader_.GetBatchWithQpl(batch_size, job)) { | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call qpl_fini_job()
and ReleaseJob()
here too?
Can we use RAII to ensure releasing job like std::lock()
?
uint32_t job_id = 0; | ||
qpl_job* job = ::arrow::util::internal::QplJobHWPool::GetInstance().AcquireJob(job_id); | ||
if (job == NULL) { | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I couldn't understand.
If we return -1
here, decoding Parquet data is succeeded? (I thought decoding Parquet data is failed.)
job->param_high = batch_size + value_offset_; | ||
job->num_input_elements = batch_size + value_offset_; | ||
|
||
job->next_in_ptr = const_cast<uint8_t*>(buffer_ - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need - 1
here?
Does it touch invalid memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not touch invlalid memory since in function DictDecoderImpl::SetData, bit reader buffer was set to encoded-data and skip <bit_width>
IAA need a full parquet rle buffer so I -1
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, does it mean that we can use this feature only for Parquet's RLE format?
Could you tell me IAA's document for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is "bit-width" included in general RLE format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, does it mean that we can use this feature only for Parquet's RLE format? Could you tell me IAA's document for this?
Hi, Kou. You can download IAA specification from https://www.intel.com/content/www/us/en/content-details/721858/intel-in-memory-analytics-accelerator-architecture-specification.html?wapkw=In-memory%20accelerator
And this PR is only for Parquet's RLE format.
Or is "bit-width" included in general RLE format?
Yes, it's a general format, you can read this format from https://parquet.apache.org/docs/file-format/data-pages/encodings/#a-namerlearun-length-encoding--bit-packing-hybrid-rle--3
job->param_low = value_offset_; | ||
job->param_high = batch_size + value_offset_; | ||
job->num_input_elements = batch_size + value_offset_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to introduce new value_offset_
? Can we use existing bit_offset_
and byte_offset_
instead?
job->next_in_ptr = const_cast<uint8_t*>(buffer_ - 1); | ||
job->available_in = max_bytes_ + 1; | ||
|
||
qpl_status status = qpl_execute_job(job); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qpl_status status = qpl_execute_job(job); | |
auto status = qpl_execute_job(job); |
I haven't looked at this in detail but my general sentiment is negative. This PR does seem to provide very significant speedups. However, it comes with several downsides:
Generally, our Arrow and Parquet C++ maintenance bandwidth is very limited, with few active maintainers. This PR adds maintenance overhead for no value to most users. If Intel were a significant contributor to Arrow and Parquet maintenance, I would view this PR (and other similar one-shot PRs that add accelerations to select parts of the codebase) more favorably. |
About the missing CI: How about using a self-hosted runner donated by Intel ? |
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator available in the upcoming generation of Intel® Xeon® Scalable processors ("Sapphire Rapids"). Its goal is to speed up common operations in analytics like data (de)compression and filtering. It support decoding of Parquet RLE format. We add new codec which utilizes the Intel® IAA offloading technology to provide a high-performance RLE decode implementation. The codec uses the Intel® Query Processing Library (QPL) which abstracts access to the hardware accelerator. The new solution provides in general higher performance against current solution, and also consume less CPU.