Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17884: [C++] Add Intel®-IAA/QPL-based Parquet RLE Decode #14585

Closed
wants to merge 1 commit into from

Conversation

yaqi-zhao
Copy link

@yaqi-zhao yaqi-zhao commented Nov 4, 2022

Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator available in the upcoming generation of Intel® Xeon® Scalable processors ("Sapphire Rapids"). Its goal is to speed up common operations in analytics like data (de)compression and filtering. It support decoding of Parquet RLE format. We add new codec which utilizes the Intel® IAA offloading technology to provide a high-performance RLE decode implementation. The codec uses the Intel® Query Processing Library (QPL) which abstracts access to the hardware accelerator. The new solution provides in general higher performance against current solution, and also consume less CPU.

@github-actions
Copy link

github-actions bot commented Nov 4, 2022

@emkornfield
Copy link
Contributor

Two high level concerns:

  1. Apologies if I missed it but this doesn't seem to integrate with CI so it will likely not super maintainable.
  2. Adding a new library dependency might not be desirable. In particular it looks like this library is still in beta?

@yaqi-zhao
Copy link
Author

Thanks for your comments, it helps a lot. @emkornfield
For Q1: I'll try to add testing to CI, but this feature requires IAA(Intel® In-Memory Analytics Accelerator) hardware, which is a new built-in accelerator in the next Gen Intel® Xeon® Scalable processors. The next Gen Intel® Xeon® Scalable processors will lunch in the near future. Maybe we run the CI together in Intel's lab machine environment that supports this processor.
For Q2: QPL is a necessary library to enable IAA and the library will provide a release version with the lunch of the next Gen Intel® Xeon® Scalable processors.
If a new library dependency might not be desirable, do you think the following solution will be better?

  1. installing the library directly on the machine instead of adding an explicit dependency in Arrow
  2. Add a third-party toolchain just like the implementation of Snappy, ZSTD, etc.
  3. I realize the used API in QPL library instead of adding a new library dependency, but this will be many new codes to be merged.

@yaqi-zhao
Copy link
Author

@emkornfield hi, emkornfield. For the 2 questions you proposed last week, is the solution on comment acceptable?

#14585 (comment)

@emkornfield
Copy link
Contributor

Maybe we run the CI together in Intel's lab machine environment that supports this processor.

I'm not sure if we have the infrastructure to do this.

I think for question #2, we should discuss on the mailing list whether we want to take on the dependency in the tool chain. And more details about possible CI options.

@yaqi-zhao
Copy link
Author

@emkornfield I'll find a machine environment on Intel's lab that you can access as soon as possible and once the machine is ready I'll share the machine on email to you.

@emkornfield
Copy link
Contributor

@yaqi-zhao I'm sorry for the confusion it isn't about me having access to a test machine. The issue is being able to continuosly test this code with github actions to make sure there is no regressions. I'm not sure if github actions supports having custom machines integrated with it.

@maqister
Copy link

"The new solution provides in general higher performance against current solution, and also consume less CPU."

Could you share performance numbers? This description is very vague..

@yaqi-zhao
Copy link
Author

yaqi-zhao commented Nov 21, 2022

@maqister We run benchmark test on ReadRowGroups APIs, test results show that there are 1~2X perf improvements according to different rle-encoded bit width of parquet file.
arrow with iaaimage
arrow master image

@yaqi-zhao yaqi-zhao force-pushed the iaa_execute_job branch 2 times, most recently from b894135 to e257c67 Compare November 22, 2022 05:56
@yaqi-zhao
Copy link
Author

@kou May I draw your attention to this PR.

We use IAA to do the RLE Decode work and benchmark results on ReadRowGroups API show that there are up to 2X perf improvements

Do you think this work valuable?

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't object this feature but I'm worried about how we can maintain this feature...

.gitmodules Outdated Show resolved Hide resolved
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/cmake_modules/DefineOptions.cmake Outdated Show resolved Hide resolved
cpp/src/arrow/util/qpl_job_pool.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/qpl_job_pool.cc Outdated Show resolved Hide resolved
@yaqi-zhao yaqi-zhao force-pushed the iaa_execute_job branch 9 times, most recently from 69f9e14 to ed0916e Compare November 28, 2022 05:30
@yaqi-zhao
Copy link
Author

@kou Hi, kou.

  1. I have updated the code that you commented last week.
  2. As for the maintenance of this feature, I add a runtime check the IAA device. If CPU processor of the machine support IAA, the continuous-integration will check the code. After the launching of the Intel CPU processor, the CI is able to check this feature.

@maqister
Copy link

@maqister We run benchmark test on ReadRowGroups APIs, test results show that there are 1~2X perf improvements according to different rle-encoded bit width of parquet file.
arrow with iaaimage
arrow master image

thanks a lot for sharing the results!

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the maintenance of this feature, I add a runtime check the IAA device. If CPU processor of the machine support IAA, the continuous-integration will check the code. After the launching of the Intel CPU processor, the CI is able to check this feature.

Does this mean that we can build this code without the new Intel CPU?

Does QPL requires the new Intel CPU? Or we can use QPL with old Intel CPU?

BTW, how about fixing lint errors before we start careful review?
https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci

cpp/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/CMakeLists.txt Show resolved Hide resolved
cpp/cmake_modules/DefineOptions.cmake Outdated Show resolved Hide resolved
cpp/cmake_modules/FindQPL.cmake Outdated Show resolved Hide resolved
cpp/cmake_modules/FindQPL.cmake Outdated Show resolved Hide resolved
cpp/src/arrow/util/qpl_job_pool.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/rle_encoding.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/rle_encoding.h Outdated Show resolved Hide resolved
cpp/src/parquet/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.h Outdated Show resolved Hide resolved
@yaqi-zhao
Copy link
Author

Does this mean that we can build this code without the new Intel CPU?

@kou Yes, QPL has no requirement for CPU, it can be built without the new Intel CPU.

QPL can check if IAA is enabled on the CPU which the program is running on.

@yaqi-zhao
Copy link
Author

Hi, @kou. I have fixed the lint error of this patch, could you please help review it again?

@kou
Copy link
Member

kou commented Dec 1, 2022

It seems there are still CMake lint errors: https://github.com/apache/arrow/actions/runs/3591159963/jobs/6045602036#step:5:815

INFO:archery:Running cmake-format linters
ERROR __main__.py:618: Check failed: /arrow/cpp/cmake_modules/DefineOptions.cmake
ERROR __main__.py:618: Check failed: /arrow/cpp/CMakeLists.txt
ERROR __main__.py:618: Check failed: /arrow/cpp/cmake_modules/FindQPL.cmake
ERROR __main__.py:618: Check failed: /arrow/cpp/cmake_modules/ThirdpartyToolchain.cmake

Could you confirm them?

https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci

CMake files pass style checks, can be fixed by running archery lint --cmake-format --fix. This requires Python 3 and cmake_format (note: this currently does not work on Windows).

FYI: You can check lint result in your fork too: https://github.com/yaqi-zhao/arrow/actions/runs/3591159610/jobs/6045360960#step:5:4197

cpp/cmake_modules/FindQPL.cmake Outdated Show resolved Hide resolved
cpp/cmake_modules/ThirdpartyToolchain.cmake Show resolved Hide resolved
cpp/src/arrow/util/bit_stream_utils.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/bit_stream_utils.h Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Show resolved Hide resolved
cpp/src/arrow/util/qpl_job_pool.cc Show resolved Hide resolved
@yaqi-zhao yaqi-zhao force-pushed the iaa_execute_job branch 4 times, most recently from a87eb4e to 506b043 Compare December 5, 2022 03:18
@yaqi-zhao
Copy link
Author

yaqi-zhao commented Dec 5, 2022

Hi, @kou and @wgtmac . I have updated the code according to the comment and run success the lint check on my fork, could you please help continue the review, thanks.

@yaqi-zhao yaqi-zhao requested a review from wgtmac December 7, 2022 03:00
QplJobHWPool();
~QplJobHWPool();

static QplJobHWPool& instance();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function names here look more like the Java style. I'd suggest to follow the convention like arrow/memory_pool.h and add more comments about the public APIs. BTW, is it possible to refactor the static functions to be member functions? You may simply use a singleton to call member functions. cc @pitrou

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac I have addressed your comment including function name styles, comments, and static functions. Can you please take a look again? Thanks.

@yaqi-zhao yaqi-zhao force-pushed the iaa_execute_job branch 2 times, most recently from 7615f46 to 1bbefe1 Compare December 8, 2022 02:44
@yaqi-zhao yaqi-zhao requested a review from kou December 8, 2022 06:39
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/cmake_modules/FindQPL.cmake Outdated Show resolved Hide resolved
cpp/cmake_modules/FindQPL.cmake Outdated Show resolved Hide resolved
cpp/cmake_modules/ThirdpartyToolchain.cmake Outdated Show resolved Hide resolved
cpp/cmake_modules/ThirdpartyToolchain.cmake Outdated Show resolved Hide resolved
uint32_t job_id = 0;
qpl_job* job = ::arrow::util::internal::QplJobHWPool::GetInstance().AcquireJob(job_id);
if (job == NULL) {
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fall back to GetBatchWithDict()?

Copy link
Author

@yaqi-zhao yaqi-zhao Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that if the QPL Job was not ready, it would not affect the RLE-Decoding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I couldn't understand.
If we return -1 here, decoding Parquet data is succeeded? (I thought decoding Parquet data is failed.)

cpp/src/arrow/util/rle_encoding.h Outdated Show resolved Hide resolved
cpp/src/parquet/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/thirdparty/versions.txt Outdated Show resolved Hide resolved
cpp/src/arrow/util/bit_stream_utils.h Outdated Show resolved Hide resolved
@yaqi-zhao yaqi-zhao force-pushed the iaa_execute_job branch 5 times, most recently from a9bc3bc to 1dd6e51 Compare December 12, 2022 10:07
@yaqi-zhao
Copy link
Author

@kou Thanks for your comments and the code have been updated. Please take a look, thanks!

set(QPL_PATCH_COMMAND)
find_package(Patch)
if(Patch_FOUND)
# This patch is for Qpl <= v0.2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you upstream this patch?
Please add pull request URL as comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not unstreamed. Do we have to use patches that have already been merged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a patch that is not yet merged but we don't want to maintain patches to reduce maintenance cost.
So we must upstream our patches.

Comment on lines +2245 to +2249
set(QPL_LIBRARIES ${QPL_STATIC_LIB})
set(QPL_INCLUDE_DIRS "${QPL_PREFIX}/include")
set_target_properties(Qpl::qpl
PROPERTIES IMPORTED_LOCATION ${QPL_LIBRARIES}
INTERFACE_INCLUDE_DIRECTORIES ${QPL_INCLUDE_DIRS})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove needless variables here:

Suggested change
set(QPL_LIBRARIES ${QPL_STATIC_LIB})
set(QPL_INCLUDE_DIRS "${QPL_PREFIX}/include")
set_target_properties(Qpl::qpl
PROPERTIES IMPORTED_LOCATION ${QPL_LIBRARIES}
INTERFACE_INCLUDE_DIRECTORIES ${QPL_INCLUDE_DIRS})
set_target_properties(Qpl::qpl
PROPERTIES IMPORTED_LOCATION ${QPL_STATIC_LIB}
INTERFACE_INCLUDE_DIRECTORIES ${QPL_PREFIX}/include)

cpp/src/arrow/util/qpl_job_pool.cc Show resolved Hide resolved
cpp/src/arrow/util/qpl_job_pool.cc Show resolved Hide resolved
return nullptr;
}
uint32_t retry = 0;
auto index = distribution(random_engine);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use distribution() to find a free job?
Is it efficient?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our initial idea is to avoid finding available job from the first index and want to save time.
we choose Mersenne twister as random engine and std::uniform_int_distribution as distribution to guarantee the randomness.
I'm not sure if there is more efficient way, do you have some suggestions?

job->available_out = static_cast<uint32_t>(destination.size());

if (!bit_reader_.GetBatchWithQpl(batch_size, job)) {
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call qpl_fini_job() and ReleaseJob() here too?

Can we use RAII to ensure releasing job like std::lock()?

uint32_t job_id = 0;
qpl_job* job = ::arrow::util::internal::QplJobHWPool::GetInstance().AcquireJob(job_id);
if (job == NULL) {
return -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I couldn't understand.
If we return -1 here, decoding Parquet data is succeeded? (I thought decoding Parquet data is failed.)

job->param_high = batch_size + value_offset_;
job->num_input_elements = batch_size + value_offset_;

job->next_in_ptr = const_cast<uint8_t*>(buffer_ - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need - 1 here?
Does it touch invalid memory?

Copy link
Author

@yaqi-zhao yaqi-zhao Dec 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not touch invlalid memory since in function DictDecoderImpl::SetData, bit reader buffer was set to encoded-data and skip <bit_width>

image

IAA need a full parquet rle buffer so I -1 here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, does it mean that we can use this feature only for Parquet's RLE format?
Could you tell me IAA's document for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is "bit-width" included in general RLE format?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, does it mean that we can use this feature only for Parquet's RLE format? Could you tell me IAA's document for this?

Hi, Kou. You can download IAA specification from https://www.intel.com/content/www/us/en/content-details/721858/intel-in-memory-analytics-accelerator-architecture-specification.html?wapkw=In-memory%20accelerator
And this PR is only for Parquet's RLE format.

Or is "bit-width" included in general RLE format?

Yes, it's a general format, you can read this format from https://parquet.apache.org/docs/file-format/data-pages/encodings/#a-namerlearun-length-encoding--bit-packing-hybrid-rle--3

Comment on lines +420 to +422
job->param_low = value_offset_;
job->param_high = batch_size + value_offset_;
job->num_input_elements = batch_size + value_offset_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to introduce new value_offset_? Can we use existing bit_offset_ and byte_offset_ instead?

job->next_in_ptr = const_cast<uint8_t*>(buffer_ - 1);
job->available_in = max_bytes_ + 1;

qpl_status status = qpl_execute_job(job);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
qpl_status status = qpl_execute_job(job);
auto status = qpl_execute_job(job);

@pitrou
Copy link
Member

pitrou commented Dec 14, 2022

I haven't looked at this in detail but my general sentiment is negative.

This PR does seem to provide very significant speedups. However, it comes with several downsides:

  • the accelerator is Intel-specific
  • it will only be provided on high-end CPUs (Xeon Scalable), and maybe even only some of them due to market segmentation
  • it requires a dedicated third-party library for which we have to maintain vendoring support

Generally, our Arrow and Parquet C++ maintenance bandwidth is very limited, with few active maintainers. This PR adds maintenance overhead for no value to most users.

If Intel were a significant contributor to Arrow and Parquet maintenance, I would view this PR (and other similar one-shot PRs that add accelerations to select parts of the codebase) more favorably.

@martin-g
Copy link
Member

About the missing CI: How about using a self-hosted runner donated by Intel ?
@assignUser already started a discussion about using ephemeral self-hosted runners at https://lists.apache.org/thread/mskpqwpdq65t1wpj4f5klfq9217ljodw

@amol-
Copy link
Contributor

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants