ARROW-17884: [C++] Add Intel®-IAA/QPL-based Parquet RLE Decode #14585

yaqi-zhao · 2022-11-04T02:48:14Z

Intel® In-Memory Analytics Accelerator (Intel® IAA) is a hardware accelerator available in the upcoming generation of Intel® Xeon® Scalable processors ("Sapphire Rapids"). Its goal is to speed up common operations in analytics like data (de)compression and filtering. It support decoding of Parquet RLE format. We add new codec which utilizes the Intel® IAA offloading technology to provide a high-performance RLE decode implementation. The codec uses the Intel® Query Processing Library (QPL) which abstracts access to the hardware accelerator. The new solution provides in general higher performance against current solution, and also consume less CPU.

github-actions · 2022-11-04T02:48:33Z

https://issues.apache.org/jira/browse/ARROW-17884

emkornfield · 2022-11-05T03:52:24Z

Two high level concerns:

Apologies if I missed it but this doesn't seem to integrate with CI so it will likely not super maintainable.
Adding a new library dependency might not be desirable. In particular it looks like this library is still in beta?

yaqi-zhao · 2022-11-07T08:49:52Z

Thanks for your comments, it helps a lot. @emkornfield
For Q1： I'll try to add testing to CI, but this feature requires IAA(Intel® In-Memory Analytics Accelerator) hardware, which is a new built-in accelerator in the next Gen Intel® Xeon® Scalable processors. The next Gen Intel® Xeon® Scalable processors will lunch in the near future. Maybe we run the CI together in Intel's lab machine environment that supports this processor.
For Q2: QPL is a necessary library to enable IAA and the library will provide a release version with the lunch of the next Gen Intel® Xeon® Scalable processors.
If a new library dependency might not be desirable, do you think the following solution will be better?

installing the library directly on the machine instead of adding an explicit dependency in Arrow
Add a third-party toolchain just like the implementation of Snappy, ZSTD, etc.
I realize the used API in QPL library instead of adding a new library dependency, but this will be many new codes to be merged.

yaqi-zhao · 2022-11-16T07:41:23Z

@emkornfield hi, emkornfield. For the 2 questions you proposed last week, is the solution on comment acceptable?

#14585 (comment)

emkornfield · 2022-11-16T08:02:06Z

Maybe we run the CI together in Intel's lab machine environment that supports this processor.

I'm not sure if we have the infrastructure to do this.

I think for question #2, we should discuss on the mailing list whether we want to take on the dependency in the tool chain. And more details about possible CI options.

yaqi-zhao · 2022-11-16T08:20:19Z

@emkornfield I'll find a machine environment on Intel's lab that you can access as soon as possible and once the machine is ready I'll share the machine on email to you.

emkornfield · 2022-11-16T09:44:37Z

@yaqi-zhao I'm sorry for the confusion it isn't about me having access to a test machine. The issue is being able to continuosly test this code with github actions to make sure there is no regressions. I'm not sure if github actions supports having custom machines integrated with it.

maqister · 2022-11-21T02:38:18Z

"The new solution provides in general higher performance against current solution, and also consume less CPU."

Could you share performance numbers? This description is very vague..

yaqi-zhao · 2022-11-21T08:19:42Z

@maqister We run benchmark test on ReadRowGroups APIs, test results show that there are 1~2X perf improvements according to different rle-encoded bit width of parquet file.
arrow with iaa
arrow master

yaqi-zhao · 2022-11-23T07:53:54Z

@kou May I draw your attention to this PR.

We use IAA to do the RLE Decode work and benchmark results on ReadRowGroups API show that there are up to 2X perf improvements

Do you think this work valuable?

kou

I don't object this feature but I'm worried about how we can maintain this feature...

.gitmodules

cpp/CMakeLists.txt

cpp/cmake_modules/DefineOptions.cmake

cpp/src/arrow/util/qpl_job_pool.cc

yaqi-zhao · 2022-11-28T05:42:46Z

@kou Hi, kou.

I have updated the code that you commented last week.
As for the maintenance of this feature, I add a runtime check the IAA device. If CPU processor of the machine support IAA, the continuous-integration will check the code. After the launching of the Intel CPU processor, the CI is able to check this feature.

maqister · 2022-11-28T06:30:00Z

@maqister We run benchmark test on ReadRowGroups APIs, test results show that there are 1~2X perf improvements according to different rle-encoded bit width of parquet file.
arrow with iaa
arrow master

thanks a lot for sharing the results!

kou

As for the maintenance of this feature, I add a runtime check the IAA device. If CPU processor of the machine support IAA, the continuous-integration will check the code. After the launching of the Intel CPU processor, the CI is able to check this feature.

Does this mean that we can build this code without the new Intel CPU?

Does QPL requires the new Intel CPU? Or we can use QPL with old Intel CPU?

BTW, how about fixing lint errors before we start careful review?
https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci

cpp/CMakeLists.txt

cpp/cmake_modules/DefineOptions.cmake

cpp/cmake_modules/FindQPL.cmake

cpp/src/arrow/util/qpl_job_pool.cc

cpp/src/arrow/util/rle_encoding.h

cpp/src/parquet/CMakeLists.txt

cpp/src/parquet/encoding.h

yaqi-zhao · 2022-11-29T09:07:35Z

Does this mean that we can build this code without the new Intel CPU?

@kou Yes, QPL has no requirement for CPU, it can be built without the new Intel CPU.

QPL can check if IAA is enabled on the CPU which the program is running on.

yaqi-zhao · 2022-12-01T09:16:48Z

Hi, @kou. I have fixed the lint error of this patch, could you please help review it again?

kou · 2022-12-01T19:56:20Z

It seems there are still CMake lint errors: https://github.com/apache/arrow/actions/runs/3591159963/jobs/6045602036#step:5:815

INFO:archery:Running cmake-format linters
ERROR __main__.py:618: Check failed: /arrow/cpp/cmake_modules/DefineOptions.cmake
ERROR __main__.py:618: Check failed: /arrow/cpp/CMakeLists.txt
ERROR __main__.py:618: Check failed: /arrow/cpp/cmake_modules/FindQPL.cmake
ERROR __main__.py:618: Check failed: /arrow/cpp/cmake_modules/ThirdpartyToolchain.cmake

Could you confirm them?

https://arrow.apache.org/docs/developers/cpp/development.html#code-style-linting-and-ci

CMake files pass style checks, can be fixed by running archery lint --cmake-format --fix. This requires Python 3 and cmake_format (note: this currently does not work on Windows).

FYI: You can check lint result in your fork too: https://github.com/yaqi-zhao/arrow/actions/runs/3591159610/jobs/6045360960#step:5:4197

cpp/cmake_modules/FindQPL.cmake

cpp/cmake_modules/ThirdpartyToolchain.cmake

cpp/src/arrow/util/bit_stream_utils.h

cpp/src/parquet/encoding.cc

cpp/src/arrow/util/qpl_job_pool.cc

yaqi-zhao · 2022-12-05T05:55:41Z

Hi, @kou and @wgtmac . I have updated the code according to the comment and run success the lint check on my fork, could you please help continue the review, thanks.

wgtmac · 2022-12-07T08:09:58Z

cpp/src/arrow/util/qpl_job_pool.h

+  QplJobHWPool();
+  ~QplJobHWPool();
+
+  static QplJobHWPool& instance();


Function names here look more like the Java style. I'd suggest to follow the convention like arrow/memory_pool.h and add more comments about the public APIs. BTW, is it possible to refactor the static functions to be member functions? You may simply use a singleton to call member functions. cc @pitrou

@wgtmac I have addressed your comment including function name styles, comments, and static functions. Can you please take a look again? Thanks.

cpp/CMakeLists.txt

cpp/cmake_modules/FindQPL.cmake

cpp/cmake_modules/ThirdpartyToolchain.cmake

kou · 2022-12-09T20:52:30Z

cpp/src/arrow/util/rle_encoding.h

+  uint32_t job_id = 0;
+  qpl_job* job = ::arrow::util::internal::QplJobHWPool::GetInstance().AcquireJob(job_id);
+  if (job == NULL) {
+    return -1;


Should we fall back to GetBatchWithDict()?

I was thinking that if the QPL Job was not ready, it would not affect the RLE-Decoding

Sorry. I couldn't understand.
If we return -1 here, decoding Parquet data is succeeded? (I thought decoding Parquet data is failed.)

cpp/src/arrow/util/rle_encoding.h

cpp/src/parquet/CMakeLists.txt

cpp/thirdparty/versions.txt

cpp/src/arrow/util/bit_stream_utils.h

yaqi-zhao · 2022-12-14T06:17:24Z

@kou Thanks for your comments and the code have been updated. Please take a look, thanks!

kou · 2022-12-14T06:26:44Z

cpp/cmake_modules/ThirdpartyToolchain.cmake

+  set(QPL_PATCH_COMMAND)
+  find_package(Patch)
+  if(Patch_FOUND)
+    # This patch is for Qpl <= v0.2.0


Did you upstream this patch?
Please add pull request URL as comment.

This was not unstreamed. Do we have to use patches that have already been merged?

We can use a patch that is not yet merged but we don't want to maintain patches to reduce maintenance cost.
So we must upstream our patches.

kou · 2022-12-14T06:28:15Z

cpp/cmake_modules/ThirdpartyToolchain.cmake

+  set(QPL_LIBRARIES ${QPL_STATIC_LIB})
+  set(QPL_INCLUDE_DIRS "${QPL_PREFIX}/include")
+  set_target_properties(Qpl::qpl
+                        PROPERTIES IMPORTED_LOCATION ${QPL_LIBRARIES}
+                                   INTERFACE_INCLUDE_DIRECTORIES ${QPL_INCLUDE_DIRS})


We can remove needless variables here:

Suggested change

set(QPL_LIBRARIES ${QPL_STATIC_LIB})

set(QPL_INCLUDE_DIRS "${QPL_PREFIX}/include")

set_target_properties(Qpl::qpl

PROPERTIES IMPORTED_LOCATION ${QPL_LIBRARIES}

INTERFACE_INCLUDE_DIRECTORIES ${QPL_INCLUDE_DIRS})

set_target_properties(Qpl::qpl

PROPERTIES IMPORTED_LOCATION ${QPL_STATIC_LIB}

INTERFACE_INCLUDE_DIRECTORIES ${QPL_PREFIX}/include)

cpp/src/arrow/util/qpl_job_pool.cc

kou · 2022-12-14T06:38:25Z

cpp/src/arrow/util/qpl_job_pool.cc

+    return nullptr;
+  }
+  uint32_t retry = 0;
+  auto index = distribution(random_engine);


Why do we use distribution() to find a free job?
Is it efficient?

Our initial idea is to avoid finding available job from the first index and want to save time.
we choose Mersenne twister as random engine and std::uniform_int_distribution as distribution to guarantee the randomness.
I'm not sure if there is more efficient way, do you have some suggestions?

kou · 2022-12-14T06:59:05Z

cpp/src/arrow/util/rle_encoding.h

+  job->available_out = static_cast<uint32_t>(destination.size());
+
+  if (!bit_reader_.GetBatchWithQpl(batch_size, job)) {
+    return -1;


Should we call qpl_fini_job() and ReleaseJob() here too?

Can we use RAII to ensure releasing job like std::lock()?

kou · 2022-12-14T07:01:09Z

cpp/src/arrow/util/rle_encoding.h

+  uint32_t job_id = 0;
+  qpl_job* job = ::arrow::util::internal::QplJobHWPool::GetInstance().AcquireJob(job_id);
+  if (job == NULL) {
+    return -1;


Sorry. I couldn't understand.
If we return -1 here, decoding Parquet data is succeeded? (I thought decoding Parquet data is failed.)

kou · 2022-12-14T07:15:46Z

cpp/src/arrow/util/bit_stream_utils.h

+  job->param_high = batch_size + value_offset_;
+  job->num_input_elements = batch_size + value_offset_;
+
+  job->next_in_ptr = const_cast<uint8_t*>(buffer_ - 1);


Why do we need - 1 here?
Does it touch invalid memory?

It will not touch invlalid memory since in function DictDecoderImpl::SetData, bit reader buffer was set to encoded-data and skip <bit_width>

IAA need a full parquet rle buffer so I -1 here

Oh, does it mean that we can use this feature only for Parquet's RLE format?
Could you tell me IAA's document for this?

Or is "bit-width" included in general RLE format?

Oh, does it mean that we can use this feature only for Parquet's RLE format? Could you tell me IAA's document for this?

Hi, Kou. You can download IAA specification from https://www.intel.com/content/www/us/en/content-details/721858/intel-in-memory-analytics-accelerator-architecture-specification.html?wapkw=In-memory%20accelerator
And this PR is only for Parquet's RLE format.

Or is "bit-width" included in general RLE format?

Yes, it's a general format, you can read this format from https://parquet.apache.org/docs/file-format/data-pages/encodings/#a-namerlearun-length-encoding--bit-packing-hybrid-rle--3

kou · 2022-12-14T07:18:39Z

cpp/src/arrow/util/bit_stream_utils.h

+  job->param_low = value_offset_;
+  job->param_high = batch_size + value_offset_;
+  job->num_input_elements = batch_size + value_offset_;


Why do we need to introduce new value_offset_? Can we use existing bit_offset_ and byte_offset_ instead?

kou · 2022-12-14T07:18:58Z

cpp/src/arrow/util/bit_stream_utils.h

+  job->next_in_ptr = const_cast<uint8_t*>(buffer_ - 1);
+  job->available_in = max_bytes_ + 1;
+
+  qpl_status status = qpl_execute_job(job);


Suggested change

qpl_status status = qpl_execute_job(job);

auto status = qpl_execute_job(job);

pitrou · 2022-12-14T11:34:38Z

I haven't looked at this in detail but my general sentiment is negative.

This PR does seem to provide very significant speedups. However, it comes with several downsides:

the accelerator is Intel-specific
it will only be provided on high-end CPUs (Xeon Scalable), and maybe even only some of them due to market segmentation
it requires a dedicated third-party library for which we have to maintain vendoring support

Generally, our Arrow and Parquet C++ maintenance bandwidth is very limited, with few active maintainers. This PR adds maintenance overhead for no value to most users.

If Intel were a significant contributor to Arrow and Parquet maintenance, I would view this PR (and other similar one-shot PRs that add accelerations to select parts of the codebase) more favorably.

martin-g · 2022-12-18T21:59:32Z

About the missing CI: How about using a self-hosted runner donated by Intel ?
@assignUser already started a discussion about using ephemeral self-hosted runners at https://lists.apache.org/thread/mskpqwpdq65t1wpj4f5klfq9217ljodw

amol- · 2023-03-30T17:08:52Z

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

github-actions bot added Component: C++ Component: Parquet labels Nov 4, 2022

yaqi-zhao force-pushed the iaa_execute_job branch 2 times, most recently from b894135 to e257c67 Compare November 22, 2022 05:56

kou reviewed Nov 23, 2022

View reviewed changes

yaqi-zhao force-pushed the iaa_execute_job branch 9 times, most recently from 69f9e14 to ed0916e Compare November 28, 2022 05:30

yaqi-zhao force-pushed the iaa_execute_job branch from ed0916e to f3daa6e Compare November 28, 2022 05:43

yaqi-zhao force-pushed the iaa_execute_job branch from 104b00e to b11ed14 Compare November 28, 2022 06:37

kou reviewed Nov 29, 2022

View reviewed changes

yaqi-zhao force-pushed the iaa_execute_job branch from 3a34e74 to f684a63 Compare December 2, 2022 01:36

wgtmac reviewed Dec 2, 2022

View reviewed changes

yaqi-zhao force-pushed the iaa_execute_job branch 4 times, most recently from a87eb4e to 506b043 Compare December 5, 2022 03:18

yaqi-zhao requested a review from wgtmac December 7, 2022 03:00

wgtmac reviewed Dec 7, 2022

View reviewed changes

yaqi-zhao force-pushed the iaa_execute_job branch 2 times, most recently from 7615f46 to 1bbefe1 Compare December 8, 2022 02:44

yaqi-zhao requested a review from kou December 8, 2022 06:39

kou reviewed Dec 9, 2022

View reviewed changes

yaqi-zhao force-pushed the iaa_execute_job branch 5 times, most recently from a9bc3bc to 1dd6e51 Compare December 12, 2022 10:07

Add Intel®-IAA/QPL-based Parquet RLE Decode

217b776

yaqi-zhao force-pushed the iaa_execute_job branch from 1dd6e51 to 217b776 Compare December 13, 2022 02:42

kou mentioned this pull request Dec 14, 2022

ARROW-17884: [C++] Add Intel®-IAA/QPL-based Parquet RLE Decode #14217

Closed

yaqi-zhao mentioned this pull request Dec 14, 2022

add parquet test data apache/parquet-testing#30

Open

kou reviewed Dec 14, 2022

View reviewed changes

amol- closed this Mar 30, 2023

	qpl_status status = qpl_execute_job(job);
	auto status = qpl_execute_job(job);

ARROW-17884: [C++] Add Intel®-IAA/QPL-based Parquet RLE Decode #14585

ARROW-17884: [C++] Add Intel®-IAA/QPL-based Parquet RLE Decode #14585

Conversation

yaqi-zhao commented Nov 4, 2022 • edited Loading

github-actions bot commented Nov 4, 2022

emkornfield commented Nov 5, 2022

yaqi-zhao commented Nov 7, 2022

yaqi-zhao commented Nov 16, 2022

emkornfield commented Nov 16, 2022

yaqi-zhao commented Nov 16, 2022

emkornfield commented Nov 16, 2022

maqister commented Nov 21, 2022

yaqi-zhao commented Nov 21, 2022 • edited Loading

yaqi-zhao commented Nov 23, 2022

kou left a comment

Choose a reason for hiding this comment

yaqi-zhao commented Nov 28, 2022

maqister commented Nov 28, 2022

kou left a comment

Choose a reason for hiding this comment

yaqi-zhao commented Nov 29, 2022

yaqi-zhao commented Dec 1, 2022

kou commented Dec 1, 2022

yaqi-zhao commented Dec 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaqi-zhao Dec 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaqi-zhao commented Dec 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaqi-zhao Dec 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Dec 14, 2022

martin-g commented Dec 18, 2022

amol- commented Mar 30, 2023

yaqi-zhao commented Nov 4, 2022 •

edited

Loading

yaqi-zhao commented Nov 21, 2022 •

edited

Loading

yaqi-zhao commented Dec 5, 2022 •

edited

Loading

yaqi-zhao Dec 12, 2022 •

edited

Loading

yaqi-zhao Dec 14, 2022 •

edited

Loading