Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34388: [C++] Build core compute kernels unconditionally #34295

Merged
merged 6 commits into from
Mar 4, 2023

Conversation

benibus
Copy link
Collaborator

@benibus benibus commented Feb 22, 2023

This includes the core compute machinery in libarrow by default - in addition to all cast kernels and several other kernels that are either dependencies of cast (take) or utilized in libarrow/libparquet (unique, filter). The remaining kernels won't be built/registered unless ARROW_COMPUTE=ON (note that this would slightly change the option's meaning, as currently, nothing in arrow/compute is built unless it's set).

Initially this was more substantial as the original goal was to build the extra kernels as a shared library (suggested in the orginal issue). After some discussion in the issue thread, I opted not to do that - primarily because I can't personally see the utility of a separate lib here, even ignoring the complexity it introduces. However, there may be a good reason that simply hasn't occured to me.

@github-actions
Copy link

⚠️ GitHub issue #25025 has been automatically assigned in GitHub to PR creator.

@benibus
Copy link
Collaborator Author

benibus commented Feb 23, 2023

One thing I haven't decided is how to deal with the compute unit tests since most of them make heavy use of the extra kernels, so a good chunk of them will fail without them. Easiest option would be force ARROW_COMPUTE=ON if ARROW_BUILD_TESTS=ON (not the worst idea in the world, i guess - as this is a packaging-focused feature). Alternatively, we could just not build the tests in question - although that would include most of the tests in compute/exec.

(Also, I'll look into the unity build failures - not quite sure what's going wrong there...)

@benibus benibus marked this pull request as ready for review February 23, 2023 02:23
@benibus
Copy link
Collaborator Author

benibus commented Feb 23, 2023

cc @felipecrv

@felipecrv
Copy link
Contributor

felipecrv commented Feb 23, 2023

This isn't enough to close the issue, right? Do you want to associate this one with a sub-issue of #25025 so you can merge it before working on the shared library setup?

I've read the PR description more carefully now.

@benibus
Copy link
Collaborator Author

benibus commented Feb 23, 2023

Well to be fair, that might still be the right move as it'd be easy to make that a follow-up PR if we decide to go down that road (assuming we get the library boundaries right in this one).

@westonpace
Copy link
Member

The goal is to eventually reduce the size of the "core" library right? Do we have any idea how slim this set of baseline kernels is compared to the full set?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable set of minimal kernels. Do you want to add a CI job (or does one already exist?) that builds without ARROW_COMPUTE to ensure basic functionality (e.g. parquet reading/writing and csv reading/writing) still works?

cpp/src/arrow/compute/kernels/CMakeLists.txt Show resolved Hide resolved
@benibus
Copy link
Collaborator Author

benibus commented Feb 24, 2023

Do we have any idea how slim this set of baseline kernels is compared to the full set?

Here's what would be present in the default build:

array_filter
array_take
cast
dictionary_encode
drop_null
filter
indices_nonzero
take
unique
value_counts

There are currently 240 kernels in the full set, so it's a pretty deep cut.

Do you want to add a CI job (or does one already exist?) that builds without ARROW_COMPUTE to ensure basic functionality (e.g. parquet reading/writing and csv reading/writing) still works?

That would be a good idea, yes. AFAIK none of the existing jobs build without ARROW_COMPUTE. Even if they did, the CSV writer/STL tests wouldn't be included and libparquet wouldn't be built at all.

@westonpace
Copy link
Member

This failure looks related: https://github.com/apache/arrow/actions/runs/4247620924/jobs/7385865305

Perhaps just a changing of include orders has angered the unity build gremlins in some way. Given the two implementations are identical maybe we could put them in util_internal.h?

@benibus benibus force-pushed the GH-25025-split-compute-kernels branch from f2b4597 to 96fe9af Compare February 28, 2023 19:18
@github-actions github-actions bot added the awaiting review Awaiting review label Feb 28, 2023
@raulcd raulcd changed the title GH-25025: [C++] Build core compute kernels unconditionally GH-34388: [C++] Build core compute kernels unconditionally Feb 28, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34388 has been automatically assigned in GitHub to PR creator.

@apache apache deleted a comment from github-actions bot Feb 28, 2023
@benibus
Copy link
Collaborator Author

benibus commented Feb 28, 2023

I still need to add the CI job, but in preparation, I set things up so that certain tests won't be built without the complete kernel registry - so we wouldn't need any special ctest flags to avoid expected failures.

The unity build redefinition errors should be fixed now. Most of the problematic code in scalar_round.cc was actually completely unused, so I just removed it.

cpp/src/arrow/compute/CMakeLists.txt Outdated Show resolved Hide resolved
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@westonpace
Copy link
Member

@felipecrv / @lidavidm I haven't been following the discussion on #25025 very closely. However, this change seems good. I assume we want to proceed with it?

@lidavidm
Copy link
Member

lidavidm commented Mar 4, 2023

Yes, either way, I think this is a necessary first step before we can unbundle the kernels + I hope it is easier to review this way

@westonpace westonpace merged commit be7a763 into apache:main Mar 4, 2023
@ursabot
Copy link

ursabot commented Mar 4, 2023

Benchmark runs are scheduled for baseline = 4b31aa4 and contender = be7a763. be7a763 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.4% ⬆️0.06%] test-mac-arm
[Finished ⬇️0.51% ⬆️0.26%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] be7a7634 ec2-t3-xlarge-us-east-2
[Finished] be7a7634 test-mac-arm
[Finished] be7a7634 ursa-i9-9960x
[Finished] be7a7634 ursa-thinkcentre-m75q
[Finished] 4b31aa4f ec2-t3-xlarge-us-east-2
[Failed] 4b31aa4f test-mac-arm
[Finished] 4b31aa4f ursa-i9-9960x
[Finished] 4b31aa4f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

kou added a commit that referenced this pull request Aug 29, 2023
…37409)

### Rationale for this change

GH-34295  changed meaning of `ARROW_COMPUTE`. `ARROW_COMPUTE=ON` means that "all compute kerenels are enabled" not "compute module is enabled".

`arrow-compute.pc`  is for detecting `ARROW_COMPUTE`. So `arrow-compute.pc` should be installed only when `ARROW_COMPUTE=ON`.

### What changes are included in this PR?

Add missing `if (ARROW_COMPUTE)`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: #37408

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…E=ON (apache#37409)

### Rationale for this change

apacheGH-34295  changed meaning of `ARROW_COMPUTE`. `ARROW_COMPUTE=ON` means that "all compute kerenels are enabled" not "compute module is enabled".

`arrow-compute.pc`  is for detecting `ARROW_COMPUTE`. So `arrow-compute.pc` should be installed only when `ARROW_COMPUTE=ON`.

### What changes are included in this PR?

Add missing `if (ARROW_COMPUTE)`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: apache#37408

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…E=ON (apache#37409)

### Rationale for this change

apacheGH-34295  changed meaning of `ARROW_COMPUTE`. `ARROW_COMPUTE=ON` means that "all compute kerenels are enabled" not "compute module is enabled".

`arrow-compute.pc`  is for detecting `ARROW_COMPUTE`. So `arrow-compute.pc` should be installed only when `ARROW_COMPUTE=ON`.

### What changes are included in this PR?

Add missing `if (ARROW_COMPUTE)`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: apache#37408

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Build core compute kernels unconditionally
5 participants