Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to UCX v1.14.0 and configure/build for release #111

Merged
merged 15 commits into from
Mar 18, 2023

Conversation

pentschev
Copy link
Contributor

@pentschev pentschev commented Oct 17, 2022

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

Changes:

  • Upgrade to UCX 1.14.0
  • Add libnuma dependency from conda-forge
  • Switch to release builds
  • Move to single build compatible for CPU-only and CUDA
  • Remove verbs support

Closes #26
Closes #66
Closes #102
Closes #110
Closes #112
Closes #115

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@pentschev
Copy link
Contributor Author

@conda-forge-admin, please rerender

@pentschev
Copy link
Contributor Author

cc @leofang @jakirkham

@github-actions
Copy link
Contributor

Hi! This is the friendly automated conda-forge-webservice.

I tried to rerender for you, but it looks like there was nothing to do.

This message was generated by GitHub actions workflow run https://github.com/conda-forge/ucx-split-feedstock/actions/runs/3267176097.

@leofang
Copy link
Member

leofang commented Oct 17, 2022

Thanks, Peter! LGTM but I'd like to have @jakirkham or someone else to take another look 🙂

@jakirkham
Copy link
Member

Does this result in ABI changes? In particular were there debug symbols that would have been linked against before that are now removed?

@pentschev pentschev mentioned this pull request Jan 27, 2023
1 task
@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@pentschev pentschev changed the title Configure/build UCX for release Update to UCX v1.14.0 and configure/build for release Mar 14, 2023
@pentschev
Copy link
Contributor Author

As per our previous discussion, we wanted to wait for the change to build for release until the new UCX release was out, now v1.14.0 has been released and I've updated this PR to use v1.14.0.

@jakirkham @leofang would you mind taking a look?

@pentschev
Copy link
Contributor Author

It seems that the CentOS 6 builds fail with UCX 1.14 because IBV_ACCESS_ON_DEMAND is missing as it was previously defined in https://github.com/openucx/ucx/blob/c0aa76d0c121445b8665dbb16e1577ef5ebc6200/src/uct/ib/base/ib_verbs.h#L160-L169 and now is expected to be defined as part of rdma-core. Similar to

# --with-rdmacm requires rdma-core v23+, while CentOS 7 only offers v22.
, I believe we should be building against rdma-core v28 for MOFED 5.x compatibility, which we can't because CentOS 7 goes only up to v22. My suggestion would be to remove --with-verbs entirely until we can guarantee rdma-core v28 or higher is available as part of the conda-forge packaging process.

@pentschev
Copy link
Contributor Author

To be clear, MOFED 5.x is now the only tested UCX version, I know MOFED 4.x was still "supported" by UCX until last year or so but not actively tested, thus I was constantly finding bugs. As a result of that we decided to stop supporting MOFED 4.x entirely for RAPIDS, both internally at NVIDIA and for customers. I would recommend the same be done here, as MOFED 4.x support (rdma-core < v28) is going to be hit or miss.

@jakirkham
Copy link
Member

Thanks Peter! 🙏

Do we also want to add libnuma ( #112 )?

Wonder if we can also combine these into a single build for both CPU and GPU (just moving cudatoolkit to run_constrained). This would resolve issue ( #66 ) and also issues ( #26 ), ( #66 ), ( #102 ) & ( #115 )

cc @kkraus14 (for awareness)

recipe/meta.yaml Outdated

build:
skip: true # [not linux]
number: 0
number: 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're bumping the version we don't need to bump the build number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an oversight, this PR was only changing to release with no bump on the UCX version, so this was left by accident. We also have the number variable above, is there a reason we can't use it here @jakirkham ? It is currently only used by outputs.name=ucx.build.number, wondering whether we can use for both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 50a5dc2 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should use the top-level number and template it everywhere else

@jakirkham
Copy link
Member

A separate question would be other we want to backport the CUDA 12 fixes ourselves ( #116 (comment) ) to ensure they are in 1.14.0

@kkraus14
Copy link

To be clear, MOFED 5.x is now the only tested UCX version, I know MOFED 4.x was still "supported" by UCX until last year or so but not actively tested, thus I was constantly finding bugs. As a result of that we decided to stop supporting MOFED 4.x entirely for RAPIDS, both internally at NVIDIA and for customers. I would recommend the same be done here, as MOFED 4.x support (rdma-core < v28) is going to be hit or miss.

Now that there's a virtual package plugin system for conda we in theory could have a virtual package for MOFED version and constrain UCX packages based on that MOFED version if desired.

Is there a path forward for an updated rdma-core CDT or are we limited to CentOS 7 for CDTs?

@jakirkham
Copy link
Member

It would need to be provided by the user somehow

Anyways if this is not helpful, we can skip it. Was just thinking the option might be useful for local builds

@pentschev
Copy link
Contributor Author

It would need to be provided by the user somehow

Anyways if this is not helpful, we can skip it. Was just thinking the option might be useful for local builds

I agree this would be a good idea if it could be provided by the user, but unfortunately I don't think this is the case. Sometime ago I even went through the rabbit hole of trying to link against the system's rdma-core/MOFED, but due to the way conda sets sysroot that was impossible, or at least cumbersome enough to prevent that from working after spending about a day on it. With that said, I think this isn't helpful, but if someone can point to a path forward I do not oppose to doing that.

@jakirkham
Copy link
Member

Gotcha. If we think it is too complicated, let's punt.

@pentschev
Copy link
Contributor Author

Gotcha. If we think it is too complicated, let's punt.

Agreed. In that case, I think this PR is ready for review and merging (if nothing needs changing). Would you mind taking another look, John?

recipe/meta.yaml Outdated Show resolved Hide resolved
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Peter! 🙏

Looks good. One minor suggestion below.

recipe/meta.yaml Outdated Show resolved Hide resolved
Co-authored-by: jakirkham <jakirkham@gmail.com>
@jakirkham
Copy link
Member

Thanks Peter! 🙏

LGTM

Will let this sit for a bit to see if anyone else has feedback. Planning on merging EOD tomorrow if no comments

@jakirkham jakirkham merged commit 09499c8 into conda-forge:main Mar 18, 2023
@jakirkham
Copy link
Member

Thanks Peter for the PR and Keith for help reviewing! 🙏

rapids-bot bot pushed a commit to rapidsai/ucx-py that referenced this pull request Mar 20, 2023
conda-forge/ucx-split-feedstock#111 was merged and UCX v1.14.0 is now available on conda-forge which we should support.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Ray Douglass (https://github.com/raydouglass)

URL: #935
rapids-bot bot pushed a commit to rapidsai/raft that referenced this pull request Mar 23, 2023
UCX 1.14.0 was recently released and conda-forge package was updated in conda-forge/ucx-split-feedstock#111 with several packaging improvements. Relax the pin to allow installing UCX v1.14.x as well.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #1366
@pentschev pentschev deleted the use-configure-release branch April 12, 2023 07:37
requirements:
build:
- {{ compiler("c") }}
- {{ compiler("cxx") }}
- {{ compiler("cuda") }} # [cuda_compiler_version != "None"]
- {{ cdt("libnl") }} # [cdt_name == "cos6"]
- {{ cdt("libnl3") }} # [cdt_name == "cos7"]
- {{ cdt("libibcm-devel") }} # [cdt_name == "cos6"]
- {{ cdt("libibverbs-devel") }} # [cdt_name == "cos6"]
- {{ cdt("librdmacm-devel") }} # [cdt_name == "cos6"]
- {{ cdt("numactl-devel") }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed we still have this CDT. Though we added libnuma below. Should we drop the CDT then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing in PR ( #118 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants