Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable use_pip by default for PyTorch, except for recent versions (>= 2.0) #3079

Merged
merged 2 commits into from Jan 22, 2024

Conversation

boegel
Copy link
Member

@boegel boegel commented Jan 17, 2024

Up until now, we've only enabled use_pip for the latest versions of PyTorch (>= 2.1).

It's probably not worth the effort to also do that for older PyTorch versions, so we need to overrule the changing default in PythonPackage (cfr. #3022)

With this approach, old PyTorch easyconfigs could explicitly set use_pip = False, and future PyTorch easyconfigs could opt-in to not using pip by using use_pip = False, should the need arise (since PyTorch currently still uses the legacy install procedure that involves setup.py build).

@boegel boegel added change EasyBuild-5.0 EasyBuild 5.0 labels Jan 17, 2024
@boegel boegel added this to the 5.0 milestone Jan 17, 2024
@boegel
Copy link
Member Author

boegel commented Jan 17, 2024

@Flamefire Anything to add here?

@Flamefire
Copy link
Contributor

(since PyTorch currently still uses the legacy install procedure that involves setup.py build).

Just to clarify: pip install . for PyTorch executes the equivalent (at least) of setup.py build due to the use of the legacy build-backend in pyproject.toml: https://github.com/pytorch/pytorch/blob/40a6710ad34ae4c6f4987f0e47b5c94df3fc8ec7/pyproject.toml#L14

Nothing wrong with that. AFAIK this means our use of setup.py install and pip install . are equivalent (for now, until they change the pyproject.toml)

I would even use 2.x as the cutoff version to simplify this to "PyTorch 2 uses pip by default", no need to go into minor versions here.

@boegel boegel changed the title disable use_pip by default for PyTorch, except for recent versions (>= 2.1) disable use_pip by default for PyTorch, except for recent versions (>= 2.0) Jan 19, 2024
@boegel
Copy link
Member Author

boegel commented Jan 19, 2024

@boegelbot please test @ jsc-zen3
EB_BRANCH=5.0.x
EB_ARGS="PyTorch-2.0.1-foss-2022a.eb PyTorch-2.0.1-foss-2022b.eb PyTorch-2.1.2-foss-2022a.eb PyTorch-2.1.2-foss-2022b.eb PyTorch-2.1.2-foss-2023a.eb"
CORE_CNT=16

@boegelbot
Copy link

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ "5.0.x" != 'develop' ]]; then EB_BRANCH="5.0.x" ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/"5.0.x" source init_env_easybuild_develop.sh; fi; EB_PR=3079 EB_ARGS="PyTorch-2.0.1-foss-2022a.eb PyTorch-2.0.1-foss-2022b.eb PyTorch-2.1.2-foss-2022a.eb PyTorch-2.1.2-foss-2022b.eb PyTorch-2.1.2-foss-2023a.eb" EB_REPO=easybuild-easyblocks EB_BRANCH="5.0.x" /opt/software/slurm/bin/sbatch --job-name test_PR_3079 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3434

Test results coming soon (I hope)...

- notification for comment with ID 1900472395 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link

Test report by @boegelbot

Overview of tested easyconfigs (in order)

  • FAIL (build issue) PyTorch-2.0.1-foss-2022a.eb (partial log available at https://gist.github.com/boegelbot/32e1931628b3314bf0387fc6307eef76)
  • FAIL (build issue) PyTorch-2.0.1-foss-2022b.eb (partial log available at https://gist.github.com/boegelbot/3b7ea8c748a91f424f2e21793e742e11)
  • SUCCESS PyTorch-2.1.2-foss-2022a.eb
  • SUCCESS pytest-flakefinder-1.1.0-GCCcore-12.2.0.eb
  • SUCCESS pytest-flakefinder-1.1.0-GCCcore-12.3.0.eb
  • SUCCESS PyTorch-2.1.2-foss-2022b.eb
  • SUCCESS pytest-rerunfailures-12.0-GCCcore-12.3.0.eb
  • SUCCESS pytest-shard-0.1.2-GCCcore-12.3.0.eb
  • SUCCESS expecttest-0.1.5-GCCcore-12.3.0.eb
  • SUCCESS Z3-4.12.2-GCCcore-12.3.0-Python-3.11.3.eb
  • SUCCESS PyTorch-2.1.2-foss-2023a.eb

Build succeeded for 9 out of 11 (5 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/d34fa9a82cc45f54a77cc7e9f34e8cbd for a full test report.

@boegel
Copy link
Member Author

boegel commented Jan 22, 2024

We haven't had test reports for PyTorch on jsc-zen3 yet (only on jsc-zen2 which is being discontinued).

For PyTorch-2.0.1-foss-2022a.eb:

== 2024-01-19 20:45:24,371 pytorch.py:480 WARNING 0 test failures, 3 test errors (out of 180622):
distributed/_tensor/test_device_mesh (24 total tests, errors=3)

For PyTorch-2.0.1-foss-2022b.eb:

== 2024-01-20 03:39:30,018 pytorch.py:480 WARNING 2 test failures, 6 test errors (out of 180621):
test_quantization (997 total tests, failures=2, errors=1, skipped=74)
distributed/_tensor/test_device_mesh (24 total tests, errors=4)
distributed/_tensor/test_dtensor (15 total tests, errors=1)

I'm pretty sure that they are not caused by enabling use_pip, and hence shouldn't block this PR.

@Flamefire Thoughts?

@Flamefire
Copy link
Contributor

  • test_quantization is kind of expected -> See comment in EC (test may randomly generate inputs where it fails)
  • distributed/_tensor/test_device_mesh -> Only seen this for CUDA builds (when NCCL is included)
  • distributed/_tensor/test_dtensor -> New to me

Generally those are test errors, not failures. So very likely the test run into something it didn't expect from the environment. E.g for the CUDA build I had it because PyTorch was built with NCCL but run without GPUs. But can't tell from this alone

@boegel
Copy link
Member Author

boegel commented Jan 22, 2024

@Flamefire There's indeed no GPU available in this test setup.

In any case, you don't see any reason to block this PR, right?

@Flamefire
Copy link
Contributor

@Flamefire There's indeed no GPU available in this test setup.

In any case, you don't see any reason to block this PR, right?

That was just an example that applies to the in-progress CUDA ECs. But yes, I agree that this shouldn't block this PR. Might be worth investigating the failures later though.

@branfosj branfosj merged commit 149213d into easybuilders:5.0.x Jan 22, 2024
17 checks passed
@boegel boegel deleted the pytorch_use_pip branch January 22, 2024 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants