{ai}[foss/2023b] PyTorch v2.3.0 #20489

akesandgren · 2024-05-07T10:49:42Z

(created using eb --new-pr)

…2.3.0_disable_DataType_dependent_test_if_tensorboard_is_not_available.patch, PyTorch-2.3.0_disable_test_linear_package_if_no_half_types_are_available.patch

akesandgren · 2024-05-07T10:51:05Z

Tests that are failing for me are:

inductor/test_torchinductor 1/1 failed!   test_multilayer_var_lowp
inductor/test_torchinductor_dynamic_shapes 1/1 failed!   test_multilayer_var_lowp
test_cpp_extensions_open_device_registration 1/1 failed!   test_open_device_registration (Not implemented yet ?)
inductor/test_cpu_repro 1/1 failed!    test_scatter_using_atomic_add
test_decomp 1/1 failed!   test_sdpa (_nn_functional_scaled_dot_product_attention_cpu_bfloat16)
inductor/test_torchinductor_opinfo 1/1 failed!
 inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_fft_ihfft2_cpu_int32 FAILED
    C++ error

akesandgren · 2024-05-07T10:53:44Z

@Flamefire
The first two are the same, precision problem on AMD zen3 at least
the cpp_extensions_open_device_registration.... haven't a clue yet
the scatter_using_atomic_add looks like it's not compiling to the code it expects, not sure why
test_sdpa is also precision related
I didn't attack the C++ error

akesandgren · 2024-05-07T11:21:25Z

@boegelbot Please test @ jsc-zen3

boegelbot · 2024-05-07T11:30:11Z

@akesandgren: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20489 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20489 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

exit code: 0
output:

Submitted batch job 4085

Test results coming soon (I hope)...

- notification for comment with ID 2098172875 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

akesandgren · 2024-05-07T20:25:43Z

Test report by @akesandgren
FAILED
Build succeeded for 0 out of 1 (3 easyconfigs in total)
b-an02.hpc2n.umu.se - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz, Python 3.8.10
See https://gist.github.com/akesandgren/ef17ea2435926ca06bbe5cbbe6058158 for a full test report.

boegelbot · 2024-05-07T20:36:01Z

Test report by @boegelbot
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/1ce9b70e5410ebd3a1d8dbbce992b8c7 for a full test report.

akesandgren · 2024-05-09T13:43:25Z

Test report by @akesandgren
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
b-cn1607.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz, 3 x NVIDIA NVIDIA A100 80GB PCIe, 545.29.06, Python 3.10.12
See https://gist.github.com/akesandgren/9a9b6ec51769af98d6d4689b4e1ba93a for a full test report.

akesandgren · 2024-05-13T05:36:53Z

Interesting...
If I run the tests standalone there are fewer failing tests than when run during a build...

Flamefire · 2024-05-13T09:13:05Z

Interesting... If I run the tests standalone there are fewer failing tests than when run during a build...

Not unusual for PyTorch ;-)
I just got bitten again by $XDG_CACHE_HOME: PyTorch uses that to store JIT compiled files so rerunning the same test again with the same value for that will result in a different behavior as it will load the file from that directory instead of JIT compiling it.

akesandgren · 2024-05-14T11:11:51Z

These fail because SANDCASTLE=1 when run as part of build

export/test_lift_unlift
export/test_serialize
export/test_torchbind
export/test_unflatten
higher_order_ops/test_with_effects
test_weak

And those are the diff between my standalone test run (which was without SANDCASTLE) and the test-while-building

akesandgren · 2024-05-14T11:30:49Z

@Flamefire Do you know why we set SANDCASTLE=1 in the easyblock?
As far as I can see it is a specific machine that they run tests on...

Flamefire · 2024-05-14T13:57:26Z

@Flamefire Do you know why we set SANDCASTLE=1 in the easyblock? As far as I can see it is a specific machine that they run tests on...

Yes, there are a lot of things like @unittest.skipIf(IS_SANDCASTLE, "NYI: fuser CPU support for Sandcastle") in the tests and the idea was: If they don't even run/work on their machine we shouldn't even try to do for us.

So we might need to patch those failing ones. For TestWithEffects it loads a different library, similar in test_weak.py and likely for the export tests although I couldn't find the exact ones you mentioned

akesandgren · 2024-05-14T14:37:52Z

I'm doing a test without SANDCASTLE set and test_hub disabled, that's one of only two I found that is doing external downloads, the other being one test in test_nn.
By the looks of some of the comments around SANDCASTLE it doesn't feel like a normal x86_64 based machine...

akesandgren · 2024-05-14T14:48:13Z

And I have manually run the full test suite without SANDCASTLE set on a previous build and saw only 3 failed tests.
So I don't think we need SANDCASTLE set.

Flamefire · 2024-05-14T17:04:42Z

By the looks of some of the comments around SANDCASTLE it doesn't feel like a normal x86_64 based machine...

Might be. I used it because it disable a LOT of tests, especially those downloading stuff IIRC. See https://github.com/search?q=repo%3Apytorch%2Fpytorch%20IS_SANDCASTLE&type=code

Two such instances seems to skip whole classes of tests at once: https://github.com/pytorch/pytorch/blob/20aa7cc6788ff10dee2d927057b10a81af638a32/test/jit/test_backends.py#L69-L73 and https://github.com/pytorch/pytorch/blob/2e4d0111953e6db7e4ce5cf041e6a78770092495/test/jit/test_torchbind.py#L37-L38

And I have manually run the full test suite without SANDCASTLE set on a previous build and saw only 3 failed tests.

If it is indeed the case that now NOT setting it causes fewer failures then we should. Best to condition it on 2.3+ to not introduce regressions.

I'll try to push a change upstream to use something like @skip_if_sandcastle which would give us an easy way to skip all those tests by patching that function without changing any other behavior controlled by that env variable

easybuild/easyconfigs/t/tlparse/tlparse-0.3.5-GCCcore-13.2.0.eb

Flamefire · 2024-05-15T08:35:18Z

We have another issue: pytest-rerun-failures interferes with our test parsing. We want some output like

    # ===================== 2 failed, 128 passed, 2 skipped, 2 warnings in 3.43s =====================
    # test_quantization failed!

But now we get:

Running test_cpp_extensions_open_device_registration 1/1 ... [2024-05-13 16:48:56.717884]
Executing ['.../python', '-bb', 'test_cpp_extensions_open_device_registration.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2'] ... [2024-05-13 16:48:56.718522]
===================== test session starts =====================
[...]
('RERUN', {'yellow': True}) [1.1713s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration ('RERUN', {'yellow': True}) [0.0036s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration FAILED [0.0033s]                                                                         [100%]

===================== RERUNS =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== FAILURES =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== short test summary info =====================
FAILED [0.0033s] test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration - AssertionError: RuntimeError not raised
!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 2 rerun in 39.35s =====================
Got exit code 1
Retrying...
===================== test session starts =====================
[...]
('RERUN', {'yellow': True}) [1.9584s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration [W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

('RERUN', {'yellow': True}) [0.0036s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration [W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

FAILED [0.0023s]                                                                         [100%]

===================== RERUNS =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== FAILURES =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== short test summary info =====================
FAILED [0.0023s] test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration - AssertionError: RuntimeError not raised
!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 2 rerun in 40.27s =====================
Got exit code 1
Retrying...
===================== test session starts =====================
[...]
('RERUN', {'yellow': True}) [1.8911s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration [W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

('RERUN', {'yellow': True}) [0.0032s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration [W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

FAILED [0.0021s]                                                                         [100%]

===================== RERUNS =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== FAILURES =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== short test summary info =====================
FAILED [0.0021s] test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration - AssertionError: RuntimeError not raised
!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 2 rerun in 40.00s =====================
Got exit code 1
Retrying...
===================== test session starts =====================
[...]
===================== 1 deselected in 0.02s =====================
The following tests failed consistently: ['test/test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration']
test_cpp_extensions_open_device_registration 1/1 failed!
Running test_cuda 1/1 ... [2024-05-13 16:51:10.730579]

I don't see how we could reasonably parse this
It exits after the first failed test. This means even "1 failed, 2 rerun in 40.00s" just says: "1 test out of an unknown number of tests failed"

akesandgren · 2024-05-15T14:02:52Z

@boegelbot Please test @ jsc-zen3
EB_ARGS="--include-easyblocks-from-pr 3330"

… more tests.

boegelbot · 2024-05-15T14:10:08Z

@akesandgren: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20489 EB_ARGS="--include-easyblocks-from-pr 3330" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20489 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

exit code: 0
output:

Submitted batch job 4128

Test results coming soon (I hope)...

- notification for comment with ID 2112638268 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

fizwit · 2024-05-15T21:35:17Z

easybuild/easyconfigs/o/optree/optree-0.11.0-GCCcore-13.2.0.eb

optree requires typing-extensions/4.10.0-GCCcore-13.2.0

Why is that? I installed it just fine:

... python -m pip check completed successfully

Where are you getting typing-extensions? It is not part of Python-3.11.5-GCCcore-13.2.0.eb. optree build fails without typing-extensions.

== installing... == ... (took 29 secs) == taking care of extensions... == restore after iterating... == postprocessing... == sanity checking... == ... (took 3 secs) == FAILED: Installation ended unsuccessfully (build directory: /build/optree/0.11.0/GCCcore-13.2.0): build failed (first 300 chars): `/app/software/Python/3.11.5-GCCcore-13.2.0/bin/python -m pip check` failed: optree 0.11.0 requires typing-extensions, which is not installed.

Seems like you need to reinstall Python. The current develop version and release 4.9.1 contains it:

easybuild-easyconfigs/easybuild/easyconfigs/p/Python/Python-3.11.5-GCCcore-13.2.0.eb

Lines 56 to 58 in 43ff814

('typing_extensions', '4.8.0', {

'checksums': ['df8e4339e9cb77357558cbdbceca33c303714cf861d1eef15e1070055ae8b7ef'],

}),

However it was a change between 4.8.2 and 4.9.x by #19777

From the looks of that PR this was made because too many other ECs depended on that. And IMO it makes sense to include it in Python by default

thanks, --rebuild --skip added four packages. This will fix many things for me.

== installing extension tomli 2.0.1 (1/4)... == configuring... == building... == testing... == installing... == ... (took 11 secs) == installing extension packaging 23.2 (2/4)... == configuring... == building... == testing... == installing... == ... (took 2 secs) == installing extension typing_extensions 4.8.0 (3/4)... == configuring... == building... == testing... == installing... == ... (took 2 secs) == installing extension setuptools-scm 8.0.4 (4/4)...

boegelbot · 2024-05-15T22:28:57Z

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3330
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/50db8a2d24c3a8139108dc99a9001182 for a full test report.

akesandgren · 2024-08-27T13:39:08Z

@Flamefire Any ideas on how to deal with the error output parsing problem?

Flamefire · 2024-08-28T12:13:24Z

@Flamefire Any ideas on how to deal with the error output parsing problem?

Not many. I still have an open issue for that: pytorch/pytorch#126523

No luck so far to get a machine readable output from PyTorch directly. I.e. I wanted them to get the --save-xml option work correctly but nothing yet after pytorch/pytorch#126690 failed.

We could try to get that option working by patching the test files to make sure --junit-xml-reruns and --save-xml is always set/passed. Then we can check if the XML files are any good for us.

Another option would be to revert their changes to the rerun feature using a custom implementation that broke our detection: pytorch/pytorch@3b7d60b

That might get difficult to keep going forward but I don't see any current alternatives.

adding easyconfigs: PyTorch-2.3.0-foss-2023b.eb and patches: PyTorch-…

f55e28a

…2.3.0_disable_DataType_dependent_test_if_tensorboard_is_not_available.patch, PyTorch-2.3.0_disable_test_linear_package_if_no_half_types_are_available.patch

akesandgren added the update label May 7, 2024

Flamefire reviewed May 14, 2024

View reviewed changes

easybuild/easyconfigs/t/tlparse/tlparse-0.3.5-GCCcore-13.2.0.eb Outdated Show resolved Hide resolved

tlparse: re-enable sanity_pip_check

f71aee2

pytorch: re-add some ported patches from previous version and disable…

22c71dd

… more tests.

fizwit reviewed May 15, 2024

View reviewed changes

boegel added this to the 4.x milestone May 22, 2024

migueldiascosta mentioned this pull request Sep 4, 2024

{math}[GCCcore/13.2.0] ArmComputeLibrary v23.08 #21309

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{ai}[foss/2023b] PyTorch v2.3.0 #20489

{ai}[foss/2023b] PyTorch v2.3.0 #20489

akesandgren commented May 7, 2024

akesandgren commented May 7, 2024

akesandgren commented May 7, 2024

akesandgren commented May 7, 2024

boegelbot commented May 7, 2024

akesandgren commented May 7, 2024

boegelbot commented May 7, 2024

akesandgren commented May 9, 2024

akesandgren commented May 13, 2024

Flamefire commented May 13, 2024

akesandgren commented May 14, 2024 •

edited

Loading

akesandgren commented May 14, 2024

Flamefire commented May 14, 2024

akesandgren commented May 14, 2024

akesandgren commented May 14, 2024

Flamefire commented May 14, 2024

Flamefire commented May 15, 2024

akesandgren commented May 15, 2024

boegelbot commented May 15, 2024

fizwit May 15, 2024

Flamefire May 16, 2024

fizwit May 18, 2024

Flamefire May 19, 2024

fizwit May 21, 2024

boegelbot commented May 15, 2024

akesandgren commented Aug 27, 2024

Flamefire commented Aug 28, 2024

	('typing_extensions', '4.8.0', {
	'checksums': ['df8e4339e9cb77357558cbdbceca33c303714cf861d1eef15e1070055ae8b7ef'],
	}),

{ai}[foss/2023b] PyTorch v2.3.0 #20489

Are you sure you want to change the base?

{ai}[foss/2023b] PyTorch v2.3.0 #20489

Conversation

akesandgren commented May 7, 2024

akesandgren commented May 7, 2024

akesandgren commented May 7, 2024

akesandgren commented May 7, 2024

boegelbot commented May 7, 2024

akesandgren commented May 7, 2024

boegelbot commented May 7, 2024

akesandgren commented May 9, 2024

akesandgren commented May 13, 2024

Flamefire commented May 13, 2024

akesandgren commented May 14, 2024 • edited Loading

akesandgren commented May 14, 2024

Flamefire commented May 14, 2024

akesandgren commented May 14, 2024

akesandgren commented May 14, 2024

Flamefire commented May 14, 2024

Flamefire commented May 15, 2024

akesandgren commented May 15, 2024

boegelbot commented May 15, 2024

fizwit May 15, 2024

Choose a reason for hiding this comment

Flamefire May 16, 2024

Choose a reason for hiding this comment

fizwit May 18, 2024

Choose a reason for hiding this comment

Flamefire May 19, 2024

Choose a reason for hiding this comment

fizwit May 21, 2024

Choose a reason for hiding this comment

boegelbot commented May 15, 2024

akesandgren commented Aug 27, 2024

Flamefire commented Aug 28, 2024

akesandgren commented May 14, 2024 •

edited

Loading