fix finding of failed tests in output of PyTorch test step #2859

boegel · 2023-01-12T19:06:10Z

Test reports in easybuilders/easybuild-easyconfigs#16286 and easybuilders/easybuild-easyconfigs#16385 showed that installation was failing because no details were found on distributed/test_c10d_gloo and test_jit_cuda_fuser

That's fixed in this PR, and I've also fleshed out the code that extracts the details on the failing test suites/groups into a dedicated function, so we can test it in CI

@Flamefire Please take a look?

…easyblock test step + add test for it

…e-letter local variable 'm')

…t count

easybuild/easyblocks/p/pytorch.py

test/easyblocks/easyblock_specific.py

easybuild/easyblocks/p/pytorch.py

…led_tests_info in PyTorch easyblock

test/easyblocks/easyblock_specific.py

… extend test for that function

boegel · 2023-01-13T08:25:31Z

@Flamefire suggested changes implemented and test enhanced, please take another look before I fire off test builds again with this?

…style checker (E101 indentation contains mixed spaces and tabs)

Flamefire · 2023-01-13T09:53:11Z

@Flamefire suggested changes implemented and test enhanced, please take another look before I fire off test builds again with this?

LGTM except for the failing check for the signaled test. Also I'd suggest to use the named tuple members in the easyblock, not just the test.

…in test step of PyTorch easyblock

Flamefire · 2023-01-15T09:46:54Z

test/easyblocks/easyblock_specific.py

+            "distributed/fsdp/test_fsdp_input (2 total tests, failures=2)",
+            "distributions/test_distributions (216 total tests, errors=4)",
+            "test_autograd (464 total tests, failures=1, skipped=52, expected failures=1)",
+            "test_fx (123 total tests, errors=2, skipped=2)",


This or the SIGSEGV test needs to be another one or failed_test_suites could pass when the signal was not detected as in the case of a forced termination due to a signla there won't be a test summary, will there?

Looks like we have a bug here. For the output of

Running distributions/rpc/test_tensorpipe_agent ... [2023-01-12 09:06:37.093571] ... Ran 123 tests in 7.549s FAILED (errors=2, skipped=2) ... test_fx failed! Received signal: SIGSEGV

it must NOT match the RegExp. The original, correct output is:

FAILED (errors=2, skipped=2) distributed/rpc/test_tensorpipe_agent failed!

So the RegExp must be made to only add the counting of "FAILED" if the next line is the failed test, see https://gist.github.com/Flamefire/dc1403ccefdebfc3412c6fbb2d5cbabd#file-pytorch-1-9-0-foss-2020b_partial-log-L480

I would actually add this wrong output as a test against regressions. With that snipped it should only find test_fx as failed (with an unknown number of tests)

Flamefire · 2023-01-15T09:51:22Z

easybuild/easyblocks/p/pytorch.py

-            error_cnt += get_count_for_pattern(r"([0-9]+) error", failure_summary)
-            failed_test_suites.append(test_suite)
+        failed_tests_info = extract_failed_tests_info(tests_out)
+        failure_report = failed_tests_info.failure_report


Nit: To avoid long distracting lines like sorted(set(failed_tests_info.failed_test_suites)) (2 times "failed") maybe either unwrap the named tuple here: failure_report, failed_test_suites = failed_tests_info.failure_report, failed_tests_info.failed_test_suites etc. or rename to test_info as test_info.failed_test_suites is easier to read.
And maybe the test_cnt extraction can be done in there too which would fit the name even better (then drop the "failed" from the function name).

akesandgren · 2024-05-15T06:09:49Z

@boegel @Flamefire Status on this one?

Flamefire · 2024-05-15T07:37:26Z

This conflicts with #3085 after that got merged possibly making all changes to the EasyBlock file unnecessary. Not fully sure though if there were any changes to the patterns that are not included yet.

I also have #3255 which further improves the function by pulling the last bits of the parsing out of the easyblock.

boegel added 3 commits January 12, 2023 19:14

flesh out standalone function extract_failed_tests_info from PyTorch …

0f6a9b1

…easyblock test step + add test for it

minor refactoring in extract_failed_tests_info (mainly to avoid singl…

7187b15

…e-letter local variable 'm')

add pattern for failing PyTorch test wihout indication of failing tes…

63abd5d

…t count

boegel added bug fix tests labels Jan 12, 2023

boegel added this to the next release (4.7.1?) milestone Jan 12, 2023

ignore indentation/long lines in PyTorch test output

b296c04

boegelbot mentioned this pull request Jan 12, 2023

{devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6 easybuilders/easybuild-easyconfigs#16286

Open

boegel commented Jan 13, 2023

View reviewed changes

easybuild/easyblocks/p/pytorch.py Show resolved Hide resolved