Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix finding of failed tests in output of PyTorch test step #2859

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

boegel
Copy link
Member

@boegel boegel commented Jan 12, 2023

Test reports in easybuilders/easybuild-easyconfigs#16286 and easybuilders/easybuild-easyconfigs#16385 showed that installation was failing because no details were found on distributed/test_c10d_gloo and test_jit_cuda_fuser

That's fixed in this PR, and I've also fleshed out the code that extracts the details on the failing test suites/groups into a dedicated function, so we can test it in CI

@Flamefire Please take a look?

@boegel
Copy link
Member Author

boegel commented Jan 13, 2023

@Flamefire suggested changes implemented and test enhanced, please take another look before I fire off test builds again with this?

…style checker (E101 indentation contains mixed spaces and tabs)
@Flamefire
Copy link
Contributor

@Flamefire suggested changes implemented and test enhanced, please take another look before I fire off test builds again with this?

LGTM except for the failing check for the signaled test. Also I'd suggest to use the named tuple members in the easyblock, not just the test.

"distributed/fsdp/test_fsdp_input (2 total tests, failures=2)",
"distributions/test_distributions (216 total tests, errors=4)",
"test_autograd (464 total tests, failures=1, skipped=52, expected failures=1)",
"test_fx (123 total tests, errors=2, skipped=2)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This or the SIGSEGV test needs to be another one or failed_test_suites could pass when the signal was not detected as in the case of a forced termination due to a signla there won't be a test summary, will there?

Copy link
Contributor

@Flamefire Flamefire Jan 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have a bug here. For the output of

Running distributions/rpc/test_tensorpipe_agent ... [2023-01-12 09:06:37.093571]
...
Ran 123 tests in 7.549s
FAILED (errors=2, skipped=2)
...
test_fx failed! Received signal: SIGSEGV

it must NOT match the RegExp. The original, correct output is:

FAILED (errors=2, skipped=2)
distributed/rpc/test_tensorpipe_agent failed!

So the RegExp must be made to only add the counting of "FAILED" if the next line is the failed test, see https://gist.github.com/Flamefire/dc1403ccefdebfc3412c6fbb2d5cbabd#file-pytorch-1-9-0-foss-2020b_partial-log-L480

I would actually add this wrong output as a test against regressions. With that snipped it should only find test_fx as failed (with an unknown number of tests)

error_cnt += get_count_for_pattern(r"([0-9]+) error", failure_summary)
failed_test_suites.append(test_suite)
failed_tests_info = extract_failed_tests_info(tests_out)
failure_report = failed_tests_info.failure_report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: To avoid long distracting lines like sorted(set(failed_tests_info.failed_test_suites)) (2 times "failed") maybe either unwrap the named tuple here: failure_report, failed_test_suites = failed_tests_info.failure_report, failed_tests_info.failed_test_suites etc. or rename to test_info as test_info.failed_test_suites is easier to read.
And maybe the test_cnt extraction can be done in there too which would fit the name even better (then drop the "failed" from the function name).

@akesandgren
Copy link
Contributor

@boegel @Flamefire Status on this one?

@Flamefire
Copy link
Contributor

This conflicts with #3085 after that got merged possibly making all changes to the EasyBlock file unnecessary. Not fully sure though if there were any changes to the patterns that are not included yet.

I also have #3255 which further improves the function by pulling the last bits of the parsing out of the easyblock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants