Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch 1.13.1 test failures: test_native_mha #17712

Open
boegel opened this issue Apr 13, 2023 · 2 comments
Open

PyTorch 1.13.1 test failures: test_native_mha #17712

boegel opened this issue Apr 13, 2023 · 2 comments

Comments

@boegel
Copy link
Member

boegel commented Apr 13, 2023

cfr.:

test_native_mha is also troublesome with PyTorch 1.12.1 on some systems, see #17615

@boegel boegel added this to the release after 4.7.2 milestone Apr 13, 2023
@boegel
Copy link
Member Author

boegel commented Apr 13, 2023

#17155 (comment) also mentions several tests that we've only seen failing on POWER9

@branfosj
Copy link
Member

distributed/rpc/test_tensorpipe_agent

INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmpfdjhi28u
INFO:torch.distributed.nn.jit.instantiator:Writing /dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmpfdjhi28u/_remote_module_non_scriptable.py
test_profiler_rpc_key_names (__main__.TensorPipeRpcTest) ... INFO:torch.testing._internal.common_distributed:Started process 0 with pid 1576671
INFO:torch.testing._internal.common_distributed:Started process 1 with pid 1576672
INFO:torch.testing._internal.common_distributed:Started process 2 with pid 1576673
INFO:torch.testing._internal.common_distributed:Started process 3 with pid 1576674
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmpxmrlfpr0
INFO:torch.distributed.nn.jit.instantiator:Writing /dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmpxmrlfpr0/_remote_module_non_scriptable.py
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmpsvntp5ty
INFO:torch.distributed.nn.jit.instantiator:Writing /dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmpsvntp5ty/_remote_module_non_scriptable.py
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmpzi8q194z
INFO:torch.distributed.nn.jit.instantiator:Writing /dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmpzi8q194z/_remote_module_non_scriptable.py
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp9e9nphe1
INFO:torch.distributed.nn.jit.instantiator:Writing /dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp9e9nphe1/_remote_module_non_scriptable.py
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 1
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 3
fi_getinfo: -61
fi_getinfo: -61
fi_getinfo: -61
fi_getinfo: -61
ERROR:torch.testing._internal.common_distributed:Caught exception:
Traceback (most recent call last):
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 587, in run_test
    getattr(self, test_name)()
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 468, in wrapper
    fn()
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/dist_utils.py", line 79, in new_test_method
    return_value = old_test_method(self, *arg, **kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 1901, in test_profiler_rpc_key_names
    fut.result()
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 1884, in rpc_with_profiling
    self.assertTrue(
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/unittest/case.py", line 687, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true : Expected {'aten::sigmoid', 'aten::mul', 'aten::add', 'aten::relu', 'aten::ones', 'aten::clamp_min'} to be included in remote profiler output.
 exiting process 1 with exit code: 10
[W tensorpipe_agent.cpp:726] RPC agent for worker0 encountered error when reading incoming request from worker1: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
Process 1 terminated with exit code 10, terminating remaining processes.
ERROR

======================================================================
ERROR: test_profiler_rpc_key_names (__main__.TensorPipeRpcTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 466, in wrapper
    self._join_processes(fn)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 689, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 734, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 587, in run_test
    getattr(self, test_name)()
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 468, in wrapper
    fn()
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/dist_utils.py", line 79, in new_test_method
    return_value = old_test_method(self, *arg, **kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 1901, in test_profiler_rpc_key_names
    fut.result()
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-i5v53gqh/tmp_13ma9_l/lib/python3.10/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 1884, in rpc_with_profiling
    self.assertTrue(
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/unittest/case.py", line 687, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true : Expected {'aten::sigmoid', 'aten::mul', 'aten::add', 'aten::relu', 'aten::ones', 'aten::clamp_min'} to be included in remote profiler output.



----------------------------------------------------------------------
Ran 1 test in 3.149s

FAILED (errors=1)
Test exited with non-zero exitcode 1. Command to reproduce: /rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.4-GCCcore-11.3.0/bin/python distributed/rpc/test_tensorpipe_agent.py -v TensorPipeRpcTest.test_profiler_rpc_key_names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants