Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"The connection to the C10d store has failed" on distributed evaluation #84

Open
chawins opened this issue Mar 25, 2022 · 2 comments
Open

Comments

@chawins
Copy link

chawins commented Mar 25, 2022

I have been trying to get AutoAttack works reliably in distributed mode (DistributedDataParallel) in pytorch. Often, the program crashes after AutoAttack ran for several batches with the stack trace below. It happened more often when AutoAttack ran for longer, i.e., high robust accuracy and lots of samples got through to Square attack.

I realize that this might be a system-specific and more pytorch-related issue, but I am curious if anyone else here has gotten a similar error and perhaps has a fix. My current fix is just to keep repeating the evaluation until it gets lucky and runs successfully. Obviously, this is not ideal and wastes a lot of time and resources as some need 5-10 repeats to succeed.

Another fix is to just run the evaluation on a non-distributed mode. This error never happens outside of the distributed mode and might be specific to c10d backend.

Some system info:

  • Happen with pytorch 1.9-1.11
  • CUDA 11.0 and 11.3
  • Use 2 V100 GPUs at a time
  • Use torchrun
...
initial accuracy: 75.00%
apgd-ce - 1/1 - 6 out of 48 successfully perturbed
robust accuracy after APGD-CE: 65.62% (total time 26.3 s)
apgd-t - 1/1 - 1 out of 42 successfully perturbed
robust accuracy after APGD-T: 64.06% (total time 95.2 s)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64626 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 64627 closing signal SIGTERM
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'v100-xgcp.internal_64618_0' has failed to shutdown the rendezvous 'dcd635f2-70f3-47cd-941b-fec1c751acd3' due to an error of type RendezvousConnectionError.
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{
  "message": {
    "message": "RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 113, in _call_store\n    return getattr(self._store, store_op)(*args, **kwargs)\nRuntimeError: Broken pipe\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 345, in wrapper\n    return f(*args, **kwargs)\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py\", line 724, in main\n    run(args)\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py\", line 715, in run\n    elastic_launch(\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py\", line 131, in __call__\n    return launch_agent(self._config, self._entrypoint, list(args))\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py\", line 236, in launch_agent\n    result = agent.run()\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py\", line 709, in run\n    result = self._invoke_run(role)\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py\", line 881, in _invoke_run\n    num_nodes_waiting = rdzv_handler.num_nodes_waiting()\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 1079, in num_nodes_waiting\n    self._state_holder.sync()\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 408, in sync\n    get_response = self._backend.get_state()\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 73, in get_state\n    base64_state: bytes = self._call_store(\"get\", self._key)\n  File \"/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 115, in _call_store\n    raise RendezvousConnectionError(\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n",
      "timestamp": "1648190455"
    }
  }
}
Traceback (most recent call last):
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Broken pipe

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/chawins/miniconda3/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
    result = agent.run()
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 881, in _invoke_run
    num_nodes_waiting = rdzv_handler.num_nodes_waiting()
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1079, in num_nodes_waiting
    self._state_holder.sync()
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
    get_response = self._backend.get_state()
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/home/chawins/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
WARNING:torch.distributed.run:
@ScarlettChan
Copy link

ScarlettChan commented Mar 25, 2022 via email

@fra31
Copy link
Owner

fra31 commented Apr 5, 2022

Hi,

I've never tried to use AA in distributed mode so far. Anyway, thanks for letting me know, I'll get back to you in case I face the same issue or find a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants