EOFError at connection.py #12

yanghaoxiang7 · 2022-01-09T14:42:34Z

During training I meet with a problem:

...
    (critic): Sequential(
      (0): Conv2d(64, 4, kernel_size=(1, 1), stride=(1, 1))
      (1): ReLU()
      (2): Flatten()
      (3): Linear(in_features=400, out_features=256, bias=True)
      (4): ReLU()
    )
    (critic_linear): Linear(in_features=256, out_features=1, bias=True)
  )
  (dist): Categorical(
    (linear): Linear(in_features=256, out_features=100, bias=True)
  )
)
Rotation: False
Process ForkProcess-1:
Traceback (most recent call last):
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/HDD_4T/Reps/baselines/baselines/common/vec_env/shmem_vec_env.py", line 123, in _subproc_worker
    cmd, data = pipe.recv()
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/yhx/anaconda3/envs/online3dbpp/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Debugging by printing out information, I found the problem of a segmentation fault around here:

(at kfac.py)
      if self.steps % self.Tf == 0:
          # My asynchronous implementation exists, I will add it later.
          # Experimenting with different ways to this in PyTorch.

          self.d_g[m], self.Q_g[m] = torch.symeig(
              self.m_gg[m], eigenvectors=True)
          self.d_a[m], self.Q_a[m] = torch.symeig(
              self.m_aa[m], eigenvectors=True)

          self.d_a[m].mul_((self.d_a[m] > 1e-6).float())
          self.d_g[m].mul_((self.d_g[m] > 1e-6).float())

I guess my problem is at torch.symeig, since I found several issues about this. But different from their running, the code stopped at the first episode (instead of stopping after several hours of training). Is there any solution to this problem? Great thanks!

The text was updated successfully, but these errors were encountered:

yanghaoxiang7 · 2022-01-09T14:43:24Z

BTW, I can run the training code with A2C and the testing code.

yanghaoxiang7 · 2022-01-09T14:45:20Z

I see that there's a possible way to add to "mask value" but I couldn't find it in config.py

yanghaoxiang7 · 2022-01-15T11:28:01Z

bug fixed.
Problem at acktr/algo/kfac.py.
I don't know why but torch.symeig is only compatible under CPU. Running under GPU will lead to a segmentation fault.
Solution:

                self.d_g[m], self.Q_g[m] = torch.symeig(
                    self.m_gg[m].cpu(), eigenvectors=True)
                self.d_g[m], self.Q_g[m] = self.d_g[m].cuda(), self.Q_g[m].cuda()
                self.d_a[m], self.Q_a[m] = torch.symeig(
                    self.m_aa[m].cpu(), eigenvectors=True)
                self.d_a[m], self.Q_a[m] = self.d_a[m].cuda(), self.Q_a[m].cuda()

I'm using torch1.7.1 + cuda 11. Not sure why this happen.

suoyike1 · 2024-07-08T02:51:04Z

BTW, I can run the training code with A2C and the testing code.

how to train this model with a2c?
when I run this training code with a2c will have a mistake as follow
Traceback (most recent call last):
File "main.py", line 233, in
main(args)
File "main.py", line 24, in main
train_model(args)
File "main.py", line 99, in train_model
args.lr,
AttributeError: 'Namespace' object has no attribute 'lr'

yanghaoxiang7 · 2024-07-08T03:35:11Z

BTW, I can run the training code with A2C and the testing code.

how to train this model with a2c? when I run this training code with a2c will have a mistake as follow Traceback (most recent call last): File "main.py", line 233, in main(args) File "main.py", line 24, in main train_model(args) File "main.py", line 99, in train_model args.lr, AttributeError: 'Namespace' object has no attribute 'lr'

Your errors indicates that your "args" does not have "lr". "lr" is the learning rate and is typically passed through the command line arguments ("args"). Check whether you run the code according to authors' information and you can directly use print("args:", args) to debug. Hope these helps.

yanghaoxiang7 closed this as completed Jan 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EOFError at connection.py #12

EOFError at connection.py #12

yanghaoxiang7 commented Jan 9, 2022

yanghaoxiang7 commented Jan 9, 2022

yanghaoxiang7 commented Jan 9, 2022

yanghaoxiang7 commented Jan 15, 2022

suoyike1 commented Jul 8, 2024

yanghaoxiang7 commented Jul 8, 2024

EOFError at connection.py #12

EOFError at connection.py #12

Comments

yanghaoxiang7 commented Jan 9, 2022

yanghaoxiang7 commented Jan 9, 2022

yanghaoxiang7 commented Jan 9, 2022

yanghaoxiang7 commented Jan 15, 2022

suoyike1 commented Jul 8, 2024

yanghaoxiang7 commented Jul 8, 2024